More Ways of Detecting Voice Gender¶

While speech processing has come a long way, the study of paralinguistic properties like emotion and gender has lagged behind. The state-of-the-art (i.e. VALLE, Tacotron) tends to focus on end-to-end pipelines rather than Statistical Parametric Speech Synthesis (SPSS) systems that explicitly represent speech properties like emotion and speaker recognition, conversion, and generation of speech. There remains work to be done to understand how paralinguistic information is represented in voice recordings. Using manifold learning and a neural network, I explore some ways gender is represented as a latent structure in human speech waveforms.

This project builds on Kory Becker's Voice Gender project (github repository here). I use the feature dataset she extracted from speech clips using the warbleR package. As a Carnegie Mellon student, I was pleased to find many of the speech clips were sourced from projects conducted at CMU. ^_^

First, I load in the dataset and some of the libraries I'll be using:

In [112]:

import umap
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler

In [10]:

voice = pd.read_csv("voice.csv")
voice

Out[10]:

	meanfreq	sd	median	Q25	Q75	IQR	skew	kurt	sp.ent	sfm	...	centroid	meanfun	minfun	maxfun	meandom	mindom	maxdom	dfrange	modindx	label
0	0.059781	0.064241	0.032027	0.015071	0.090193	0.075122	12.863462	274.402906	0.893369	0.491918	...	0.059781	0.084279	0.015702	0.275862	0.007812	0.007812	0.007812	0.000000	0.000000	male
1	0.066009	0.067310	0.040229	0.019414	0.092666	0.073252	22.423285	634.613855	0.892193	0.513724	...	0.066009	0.107937	0.015826	0.250000	0.009014	0.007812	0.054688	0.046875	0.052632	male
2	0.077316	0.083829	0.036718	0.008701	0.131908	0.123207	30.757155	1024.927705	0.846389	0.478905	...	0.077316	0.098706	0.015656	0.271186	0.007990	0.007812	0.015625	0.007812	0.046512	male
3	0.151228	0.072111	0.158011	0.096582	0.207955	0.111374	1.232831	4.177296	0.963322	0.727232	...	0.151228	0.088965	0.017798	0.250000	0.201497	0.007812	0.562500	0.554688	0.247119	male
4	0.135120	0.079146	0.124656	0.078720	0.206045	0.127325	1.101174	4.333713	0.971955	0.783568	...	0.135120	0.106398	0.016931	0.266667	0.712812	0.007812	5.484375	5.476562	0.208274	male
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
3163	0.131884	0.084734	0.153707	0.049285	0.201144	0.151859	1.762129	6.630383	0.962934	0.763182	...	0.131884	0.182790	0.083770	0.262295	0.832899	0.007812	4.210938	4.203125	0.161929	female
3164	0.116221	0.089221	0.076758	0.042718	0.204911	0.162193	0.693730	2.503954	0.960716	0.709570	...	0.116221	0.188980	0.034409	0.275862	0.909856	0.039062	3.679688	3.640625	0.277897	female
3165	0.142056	0.095798	0.183731	0.033424	0.224360	0.190936	1.876502	6.604509	0.946854	0.654196	...	0.142056	0.209918	0.039506	0.275862	0.494271	0.007812	2.937500	2.929688	0.194759	female
3166	0.143659	0.090628	0.184976	0.043508	0.219943	0.176435	1.591065	5.388298	0.950436	0.675470	...	0.143659	0.172375	0.034483	0.250000	0.791360	0.007812	3.593750	3.585938	0.311002	female
3167	0.165509	0.092884	0.183044	0.070072	0.250827	0.180756	1.705029	5.769115	0.938829	0.601529	...	0.165509	0.185607	0.062257	0.271186	0.227022	0.007812	0.554688	0.546875	0.350000	female

3168 rows × 21 columns

Then I separate the predictor variables (mean frequency of waveform, frequency range, etc) from the label I'm trying to predict — gender.

In [140]:

voice_pred = voice.loc[:, voice.columns != "label"]

Dimesionality Reduction¶

The dimensionality reduction algorithms I plan to use are not invariant to scaling, so I normalize the data. Otherwise, the Euclidean distance function would be dominated by certain features over others and fail to accurately represent the data.

In [ ]:

voice_pred_scaled = StandardScaler().fit_transform(voice_pred)

First, I use Uniform Manifold Approximation and Projection (UMAP), a novel dimensionality reduction algorithm that is most commonly used in the literature for visualizing single-cell RNA datasets. The manifold learning algorithm is quite adept at finding nonlinear projections of the data. It also scales particularly well compared to t-SNE, its older cousin. After reducing the speech datase to two dimensions, I plot the results.

In [120]:

Y = umap.UMAP(n_neighbors=10, min_dist=0.01).fit_transform(voice_pred_scaled)

In [145]:

# some helpful color stuff for matplotlib
import matplotlib.patches as mpatches

f = lambda x: '#0038A8' if x == 'male' else '#D60270'
gender_colors=[f(i) for i in voice.label]
male_patch = mpatches.Patch(color='#0038A8', label='male')
female_patch = mpatches.Patch(color='#D60270', label='female')

In [147]:

plt.scatter(Y[:,0], Y[:, 1], c=gender_colors, s=5)
plt.legend(handles=[male_patch, female_patch])

Out[147]:

<matplotlib.legend.Legend at 0x16b8c7880>

Ok, UMAP has done a decent job separating the voices by gender. It's worth noting that the algorithm is unsupervised, so the gendered property of voice seems to emerge as an underlying structure rather than one induced on the data. Of course, there are not just two uniform and distinct clusters — there is still overlap and variation. Put differently, voice does not reflect a simple mutually exclusive binary of speaker genders. Nevertheless, it does seem that gender is prominently represented in the fundamental structure of voice data.

Next, I use t-distributed stochastic neighbor embeddings (t-SNE). This is an older and more well-established dimensionality reduction algorithm than t-SNE. However, it does suffer from a larger runtime complexity.

In [148]:

from sklearn.manifold import TSNE
tsne = TSNE(n_components=2, random_state=238)
voice_tsne = tsne.fit_transform(voice_pred_scaled)

/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/sklearn/manifold/_t_sne.py:795: FutureWarning: The default initialization in TSNE will change from 'random' to 'pca' in 1.2.
  warnings.warn(
/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/sklearn/manifold/_t_sne.py:805: FutureWarning: The default learning rate in TSNE will change from 200.0 to 'auto' in 1.2.
  warnings.warn(

In [149]:

plt.scatter(voice_tsne[:,0], voice_tsne[:,1], c=gender_colors, s=10)
plt.legend(handles=[male_patch, female_patch])

Out[149]:

<matplotlib.legend.Legend at 0x16b660eb0>

Neural Networks¶

Neural networks are also adept at finding latent structures in high-dimensional datasets. In fact, one explanation for the effectiveness of deep learning essentially states that neural networks find the underlying manifold the data lies on much like older dimensionality reduction algorithms.

The previous voice gender project author eschewed neural networks for algorithms with more easily explainable results. However, I think it may be worth seeing how well a neural network is able to classify the data just given the features extracted from the voice waveforms.

I import the relevant modules and train a neural network on the data:

In [71]:

from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import train_test_split

In [125]:

X_train, X_test, Y_train, Y_test = train_test_split(voice_pred, voice.label, random_state=11523)
mlp_cl = MLPClassifier(hidden_layer_sizes=[15, 15, 15], random_state=12)
mlp_cl.fit(X_train, Y_train)

Out[125]:

MLPClassifier(hidden_layer_sizes=[15, 15, 15], random_state=12)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

In [126]:

print("In-sample Accuracy Rate:")
print(sum(mlp_cl.predict(X_train) == Y_train) / Y_train.shape[0])
print("Out-of-sample Accuracy Rate:")
print(sum(mlp_cl.predict(X_test) == Y_test) / Y_test.shape[0])

In-sample Accuracy Rate:
0.9419191919191919
Out-of-sample Accuracy Rate:
0.9330808080808081

It looks like we achieve a high accuracy rate - around 93%! Further work I'd like to do would examine what latent structure the neural net found.