Modeling Gaussian mixtures
Thanks to R’s roots as a statistical programming language it has very strong support for common statistical tasks like modeling and prediction. In modeling your data as a mixture of distributions, I know of the mixtools
package, though there are many others, as reviewed by this paper, and presumably more are being developed all the time. I was told by my colleague Chad about mclust
, which seems to strike a good balance between features, ease-of-use, and speed (thanks Chad!).
Modeling my daily iPhone step counts in R
Analyzing my iphone step count dataset is pretty straightforward. For density estimation, it is literally a one-liner using mclust
:
dens <- densityMclust(steps$stepsWalked)
Absent explicit parameters, mclust
will pick the number of distributions for you through BIC, or you can specify yourself (eg. G=10
for exactly 10 clusters, the default is G=1:9
).
The output contains parameters for the component Gaussians, as well as the parameter selection for diagnostic purposes. For my steps data, the algorithm found 4 component univariate Gaussians, with approximate mean step counts of 3000, 6000, 10000 and 12000, which I have documented in modeling_gaussian_mixtures.R
mclust
also works on data of higher dimensions with the same syntax, which I applied on the merged biking and step count data . It can also do dimensional reduction, do clustering/discriminant analysis and perform cross-validation. But since my step data is neither very large nor very metadata-rich, there was little more to do.
Using mixtools
, the process is pretty similar, starting with the one-liner:
mixauto <- normalmixEM(steps$stepsWalked)
The algorithm attempts to estimate K
by clustering the data, and the modeling is done using classic EM.
What about in Python ?
Scikit-learn
has the neighbors.KernelDensity
class for the simple 1D case, where you can specify different kernels. For modeling higher dimensions, mixture
provides good infrastructure: you can specify the covariance type as well as the number of expected clusters.
For model selection you can use either BIC or AIC as metrics. The mixture.GaussianMixture()
call itself returns means and the appropriate covariance matrices, which you can use to make diagnostic plots.