Gaussian Mixture Model

Suppose there are a set of data points that need to be grouped into several parts or clusters based on their similarity. In machine learning, this is known as Clustering.

There are several methods available for clustering like:

K Means Clustering

Hierarchical Clustering

Gaussian Mixture Models

etc.

In this article, the Gaussian Mixture Model will be discussed.

Normal or Gaussian Distribution

In real life, many datasets can be modeled by Gaussian Distribution (Univariate or Multivariate). So it is quite natural and intuitive to assume that the clusters come from different Gaussian Distributions. Or in other words, it is tried to model the dataset as a mixture of several Gaussian Distributions. This is the core idea of this model.

In one dimension the probability density function of a Gaussian Distribution is given by

$G(X\h|\mu, \sigma) = \frac{1}{{\sigma \sqrt {2\pi } }}e^{{{ - \left( {x - \mu } \right)^2 } \mathord{\left/ {\vphantom {{ - \left( {x - \mu } \right)^2 } {2\sigma ^2 }}} \right. \kern-\nulldelimiterspace} {2\sigma ^2 }}}$
where $\mu$ and $\sigma^2$ are respectively mean and variance of the distribution.

For Multivariate ( let us say d-variate) Gaussian Distribution, the probability density function is given by

$G(X|\mu, \Sigma)= \frac{1}{\sqrt{(2\pi)|\boldsymbol\Sigma|}} \exp\left(-\frac{1}{2}({X}-{\mu})^T{\boldsymbol\Sigma}^{-1}({X}-{\mu}) \right)$

Here $\mu$ is a d dimensional vector denoting the mean of the distribution and $\Sigma$ is the d X d covariance matrix.

Gaussian Mixture Model

Suppose there are K clusters (For the sake of simplicity here it is assumed that the number of clusters is known and it is K). So $\mu$ and $\Sigma$ is also estimated for each k. Had it been only one distribution, they would have been estimated by maximum-likelihood method. But since there are K such clusters and the probability density is defined as a linear function of densities of all these K distributions, i.e.

$p(X) =\sum_{k=1}^K \pi_k G(X|\mu_k, \Sigma_k)$

where $\pi_k$ is the mixing coefficient for k-th distribution.

For estimating the parameters by maximum log-likelihood method, compute p(X| $\mu$ , $\Sigma$ , $\pi$ ).

$ln\hspace{0.25 cm} {p( X|\mu, \Sigma, \pi)} \newline =\sum_{i=1}^N p(X_i) \newline =\sum_{i=1}^N ln {\sum_{k=1}^K \pi_k G(X_i | \mu_k, \Sigma_k)}$

Now define a random variable $\gamma_k(X)$ such that $\gamma_k(X)$ =p(k|X).

From Bayes’theorem,

$\gamma_k(X) \newline\newline=\frac{p(X|k)p(k)}{\sum_{k=1}^K p(k)p(X|k)} \newline\newline=\frac{p(X|k)\pi_k}{\sum_{k=1}^K \pi_k p(X|k)}$

Now for the log likelihood function to be maximum, its derivative of $p(X|\mu, \Sigma, \pi)$ with respect to $\mu$ , $\Sigma$ and $\pi$ should be zero. So equaling the derivative of $p(X|\mu, \Sigma, \pi)$ with respect to $\mu$ to zero and rearranging the terms,

$\mu_k=\frac{\sum_{n=1}^N \gamma_k(x_n)x_n}{\sum_{n=1}^N \gamma_k(x_n)}$

Similarly taking derivative with respect to $\Sigma$ and $\pi$ respectively, one can obtain the following expressions.

$\Sigma_k=\frac{\sum_{n=1}^N \gamma_k(x_n)(x_n-\mu_k)(x_n-\mu_k)^T}{\sum{n=1}^N \gamma_k(x_n)} \newline$
And
$\pi_k=\frac{1}{N} \sum_{n=1}^N \gamma_k(x_n)$

Note: $\sum_{n=1}^N\gamma_k(x_n)$ denotes the total number of sample points in the k-th cluster. Here it is assumed that there are total N number of samples and each sample containing d features are denoted by $x_i$ .

So it can be clearly seen that the parameters cannot be estimated in closed form. This is where Expectation-Maximization algorithm is beneficial.

Expectation-Maximization (EM) Algorithm

The Expectation-Maximization (EM) algorithm is an iterative way to find maximum-likelihood estimates for model parameters when the data is incomplete or has some missing data points or has some hidden variables. EM chooses some random values for the missing data points and estimates a new set of data. These new values are then recursively used to estimate a better first data, by filling up missing points, until the values get fixed.
These are the two basic steps of the EM algorithm, namely E Step or Expectation Step or Estimation Step and M Step or Maximization Step.

Estimation step:
- initialize $\mu_k$ , $\Sigma_k$ and $\pi_k$ by some random values, or by K means clustering results or by hierarchical clustering results.
- Then for those given parameter values, estimate the value of the latent variables (i.e $\gamma_k$ )
Maximization Step:
- Update the value of the parameters( i.e. $\mu_k$ , $\Sigma_k$ and $\pi_k$ ) calculated using ML method.

Algorithm:


Initialize the mean , 
the covariance matrix  and 
the mixing coefficients  
by some random values. (or other values)
Compute the  values for all k.
Again Estimate all the parameters 
using the current  values.
Compute log-likelihood function.
Put some convergence criterion
If the log-likelihood value converges to some value 
( or if all the parameters converge to some values ) 
then stop, 
else return to Step 2.

Example:
In this example, IRIS Dataset is taken. In Python there is a GaussianMixture class to implement GMM.

Note: This code might not run in an online compiler. Please use an offline ide.

Load the iris dataset from datasets package. To keep things simple, take only first two columns (i.e sepal length and sepal width respectively).

Now plot the dataset.

filter_none

brightness_4

import numpy as np 

import pandas as pd 

import matplotlib.pyplot as plt 

from pandas import DataFrame 

from sklearn import datasets 

from sklearn.mixture import GaussianMixture 

# load the iris dataset 

iris = datasets.load_iris() 

# select first two columns  

X = iris.data[:, :2] 

# turn it into a dataframe 

d = pd.DataFrame(X) 

# plot the data 

plt.scatter(d[0], d[1])

Now fit the data as a mixture of 3 Gaussians.

Then do the clustering, i.e assign a label to each observation. Also find the number of iterations needed for the log-likelihood function to converge and the converged log-likelihood value.

filter_none

brightness_4

gmm = GaussianMixture(n_components = 3) 

# Fit the GMM model for the dataset  

# which expresses the dataset as a  

# mixture of 3 Gaussian Distribution 

gmm.fit(d) 

# Assign a label to each sample 

labels = gmm.predict(d) 

d['labels']= labels 

d0 = d[d['labels']== 0] 

d1 = d[d['labels']== 1] 

d2 = d[d['labels']== 2] 

# plot three clusters in same plot 

plt.scatter(d0[0], d0[1], c ='r') 

plt.scatter(d1[0], d1[1], c ='yellow') 

plt.scatter(d2[0], d2[1], c ='g')

Print the converged log-likelihood value and no. of iterations needed for the model to converge

filter_none

brightness_4

# print the converged log-likelihood value

print(gmm.lower_bound_)

# print the number of iterations needed

# for the log-likelihood value to converge

print(gmm.n_iter_)</div>
Hence, it needed 7 iterations for the log-likelihood to converge. If more iterations are performed, no appreciable change in the log-likelihood value, can be observed.

For more discussions about GMM...you can visit the website below.

https://towardsdatascience.com/gaussian-mixture-modelling-gmm-833c88587c7f

Statistics Lessons

Pages

Saturday, August 3, 2019

The Gaussian Mixture Model

Gaussian Mixture Model

Normal or Gaussian Distribution

Gaussian Mixture Model

Expectation-Maximization (EM) Algorithm

No comments:

Post a Comment

Postulates/Axioms of Probability

Popular Posts

Like us on Facebook

Contact Us