Are you someone who is looking for some concepts to understand the basics of Theoretical Statistics. Yes, you have come to the right place. In today’s blog, we will have an overview on the various mathematical concepts of Statistics used in Natural Language Processing.
The two mathematical concepts which are widely used in the theoretical concepts of Statistics and Probability and tend to be most useful in understanding variables are what we know as Covariance and Correlation. Both the concepts are generally used in the field of natural language processing for comparing data samples from different populations, where covariance determines how much two variables change randomly to each other, and correlation, which is a normalized version of covariance, determines the change in one variable as it affects another variable.
Before diving into details about covariance and correlation, let us first try to understand what is meant by variance and standard deviation.
What does it mean by variance and standard deviation? What is their mathematical representation? Let’s read them out.
Variance is referred to as the measure of variability i.e., it is calculated by taking the average of the squared deviation of a random variable from its mean. In other words, it measures how far a set of numbers are spread out from their average value which states the more spread the data, the larger the variance is in relation to the mean. The mathematical representation of variance is:
S2 = sample variance
xi = value of one observation
x = mean value of all observations
n = number of observations
Standard deviation is a statistical term that measures the amount of dispersion or variation for a set of values relative (absolute variability of a random variable) to its mean as it is calculated as the square root of the variance. A low standard deviation indicates that the values tend to be nearer to the mean of the set, while a high standard deviation indicates that the values are further from the mean over a wider range, thus, the more spread of the data, the higher the standard deviation.
The general mathematical formula to find standard deviation for a given dataset is as follows:
So now, what do you mean by covariance and correlation?
In simple words, Covariance is a measure to indicate the extent to which two random variables change in tandem. Whereas Correlation is a measure used to represent how strongly two random variables are related to each other.
In a more explanatory sense, Covariance is defined as a quantitative measure of the extent to which the deviation of one variable from its mean matches the deviation of the other from its mean. It is actually a statistical technique that shows whether and how strongly pairs of variables are related. For example, how height and weight are related while describing taller people and shorter people and who is heavier.
In order to understand its mathematical representation, let us suppose we have two variables X and Y, then we represent the covariance between these two variables as Cov(X, Y). Now, if Σ(X) and Σ(Y) are the expected values of the variables, the covariance formula can be represented as:
On the contrary, correlation works primarily for quantifiable data where numbers hold much value and meaning. It cannot be used for purely categorical data, such as brands or goods purchased, gender, price of some items or maybe favorite color. The word correlation is used in our everyday life to denote some form of association between two quantitative variables like we might observe a correlation between the foggy days and attacks of wheezing.
So, how can we compare both? There is not much difference between the two but it is quite important to be clear with the theoretical concepts when we are discussing both at a time. What makes them apart is the fact that correlation values are standardized values whereas covariance values are not. A simple method to obtain the correlation coefficient of two variables is by dividing the covariance of these variables by the product of the standard deviations of the same values.
Both the terms are related to the linear relationship between variables, i.e., if one variable goes on increasing, then the other variable also moves in the same direction which means a positive correlation. On the other hand, if both the variables are in the opposite direction then correlation is negative. When there is no relationship, there are no changes.
A Correlation Matrix is used as a term that investigates the dependence between multiple variables at one go. There are three main applications of a correlation matrix some of which includes to diagnose or check other analysis, to input into other analyses, and to summarize large amounts of data.
On what kind of datasets do we normally use covariance and correlation for? A sample is randomly chosen from the population, and so we calculate covariance and correlation on samples rather than on the complete population.
Now that we have got a brief idea about the mathematical theory of covariance and correlation, let us explore how and where we can apply it in the field of data analytics. Principal Component Analysis is one such application. PCA can be defined as the dimensionality-reduction method that is often used to reduce the dimensionality of large data sets, by transforming a large set of variables into a smaller one that contains most of the information in the large set.
So how do we decide what to use? Correlation matrix or the covariance matrix? Let’s try to understand this with the help of examples, and to showcase agility of implementation across technologies, I shall execute this example in Python.
We will consider the ‘iris’ data-set for the same.
Foremostly, we have to import the required libraries and then load the iris dataset. After that, we have to create a dataframe and drop empty records. Not to worry, we will check the code of each and every line below:
#importing necessary libraries
from sklearn import datasets
import pandas as pd
import numpy as np
# load iris dataset
iris = datasets.load_iris()
# Since this is a bunch, create a dataframe
iris_df.columns=[‘sepal_len’, ‘sepal_wid’, ‘petal_len’, ‘petal_wid’, ‘class’]
iris_df.dropna(how=”all”, inplace=True) # remove any empty lines
#selecting only first 4 columns as they are the independent(X) variable
# any kind of feature selection or correlation analysis should be first done on these
#This data-set will now be standardized using the inbuilt function.
from sklearn.preprocessing import StandardScaler
# let us now standardize the dataset
iris_X_std = StandardScaler().fit_transform(iris_X)
# I have then computed 3 matrices:
# covariance matrix on standardized data
mean_vec = np.mean(iris_X_std, axis=0)
cov_matrix = (iris_X_std – mean_vec).T.dot((iris_X_std – mean_vec)) / (iris_X_std.shape-1)
print(‘Covariance matrix \n%s’ %cov_matrix)
# Correlation matrix on standardized data
cor_matrix = np.corrcoef(iris_X_std.T)
print(‘Correlation matrix using standardized data\n%s’ %cor_matrix)
# Correlation matrix on unstandardized data
cor_matrix2 = np.corrcoef(iris_X.T)
print(‘Correlation matrix using base unstandardized data \n%s’ %cor_matrix2)