Descriptive statistics of a numerical or frequency table of a categorical variable.

Variable | mean | sd | var |
---|---|---|---|

Miles/(US) gallon | 20.09062 | 6.026948 | 36.3241 |

The standard deviation equals to *6.03* (variance: *36.3*), which shows the unstandardized degree of homogenity: how much variation exists from the average. The expected value is around *20.1*, somewhere between *18* and *22.2* with the standard error of *1.07*.

The highest value found in the dataset is *33.9*, which is exactly *3.26* times higher than the minimum (*10.4*). The difference between the two is described by the range: *23.5*.

A histogram visually shows the distribution of the dataset based on artificially allocated frequencies. Each bar represents a theoretical interval of the data, where the height shows the count or density.

If we *suppose* that *Miles/(US) gallon* is not near to the normal distribution (see for example skewness: *0.611*, kurtosis: *-0.373*), checking the median (*19.2*) might be a better option instead of the mean. The interquartile range (*7.38*) measures the statistics dispersion of the variable (similar to standard deviation) based on median.

Correlation is one of the most commonly used statistical tool. With the help of that we can get information about a possible linear relation between two variables. According to the definition of the correlation, one can call it also as the standardized covariance.

The maximum possible value of the correlation (the so-called correlation coefficient) could be 1, the minimum could be -1. In the first case there is a perfect positive (thus in the second case there is a perfect negative) linear relationship between the two variables, though perfect relationships, especially in the social sciences, are quite rare. If two variables are independent from each other, the correlation between them is 0, but 0 correlation coefficient only means certainly a linear independency.

Because extreme values occur seldom we have rule of thumbs for the coefficients, like other fields of statistics:

- we call two variables highly correlated if the absolute value of the correlation coefficient between them is higher than 0.7 and
- we call them uncorrelated if that is smaller than 0.2.

Please note that correlation has nothing to do with causal models, it only shows association but not effects.

There are no highly correlated (r < -0.7 or r > 0.7) variables.

There are no uncorrelated correlated (r < -0.2 or r > 0.2) variables.

mpg | drat | |
---|---|---|

mpg |
0.6812 * * * | |

drat |
0.6812 * * * |

Where the stars represent the significance levels of the bivariate correlation coefficients: one star for a p value below `0.05`

, two for below `0.01`

and three for below `0.001`

.

On the plot one can see the correlation in two forms: below the diagonal visually, above that one can find the coefficient(s).

With the help of the linear regression we can investigate the relationship between two variables. More punctually we can observe if one of the variables, the so-called dependent variable, significantly depended on the other variable, if an increase/decrease on the dependent variable's values made an increase/decrease on the independent variable. In this case we only observe linear relationships.

# Overview

Linear Regression was carried out, with *Rear axle ratio* as independent variable, and *Miles/(US) gallon* as a dependent variable. The interaction between the independent variables was taken into account.

In order to have reliable results, we have to check if the assumptions of the linear regression met with the data we used.

Value | p-value | |
---|---|---|

Global Stat |
3.44370595 | 0.48648944 |

Skewness |
0.37262924 | 0.54157458 |

Kurtosis |
0.01114036 | 0.91594106 |

Link Function |
0.17640054 | 0.67448498 |

Heteroscedasticity |
2.88353581 | 0.08948932 |

Decision | |
---|---|

Global Stat |
Assumptions acceptable. |

Skewness |
Assumptions acceptable. |

Kurtosis |
Assumptions acceptable. |

Link Function |
Assumptions acceptable. |

Heteroscedasticity |
Assumptions acceptable. |

To check these assumptions the so-called GVLMA, the Global Validation of Linear Model Assumptions will help us. The result of that we can see in the table above.

The GVLMA makes a thorough detection on the linear model, including tests generally about the fit, the shape of the distribution of the residuals (skewness and kurtosis), the linearity and the homoskedasticity. On the table we can see if our model met with the assumptions. As a generally accepted thumb-rule we use the critical p-value=0.05.

So let's see the results, which the test gave us:

The general statistic tells us about the linear model, that it can fit to our data.

According to the GVLMA the residuals of our model's skewness does not differs significantly from the normal distribution's skewness.

The residuals of our model's kurtosis does not differs significantly from the normal distribution's kurtosis, based on the result of the GVLMA.

In the row of the link function we can read that the linearity assumption of our model was accepted.

At last but not least GVLMA confirms homoscedasticity.

In summary: We can be sure that the linear model used here fits to the data.

References:

- Pena, EA and Slate, EH (2006): Global validation of linear model assumptions.
*J. Amer. Statist. Assoc.***101**(473):341-354.

As we want to fit a linear regression model, it is advisable to see if the relationship between the used variables are linear indeed. Next to the test statistic of the GVLMA it is advisable to use a graphical device as well to check that linearity. Here we will use the so-called crPlots funtion to do that, which is an abbreviation of the Component + Residual Plot.

Here comes the question: What do we see on the plot? First of all we can see two lines and several circles. The red interrupted line is the best fitted linear line, which means that te square of the residuals are the least while fitting that line in the model. The green curved line is the best fitted line, which does not have to be straight, of all. The observations we investigate are the circles. We can talk about linearity if the green line did not lie too far from the red.

Next to these options there is a possibility to have a glance on the so-called diagnostic plots, which on we can see the residuals in themselves and in standardized forms.

After successfully checked the assumptions we can finally turn to the main part of the interest, the results of the Linear Regression Model. From the table we can read the variables and interactions which have significant effect on the dependent variable.

Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|

(Intercept) |
-7.52 | 5.48 | -1.37 | 0.18 |

drat |
7.68 | 1.51 | 5.1 | 1.78e-05 |

K-means clustering with automatically estimated number of clusters

K-means Clustering is a specific and one of the most widespread method of clustering. With clustering we want to divide our data into groups, which in the objects are similar to each other. K-means clustering is specified in the way, here we set the number of groups we want to make. In our case we will take into account the following variables: *Miles/(US) gallon* and *Rear axle ratio*, to find out which observations are the nearest to each other.

J. B. MacQueen (1967). *"Some Methods for classification and Analysis of Multivariate Observations, Proceedings of 5-th Berkeley Symposium on Mathematical Statistics and Probability"*. 1:281-297

As it was mentioned above, the speciality of the K-means Cluster method is to set the number of groups we want to produce. Let's see how to decide which is the ideal number of them!

We can figure out that, as we see how much the Within groups sum of squares decreases if we set a higher number of the groups. So the smaller the difference the smaller the gain we can do with increasing the number of the clusters (thus in this case the larger decreasing means the bigger gain).

The ideal number of clusters seems to be *2*.

The method of the K-means clustering starts with the step to set k number of centorids which could be the center of the groups we want to form. After that there comes several iterations, meanwhile the ideal centers are being calculated.

The centroids are the observations which are the nearest in average to all the other observations of their group. But it could be also interesting which are the typical values of the clusters! One way to figure out these typical values is to see the group means. The *2* cluster averages are:

mpg | drat | |
---|---|---|

1. |
15.9 | 3.3 |

2. |
25.5 | 3.98 |

On the chart below we can see the produced groups. To distinct which observation is related to which cluster each of the objects from the same groups have the same figure and there is a circle which shows the border of the clusters.