**Data Pilot - Statistical methods**

Historically, most statisticsal methods have been developed before the computer era. Back then it was very difficult to do large sample calculations. That is why the methods that used hypothesis about sampled population distribution became widespread. However, it is quite difficult to correctly apply distribution-dependent methods. And using these methods without indicating appropriate distribution models leads to the erroneous results.

Monte-Carlo method appeared then, but did not get popular due to a large amount of calculations required. Now, when the computers are so available, two methods based are on it - bootstrap method and shuffling method. They allow getting reliable results regardless of the original (basic) data. Jackknife can also provide accurate results, but require a large sample size.

**Distribution-dependent methods drawbacks**

If we know that a sampled population has a normal distribution, then a small sample is enough to estimate its average and dispersion, which makes it easy to get the confidence interval, compare it to the averages from other samples and so on. However, there are the following problems here:

- A large sample size needs for a reliable check of distribution. Using Kolmogorov-Smirnov’s criteria it requires 100 or more observations. In reality, such a large sample size is a rarity. Fifteen observations are enough to apply Chi-Square test, but it is less accurate and highly depends on group splitting. Moreover, frequently there are less then 15 observations.
- Even when there are many observations, they are most often not distributed normally. In that case, it takes a lot of time and diligent effort to normalize data through various data conversion methods. Unfortunately, sometimes normalization is impossible, for example for the multimodal distributions.

The advantage of this method is that instead of one statistical parameter value it allows to get sample distribution for this parameter.

For example, when we have 10 pairs of observations, using the classic method, we get only 1 correlation coefficient. The bootstrap method uses random subsamples to determine a new correlation coefficient each time. When we get many values for this parameter, we build a distribution curve for it in order to estimate its significance. When about half of the values are positive and half are negative, the correlation is insignificant. If, however, 95% of the correlation coefficients are positive, we can state that there is positive correlation dependence between the variables with the confidence probability of 0.95. Frequently, instead of confidence probability, people use significance level, which is calculated by subtracting the confidence probability value from 1; in our case, it’s 0.05.

This method is used in order to determine the probability of correlation between variables. The original data is used to calculate correlation coefficient. Then the value of one of the variables gets “shuffled” and the statistical values get calculated again and again. This way we get several statistical values. If we get the values of correlation coefficients that are just as high as with the original data – then there is no correlation between the variables. If, however, the resulting values are lower, then the correlation is assumed and a confidence level can be determined.

This method is similar to the bootstrap method except a subsample selection. In the above example, if we had 10 observations we would get 10 selections with 9 observations in each (if we throw out 1 observation) or 10*9=90 subsamples with 8 observations in each, if the throw out 2 observations.

Examples:

Confidence Interval

Comparing Averages

Correlation Significance

Classification Importance

Send us your own example