Underrated statistic measures that you may want to start considering.
I first started working using statistics when I wrote my bachelor's thesis, but looking back to the manuscript, I could see that I was lacking the basic foundation of statistics. It does not mean that the conclusion in my thesis was not acceptable, it's just that I did not provide the full arguments of the method I used in my thesis, which kind of bothering my today's self.
The reason was not because I did not take a statistic class before. It was simply because when I took the class I did not yet have the first-hand and real experience of applied statistics. Looking back, now I know what I could improve from my bachelor project and what measures I should use, etc. etc. This realization does not just come because I am now an expert in statistics, rather it is mostly because I already have the experience of using statistics in a real life condition and I can make sense all the basic calculations in statistics.
So I grow to be more accustomed in processing and analyzing data than I was before. But it does not mean that my interest comes directly with me understanding statistical theories by heart. So, in between my time applying for jobs, I started taking basic statistics class in Khan Academy where I just realized that there are statistical measures that do not get as much attention as they deserve. In this article I propose to include median and to pay more attention to standard deviation when you describe your data.
Start including median in your analysis.
When we have our dataset, the first thing we want to do before proceeding to other steps is to understand what is happening in our dataset. In inference statistics, we are usually dealing with sample population that provides us a small picture of our population. Central tendency is a powerful measure to describe our dataset and important to be assessed before we do anything further. There are three important measures: (arithmetic) mean, median, and mode.
Mean is the most common measure that people use when it comes to explain their data. However, mean is very sensitive to the presence of outliers that are far from the quantiles of the data. There is also median which, despite usually being overlooked, is worth for our time in our data analysis process (really, in R you just need to type median(x), and you will get it). If we have to compare between mean and median, median can give better estimation of central tendency than mean in the presence of outliers.
With median, we simply order the values in an ascending manner and pick the middle value, while with mean, we sum all the values and divide it by the number of observations in mean calculation. Identifying median does not involve the summation of the values as in the calculation of mean, so it is less sensitive to the presence of large or small outliers. On the other hand, in the calculation of mean, the presence of large outliers can result to the overestimation of the mean, while small outliers can result to the underestimation.
However, even if mean is more sensitive to outliers than median, the question is not about "when to use mean and when to use median?". What I propose is simply to put both in your descriptive statistics.
There are times where median provides better explanation, such as in asymmetrical or skewed distribution (i.e. left- and right-skewed). For example, in the situation where many data is distributed in ascending order at the right side of the most typical values (the peak in histogram), median can show the middle value that most probably falls in between the typical values. On the other hand, if we use mean, we will have to sum all the values, which are composed by higher values than the typical values, and in turn will give us the value that is higher than the typical values in our dataset. So, in this case, median shows the tendency of sample's behavior given the frequent condition when the mean is failed to do so, even though the mean is correct mathematically.
In symmetrical distribution, or normal distribution, there is usually no high discrepancy between mean and median, although outliers can still have an influence. However, from my experience, when I dealt with outliers and I used both mean and median, there isn't really a high discrepancy between the two. I guess the discrepancy really depends on the nature of your outliers. My limit to say a value as an outlier is if this value falls in the outside range of lower quartile (Q1) - 25% to upper quartile (Q3) + 75%. Also, it depends on the number of observations (n) with outliers and how big or how small the outlier is. If there are more observations with typical values than the observations with outliers, I guess the mean will not be very different than the median, and mean is okay to explain the data. The thing is, with outliers the accuracy of your mean and median estimation can be low.
Either way, I guess it's better to include them in your descriptive statistics, so that you can explain your data better and you are not risking over- or under-assuming your data. Another solution is to consider other types of mean. In the beginning of this article, I mentioned (arithmetic) mean, it's the type of mean that we usually use, and the type that is sensitive to outliers. I just found other types of mean that can reduce the influence of outliers and can provide better explanation of the data. I will not cover the full explanation of these mean, but I will mention them here and you can check them in any statistical papers. These simple explanations were taken from NCBI article.
- Weighted mean, where you calculate the average of the values by putting weights on these values based on their importance.
- Geometric mean, it is a log scale of arithmetic mean, or the n-th root of the products of an observation. This is an appropriate measure when you're dealing with skewed distribution or when the values are changing exponentially.
- Harmonic mean.
Pay more attention to standard deviation
Standard deviation seems to not getting the attention it deserves. It's such an important measure of spread, but maybe because its calculation is quite complicated or maybe it does not provide a straight-forward interpretation, standard deviation is sometimes being overlooked in the interpretation of a dataset. For example, in providing salary information of a certain region. What I usually see is articles show the average salary of a certain position in a certain region, but does not show the variation of the salary which, if you think about it, can provide a better picture on how much salary you should expect from this certain position. I will explain in the following paragraphs.
Standard deviation is the measure of how far a value is from the mean. In descriptive statistics, it complements the measures of tendency, so it is to complement median and mean in describing our data. Another way to interpret standard deviation is: it is the measure of variation in your dataset. It shows if your data is consisted of variable values or not, if your data is highly spreading or not, or if your data is diverse or not.
Coming back to the example of salary, imagine that you are searching for the average salary of a data analyst in Paris. Turned out it's 50 000 Euros/year, and you may think that it's great. But then you found out that the standard deviation for the salary is 15 000 Euros/year, and it indicates a high variation in the salary for a data analyst position in Paris. Then you look into the factors, turned out the salary depends on the industries, certain industries may give high salary while others may not. So, with standard deviation you get a better picture of the real condition in a population.
Furthermore, standard deviation is the important measure to compare two sets of data. About the salary, you are torn between becoming data analyst and data scientist. You found out that the average salary for data analyst in Paris is 50 000 euros with variation 15 000 euros, while for data scientist is 50 000 euros with variation 2 000 euros. From this description, you can say that there is higher salary variation for data analyst positions than for data scientist, so it is safer to say that you can stand higher chance to get 50 000 euros/year if you become a data scientist in Paris. However, to effectively compare several datasets, you should pay attention on the number of samples. If, for example, there are 5 000 respondents in the survey for data analyst's salary and there are only 200 respondents for data scientist's salary, the salary estimation is better for data analyst than for data scientist. And so, this comes another measure that you need to understand when describing your data: number of observation.
Bottom line: median and standard deviation are as important as mean
So, mean is not the only way to describe your data. There are some occasions to use median, especially when your data is skewed, as opposed to mean, but it's always good to provide both of them in your data description (or descriptive statistics, or summary statistics). Moreover, complementing the measures of tendency with the measures of spread can provide you a better explanation of your data.
From my experience, this is the important step before starting doing anything with my data. I take my raw data, and I checked the median, mean, and standard deviation. If there is a high discrepancy between the median and mean, I check the outliers, etc. etc. until I get to the point that my data is finally making sense before going to further analysis.
Previously posted in Medium blog: Firza Riany's Medium Story
Comments
Post a Comment