Home > Statistics in Data Science

Statistics is a broad field with applications in many industries. Wikipedia defines it as *the study of the collection, analysis, interpretation, presentation, and organization of data*. Therefore, it shouldnâ€™t come as a surprise that there is a significant role of statistics in Data Science. For example, at the most minimum, data analysis requires descriptive statistics and probability theory which are applications of Statistics in Data Science. These concepts will help you make better business decisions from data.

Some of the more important Statistics concepts used in Data Science include probability distributions, statistical significance, hypothesis testing, and regression. Another important consideration is that machine learning requires a good understanding of Bayesian thinking. Bayesian thinking is a dynamic mode of thinking where our beliefs and conclusions may change as more data is introduced into the system. This is the basic model on the basis of which several Machine Learning projects have been developed. Some other key concepts are conditional probability, maximum likelihood, and priors and posteriors. These terms are probably unintelligible for people who do not understand Statistics and the role of Statistics in Data Science but given enough time, it is very possible to learn these concepts with a little application.

To know how to learn the statistics in data science, it's helpful to start by looking at how it will be used. Let's take a look at some examples of real analyses or applications you might need to implement as a data scientist:

Your company is rolling out a new product line, but it sells through offline retail stores. You need to design an A/B test that controls for differences across geographies. You also need to estimate how many stores to pilot in for statistically significant results. This may be one of the more significant & common uses of Statistics in Data Science.

Your company needs to better predict the demand for individual product lines in its stores. Under-stocking and over-stocking are both expensive. You consider building a series of regularized regression models.

You have multiple machine learning model candidates you're testing. Several of them assume specific probability distributions of input data, and you need to be able to identify them and either transform the input data appropriately *or* know when underlying assumptions can be relaxed.

A data scientist makes hundreds of decisions every day. They range from small ones like how to tune a model up big ones like the team's R&D strategy.

One of the philosophical debates in statistics is between Bayesians and frequentists. The Bayesian side is more relevant when using statistics in data science.

In a nutshell, frequentists use probability only to model sampling processes. This means they only assign probabilities to describe the data they've already collected.

On the other hand, Bayesians use probability to model sampling processes and to quantify uncertainty before collecting data. If you'd like to learn more about this divide, check out this Quora post: For a non-expert, what's the difference between Bayesian and frequentist approaches?

In Bayesian thinking, the level of uncertainty before collecting data is called the prior probability. It's then updated to a posterior probability after data is collected. This is a central concept to many machine learning models, so it's important to master.

Again, all of these concepts will make sense once you implement them.

If you want to learn statistics for data science, there's no better way than playing with statistical machine learning models after you've learned core concepts and Bayesian thinking.

The statistics and machine learning fields are closely linked, and "statistical" machine learning is the main approach to modern machine learning.