Home > Common Mistakes Amateur Data Scientists Makes

This article, honestly, is rather funny for me to write. That's mostly because if anything, I am personally an Amateur Data Scientist. So given my apparently high (I'm kidding, by the way) level of credibility on the subject, let me shed some light on some of my research on what we as amateur data scientists can do to go about this the best way. Specifically, I'll go into 3 big ones.

Data Science is literally the science of data. I'm sure you all know science is no joke. It's the process of questioning, understanding, and trying to grasp the depths of how things work. This is an analytical field, so immediately we're talking mathematical maturity. You need to be able to use math effectively so you can understand nature, and in our case, data. Let me give you an example from a field very near and dear to me: physics.

Consider the following issue. I take a bar of metal that sits at room temperature and decide to touch a really hot bar with it. Qualitatively, this is pretty simple! Over time, the bar gets warm and reaches some equilibrium temperature. You know what's really crazy: this process utilizes one of the most complicated mathematical objects called Partial Differential Equations (the heat equation, to be specific). There are a number of solutions representing the evolution of that system, even though it's literally just a bar touching a hot one.

In that light, knowing how to deal with the math behind data is essential. As Analytics Vidhya eloquently puts it, "You should get to know how techniques work before you apply them in a problem. Learning this will help you understand how an algorithm works, what you can do to fine tune it, and will also help you build on existing techniques. Mathematics plays an important role here so it's always helpful to know certain concepts. In a day-to-day corporate data scientist role you may not need to know advanced calculus, but having a high-level overview definitely helps." Specifically, knowing the fundamentals of Linear Algebra, Calculus, Statistics, and Probability will (as I've heard) take you a long way in the field. You can even learn some on Khan Academy, Gilbert Strang's Linear Algebra books, or MIT OpenCourseWare (personal favorite!).

So that's one. The next is a classic. It's the concept of using ML mindlessly. In attempting to focus specifically in the accuracy of a model and not just in the interpretability and the intuition that comes with it- let me give you an interesting example.

During the Second World War, planes were shot down. We know that. However, to effectively use resources, the Allies analyzed the areas that were most hit. In doing that, they began to reinforce those before battle. However, a very smart statistician, Abraham Wald, argues that the cockpit, engines, and back were instead more important. Why? The data said otherwise? Almost like an ML model that one could train and tell to "reinforce parts," the model was biased! Turns out, many of the planes that did return were the only few to survive. So, the real issue was the one that never returned! As Wald found, anything hit in the back, engine, or cockpit would not survive. So, using intuition, some common sense, and looking outside the given data, he was able to fix the bias. This was a clever way to avoid selection bias.

And that's an example of looking beyond accuracy numbers! Data isn't just numbers, it is real results. To remember that and to truly interpret it can save a lot of work and pain, making it more effective and useful as a process. So, look beyond the numbers.

Although there are plenty more, I'll highlight one last one- communication. Unfortunately, the world doesn't totally speak data science. You can ramble about your groupbys, your SVMs and your SGD models, but not everyone gets that. At the end of the day, the purpose of our work is to solve problems. That means understanding them conceptually, and more importantly, being able to present them to people who care.

So what can you do to stay away from that? Well, as Analytics Vidhya once again advises, try explaining your work to someone non-technical! Show them what everything really means. Nobody outside of your field really cares about how you've reset the index or changed data types. So be personal, be real, and dive into meaning.

And that's pretty much all from me! Obviously, there's a lot more out there, so feel free to roam the interwebs for advice: it's got plenty.