Balance Between Code and Statistics

Posted by Steven Jasper on October 15, 2020

Originally my choice to join a coding bootcamp was a tough one, I could not decide where I wanted my career to go. With my background in computer science and mathematics I figured Data Science would be a good way to go, my initial understanding of Data Science was next to nothing. I knew it had some math sprinkled with coding, but when I heard Data Science I always thought Big Data, and the top tech companies such as Amazon, Apple, Microsoft, etc. This understanding was quickly altered when I began the Flatiron program, I learned that there is almost a harmonic relationship between the mathematics behind these machine learning algorithms and aspects of software development.

If we take a look at mathematics in the last 40 years, computers have began to take a massive role in allowing more complex algorithms to be ran. With the vast improvements in computational power more and more data can be utilized in calculations, this leads to an interesting outcome, better predictions and accuracy of the world around us. To further this role fantastically easy-to-use tools such as Python, scikit-learn, scipy and many more, have revolutionized the way we look at mathematics and statistics.

It is relatively simple to use these highly intricate libraries such as sci-kit learn to create a simple linear regression model, or some other more complex model, however with the simplicity of the code comes the complexity of the math. With less than 100 lines of code you can take a relatively messy dataset, clean it, explore it, and model it with these libraries, but does this actually allow anyone to obtain meaningful information from the data? I do not believe so, without at least a working knowledge of what these algorithms are actually doing, it becomes difficult to decipher the meaning of these algorithms. Yes, you could just look at the array of different scores that come from the statistics calculations, but when using something like a KNN algorithm you could pull useful information by understanding WHY this data is within the distance parameters set. Are there hidden patterns that your ML algorithm has discovered? Is there any bias behind your model? By just coding it and not understanding the statistics behind it, you may not understand where underlying biases may exists.

I also believe that Data Science is a breeding ground for people who enter the field to experience the Dunning-Kruger Effect. When first entering the field and looking at how easy it can be to create a fairly well-performing linear regression model one may be incredibly confident in their future career as a Data Scientist. Once they have this high level of confidence they begin to dive deeper into more complex models, such as neural networks and other deep learning methods. This is where the downhill portion of the Dunning-Kruger Effect kicks in, once you begin diving deeper and deeper the more you tend to get overwhelmed with complexity, only pushing through and practice allows you to come out the other side. The lowest point correlates with the highest complexity, until you understand this you are at a higher risk to give up. If you want to read more about the Dunning-Kruger Effect you can see more Here.

In essense Data Science is a beautiful dance between computer science, mathematics, and business. We did not speak much to the business portion, as this can be more widely spread than even the mathematics. However, you cannot become a successful Data Scientist without having a working knowledge of all 3 components of this dance.