Data Science is considered as one of the most modern and fascinating jobs of our time. It can be funny and can give you satisfaction, but is it really as it’s described?
In this article, I’ll show you the reality of a Data Scientist’s life.
What you think it is
At the beginning of their career, Data Scientists think that Data Science is a wonderful, magical world full of algorithms, Python functions that performs every possible spell with a line of code and statistical models able to detect the most useful correlations among data that could make you an invincible superhero in your company. You start dreaming about your CEO congratulating with you and shaking your hand, you begin to see decision trees and clusters everywhere and, of course, the most terrifying neural network architectures your mind can dream.
But since the very first day of your first Data Science project, you start to realize what reality is.
What it really is
Expectation for results
Managers often think that Data Science is the Holy Grail of information technology. They have huge expectations about it and they want them to be satisfied here and now.
In reality, results are very difficult to achieve and need much time. Sometimes a result can’t be reached. Think about clustering, for example. You can spend an entire life searching for a clustering pattern that simply doesn’t exist in your data. Most managers don’t understand this fact and it can be very stressful for you and your team.
The only thing better than a good algorithm is an explainable algorithm. Never forget this. No sane manager in the world would follow an unknown algorithm for managing their company’s money only because its AUROC is greater than 95%. Managers need to understand algorithms, figure out how they think about data and this is often a great task for a Data Scientist. Explaining algorithms to somebody with no scientific background can be quite difficult, but it’s very common in large companies and you must face this fact.
Most of your time you’ll find yourself trying to erase that awful question mark on your boss’ face, simplifying as much as possible to make them understand your results. Remember: if you can’t explain your results, managers will start to ask themselves whether you are useful or not in your company.
You’ll spend a lot of time interviewing product owners and ITC professionals to understand the information hidden inside business data they know or produce. There’s no way you can make it without their help.
Many times data comes from complex and heterogeneous systems and this often implies lines of log files that you need to understand. Data isn’t everything; information is everything. Never forget this. Information is buried inside data and you’ll need somebody telling you where you should dig.
The larger the company, the more difficult it is to find the right people to interview and when you finally make it, their answers will generate more questions and these people may not have enough time for you and your “nerdy stuff”.
You’ll find yourself using data visualization more often than you would have ever imagined. Charts, slides and other graphical tools will be like silver bullets in your shotgun. Maybe you have magic formulas in your mind, graphs and so on. Forget about them. Data Science is told by graphical representations and it’s often difficult to find the proper visualization technique suitable for your audience.
There they are. We are slaves in a world of deadlines and expectations. When you were a software engineer you had milestones in your plan and you weren’t allowed to delay a second. In Data Science, things aren’t easier.
There are deadlines and milestones even in Data Science, and there is a great difficulty inside them: Data Science is something very close to academic research, so it doesn’t fit well in the classical, waterfall ITC project management style. Instead, some Agile framework (e.g. Scrum or Kanban) should work well, due to its physiological ability to quickly adapt to changes. But Agile is difficult to teach to managers. It can give them the false idea that there’s no clear delivery date and this is very difficult to accept by companies.
Algorithms and programming
And finally, the fun part. Python, R, Knime, reading scientific papers, optimization algorithms, cross-validation and so on. The technical and nerdy real fun is a very small part of the work and it takes very little time in the whole project lifetime. Maybe you have already lost enthusiasm in the previous phases before writing your first line of code and things no longer seem as funny as you thought at the beginning.
What’s the best way to do Data Science?
According to my experience, I can answer with a single word: Agile. There’s no need to perform all the business understanding part before writing your first Python code line. Start with a simple business understanding of a small piece of data, explore it, visualize it and begin with a simple model. Create the first, quantifiable results week by week keeping your customers constantly engaged in the process. Deliver small results with a constant delivery rate and, please, don’t fall into the waterfall trap.
Simplicity is the key. Never forget it. Start with the simplest things possible and add a small piece of complexity only if needed.
There’s a psychological sense of relief in constant, small results and this is another weapon you have to use if you want to survive in the jungle of companies’ deadlines and business processes. In this way, every colleague of yours who is committed to your project will feel your difficulties and start to understand how difficult Data Science is.
Remember, companies still think about Data Science as an ITC branch; they are not completely wrong, but they shouldn’t expect you to follow the waterfall approach. So, you have to suffer the struggle to guide your company toward an Agile way of thinking.
Concerning the explanation part of the job, I prefer to start with the simplest machine learning model possible: k-nearest neighbors. It’s very easy to understand. You only need paper, a pencil and a Cartesian plane with some points drawn on it. That’s it. If it produces very nice results, everybody will finally see you like the great business partner you think you are.
If KNN doesn’t work, then you can use regressions and decision trees (random forests, gradient boosted tree classifiers and so on), which are very easy to explain, or Bayesian networks, which have a very useful graphical representation.
Finally, visualize. Visualize everything. Ask your boss to buy you a course in data visualization, learn as much as possible about the best visualization techniques and, please, remember to avoid pie charts. They are pretty useless and misleading. If you provide a simple scatter or bar plot, people will catch all the relevant information.
Simple results are the best ones. Some days ago, my team and I presented some results about a time series analysis using only three slides: high-level KPIs describing the business phenomenon, a confusion matrix and some performance metrics. Our audience was enthusiastic since the first slide, only because we started with clear numbers explaining the business in a simple way. In many situations, a small building block can really save your life.
Data Science is an exciting job, but it can be very difficult to perform if you speak to a non-technical audience. Data and business are intimately related to each other and you must remember this point when you work with business-oriented people. The only way to survive is to find a middle point between a data-driven bottom-up approach and a business-driven top-down approach.
Finally, as Data Science is hard and time-consuming, delivering small results with a constant delivery rate is the only way you can keep your customers engaged.
Originally published on Towards Data Science.