Data science is a wonderful job because it mixes several different skills. However, while data scientists usually follow (and, possibly, create) innovation, there are some “good old” skills that are still useful for this kind of job. One of such skills is SQL.
Are you really working with Big Data?
The buzzword “Big Data” has been around for years but let’s face it: data scientists rarely work with Big Data. They usually work with tabular data taken from relational databases like MySQL, PostgreSQL, Teradata, Oracle or MS Access (yes, it happens). All these technologies share the same common language, which is SQL. So, if you want to explore such databases and create your training dataset, you must know SQL very well.
Relational databases are the most common type of corporate database and dozens of legacy systems rely on such technologies. No company will ever switch to Big Data in one day, so SQL is still a very important tool for a data scientist who wants to work hard where data really is.
Yes? Ok, no problem
If you work with non-relational databases like MongoDB or CosmosDB, you can still analyze the result of several aggregated datasets using SQL. Almost all database engines have implemented specific functions to work with JSON data. Think about the powerful “->>” operator in MySQL or PostgreSQL, which allows you to enter inside a JSON object and extract information according to a particular JSON path query. So, working with JSON documents using SQL has never been easier than it is today.
Technologies like Apache Spark convert NoSQL data into SQL views using some parallel computing engine (often based on Hadoop) that speeds up computations. So, working with JSON (and other types of data) in SQL is not a real problem.
The same concept applies to formats like Parquet and CSV. Spark allows merging SQL tables, Parquet files, JSON documents and CSV tables using a common SQL language. That’s very useful for data scientists because by mastering a single language they can manipulate datasets of different types.
Data analysis is better with SQL
Have you ever tried making a query on a NoSQL database? It’s pretty hard to have a result in a decent amount of time. With an SQL database, you’ll easily perform aggregations without problems and by writing a few lines of code. That’s because NoSQL databases are made to store data, not to analyze them. Data scientists need to analyze data and that’s what SQL databases are very efficient in. You can do almost everything in SQL and extract KPIs and insights easier than using a NoSQL database. So, SQL is a must-have for those who analyze data.
If you need to use some special, custom functions on your data (for example, the prediction of a machine learning model), you can extend SQL functionalities with the use of User-Defined Functions (UDF). UDF are custom functions made by the user that can implement new features and operations that can be performed on the database. They can be written using several programming languages and can actually make you do almost everything you want.
So, instead of creating a complex program to analyze a NoSQL database, you can just write your custom UDF and use it in a simple relational database. The effort will be minimal and the results will come faster.
Joins are crucial
Data scientists usually have to merge data coming from different sources. Trying to do it using NoSQL databases can be tricky and time-consuming. With SQL, you can perform joins natively without writing too much code. Several database engines are optimized to join data as fast as possible, so knowing SQL is a must for each data scientist who needs to join data.
Joining data on NoSQL databases can be quite hard and common software programs usually try to convert NoSQL data into a tabular form (for example, you can convert a JSON document into a pandas Dataframe).
When does SQL fail?
SQL fails when we want to work with images and sounds or, generally speaking, with unstructured data. We could store an image inside a BLOB field of a table, but it’s difficult to analyze if we don’t use any computer vision model.
Moreover, SQL needs a well-structured database in order to completely express its power. Some operational databases are not designed for analytic purposes, so there’s always a data preparation phase that can be difficult and annoying. Unfortunately, although SQL is able to perform data preparation, you’ll always have to write a lot of code to prepare your dataset according to your needs.
SQL is still a relevant topic in the modern landscape of Data Science. It’s not so difficult to learn, but mastering it requires practice and patience. The results will come and will be satisfying, bringing value to each data science project.
If this language is still a problem for you, book a call with me and let’s talk about it.