A quick way to get laughed out of your first data science job interview is to walk in and start talking about S-Q-L. To those in the know, the acronym for Structured Query Language is pronounced sequel.
This isn’t just pronunciational laziness… the original version was actually named SEQUEL, for Structured English QUEry Language. Invented by researchers at IBM in the early 1970s, SQL started off as a basic high-level language for working with databases but in popular usage, it’s become much more than that.
SQL provided the first implementation for the revolutionary relational database data storage model, a method of preserving relationships between discrete items of data that formed the underlying foundation to the technological revolution. As The Next Web pointed out, just about every single computer or device application that most of us use daily relies on a database.
It is in the analysis of the items in those databases that data scientists make their living. Going into the field of data science without a working knowledge of SQL is like going into ditch digging without knowing how to use a shovel.
Solving the Data Problem: Relationships Are Key
Human beings have been storing data for thousands of years. Some of the earliest known examples of cuneiform writing, invented by the ancient Sumerians, is found on tablets recording the distribution of beer among workers… a function that would certainly be handled in a relational database today.
But like most written records, the Sumerian cuneiform is essentially free-form, unordered data. An analyst is forced to read and understand the recording process before coming to any conclusions about the information so stored. If Eshkar had two bread beers on one day and Minesh had three over two days, there was no easy way to total how many bread beers had been consumed or who had them other than by reading completely through the records.
In 1970, E.F. Codd, an English computer scientist with IBM, realized that if one stored the data discretely, but included keys describing the relationships between certain pieces, logical query statements could retrieve it.
SQL is the most popular language in which those tables, relationships, and queries are conveyed to the computer. It is an integral part of most Relational Database Management Systems (RDBMS), or, simply, what most people just think of as a “database.”
Let a Thousand Flowers be SELECTed: SQL Standards Vary
Although there is a formal definition for SQL (ISO/IEC 9075:1-2003, if you are up for a little light reading), in practice almost every platform that adopts the language implements it with slightly different quirks. SQL is essentially a declarative language in the ISO specification; restricted to stating computational logic without any of the flow control statements such as IF-THEN that programmers find so useful.
So useful, in fact, that most SQL implementations are altered to include such procedural statements. Unfortunately, since they are not included in the specification, they are often implemented quite differently from platform to platform. Popular, or at least commonly used, implementations are:
Due to these disparities, SQL today is more often thought of in terms of the specific implementations associated with individual or subclasses of RDBMS, than as a distinct, inviolate language. Transact-SQL is the Microsoft/Sybase flavor of the language used in SQL Server, while PL/SQL is used on Oracle platforms along with SQL-PSM by MySQL and most other open-source RDBMS.
Although the basics will be consistent from platform to platform, data types and more advanced statements may differ in function. Data scientists in the field will have to be careful about noting the RDBMS source when writing SQL code, unless (for example), they attempt to use a FULL JOIN – a statement that works perfectly well for SQL Server – in a query for MySQL, which doesn’t support it.
Using SQL For Data Analysis
Although some of the largest and fastest data stores in the world are getting away from the RDBMS backend, chances are high that the average data scientist will spend most of his or her career pulling information out of a SQL database for analysis. Creating a SQL query will often be the very first step taken in any sequence of evaluation. A 2014 post on Revolutions, an R language blog, shows that SQL is far and away the most common tool data scientists use.
Queries, as suggested in the name, form the basis of SQL use. The basic SELECT statement defines what information, with what filters in what order, the data scientist wants to retrieve to answer a specific question, or as the first step in a much longer sequence of analysis.
Quite a lot of basic manipulation can happen within the SQL framework, but the language can also be used in concert with other popular data analysis tools for even more powerful examination. Microsoft offers R Services for SQL Server that allow programmers to use R directly with SQL databases. Going in the other direction, an R package called sqldf allows R users to write SQL into their R projects to query R data frames. And, as described in this April, 2015 Data Science Central post, many data scientists are opting for the Dagwood approach and throwing together Python, R, and SQL for more power and flexibility.
And even outside the RDBMS framework, SQL is finding traction for data analysis.
- Apache’s HIVE builds a SQL-like interface on top of Hadoop datasets
- Teradata offers SQL-Mapreduce to map SQL statements onto MapReduce functions
- Cloudera’s Impala uses a similar process to allow SQL to query unstructured file systems or almost any ODBC (Open DataBase Connectivity) data source