At Facebook, data scientists pull a massive chunk of post data out of the company’s galactic-sized Hadoop data warehouse cluster. They want to know what percentage of a user’s friends see that person’s posts over a certain time range. With millions of users, aggregating this data onto a chart plot could take weeks… but instead, they turn to a little-known statistical programming language called “R” and zip out a few lines of code that generate the plot in minutes.
At Google, at Pfizer, at Bank of America, at Shell Oil, their compatriots are facing similar challenges and are turning to the same tool: the R programming language.
“R” only seems like a funny name for a language until you realize that more than half the alphabet has been used up for one-letter programming language names. And when you learn that “R” is just an implementation of another language called “S”, well, the logic just all falls together.
Neither R nor S nor statistically-oriented programming languages in general are encountered frequently outside the cloisters of data science; within those precincts, R has become the hot analytical programming tool of choice for data scientists in every industry from insurance to banking to marketing to pharmaceutical development.
For data scientists, R offers a multitude of features making statistical analysis of large data sets simple:
- Linear and non-linear modeling
- Time-series analysis
- Clustering
- Easy extensibility and interfaces to other programming languages
- Sizable shared code package repository
R has a strong Integrated Development Environment (IDE) available in RStudio and is accessible from a number of scripting languages widely used in the data science community– including Python. R is also free, both in the financial and intellectual property sense—the language is maintained by the GNU project under a GNU open-source license. It will compile and run on almost any popular modern operating system.
Anyone considering a career in data science is going to need more than a passing familiarity with R.
R is the Right Tool When the Job is Data Analysis
Computers run on numbers and, at heart, every programming language is just shuffling numbers around in increasingly complex sequences until something useful happens, like making the “Like” counter increment on your latest Facebook post.
But certain languages are designed for and easier to use with certain tasks. And R is all about data manipulation and visualization.
The language has been around since 1997, when the first version was written by Ross Ihaka and Robert Gentleman at the University of Auckland in New Zealand. S, on which R is based, was designed at Bell Labs in the mid-1970s as a more interactive alternative to performing data manipulation with Fortran. But like most languages of that era, S had a clunky syntax and didn’t easily implement objects.
R is a realization of S with many of the features of Scheme, another 1970s-vintage language with a heritage rooted in the other main branch of digital programming languages, Lisp. From Scheme, R adopts lexical scoping and a more object-friendly syntax.
Other high-level programming languages are entirely capable of implementing the features and functions of R, but all require additional coding to do so. With R, almost every tool a data scientist might need to manipulate and evaluate structured data is included. What isn’t in the base package has often been built and shared by other programmers and is freely available for download.
R Pulls Ahead of the Competition
Success begets success for programming languages, as users implement and share various useful solutions and extensions. In R’s case, the Comprehensive R Archive Network (CRAN) represents the largest collection of packages available for a dedicated statistical programming language today. Almost 8,000 user-contributed packages cover such core analytical segments as:
- Finance
- Genetics
- High-performance computing
- Medical Imaging
- Social Sciences
If a statistical technique exists, R probably has a package implementing it. The graphics and charting capabilities of the language are unparalleled. This advantage can’t be overstated for data scientists: pulling statistics out of a data set in other languages may require only a few lines of code (just like R), but writing functions to show the results as anything other than an endless stream of numbers can take thousands more lines.
Because R is a formal programming language and because it has been so widely adopted in the field of data science, it helps ensure results are easily duplicated– a challenge faced by other, less rigorous statistical analysis techniques. R first gained traction in academia, where reproducibility was and remains a key to credibility. The basic advantage of reproducible work has found appeal in the business sector, too, as larger and larger data sets require repeated analysis.
A Few Limitations, but Part of R’s Appeal is How Easy it is to Learn
R suffers from notable problems with memory management and performance, significant drawbacks for researchers working with very large datasets. But the capabilities of the language outweigh the issues. Packages such as dplyr and data.table have been written and put in CRAN to help R cope with the challenges of mining massive datasets.
Non-programmers with a statistics or math background tend to find R easier to learn than conventional programming languages. One of the big reasons R has been a hit outside of academia is that it is relatively easy for non-programmers to learn, but retains much of the power of a dedicated programming language. Engineers, scientists, and statisticians all find it relatively easy to pick up the language, because it uses concepts and terms they are familiar with from their general background in math and statistics.
R in Action … From Facebook to Pfizer to Finance
Facebook’s data scientists turn to R for a fast way to get an overview of new data. R’s visualization components make it easy to turn huge numbers into easy-to-understand linear or scatterplot charts, putting a cogent face on the data.
Facebook is such a massive user of R that the company has created its own Massive Open Online Course (MOOC) to help teach prospective data scientists how to work with the language.
At Google, R is used by the company’s advertising segments to examine underlying trends in their bid-driven ad pricing model, AdWords. The company also uses R extensively to churn through the buckets of search data it generates every second.
At Pfizer, customized R implementations allow non-programming researchers to plumb their drug trials data without having to send the information off to specialized statisticians first. The scientists are able to alter the direction of their research almost instantly in response to the information they receive.
R hit it big early on in the financial services sector. Dozens of specialized packages have been developed to allow fast and easy analysis of market data in real-time.