Every day, around the United States, more than 36,000 weather forecasts are issued covering 800 different regions and cities. You probably notice the forecast was wrong when it starts raining in the middle of your picnic on what was supposed to be a sunny day, but did you ever wonder just how accurate those forecasts really are?
The folks at Forecastwatch.com did. Every day, they gather all 36,000 forecasts, put them in a database, and compare them to the actual conditions encountered in that location on that day. Forecasters around the country then use the results to improve their forecast models for the next round.
All that collection, analysis, and reporting takes a lot of heavy analytical horsepower, but ForecastWatch does it all with one programming language: Python.
The company isn’t alone. According to a 2013 survey by industry analyst O’Reilly, 40 percent of data scientists responding use Python in their day-to-day work. They join the many other programmers in all fields who have made Python one of the top ten most popular programming languages in the world every year since 2003.
Organizations such as Google, NASA, and CERN use Python for almost every programming purpose under the sun… including, in increasing measures, data science.
Python: Good Enough Means Good for Data Science
Python is a multi-paradigm programming language: a sort of Swiss Army knife for the coding world. It supports object-oriented programming, structured programming, and functional programming patterns, among others. There’s a joke in the Python community that “Python is generally the second-best language for everything.”
But this is no knock in organizations faced with a confusing proliferation of “best of breed” solutions which quickly render their codebases incompatible and unmaintainable. Python can handle every job from data mining to website construction to running embedded systems, all in one unified language.
At ForecastWatch, for example, Python was used to write a parser to harvest forecasts from other websites, an aggregation engine to compile the data, and the website code to display the results. PHP was originally used to build the website until the company realized it was easier to only deal with a single language throughout.
And Facebook, according to a 2014 article in Fast Company magazine, chose to use Python for data analysis because it was already used so widely in other parts of the company.
Python: The Meaning of Life in Data Science
The name is appropriated from Monty Python, which creator Guido Van Possum selected to indicate that Python should be fun to use. It’s common to find obscure Monty Python sketches referenced in Python code examples and documentation.
For this reason and others, Python is much beloved by programmers. Data scientists coming from engineering or scientific backgrounds might feel like the barber turned axe-man in The Lumberjack Song the first time they try to use it for data analysis—a little bit out of place.
But Python’s inherent readability and simplicity make it relatively easy to pick up and the number of dedicated analytical libraries available today mean that data scientists in almost every sector will find packages already tailored to their needs freely available for download.
Because of Python’s extensibility and general purpose nature, it was inevitable as its popularity exploded that someone would eventually start using it for data analytics. As a jack of all trades, Python is not especially well-suited to statistical analysis, but in many cases organizations already heavily invested in the language saw advantages to standardizing on it and extending it to that purpose.
The Libraries Make the Language: Free Data Analysis Libraries for Python Abound
As is the case with many other programming languages, it’s the available libraries that lead to Python’s success: some 72,000 of them in the Python Package Index (PyPI) and growing constantly.
With Python explicitly designed to have a lightweight and stripped-down core, the standard library has been built up with tools for every sort of programming task… a “batteries included” philosophy that allows language users to quickly get down to the nuts and bolts of solving problems without having to sift through and choose between competing function libraries.
Who’s Who in the Data Science Zoo: Pythons and Munging Pandas
Python is free, open-source software, and consequently anyone can write a library package to extend its functionality. Data science has been an early beneficiary of these extensions, particularly Pandas, the big daddy of them all.
Pandas is the Python Data Analysis Library, used for everything from importing data from Excel spreadsheets to processing sets for time-series analysis. Pandas puts pretty much every common data munging tool at your fingertips. This means that basic cleanup and some advanced manipulation can be performed with Pandas’ powerful dataframes.
Pandas is built on top of NumPy, one of the earliest libraries behind Python’s data science success story. NumPy’s functions are exposed in Pandas for advanced numeric analysis.
If you need something more specialized, chances are it’s out there:
- SciPy is the scientific equivalent of NumPy, offering tools and techniques for analysis of scientific data.
- Statsmodels focuses on tools for statistical analysis.
- Scilkit-Learn and PyBrain are machine learning libraries that provide modules for building neural networks and data preprocessing.
And these just represent the peoples’ favorites. Other specialized libraries include:
- SymPy – for statistical applications
- Shogun, PyLearn2 and PyMC – for machine learning
- Bokeh, d3py, ggplot, matplotlib, Plotly, prettyplotlib, and seaborn – for plotting and visualization
- csvkit, PyTables, SQLite3 – for storage and data formatting
There’s Always Someone to Ask for Help in the Python Community
The other great thing about Python’s broad and diverse base is that there are millions of users who are happy to offer advice or suggestions when you get stuck on something. Chances are, someone else has been stuck there first.
Open-source communities are known for their open discussion policies, but some of them have fierce reputations for not suffering newcomers lightly.
Python, happily, is an exception. Both online and in local meetup groups, many Python experts are happy to help you stumble through the intricacies of learning a new language.
And because Python is so prevalent in the data science community, there are plenty of resources that are specific to using Python in the field of data science. Meetup groups for data scientists using Python exist all over the country in places like Seattle and Los Angeles.
And if you have trouble finding a meetup near you with the right qualifications, there’s even a data science hack, (using Python!) to search Meetup.com for the right match.