As we covered earlier, Drew Conway over at Zero Intelligence Agents has gotten off to a great start with his first two tutorials on collecting and managing web data with Python. However, critics of such automated collection might argue that the cost of writing and maintaining this code is higher than the return for small datasets. Furthermore, someone still needs to manually enter the players of interest for this code to work.
To convince these remaining skeptics, I decided to put together an example where automated collection is clearly the winner.
Problem: Imagine you wanted to compare Drew’s NY Giants draft picks with the league as a whole. How would you go about obtaining data on the rest of the league’s players?
Human Solution: If you planned to do this the old-fashioned manual way, you would probably decide to collect the player data team-by-team. On the NFL.com website, the first step would thus be to find the list of team rosters:
Now, you’d need to click through to each team’s roster. For instance, if you’re from Ann Arbor, you might be a Lion’s fan…
This is the list of current players for Detroit Lions. In order to collect the desired player info, however, you’d again have follow the link to each player’s profile page. For instance, you might want to check out the Lion’s own first round pick:
At last, you can copy down Stafford’s statistics. Simple enough, right? This might take all of 30 seconds with page load times and your spreadsheet entry.
The Lions have more than 70 players rostered (more than just active players); let’s assume this is representative. There are 32 teams in the NFL. By even a conservative estimate, there are over 2000 players you’d need to collect data. If each of the 2000 players took 30 seconds, you’d need about 17 man hours to collect the data. You might hand this data entry over to a team of bored undergrads or graduate students, but then you’d need to worry about double-coding and cost of labor. Furthermore, what if you wanted to extend this analysis to historical players as well? You better start looking for a source of funding…
What if there was an easier way?
The solution requires just 100 lines of code. An experienced Python programmer can produce this kind of code in half an hour over a beer at a place like Ashley’s. The program itself can download the entire data set in less than half an hour. In total, this data set is the product of less than an hour of total time.
How long would it take your team of undergrads? Think about all the paperwork, explanations, formatting problems, delays, and cost…
The end result is a spreadsheet with the name, weight, age, height in inches, college, and NFL team for 2,520 players. This isn’t even the full list – for the purpose of this tutorial, players with missing data, e.g., unknown height, are not recorded.
You can view the spreadsheet here. In upcoming tutorials, I’ll cover how to visualize and analyze this data in both standard statistical models as well as network models.
In the meantime, think about which of these two solutions makes for a better world.
9 thoughts on “How Python can Turn the Internet into your Dataset: Part 1”
I’m not pointing this out just to be a pain in the tuckus. I want / need to do this sort of scraping all of the time but am often hamstrung by the ToS. To some degree, if you are only doing this sort of thing as a personal project you may be safe simply because you’ll fly under the radar. But to the degree that you are doing this sort of thing as a research project and trying to publish repeatable results, the terms of service are a serious impediment.
This is a serious issue with using the web as data for publishable research and visualizations. The web may seem like a giant, open repository of data ready for easy grabbing with 100 line python scripts. But there are often real legal hurdles to actually using the data that is tantalizingly within reach.
I also have a specific interest in nfl statistics. I actually spent a month of my life writing a much more involved version of what you’ve done above to capture the complete play by play data from nfl.com (much more than 100 lines!), but I dropped the project largely because I didn’t feel safe publishing the resulting data. I’ve been bouncing around in my head the possibility of building a site to support the community production nfl stats precisely so that folks would have an open set of nfl data with which to work.
My favorite part is that you do everything with the baseline packages, showing the power of regex + Python.
Updated my post at ZIA to include a link here, looking forward to part 2.
You raise excellent points with respect to the legal constraints on this kind of research. I’ll probably be emailing you in the next few days to pick your brain as to the most current trajectory of legal standing here.
One thing I will say though is that these examples are mostly pedagogical – that is, the only way to teach PhD’s how to do this is to dangle the content carrot in front of them. Most research these students often perform down the line takes place on governmental or non-profits/NGO/academic sites, where the legal framework is much different.
Thanks! Not sure whether to do an igraph or matplotlib example next…
Check out BeautifulSoup, it is the bomb and it makes scraping baseball info a lot easier, b/c I’ve used it for that…
then use xpaths and get the href for the first result something like this
Pretty much as simple as that.
Istead of all that you can just use xpath and search and get the href for the first result something like this:*
Get the data from the FreeBase.
There’s your data, ready for download. 🙂
???? it doesnt work on phyton 3