9 thoughts on “How Python can Turn the Internet into your Dataset: Part 1

  1. I 100% agree with the idea here of teaching legal (and other social science) researchers the value of this sort of easy web data collection. But the problem with this (especially for legal researchers who are expected to be aware of this stuff) is that all of the nfl stats on all of the nfl sites are generated by a single company called ‘Stats LLC’. They, in conjunction with the nfl and the sites to whom they sell the data, attach very strict terms of use to reading the stats, including that you cannot scrape the data and use it as your own. For example, here’s the relevant bit from the nfl.com ToS: “Systematic retrieval of data or other content from the Service, whether to create or compile, directly or indirectly, a collection, compilation, database or directory, is prohibited absent our express prior written consent.” The copyright issues are less clear, but these kinds of ToS seem to be clearly enforceable.

    I’m not pointing this out just to be a pain in the tuckus. I want / need to do this sort of scraping all of the time but am often hamstrung by the ToS. To some degree, if you are only doing this sort of thing as a personal project you may be safe simply because you’ll fly under the radar. But to the degree that you are doing this sort of thing as a research project and trying to publish repeatable results, the terms of service are a serious impediment.

    This is a serious issue with using the web as data for publishable research and visualizations. The web may seem like a giant, open repository of data ready for easy grabbing with 100 line python scripts. But there are often real legal hurdles to actually using the data that is tantalizingly within reach.

    I also have a specific interest in nfl statistics. I actually spent a month of my life writing a much more involved version of what you’ve done above to capture the complete play by play data from nfl.com (much more than 100 lines!), but I dropped the project largely because I didn’t feel safe publishing the resulting data. I’ve been bouncing around in my head the possibility of building a site to support the community production nfl stats precisely so that folks would have an open set of nfl data with which to work.

  2. Very elegant!

    My favorite part is that you do everything with the baseline packages, showing the power of regex + Python.

    Updated my post at ZIA to include a link here, looking forward to part 2.

  3. @Hal Roberts
    You raise excellent points with respect to the legal constraints on this kind of research. I’ll probably be emailing you in the next few days to pick your brain as to the most current trajectory of legal standing here.

    One thing I will say though is that these examples are mostly pedagogical – that is, the only way to teach PhD’s how to do this is to dangle the content carrot in front of them. Most research these students often perform down the line takes place on governmental or non-profits/NGO/academic sites, where the legal framework is much different.

  4. Check out BeautifulSoup, it is the bomb and it makes scraping baseball info a lot easier, b/c I’ve used it for that…

  5. then use xpaths and get the href for the first result something like this
    page=paser(“http://www.nfl.com/players/search?category=name&filter=”+player+”&playerType=current”)
    find_profile=eree.XPath(“xpath/@href”)
    profile=find_profile(page)[0]

    Pretty much as simple as that.

  6. Istead of all that you can just use xpath and search and get the href for the first result something like this:*

Leave a Reply

Your email address will not be published.