ABSTRACT: OpenEDGAR is an open source Python framework designed to rapidly construct research databases based on the Electronic Data Gathering, Analysis, and Retrieval (EDGAR) system operated by the US Securities and Exchange Commission (SEC). OpenEDGAR is built on the Django application framework, supports distributed compute across one or more servers, and includes functionality to (i) retrieve and parse index and filing data from EDGAR, (ii) build tables for key metadata like form type and filer, (iii) retrieve, parse, and update CIK to ticker and industry mappings, (iv) extract content and metadata from filing documents, and (v) search filing document contents. OpenEDGAR is designed for use in both academic research and industrial applications, and is distributed under MIT License at https://github.com/LexPredict/openedgar
ABSTRACT: “Array programming provides a powerful, compact and expressive syntax for accessing, manipulating and operating on data in vectors, matrices and higher-dimensional arrays. NumPy is the primary array programming library for the Python language. It has an essential role in research analysis pipelines in fields as diverse as physics, chemistry, astronomy, geoscience, biology, psychology, materials science, engineering, finance and economics. For example, in astronomy, NumPy was an important part of the software stack used in the discovery of gravitational waves and in the first imaging of a black hole. Here we review how a few fundamental array concepts lead to a simple and powerful programming paradigm for organizing, exploring and analyzing scientific data. NumPy is the foundation upon which the scientific Python ecosystem is constructed. It is so pervasive that several projects, targeting audiences with specialized needs, have developed their own NumPy-like interfaces and array objects. Owing to its central position in the ecosystem, NumPy increasingly acts as an interoperability layer between such array computation libraries and, together with its application programming interface (API), provides a flexible framework to support the next decade of scientific and industrial analysis.” Access Paper via Nature.
Our next paper — OpenEDGAR – Open Source Software for SEC Edgar Analysis is now available.This paper explores a range of #OpenSource tools we have developed to explore the EDGAR system operated by the US Securities and Exchange Commission (SEC). While a range of more sophisticated extraction and clause classification protocols can be developed leveraging LexNLP and other open and closed source tools, we provide some very simple code examples as an illustrative starting point.
Click here for Paper: < SSRN > < arXiv >
Access Codebase Here: < Github > Abstract: OpenEDGAR is an open source Python framework designed to rapidly construct research databases based on the Electronic Data Gathering, Analysis, and Retrieval (EDGAR) system operated by the US Securities and Exchange Commission (SEC). OpenEDGAR is built on the Django application framework, supports distributed compute across one or more servers, and includes functionality to (i) retrieve and parse index and filing data from EDGAR, (ii) build tables for key metadata like form type and filer, (iii) retrieve, parse, and update CIK to ticker and industry mappings, (iv) extract content and metadata from filing documents, and (v) search filing document contents. OpenEDGAR is designed for use in both academic research and industrial applications, and is distributed under MIT License at https://github.com/LexPredict/openedgar
Paper Abstract – LexNLP is an open source Python package focused on natural language processing and machine learning for legal and regulatory text. The package includes functionality to (i) segment documents, (ii) identify key text such as titles and section headings, (iii) extract over eighteen types of structured information like distances and dates, (iv) extract named entities such as companies and geopolitical entities, (v) transform text into features for model training, and (vi) build unsupervised and supervised models such as word embedding or tagging models. LexNLP includes pre-trained models based on thousands of unit tests drawn from real documents available from the SEC EDGAR database as well as various judicial and regulatory proceedings. LexNLP is designed for use in both academic research and industrial applications, and is distributed at https://github.com/LexPredict/lexpredict-lexnlp
We have spent the past couple days at the University of Pennsylvania where we presented information about our efforts to compile a complete United States Supreme Court Corpus. As noted in the slides below, we are interested in creating a corpus containing not only every SCOTUS opinion, but also every SCOTUS disposition from 1791-2010. Slight variants of the slides below were presented at the Penn Computational Linguistics Lunch (CLunch) and the Linguistic Data Consortium(LDC). We really appreciated the feedback and are looking forward to continue our work with the LDC. For those who might be interested, take a look at the slides embedded below or click on this link:
In this post, we will continue building on the basic models we discussed in the first and second tutorials. If you haven’t had a chance to take a look at them yet, definitely go back and at least skim them, since the ideas and code there form the backbone of what we’ll be doing here.
In this tutorial, we will build a model that can simulate outbreaks of disease on a small-world network (although the code can support arbitrary networks). This tutorial represents a shift away from both:
a) the mass-action mixing of the first two and and
b) the assumption of social homogeneity across individuals that allowed us to take some shortcuts to simplify model code and speed execution. Put another way, we’re moving more in the direction of individual-based modeling.
When we’re done, your model should be producing plots that look like this:
Red nodes are individuals who have been infected before the end of the run, blue nodes are never-infected individuals and green ones are the index cases who are infectious at the beginning of the run.
And your model will be putting out interesting and unpredictable results such as these:
In order to do this one, though, you’re going to needto download and install have igraph for Python on your system.
It is important to make the subtle distinction between individual and agent based models very clear here. Although the terms are often used interchangeably, referring to our nodes, who have no agency, per se, but are instead fairly static receivers and diffusers of infection, as agents, seems like overreaching. Were they to exhibit some kind of adaptive behavior, i.e., avoiding infectious agents or removing themselves from the population during the infective period, they then become more agent-like.
This is not to under- or over-emphasize the importance or utility of either approach, but just to keep the distinction in mind to avoid the “when all you have is a hammer, everything looks like a nail” problem.
The guiding idea behind small-world networks is that they capture some of the structure seen in more realistic contact networks: most contacts are regular in the sense that they are fairly predicable, but there are some contacts that span tightly clustered social groups and bring them together.
In the basic small-world model, an individual is connected to some (small, typically <=8) number of his or her immediate neighbors. Some fraction of these network connections are then randomly re-wired, so that some individuals who were previously distant in network terms – i.e., connected by a large number of jumps – are now adjacent to each other. This also has the effect of shortening the distance between their neighbors and individuals on the other side of the graph. Another way of putting this is that we have shortened the average path length and increased the average reachability of all nodes.
These random connections are sometimes referred to as “weak ties”, as there are fewer of these ties that bridge clusters than there are within clusters. When these networks are considered from a sociological perspective, we often expect to find that the relationship represented by a weak tie is one in which the actors on either end have less in common with each other than they do with their ‘closer’ network neighbors.
Random networks also have the property of having short average path lengths, but they lack the clustering that gives the small-world model that pleasant smell of quasi-realism that makes them an interesting but largely tractable, testing ground for theories about the impact of social structure on dynamic processes.
Installation and Implementation Issues
If you have all the pre-requisites installed on your system, you should be able to just copy and paste this code into a new file and run it with your friendly, local Python interpreter. When you run the model, you should first see a plot of the network, and when you close this, you should see a plot of the number of infections as a function of time shortly thereafter.
Aside from the addition of the network, the major conceptual difference is that the model operates on discrete individuals instead of a homogeneous population of agents. In this case, the only heterogeneity is in the number and identity of each individual’s contacts, but there’s no reason we can’t (and many do) incorporate more heterogeneity (biological, etc.) into a very similar model framework.
With Python, this change in orientation to homogeneous nodes to discrete individuals seems almost trivial, but in other languages it can be somewhat painful. For instance, in C/++, a similar implementation would involve defining a struct with fields for recovery time and individual ID, and defining a custom comparison operator for these structs. Although this is admittedly not a super-high bar to pass, it adds enough complexity that it can scare off novices and frustrate more experienced modelers.
Perhaps more importantly, it often has the effect of convincing programmers that a more heavily object-oriented approach is the way to go, so that each individual is a discrete object. When our individuals are as inert as they are in this model, this ends up being a waste of resources and makes for significantly more cluttered code. The end result can often be a model written in a language that is ostensibly faster than Python, such as C++ or Java, that runs slower than a saner (and more readable) Python implementation.
For those of you who are playing along at home, here are some things to think about and try with this model:
Change the kind of network topology the model uses (you can find all of the different networks available in igraph here).
Incorporate another level of agent heterogeneity: Allow agents to have differing levels of infectivity (Easier); Give agents different recovery time distributions (Harder, but not super difficult).
Make two network models – you can think of them as separate towns – and allow them to weakly influence each other’s outbreaks. (Try to use the object-oriented framework here with minimal changes to the basic model.)
That’s it for tutorial #3, (other than reviewing the comment code which is below) but definitely check back for more on network models!
In future posts, we’ll be thinking about more dynamic networks (i.e., ones where the links can change over time), agents with a little more agency, and tools for generating dynamic visualizations (i.e., movies!) of stochastic processes on networks.
That really covers the bulk of the major conceptual issues. Now let’s work through the implementation.
Click Below to Review the Implementation and Commented Code!
In the next few tutorials, we’re going to transition to exploring how to model dynamics on a network.
The first tutorial was a bit of a blockbuster length-wise because there was a lot of ground to cover to get the basic model up and running. Moving forward, we’ll be able to go a bit more incrementally, adding elements to the basic model as we go. If you don’t feel comfortable with the original, go back and take a look at it and make sure that it makes sense to you before moving on.
We’re first going to deal with some of the efficiency issues in the first model. After this, we’ll make some basic changes to the architecture of the SIR program that makes it more amenable to contact patterns on a social network.
Finally, we’ll show you how to how to take the output of your epidemic model and generate animations like this one:
Blue nodes are exposed but uninfected, red nodes are infectious, and yellow oneshave recovered.
The movie is a bit of a carrot to get you through the less flashy, but, I promise, important and actually interesting nuts and bolts of putting these kinds of models together.
This tutorial is going to cover the last two big things that we need to tackle before we get to the model of an outbreak on a network. So, here we go!
1. Arbitrarily Distributed Infectious Periods
First, we’re going to deal with the duration of the infectious period. The assumption of an exponentially distributed infectious period is unnecessarily restrictive for a general model of diffusion, and the way the original code goes about recovering individuals – drawing a random number on every step for every infectious individual – should strike you as both inelegant and computationally inefficient, particularly when the rate of recovery is slow and there are many infectious individuals.
In order to deal with this, we’re going to introduce two new tools. The first is the scipy.stats toolkit and the second is a neat (and very easy to use) data structure called a heap.
A heap is in very many ways what it sounds like: imagine a pile of trash in a landfill; the tires and rusting washing machines are on the bottom, while the pop cans and grocery store receipts are closer to the top.
As a programming tool, a heap is useful because it always keeps the smallest (or largest, depending on your preference) item at the top of the list. It also allows for linear-time insertion and removal of objects. This means that the time it takes to execute an action grows proportionally to the size of the list, so if it has N items, it takes N*C steps (where C is a constant) to process the list, and if it has 2*N items, it takes 2*N*C steps. Other ways of sorting could take N^2 or worse steps to do the same.
In our outbreak model, the top item on the heap is always going to be the time at which the next individual recovers. By doing this, we can avoid the loop in the first tutorial (and replicated in one implementation here) that checks whether each infectious individual is going to recover on each step.
Looping over everyone is the most intuitive way to check if they’re going to recover, but it’s very inefficient, especially when infectious periods are long and the population is large. It’s also problematic from a theoretical perspective, because it chains us to exponentially distributed recovery periods.
Exponentially distributed infectious periods make analytic sense for a deterministic model, but your disease or *insert diffusible here* may have a constant or normally distributed ‘infectious’ period.
By using a heap-based implementation, as you will see, we can use arbitrary recovery periods, and Python’s implementation of the heap is very straightforward – just a small twist on the usual list using the heapq module.
We won’t go too deep into the details of OOP here, but the real strength of OOP implementations are that they allow code to be easily re-used in other programs (Python’s all-powerful ‘import‘ statement really makes this true) and also forces some structure on what functions have access to what variables, etc.
Click Below to Review the Implementation and Commented Code!
In this series of tutorials, we are going to focus on the theory and implementation of transmission models in some kind of population.
In epidemiology, it is common to model the transmission of a pathogen from one person to another. In the social sciences and law, we may be interested in thinking about the way in which individuals influence each other’s opinions, ideology and actions.
These two examples are different, but in many ways analogous: it is not difficult to imagine the influence that one individual has on another as being similar to the infectivity of a virus in the sense that both have the ability to change the state of an individual. One may go from being susceptible to being infected, or from unconvinced to convinced.
Additionally, social networks have become an important area of study for epidemiological modelers. We can imagine that the nature of the network is different than the ones we think about in the social sciences: when studying outbreaks of a sexually transmitted disease, one doesn’t care all that much about the friendship networks of the people involved, while this would be very important for understanding the impact of social influence on depression and anxiety.
As someone who spends a lot of time working in the space where epidemiology and sociology overlap, I end up thinking a lot about these models – and their potential application to new and different problems and am really excited to share them with a broader audience here. In this first tutorial, I’m going to introduce a simpleSusceptible-Infected-Recovered (SIR) model from infectious disease epidemiology and show a simple, Pythonic implementation of it. We’ll work through the process of writing and optimizing this kind of model in Python and, in the final tutorials, will cover how to include a social network in the simulation model.
In order to use the example below, all you need to have installed is a current version of Python (2.4+ is probably best) and the excellent Python plotting package Matplotlib in order to view output. If you don’t have Matplotlib and don’t want to go and install it (although I guarantee you won’t regret it), just comment out import for Pylab and any lines related to plotting.
1. State Space / Markov Model
Before getting into the mechanics of the model, let’s talk about the theory and assumptions behind the model as it is implemented here:
The SIR model is an example of a ‘state space‘ model, and the version we’ll be talking about here is a discrete time, stochastic implementation that has the Markov property, which is to say that its state at time t+1 is only conditional on the parameters of the model and its state at time t.
For the uninitiated, in a state-space model, we imagine that each individual in the system can only be in one state at a time and transitions from state to state as a function of the model parameters, i.e., the infectivity of the pathogen or idea and the rate of recovery from infection…and the other states of the system. In other words, the system has endogenousdynamics. This is what makes it both interesting and in some ways difficult to work with.
In the SIR model, we assume that each infectious individual infects each susceptible individual at rate beta. So, if beta = .5, there is a 50% chance that each susceptible individual will be infected by an exposure to an infectious individual. For this reason, as the number of infected individuals in the system grows, the rate at which the remaining susceptible individuals is infected also grows until the pool of susceptible individuals is depleted and the epidemic dies out.
The other parameter we care about is gamma, or the rate of recovery. If gamma is also equal to .5, we assume that the average individual has a 50% chance of recovering on a given day, and the average duration of infectiousness will be 1/gamma, or 2 days.
We refer to the ratio beta/gamma as the basic reproductive ratio, or Ro (‘R naught’). When this number is less than one, we typically expect outbreaks to die out quickly. When this quantity is greater than one, we expect that the epidemic will grow and potentially saturate the whole population.
2. Homogeneous Mixing:
We’re assuming a world in which everyone has simultaneous contact with everyone else. In other words, we’re thinking of a totally connected social network. If you’re a regular reader of this blog, a social network enthusiast, or in some other way a thinking person, this assumption probably seems unreasonable. It turns out, however, that for many diseases, this assumption of homogeneous or ‘mass-action’ mixing, which was actually originally borrowed from chemistry, turns out to be a reasonable approximation.
For instance, if we are trying to approximate the transmission dynamics of a very infectious pathogen like measles in a city or town, we can usually overlook social network effects at this scale and obtain a very good fit to the data. This is because even very weak contacts can transmit measles, so that friendships and other types of close contacts are not good predictors of risk. Instead, we we are better off looking at a higher level of organization – the pattern of connection between towns and cities to understand outbreaks. In a social context, something like panic may be thought of as being super-infectious (for a really interesting study about the potential relationship between social panic and flu dynamics, see this paper by Josh Epstein).
This is, however, a generally problematic assumption for most problems of social influence, but an understanding of this most basic version of the model is necessary to move on to more complicated contact patterns.
3. Exponentially distributed infectious periods:
In the most basic implementation of the SIR model, we assume that each infectious individual has some probability of recovering on every step. If our model steps forwards in days and individuals have a .5 probability of recovery on each day, we should expect that the time to recovery follows an exponential distribution. This means that most people will be pretty close to the mean, but some will take a relatively long time to recover. This is accurate for a lot of cases, but definitely not for all. In some diseases, recovery times may be lognormal, power-law or bimodally disributed. For social models, the notion of an ‘infectious period’ may not make a tremendous amount of sense at all. But it allows for a very simple and transparent implementation, so we’ll use it here.
CLICK THROUGH TO SEE THE IMPLEMENTATION and RELEVANT PYTHON CODE!
When it comes to quickly motivating a point or engaging students in a classroom, one of the most effective tools is visualization. Not only do movies provide fun and excitement, but they also allow viewers to leverage the abilities of the visual cortex to infer dynamics and patterns in the animated system.
For our recent research, dynamic graphs are the type of system of interest. As I’ve covered before, Python is my language of choice for most programming tasks. Furthermore, Python is a very accessible language, even for beginners. However, when it comes to visualizing dynamic networks, we need another tool. Our tool of choice is SONIA, the Social Network Image Animator.
I thought I’d provide a helpful little function to generate SONIA input files from igraph objects, along with a few examples.
This function takes as input an igraph.Graph object and a file name to store the SONIA output in. Every vertex in the Graph object should have a time attributed specified, either simply as an integer indicating the start time, or as a tuple or list of the form (startTime,endTime). Check out the following two examples if you need more guidance. Both examples visualize the construction of a periodic lattice. However, in the second example, nodes decay after some random time. Make sure not to miss the second video at the bottom of the post!
As we covered earlier, Drew Conway over at Zero Intelligence Agents has gotten off to a great start with his first two tutorials on collecting and managing web data with Python. However, critics of such automated collection might argue that the cost of writing and maintaining this code is higher than the return for small datasets. Furthermore, someone still needs to manually enter the players of interest for this code to work.
To convince these remaining skeptics, I decided to put together an example where automated collection is clearly the winner.
Problem: Imagine you wanted to compare Drew’s NY Giants draft picks with the league as a whole. How would you go about obtaining data on the rest of the league’s players?
Human Solution: If you planned to do this the old-fashioned manual way, you would probably decide to collect the player data team-by-team. On the NFL.com website, the first step would thus be to find the list of team rosters:
This is the list of current players for Detroit Lions. In order to collect the desired player info, however, you’d again have follow the link to each player’s profile page. For instance, you might want to check out the Lion’s own first round pick:
At last, you can copy down Stafford’s statistics. Simple enough, right? This might take all of 30 seconds with page load times and your spreadsheet entry.
The Lions have more than 70 players rostered (more than just active players); let’s assume this is representative. There are 32 teams in the NFL. By even a conservative estimate, there are over 2000 players you’d need to collect data. If each of the 2000 players took 30 seconds, you’d need about 17 man hours to collect the data. You might hand this data entry over to a team of bored undergrads or graduate students, but then you’d need to worry about double-coding and cost of labor. Furthermore, what if you wanted to extend this analysis to historical players as well? You better start looking for a source of funding…
What if there was an easier way?
The solution requires just 100 lines of code. An experienced Python programmer can produce this kind of code in half an hour over a beer at a place like Ashley’s. The program itself can download the entire data set in less than half an hour. In total, this data set is the product of less than an hour of total time.
How long would it take your team of undergrads? Think about all the paperwork, explanations, formatting problems, delays, and cost…
The end result is a spreadsheet with the name, weight, age, height in inches, college, and NFL team for 2,520 players. This isn’t even the full list – for the purpose of this tutorial, players with missing data, e.g., unknown height, are not recorded.
You can view the spreadsheet here. In upcoming tutorials, I’ll cover how to visualize and analyze this data in both standard statistical models as well as network models.
In the meantime, think about which of these two solutions makes for a better world.
We wanted to highlight a couple of very interesting posts by Drew Conway of Zero Intelligence Agents. While not simple, the programming language python offers significant returns upon investment. From a data acquisition standpoint, python has made what seemed impossible quite possible. As a side note, this code looks like our first Bommarito led Ann Arbor Python Club effort to download and process NBA Box Scores…. you know it is all about trying to win the fantasy league…!