Netflix Challenge for SCOTUS Prediction?

During our break from blogging, Ian Ayers offered a very interesting post over a Freakonomics entitled “Prediction Markets vs. Super Crunching: Which Can Better Predict How Justice Kennedy Will Vote?” In general terms, the post compares the well known statistical model offered by Martin-Quinn to the new Supreme Court Fantasy League created by Josh Blackman. We were particularly interested in a sentence located at end of the post … “[T]he fantasy league predictions would probably be more accurate if market participants had to actually put their money behind their predictions (as with intrade.com).”  This point is well taken. Extending the idea of having some “skin in the game,” we wondered what sort of intellectual returns could be generated for the field of quantitative Supreme Court prediction by some sort of Netflix style SCOTUS challenge.

The Martin-Quinn model has significantly advanced the field of quantitative analysis of the United States Supreme Court. However, despite all of the benefits the model has offered, it is unlikely to be the last word on the question. While only time will tell, an improved prediction algorithm might very well be generated through the application of ideas in machine learning and via incorporation of additional components such as text, citations, etc.

With significant financial sum at stake … even far less than the real Netflix challenge … it is certainly possible that a non-trivial mprovement could be generated. In a discussion among a few of us here at the Michigan CSCS lab, we generated the following non-exhaustive set of possible ground rules for a Netflix Style SCOTUS challenge:

  1. To be unseated, the winning team should be required to make a non-trivial improvement upon the out-of-sample historical success of the Martin-Quinn Model.
  2. To prevent overfitting, the authors of this non-trivial improvement should be required to best the existing model for some prospective period.
  3. All of those who submit agree to publish their code in a standard programming language (C, Java, Python, etc.) with reasonable commenting / documentation.

New Paper: Properties of the United States Code Citation Network

We have been working on a larger paper applying many concepts from structural analysis and complexity science to the study of bodies of statutory law such as the United States Code. To preview the broader paper, we’ve published to SSRN and arXiv a shorter, more technical analysis of the properties of the United States Code’s network of citations.

Click here to Download the Paper!

Abstract: The United States Code is a body of documents that collectively comprises the statutory law of the United States. In this short paper, we investigate the properties of the network of citations contained within the Code, most notably its degree distribution. Acknowledging the text contained within each of the Code’s section nodes, we adjust our interpretation of the nodes to control for section length. Though we find a number of interesting properties in these degree distributions, the power law distribution is not an appropriate model for this system.

Citation In-Degree
Citation In-Degree

United States Court of Appeals & Parallel Tag Clouds from IBM Research

Ct of Appeals

Download the paper: Collins, Christopher; Viégas, Fernanda B.; Wattenberg, Martin. Parallel Tag Clouds to Explore Faceted Text Corpora To appear in Proceedings of the IEEE Symposium on Visual Analytics Science and Technology (VAST), October, 2009. [Note: The Paper is 24.5 MB]

Here is the abstract: Do court cases differ from place to place? What kind of picture do we get by looking at a country’s collection of law cases? We introduce Parallel Tag Clouds: a new way to visualize differences amongst facets of very large metadata-rich text corpora. We have pointed Parallel Tag Clouds at a collection of over 600,000 US Circuit Court decisions spanning a period of 50 years and have discovered regional as well as linguistic differences between courts. The visualization technique combines graphical elements from parallel coordinates and traditional tag clouds to provide rich overviews of a document collection while acting as an entry point for exploration of individual texts. We augment basic parallel tag clouds with a details-in-context display and an option to visualize changes over a second facet of the data, such as time. We also address text mining challenges such as selecting the best words to visualize, and how to do so in reasonable time periods to maintain interactivity.

The Structure of the United States Code

United States Code (All Titles)

Formally organized into 50 titles, the United States Code is the repository for federal statutory law. While each of the 50 titles define a particular substantive domain, the structure within and across titles can be represent as a graph/network. In a series of prior posts, we offered visualizations at various “depths” for a number of well know U.S.C. titles. Click here and click Here for our two separate visualizations of the Tax Code (Title 26).  Click here for our visualization of the Bankruptcy Code (Title 11).  Click here for our visualization of Copyright (Title 17). While our prior efforts were devoted to displaying the structure of a given title of the US Code, the visualization above offers a complete view of the structure of the entire United States Code (Titles 1-50).

Using Seadragon from Microsoft Labs, each title is labeled with its respective number. The small black dots are “vertices” representing all sections in the aggregate US Code (~37,500 total sections). Given the size of the total undertaking, in the visual above, every title is represented to the “section level.”  As we described in earlier posts, a “section level” representation halts at the section and thus does not represent any of subsection depth.  For example, all sections under 26 U.S.C. § 501 including the well known § 501 (c) (3) are reattributed upward to their parent section.

There are two sources of structure within the United States Code. The explicitly defined structure / linkage / dependancy derives from the sections contained under a given title. The more nuanced version of structure is obtained from references or definitions contained within particular sections. This class of connections not only link sections within a given title but also connection sections across titles.  Within this above visual, we represent these important cross-title references by coloring them red.

Taken together, this full graph of the Untied States Code is quite large {i.e. directed graph (|V| = 37500, |E| = 197749)}. There exist 37,500 total sections distributed across the 50 Titles. However, these sections are not distributed in a uniform manner. For example, components such as Title 1 feature very few sections while Titles such as 26 and 42 contain many sections. The number of edges far outstrips the number of vertices with a total 197,000+ edges in the graph.

Picture 1 Seadragon has a number of nice features which enhance the experience of the end user. For example, a user can drag the image around by clicking and holding down the mouse button. Most importantly, is the symbol to the left. If you run your mouse over the above zoomable visual… look for this symbol to appear in the southeast corner.  Click on it and it will make the visual full size… as you will see… the full size visual makes for a far more compelling HCI

Power Laws, Preferential Attachment and Positive Legal Theory [Part 2] [Repost]

Law as a Complex System?

As was stated in Part 1 of this thread, it is by no means a given that the statistical artifact displayed above would appear. Namely, such large scale patterns need not assume this flavor as many social and physical systems feature substantially different properties.

For purpose of generating an empirically grounded theory of American Common Law development … explaining these artifacts would seem to critical. Fortunately, with respect to the above pattern, there exist a definable set of generative processes plausibly responsible for producing what is displayed. While certainly not the only generative process responsible for a power law, the preferential attachment model, first outlined in the physics literature by Barabási & Albert, is among the likely candidates.

Confronting much of the extant literature, query as to whether a closed form equilibria based analytical apparatus (punctuated or otherwise) is up to the task of describing the relevant dynamics? If anything, the distributions displayed above provide first-order evidence of a system which is likely to feature dynamics of a non-linear flavor. Indeed, while significant work still remains, the weight of available evidence indicates Law is a Complex Adaptive System. As such, we believe it would be appropriate to leverage the methods typically reserved for the study of complexity.  For purposes of generating positive legal theory, we believe agent based models, dynamic network analysis and other methods of computational social science offer great potential. We encourage scholars to consider learning more about these approaches.

Sea Dragon Visualization of the American Legal Academy


Here is another visual run through the SeaDragon Visualization from Microsoft Labs.  Similar to the Title 17 United States Code visual from earlier in the week, one can zoom in and out. Using the button in the far southeast corner it is possible to generate a full screen visual of the network.

Pervious posts discussing this visualization are located  here and  here. In addition, results of our model of diffusion on the network are located here while an interactive version of the agent based model generated in Netlogo is located here.  For those interested in the full draft … it is entitled Reproduction of Hierarchy? A Social Network Analysis of the American Law Professoriate.

Citation Analysis in Continental Jurisdictions

Citation Analysis

Anton Geist has posted Using Citation Analysis Techniques for Computer-Assisted Legal Research in Continental Jurisdictions to the SSRN.  While this is certainly longer than most papers, we believe it offers a good review of the broader information retrieval and law literature.  In addition, it offers some empirical insight into citation patterns within continental jurisdictions. The findings in this paper are similar to those shown in important papers by Thomas Smith in The Web of the Law and by David Post & Michael Eisen in How Long is the Coastline of Law? Thoughts on the Fractal Nature of Legal Systems. 

In our view, the next step for this research is to determine whether the pattern does indeed follow a power law distribution.  Specifically, there exists a Maximum Likelihood based test developed in the applied physics paper Power-law Distributions in Empirical Data by Aaron ClausetCosma Shalizi and Mark Newman which can help adjudicate whether the detected pattern represents a highly skewed distribution or is indeed a power law.

Either way, we are excited by this paper as we believe comparative research is absolutely critical to broader theory development.

Law as a Seamless Web? Part III

Seamless Web III

This is the third installment of posts related to our paper Law as a Seamless Web? Comparison of Various Network Representations of the United States Supreme Court Corpus (1791-2005) previous posts can be found (here) and (here). As previewed in the earlier posts, we believe comparing the Union, the Intersect and the Compliment of the SCOTUS semantic and citation networks is at the heart of an empirical evaluation of Law as a Seamless Web …. from the paper….

“Though law is almost certainly a web, questions regarding its interconnectedness remain. Building upon themes of Maitland, Professor Solum has properly raised questions as to whether or not the web of law is “seamless”. By leveraging the tools of computer science and applied graph theory, we believe that an empirical evaluation of this question is at last possible.  In that vein, consider Figure 9, which offers several possible topological locations that might be populated by components of the graphs discussed herein. We believe future research should consider the relevant information contained in the union, intersection, and complement of our citation and semantic networks.

While we leave a detailed substantive interpretation for subsequent work, it is worth broadly considering the information defined in Figure 9.  For example, the intersect (∩) displayed in Figure 9 defines the set of cases that feature both semantic similarity and a direct citation linkage. In general, these are likely communities of well-defined topical domains.  Of greater interest to an empirical evaluation of the law as a seamless web, is likely the magnitude and composition of the Citation Only and Semantic Only subsets.  Subject to future empirical investigation, we believe the Citation Only components of the graph may represent the exact type of concept exportation to and from particular semantic domains that would indeed make the law a seamless web.”