Tax Day: A Mathematical Approach to the Study of the United States Code

United States Code
United States Code

April 15th is Tax Day! Unless you’ve filed for an extension or you’re a corporation on your own fiscal year, you’ve hopefully finished your taxes by now!

While you were filing your return, you may have noticed references to the Internal Revenue Code (IRC). The IRC, also known as Title 26, is legal slang for the “Tax Code.”  Along with the Treasury Regulations compiled into the Code of Federal Regulations (26 C.F.R.), the Internal Revenue Code contains many of the rules and regulations governing how we can and can’t file our taxes.  Even if you prepared your taxes using software like TurboTax, the questions generated by these programs are determined by the rules and regulations within the Tax Code and Treasury Regulations.

Many argue that there are too many of these rules and regulations or that these rules and regulations are too complex. Furthermore, many also claim that the “Tax Code” is becoming larger or more complex over time. Unfortunately, most individuals do not support this claim with solid data. When they do, they often rely on either the number of pages in Title 26 or the CCH Standard Federal Tax Reporter. None of these measures take into consideration the real complexity of the Code, however.

In honor of Tax Day, we’re going to highlight a recent paper that we’ve written that tries to address some of these issues – A Mathematical Approach to the Study of the United States Code. The first point to make is that this paper is a study of the entire United States Code. Title 26, the Tax Code, is actually only one small part of the set of rules and regulations defined in the United States Code. The United States Code as a whole is the largest and arguably most important source of Federal statutory law. Compiled from the legislation and treaties published in the Statutes at Large in 6-year intervals, the entire document contains over 22 million words.

In this paper, we develop a mathematical approach to the study of the large bodies of statutory law and in particular, the United States Code. This approach can be summarized as guided by a representation of the Code as an object with three primary parts:

  1. A hierarchical network that corresponds to the structure of concept categories.
  2. A citation network that encodes the interdependence of concepts.
  3. The language contained within each section.

Given this representation, we then calculate a number of properties for the United States Code in 2008, 2009, and 2010 as provided by the Legal Information Institute at the Cornell University Law School. Our results can be summarized in three points:

  1. The structure of the United States Code is growing by over 3 sections per day.
  2. The interdependence of the United States Code is increasing by over 7 citations per day.
  3. The amount of language in the United States Code is increasing by over 2,000 words per day.

The figure above is an actual image of the structure and interdependence of the United States Code. The black lines correspond to structure and the red lines correspond to interdependence.  Though visually stunning, the true implication of this figure is that the United States Code is a very interdependent set of rules and regulations, both within and across concept categories.

If you’re interested in more detail, make sure to read the paper –A Mathematical Approach to the Study of the United States Code. If you’re really interested, make sure to check back in the near future for our forthcoming paper entitled Measuring the Complexity of the United States Code.

Relational Topic Models for Document Networks — Chang & Blei

This is a very important paper by Jonathan Chang and David Blei. Suffice to say, it has potential use in a wide class of social science applications.  Click here to access related material on Professor Blei’s Princeton homepage.  Click here for some slides (note 7.0 mb). Check it out!

Slides from our Presentation at UPenn Computational Linguistics (CLUNCH) / Linguistic Data Consortium (LDC)

We have spent the past couple days at the University of Pennsylvania where we presented information about our efforts to compile a complete United States Supreme Court Corpus.  As noted in the slides below, we are interested in creating a corpus containing not only every SCOTUS opinion, but also every SCOTUS disposition from 1791-2010. Slight variants of the slides below were presented at the Penn Computational Linguistics Lunch (CLunch) and the Linguistic Data Consortium(LDC).  We really appreciated the feedback and are looking forward to continue our work with the LDC.  For those who might be interested, take a look at the slides embedded below or click on this link:

Law as a Seamless Web … Poster for WIN Conference @ NYU Stern

Seamless Web Poster

As we mentioned in previous posts, Seadragon is a really cool product. Please note load times may vary depending upon your specific machine configuration as well as the strength of your internet connection. For those not familiar with how to operate it please see below. In our view, the Full Screen is best the way to go ….

Law as a Seamless Web? Part III

Seamless Web III

This is the third installment of posts related to our paper Law as a Seamless Web? Comparison of Various Network Representations of the United States Supreme Court Corpus (1791-2005) previous posts can be found (here) and (here). As previewed in the earlier posts, we believe comparing the Union, the Intersect and the Compliment of the SCOTUS semantic and citation networks is at the heart of an empirical evaluation of Law as a Seamless Web …. from the paper….

“Though law is almost certainly a web, questions regarding its interconnectedness remain. Building upon themes of Maitland, Professor Solum has properly raised questions as to whether or not the web of law is “seamless”. By leveraging the tools of computer science and applied graph theory, we believe that an empirical evaluation of this question is at last possible.  In that vein, consider Figure 9, which offers several possible topological locations that might be populated by components of the graphs discussed herein. We believe future research should consider the relevant information contained in the union, intersection, and complement of our citation and semantic networks.

While we leave a detailed substantive interpretation for subsequent work, it is worth broadly considering the information defined in Figure 9.  For example, the intersect (∩) displayed in Figure 9 defines the set of cases that feature both semantic similarity and a direct citation linkage. In general, these are likely communities of well-defined topical domains.  Of greater interest to an empirical evaluation of the law as a seamless web, is likely the magnitude and composition of the Citation Only and Semantic Only subsets.  Subject to future empirical investigation, we believe the Citation Only components of the graph may represent the exact type of concept exportation to and from particular semantic domains that would indeed make the law a seamless web.”

Law as a Seamless Web? Part II

Semantic Network
In our paper Law as a Seamless Web, we offer a first-order method to generate case-to-case and opinionunit-to-opinionunit semantic networks. As constructed in the figure above, nodes represent cases decided between 1791-1865 while edges are drawn when two cases possess a certain threshold of semantic similarity. Except for the definition of edges, the process of constructing the semantic graph is identical to that of the citation graph we offered in the prior post. While computer science/computational linguistics offers a variety of possible semantic similarity measures, we choose to employ a commonly used measure. Here a description from the paper:

“Semantic similarity measures are the focus of significant work in computational linguistics. Given the scope of the dataset, we have chosen a first-order method for calculating similarity.  After lemmatizing the text of the case with WordNet, we store the nouns with the top N frequencies for each case or opinion unit. We define the similarity between two cases or opinion units A and B as the percentage of words that are shared between the top words of A and top words of B.

An edge exists between A and B in the set of edges  if  σ (A,B) exceeds some threshold.  This threshold is the minimum similarity necessary for the graph to represent the presence of a semantic connection.”  

As this a technical paper, it is slanted toward demonstrating proof of methodological concept rather than covering significant substantive ground. With that said, we do offer a hint of our broader substantive goal of detecting the spread of legal concepts between various topical domains. Specifically, with respect to enriching positive political theory, we believe union, intersect and compliment of the semantic and citation networks are really important. More on this point is forthcoming in a subsequent post…

Artificial Intelligence and Law — Barcelona 2009

AI & Law

Live from Barcelona, we are on the road at the International Association for Artificial Intelligence and Law.  Henry Prakken has just delivered the keynote address and we will soon be giving our presentation. The conference is interesting as it embraces a wide range of topics and intellectual traditions. For example, there is a significant emphasis on ontological reasoning, computational models of argumentation and the use of XML schemas. In addition, there are a number of folks using graph theoretic techniques and applying them to the development of the law. It has been a nice few days and we have enjoyed our time here. Tomorrow, the trip continues…. 

Tax Day! A First-Order History of the Supreme Court and Tax


Click to view the full image.

In honor of Tax Day, we’ve produced a simple time series representation of the Supreme Court and tax.  The above plot shows the how often the word “tax” occurs in the cases of  the Supreme Court, for each year – that is, what proportion of all words in every case in a given year are the word “tax.”  The data underneath includes non-procedural cases from 1790 to 2004.  The arrows highlight important legislation and cases for income tax as well.

Make sure to click through the image to view the full size.

Happy Tax Day!