Law is Code: A Software Engineering Approach to Analyzing the United States Code

Law is CodeWilliam Li, Pablo Azar, David Larochelle, Phil Hill & Andrew Lo, Law is Code: A Software Engineering Approach to Analyzing the United States Code

ABSTRACT:  “The agglomeration of rules and regulations over time has produced a body of legal code that no single individual can fully comprehend. This complexity produces inefficiencies, makes the processes of understanding and changing the law difficult, and frustrates the fundamental principle that the law should provide fair notice to the governed. In this article, we take a quantitative, unbiased, and software-engineering approach to analyze the evolution of the United States Code from 1926 to today. Software engineers frequently face the challenge of understanding and managing large, structured collections of instructions, directives, and conditional statements, and we adapt and apply their techniques to the U.S. Code over time. Our work produces insights into the structure of the U.S. Code as a whole, its strengths and vulnerabilities, and new ways of thinking about individual laws. For example, we identify the first appearance and spread of important terms in the U.S. Code like “whistleblower” and “privacy.” We also analyze and visualize the network structure of certain substantial reforms, including the Patient Protection and Affordable Care Act (PPACA) and the Dodd-Frank Wall Street Reform and Consumer Protection Act, and show how the interconnections of references can increase complexity and create the potential for unintended consequences. Our work is a timely illustration of computational approaches to law as the legal profession embraces technology for scholarship, to increase efficiency, and to improve access to justice.”

Mike and I are excited to see this paper as it is related to two of our prior papers:
Daniel Martin Katz & Michael J. Bommarito II, Measuring the Complexity of the Law: The United States Code, 22 Journal of Artificial Intelligence & Law 1 (2014)

Michael J. Bommarito II & Daniel Martin Katz , A Mathematical Approach to the Study of the United States Code, 389 Physica A 4195 (2010)

Measuring the Complexity of the Law: The United States Code (By Daniel Martin Katz & Michael J. Bommarito)

From our abstract:  “Einstein’s razor, a corollary of Ockham’s razor, is often paraphrased as follows: make everything as simple as possible, but not simpler.  This rule of thumb describes the challenge that designers of a legal system face—to craft simple laws that produce desired ends, but not to pursue simplicity so far as to undermine those ends.  Complexity, simplicity’s inverse, taxes cognition and increases the likelihood of suboptimal decisions.  In addition, unnecessary legal complexity can drive a misallocation of human capital toward comprehending and complying with legal rules and away from other productive ends.

While many scholars have offered descriptive accounts or theoretical models of legal complexity, empirical research to date has been limited to simple measures of size, such as the number of pages in a bill.  No extant research rigorously applies a meaningful model to real data.  As a consequence, we have no reliable means to determine whether a new bill, regulation, order, or precedent substantially effects legal complexity.

In this paper, we address this need by developing a proposed empirical framework for measuring relative legal complexity.  This framework is based on “knowledge acquisition,” an approach at the intersection of psychology and computer science, which can take into account the structure, language, and interdependence of law. We then demonstrate the descriptive value of this framework by applying it to the U.S. Code’s Titles, scoring and ranking them by their relative complexity.  Our framework is flexible, intuitive, and transparent, and we offer this approach as a first step in developing a practical methodology for assessing legal complexity.”

This is a draft version so we invite your comments ( and (  Also, for those who might be interested – we are building out a full replication page for the paper.  In the meantime, all of the relevant code and data can be accessed at GitHub and from the Cornell Legal Information Institute.

UPDATE: Paper was named “Download of the Week” by Legal Theory Blog.

Building a Better Legal Search Engine, Part 1: Searching the U.S. Code

Cross Post from Michael Bommarito’s Blog – “Last week, I mentioned that I am excited to give a keynote in two weeks on Law and Computation at the University of Houston Law Center alongside Stephen Wolfram, Carl Malamud, Seth Chandler, and my buddy Dan Katz from here at the CLS Blog.  The first part in my blog series leading up to this talk will focus on indexing and searching the U.S. Code with structured, public domain data and open source software.

Before diving into the technical aspects, I thought it would be useful to provide some background on what the U.S. Code is and why it exists.  Let’s start with an example – the Dodd-Frank Wall Street Reform and Consumer Protection Act. After the final version of HR 4173 was passed by both houses and enrolled in July of 2010, it received a new identifier, Public Law 111-230.  This public law, along with private laws, resolutions, amendments, and proclamations, is published in order of enactment in the Statutes at Large.  The Statutes at Large is therefore a compilation of all these sources of law dating back to the Declaration of Independence itself, and as such, is the authoritative source of statutory law.

If we think about the organization and contents of the Statutes at Large, it quickly becomes clear why the Code exists.  The basic task of a legal practitioner is to determine what the state of law is with respect to a given set of facts at a certain time, typically now.  Let’s return to the Dodd-Frank example above.  Let’s say we’re in the compliance department at a financial institution and we’d like to know how the new proprietary trading rules affect us. To do this, we might perform the following tasks:

  • Search for laws by concept, e.g., depository institution or derivative.
  • Ensure that these laws are current and comprehensive.
  • Build a set of rules or guidelines from these laws.
  • Interpret these rules in the context of our facts.


However, the Statutes at Large is not well-suited to these tasks.

  • It is sorted by date of enactment, not by concept.
  • It contains laws that may affect multiple legal concepts.
  • It contains laws that reference other laws for definitions or rules.
  • It contains laws that amend or repeal other laws.


Based on our goal and these properties of the Statutes, we need to perform an exhaustive search every time we have a new question. This is pretty clearly bad if we want to get anything done (but hey, maybe you’re not in-house and you bill by the hour). So what might we do to re-organize the Statutes to make it easier for us to use the law?

  • Organize the law by concept, possibly hierarchically.
  • Combine laws that refer or amend one another.
  • Remove laws that have expired or have been repealed.
  • Provide convenient citations or identifiers for legal concepts.


A systematic organization of the Statutes at Large that followed these rules would make our lives significantly easier. We could search for concepts and use the hierarchical context of these results to navigate related ideas. We could rest assured that the material we read was near-comprehensive and current. Furthermore, we could communicate more succintly by referencing a small number of organized sections instead of hundreds of Public Laws.

As you might have guessed, this organizational scheme defines the United States Code as produced by the Office of the Law Revision Counsel. While the LRC traditionally distributes copies of the Code as ASCII files on CD-ROMs, they recently began distributing copies of the code in XHTML. We’ll be using these copies to build our index, so if you’d like to follow along, you should download them from here –

If we’d like to build a legal search engine, the Code is arguably the best place to start. While there are other important statutory and judicial sources like the Code of Federal Regulations or the Federal Reporter, the Code is as close to capital-L Law as it gets.

In this part of the post series, I’m going to build an index of the text of the Code from the 2009 and 2010 LRC snapshots. To do this, we’ll use the excellent Apache Lucene library for Java. Lucene is, in their own words, a “a high-performance, full-featured text search engine library written entirely in Java.” As we’ll see in later posts, Lucene (with its sister project, Solr) is a very easy and powerful tool to develop fast, web-based search interfaces. Before we dive into the code below the break, let’s take a look at what we’re working towards. Below is a search for the term “swap” across the entire Code. We’re displaying the top five results, and these were produced in a little over a second on my laptop. “

To view the images, click over to Michael Bommarito’s Blog (click here for direct access). Additional technical specifications and code are also available. 


Measuring the Complexity of the Law : The United States Code

Understanding the sources of complexity in legal systems is a matter long considered by legal commentators. In tackling the question, scholars have applied various approaches including descriptive, theoretical and, in some cases, empirical analysis. The list is long but would certainly include work such as Long & Swingen (1987), Schuck (1992), White (1992), Kaplow (1995), Epstein (1997), Kades (1997), Wright (2000) and Holz (2007). Notwithstanding the significant contributions made by these and other scholars, we argue that an extensive empirical inquiry into the complexity of the law still remains to be undertaken.

While certainly just a slice of the broader legal universe, the United States Code represents a substantively important body of law familiar to both legal scholars and laypersons. In published form, the Code spans many volumes. Those volumes feature hundreds of thousands of provisions and tens of millions of words. The United States Code is obviously complicated, however, measuring its size and complexity has proven be non-trivial.

In our paper entitled, A Mathematical Approach to the Study of the United States Code we hope to contribute to the effort by formalizing the United States Code as a mathematical object with a hierarchical structure, a citation network and an associated text function that projects language onto specific vertices.

In the visualization above, Figure (a) is the full United States Code visualized to the section level. In other words, each ring is a layer of a hierarchical tree that halts at the section level. Of course, many sections feature a variety of nested sub-sections, etc. For example, the well known 26 U.S.C. 501(c)(3) is only shown above at the depth of Section 501.  If we added all of these layers there would simply be additional rings. For those interested in the visualization of specific Titles of the United States Code … we have previously created fully zoomable visualizations of Title 17 (Copyright), Title 11 (Bankruptcy),  Title 26 (Tax) [at section depth], Title 26 (Tax) [Capital Gains & Losses] as well as specific pieces of legislation such as the original Health Care Bill — HR 3962.

In the visualization above, Figure (b) combines this hierarchical structure together with a citation network.  We have previously visualized the United States Code citation network and have a working paper entitled Properties of the United States Code Citation Network. Figure (b) is thus a realization of the full United States Code through the section level.

With this representation in place, it is possible to measure the size of the Code using its various structural features such as vertices V and its edges E.  It is possible to measure the full Code at various time snapshots and consider whether the Code is growing or shrinking. Using a limited window of data, we observe growth not only in the size of the code but also its network of dependancies (i.e. its citation network).

Of course, growth in the size United States Code alone is not necessarily analogous to an increase in complexity.  Indeed, while we believe in general the size of the code tends to contribute to “complexity,” some additional measures are needed.  Thus, our paper features structural measurements such as number of sections, section sizes, etc.

In addition, we apply the well known Shannon Entropy measure (borrowed from Information Theory) to evaluate the “complexity” of the message passing / language contained therein.  Shannon Entropy has a long intellectual history and has been used as a measure of complexity by many scholars.  Here is the formula for Shannon entropy:

For those interested in reviewing the full paper, it is forthcoming in Physica A: Statistical Mechanics and its Applications. For those not familiar, Physica A is a journal published by Elsevier and is a popular outlet for Econophysics and Quantitative Finance. A current draft of the paper is available on the SSRN and the physics arXiv

We are currently working on a follow up paper that is longer, more detailed and designed for a general audience.  Even if you have little or no interest in the analysis of the United States Code, we hope principles such as entropy, structure, etc. will prove useful in the measurement of other classes of legal documents including contracts, treaties, administrative regulations, etc.

Computational Legal Studies – The Interactive Gallery

Click on the above picture and you will be taken to the Interactive Gallery of Computational Legal Studies. Once inside the gallery, click on any thumbnail to see the full size image. Each image features a link to supporting materials such as documentation and/or the underlying academic paper. We hope to add more content to gallery over the coming weeks and months — so please check back!  Please note that load time may vary depending upon your connection, machine, etc.

Computational Legal Studies Presentation Slides from the Meetings

Thanks to Carl Malamud and the good folks at the University of Colorado Law School and University of Texas Law School for allowing us to participate in their respective meetings. For those interested in governmental transparency, we believe that Carl Malamud’s on-going national conversation is very important. The video above represents a fixed spaced movie combining the majority of the slides we presented at the two meetings. If the video will not load, click here to access the YouTube Version of the Slides. Enjoy!

The United States Code — The Movie — Featuring Title 16 — Conservation

Above is a movie displaying Title 16 (Conservation) a subset of the content contained within the United States Code. At more than 2,400 pages (download it here), Title 16 is one of the larger titles in the US Code.  Yet, it is not the largest.  For example, Title 26 (Internal Revenue Code) and Title 42 (Public Health and Welfare) are far larger than the object displayed above.

Now, you might be wondering why we chose to generate this movie. We envisioned at least two purposes.

(1) The title of this blog is Computational Legal Studies.  One of our major goals to either develop or apply tools that scale to life in the era of Big Data. Given the scope of an object such as the United States Code, it is is clear that a significant class of potential analysis cannot reasonably be undertaken without the use of computational tools.  Thus, with respect to developing new insights, we believe computational linguistics, information theory, applied graph theory can be of great use.  For those interested, our new paper entitled A Mathematical Approach to the Study of the United States Code offers our initial exploration of the possibilities.

(2) We believe this movie can be a meaningful pedagogical device.  Many students enter law school and are dismayed when even in  statutory based classes they are not exclusively reviewing the black letter law. Given the scope of this and other large bodies of documents, any model of legal education cannot be exclusively be dedicated to teaching black letter law. Instead, such training is appropriately devoted to a mixture of existing legal rules as well as the development of information acquisition protocols that train students to navigate the relevant landscape.

The Structure and Complexity of the United States Code

Mike and I have been working on a paper we hope to soon post to the SSRN entitled ” The Structure and Complexity of the United States Code.”  Yesterday, we presented a pre-alpha version of the paper in the Michigan Center for Political Studies Workshop For those who might be interested, the abstract for the working abstract for the paper is below. If you are interested in accessing documentation for the above visualization please click here.

“The United States Code is the substantively important body of information that collectively constitutes the federal statutory law of the United States.  The Code is a complied hierarchical document organized into fifty substantive titles including Bankruptcy (Title 11), Judiciary and Judicial Procedure (Title 28), Public Health, and Welfare (Title 42) and Tax (Title 26).  In addition to its hierarchical organization, the Code contains an extensive citation network where cross-references connect its provisions in a variety of novel manners.

Claims regarding complexity of the Code, in particular the Internal Revenue Code, are consistently part of the public discourse. Undoubtedly, the Code is complicated. However, quantifying its complexity is a far more difficult proposition.  While there have been some initial attempts to identify the size of certain pieces of the Code, few comprehensive or comparative investigations of the entire United States Code have been undertaken.

In this article, we ask how complex is the United States Code and in comparative terms which titles are the most and least complex? Employing a wide variety of approaches including techniques drawn from information theory, computer science, linguistics and applied graph theory, we develop and apply a series of distinct measures for the structural and linguistic complexity of the Code.  After developing these discrete approaches, we generate a composite measure and use it to comparatively score each of the Code’s titles. While we recognize other composite measures for size and complexity could legitimately be offered, we believe our interdisciplinary approach represents a significant advance and provides much needed rigor to questions of code complexity.”

New Paper: Properties of the United States Code Citation Network

We have been working on a larger paper applying many concepts from structural analysis and complexity science to the study of bodies of statutory law such as the United States Code. To preview the broader paper, we’ve published to SSRN and arXiv a shorter, more technical analysis of the properties of the United States Code’s network of citations.

Click here to Download the Paper!

Abstract: The United States Code is a body of documents that collectively comprises the statutory law of the United States. In this short paper, we investigate the properties of the network of citations contained within the Code, most notably its degree distribution. Acknowledging the text contained within each of the Code’s section nodes, we adjust our interpretation of the nodes to control for section length. Though we find a number of interesting properties in these degree distributions, the power law distribution is not an appropriate model for this system.

Citation In-Degree
Citation In-Degree