Building a Better Legal Search Engine, Part 1: Searching the U.S. Code

Cross Post from Michael Bommarito’s Blog – “Last week, I mentioned that I am excited to give a keynote in two weeks on Law and Computation at the University of Houston Law Center alongside Stephen Wolfram, Carl Malamud, Seth Chandler, and my buddy Dan Katz from here at the CLS Blog.  The first part in my blog series leading up to this talk will focus on indexing and searching the U.S. Code with structured, public domain data and open source software.

Before diving into the technical aspects, I thought it would be useful to provide some background on what the U.S. Code is and why it exists.  Let’s start with an example – the Dodd-Frank Wall Street Reform and Consumer Protection Act. After the final version of HR 4173 was passed by both houses and enrolled in July of 2010, it received a new identifier, Public Law 111-230.  This public law, along with private laws, resolutions, amendments, and proclamations, is published in order of enactment in the Statutes at Large.  The Statutes at Large is therefore a compilation of all these sources of law dating back to the Declaration of Independence itself, and as such, is the authoritative source of statutory law.

If we think about the organization and contents of the Statutes at Large, it quickly becomes clear why the Code exists.  The basic task of a legal practitioner is to determine what the state of law is with respect to a given set of facts at a certain time, typically now.  Let’s return to the Dodd-Frank example above.  Let’s say we’re in the compliance department at a financial institution and we’d like to know how the new proprietary trading rules affect us. To do this, we might perform the following tasks:

  • Search for laws by concept, e.g., depository institution or derivative.
  • Ensure that these laws are current and comprehensive.
  • Build a set of rules or guidelines from these laws.
  • Interpret these rules in the context of our facts.


However, the Statutes at Large is not well-suited to these tasks.

  • It is sorted by date of enactment, not by concept.
  • It contains laws that may affect multiple legal concepts.
  • It contains laws that reference other laws for definitions or rules.
  • It contains laws that amend or repeal other laws.


Based on our goal and these properties of the Statutes, we need to perform an exhaustive search every time we have a new question. This is pretty clearly bad if we want to get anything done (but hey, maybe you’re not in-house and you bill by the hour). So what might we do to re-organize the Statutes to make it easier for us to use the law?

  • Organize the law by concept, possibly hierarchically.
  • Combine laws that refer or amend one another.
  • Remove laws that have expired or have been repealed.
  • Provide convenient citations or identifiers for legal concepts.


A systematic organization of the Statutes at Large that followed these rules would make our lives significantly easier. We could search for concepts and use the hierarchical context of these results to navigate related ideas. We could rest assured that the material we read was near-comprehensive and current. Furthermore, we could communicate more succintly by referencing a small number of organized sections instead of hundreds of Public Laws.

As you might have guessed, this organizational scheme defines the United States Code as produced by the Office of the Law Revision Counsel. While the LRC traditionally distributes copies of the Code as ASCII files on CD-ROMs, they recently began distributing copies of the code in XHTML. We’ll be using these copies to build our index, so if you’d like to follow along, you should download them from here –

If we’d like to build a legal search engine, the Code is arguably the best place to start. While there are other important statutory and judicial sources like the Code of Federal Regulations or the Federal Reporter, the Code is as close to capital-L Law as it gets.

In this part of the post series, I’m going to build an index of the text of the Code from the 2009 and 2010 LRC snapshots. To do this, we’ll use the excellent Apache Lucene library for Java. Lucene is, in their own words, a “a high-performance, full-featured text search engine library written entirely in Java.” As we’ll see in later posts, Lucene (with its sister project, Solr) is a very easy and powerful tool to develop fast, web-based search interfaces. Before we dive into the code below the break, let’s take a look at what we’re working towards. Below is a search for the term “swap” across the entire Code. We’re displaying the top five results, and these were produced in a little over a second on my laptop. “

To view the images, click over to Michael Bommarito’s Blog (click here for direct access). Additional technical specifications and code are also available. 


Ignite Law 2011 & The ABA Techshow

On the eve the ABA Tech Show, I am looking forward to attending Ignite Law 2011 @ The Chicago Hilton. For those not familiar, Ignite offers a unique style of presentation (6 minutes total with automatically advancing slides). For a certain class of ideas, Ignite offers thefmaximal information compression approach to concept introduction.  Anyway, the topics of the talks interesting.

Tomorrow, I will be attending some of the sessions at the Tech Show. If anyone is attending the conference and would like to touchbase, feel free to ping me.

Oyez @ Chicago Kent Releases Free OyezToday App for IPhone

Kudos to Jerry Goldman, the other folks at the Oyez Project as well as the Chicago-Kent College of Law for making this free resource available to the public!

From the description: “OYEZTODAY at IIT Chicago-Kent College of Law offers you the latest information and media on the current business of the Supreme Court of the United States. OYEZTODAY provides: easy-to-grasp abstracts for every case granted review, timely and searchable audio of oral arguments + transcripts, and up-to-date summaries of the Court’s most recent decisions including the Court’s full opinions. You will have access to all this information on your iPhone with the ability to share reactions on Facebook, Twitter, or by email. (Recordings of opinion announcements from the bench will follow when the Court releases these files to the National Archives at the start of the Court’s next Term).  Chicago-Kent is proud to provide this free service to enhance the public’s understanding of the Supreme Court and current legal controversies.”

Salman Khan: Let’s Use Video to Reinvent Education [ TED 2011 ]


“In 2004, Salman Khan, a hedge fund analyst, began posting math tutorials on YouTube. Six years later, he has posted more than 2.000 tutorials, which are viewed nearly 100,000 times around the world. In this TED 2011 Talk,  Salman talks about how and why he created the remarkable Khan Academy, a carefully structured series of educational videos offering complete curricula in math and, now, other subjects. He shows the power of interactive exercises, and calls for teachers to consider flipping the traditional classroom script — give students video lectures to watch at home, and do “homework” in the classroom with the teacher available to help.”

This offers a pretty interesting alternative model for education delivery.  It is worth checking out!

Rock / Paper / Scissors – Man v. Machine (as t→∞ you are not likely to win) [via NY Times]

From the site … “A truly random game of Rock-Paper-Scissors would result in a statistical tie with each player winning, tying and losing one-third of the time …  However, people are not truly random and thus can be studied and analyzed. While this computer won’t win all rounds, over time it can exploit a person’s tendencies and patterns to gain an advantage over its opponent.

Computers mimic human reasoning by building on simple rules and statistical averages. Test your strategy against the computer in this rock-paper-scissors game illustrating basic artificial intelligence. Choose from two different modes: novice, where the computer learns to play from scratch, and veteran, where the computer pits over 200,000 rounds of previous experience against you.”

Time to dust off your random seedpseudorandom number generators … good luck!

Academic Universe: University Instruction from Some of the World’s Top Scholars

I have been a huge fan of Academic Earth for quite some time. For those not familiar with the page, I wanted to highlight some of my favorite material from the site. Academic Earth features full length video for a number of useful/interesting courses. If you simply are interested in absorbing the content, it is a great way to learn. It is arguably better than site such as MIT Opencourseware as it has the full video (MIT Opencourseware often just has lecture notes).

Thomas Goetz: It’s Time to Redesign Medical Data [TEDMed]

Thomas Goetz is the executive editor of Wired and author of “The Decision Tree: Taking Control of Your Health in the New Era of Personalized Medicine.  From the Talk Abstract “Your medical chart: it’s hard to access, impossible to read — and full of information that could make you healthier if you just knew how to use it. At TEDMED, Thomas Goetz looks at medical data, making a bold call to redesign it and get more insight from it.”

Yesterday’s Fast is Today’s Slow – The 2011 Season Starts Today!

Well my Ducks did not quite get it done last night in the BCS National Championship Game. Despite the loss, I think that it is important to emphasize that innovation on a variety of fronts is responsible for bringing Oregon to the title game. Simply put, Oregon has redefined the game and there is no doubt that copycats will soon begin to follow their model. The Ducks still have one more big step to take but they will be one of the favorites in 2011 — that season begins today.

The AI Revolution Is On [ Via Wired Magazine ]

From the Full Article: “AI researchers began to devise a raft of new techniques that were decidedly not modeled on human intelligence. By using probability-based algorithms to derive meaning from huge amounts of data, researchers discovered that they didn’t need to teach a computer how to accomplish a task; they could just show it what people did and let the machine figure out how to emulate that behavior under similar circumstances. … They don’t possess anything like human intelligence and certainly couldn’t pass a Turing test. But they represent a new forefront in the field of artificial intelligence. Today’s AI doesn’t try to re-create the brain. Instead, it uses machine learning, massive data sets, sophisticated sensors, and clever algorithms to master discrete tasks. Examples can be found everywhere …”