Archive

Posts Tagged ‘computational linguistics’

ICPSR 2010 Summer Program — Introduction to Computing for the Study of Complex Systems

July 21st, 2010 dmartink 1 comment

This summer I will be teaching a summer course entitled Introduction to Computing for the Study of Complex Systems at the ICPSR Summer Program in Quantitative Methods.  For those not familiar, ICSPR has been offering summer classes in methods since 1963. The Summer Program current features dozens of courses including basic and advanced econometrics, bayesian statistics, game theory, complex systems, network analysis, quantitative analysis of crime, etc.

The Complex Systems Computing Module runs together with the Complex Systems lectures offered by Ken Kollman (Michigan), Scott E. Page (Michigan), P.J. Lamberson (MIT-Sloan) and Kate Anderson (Carnegie Mellon).  Here is the syllabus for the lecture.

The first computing session is tonight from 6-8pm at the ICPSR Computer Lab.  If you click here or click on the image above you will be taken to a dedicated page that will host the syllabus and course slides!  Note: The slides and assignments will not be posted until the conclusion of each class.

The Google Prediction API [From Google Labs]

July 19th, 2010 dmartink No comments

Wordle of the Declaration of Independence — Enjoy the 4th of July!

July 4th, 2010 dmartink No comments

Tax Day: A Mathematical Approach to the Study of the United States Code

April 15th, 2010 mjbommar No comments
United States Code
United States Code

April 15th is Tax Day! Unless you’ve filed for an extension or you’re a corporation on your own fiscal year, you’ve hopefully finished your taxes by now!

While you were filing your return, you may have noticed references to the Internal Revenue Code (IRC). The IRC, also known as Title 26, is legal slang for the “Tax Code.”  Along with the Treasury Regulations compiled into the Code of Federal Regulations (26 C.F.R.), the Internal Revenue Code contains many of the rules and regulations governing how we can and can’t file our taxes.  Even if you prepared your taxes using software like TurboTax, the questions generated by these programs are determined by the rules and regulations within the Tax Code and Treasury Regulations.

Many argue that there are too many of these rules and regulations or that these rules and regulations are too complex. Furthermore, many also claim that the “Tax Code” is becoming larger or more complex over time. Unfortunately, most individuals do not support this claim with solid data. When they do, they often rely on either the number of pages in Title 26 or the CCH Standard Federal Tax Reporter. None of these measures take into consideration the real complexity of the Code, however.

In honor of Tax Day, we’re going to highlight a recent paper that we’ve written that tries to address some of these issues – A Mathematical Approach to the Study of the United States Code. The first point to make is that this paper is a study of the entire United States Code. Title 26, the Tax Code, is actually only one small part of the set of rules and regulations defined in the United States Code. The United States Code as a whole is the largest and arguably most important source of Federal statutory law. Compiled from the legislation and treaties published in the Statutes at Large in 6-year intervals, the entire document contains over 22 million words.

In this paper, we develop a mathematical approach to the study of the large bodies of statutory law and in particular, the United States Code. This approach can be summarized as guided by a representation of the Code as an object with three primary parts:

  1. A hierarchical network that corresponds to the structure of concept categories.
  2. A citation network that encodes the interdependence of concepts.
  3. The language contained within each section.

Given this representation, we then calculate a number of properties for the United States Code in 2008, 2009, and 2010 as provided by the Legal Information Institute at the Cornell University Law School. Our results can be summarized in three points:

  1. The structure of the United States Code is growing by over 3 sections per day.
  2. The interdependence of the United States Code is increasing by over 7 citations per day.
  3. The amount of language in the United States Code is increasing by over 2,000 words per day.

The figure above is an actual image of the structure and interdependence of the United States Code. The black lines correspond to structure and the red lines correspond to interdependence.  Though visually stunning, the true implication of this figure is that the United States Code is a very interdependent set of rules and regulations, both within and across concept categories.

If you’re interested in more detail, make sure to read the paper -A Mathematical Approach to the Study of the United States Code. If you’re really interested, make sure to check back in the near future for our forthcoming paper entitled Measuring the Complexity of the United States Code.

Relational Topic Models for Document Networks — Chang & Blei

February 16th, 2010 dmartink No comments

This is a very important paper by Jonathan Chang and David Blei. Suffice to say, it has potential use in a wide class of social science applications.  Click here to access related material on Professor Blei’s Princeton homepage.  Click here for some slides (note 7.0 mb). Check it out!

Slides from our Presentation at UPenn Computational Linguistics (CLUNCH) / Linguistic Data Consortium (LDC)

January 26th, 2010 mjbommar No comments

We have spent the past couple days at the University of Pennsylvania where we presented information about our efforts to compile a complete United States Supreme Court Corpus.  As noted in the slides below, we are interested in creating a corpus containing not only every SCOTUS opinion, but also every SCOTUS disposition from 1791-2010. Slight variants of the slides below were presented at the Penn Computational Linguistics Lunch (CLunch) and the Linguistic Data Consortium(LDC).  We really appreciated the feedback and are looking forward to continue our work with the LDC.  For those who might be interested, take a look at the slides embedded below or click on this link:

Law as a Seamless Web … Poster for WIN Conference @ NYU Stern

September 22nd, 2009 dmartink 2 comments

Seamless Web Poster

As we mentioned in previous posts, Seadragon is a really cool product. Please note load times may vary depending upon your specific machine configuration as well as the strength of your internet connection. For those not familiar with how to operate it please see below. In our view, the Full Screen is best the way to go ….

Computational Linguistics and Law — Some Useful Introductory Slides

July 7th, 2009 dmartink No comments

Comp Linguistics & Law

Wordle of the Declaration of Independence — Enjoy the 4th of July!

July 4th, 2009 dmartink No comments

Declaration of Independence (Wordle)

Law as a Seamless Web? Part III

June 30th, 2009 dmartink No comments

Seamless Web III

This is the third installment of posts related to our paper Law as a Seamless Web? Comparison of Various Network Representations of the United States Supreme Court Corpus (1791-2005) previous posts can be found (here) and (here). As previewed in the earlier posts, we believe comparing the Union, the Intersect and the Compliment of the SCOTUS semantic and citation networks is at the heart of an empirical evaluation of Law as a Seamless Web …. from the paper….

“Though law is almost certainly a web, questions regarding its interconnectedness remain. Building upon themes of Maitland, Professor Solum has properly raised questions as to whether or not the web of law is “seamless”. By leveraging the tools of computer science and applied graph theory, we believe that an empirical evaluation of this question is at last possible.  In that vein, consider Figure 9, which offers several possible topological locations that might be populated by components of the graphs discussed herein. We believe future research should consider the relevant information contained in the union, intersection, and complement of our citation and semantic networks.

While we leave a detailed substantive interpretation for subsequent work, it is worth broadly considering the information defined in Figure 9.  For example, the intersect (∩) displayed in Figure 9 defines the set of cases that feature both semantic similarity and a direct citation linkage. In general, these are likely communities of well-defined topical domains.  Of greater interest to an empirical evaluation of the law as a seamless web, is likely the magnitude and composition of the Citation Only and Semantic Only subsets.  Subject to future empirical investigation, we believe the Citation Only components of the graph may represent the exact type of concept exportation to and from particular semantic domains that would indeed make the law a seamless web.”

Law as a Seamless Web? Part II

June 21st, 2009 dmartink No comments

Semantic Network
In our paper Law as a Seamless Web, we offer a first-order method to generate case-to-case and opinionunit-to-opinionunit semantic networks. As constructed in the figure above, nodes represent cases decided between 1791-1865 while edges are drawn when two cases possess a certain threshold of semantic similarity. Except for the definition of edges, the process of constructing the semantic graph is identical to that of the citation graph we offered in the prior post. While computer science/computational linguistics offers a variety of possible semantic similarity measures, we choose to employ a commonly used measure. Here a description from the paper:

“Semantic similarity measures are the focus of significant work in computational linguistics. Given the scope of the dataset, we have chosen a first-order method for calculating similarity.  After lemmatizing the text of the case with WordNet, we store the nouns with the top N frequencies for each case or opinion unit. We define the similarity between two cases or opinion units A and B as the percentage of words that are shared between the top words of A and top words of B.

An edge exists between A and B in the set of edges  if  σ (A,B) exceeds some threshold.  This threshold is the minimum similarity necessary for the graph to represent the presence of a semantic connection.”  

As this a technical paper, it is slanted toward demonstrating proof of methodological concept rather than covering significant substantive ground. With that said, we do offer a hint of our broader substantive goal of detecting the spread of legal concepts between various topical domains. Specifically, with respect to enriching positive political theory, we believe union, intersect and compliment of the semantic and citation networks are really important. More on this point is forthcoming in a subsequent post…

WP SlimStat