Archive

Author Archive

Computational World Cup

June 14th, 2010 mjbommar No comments

The Financial Times’s Alphaville blog recently covered a number of quantitative models for predicting World Cup outcomes – models developed by well-known “quant” desks.  Though this may seem like a waste of brains and shareholder value, World Cup outcomes are historically predictive of regional equity performance; furthermore, recent trends in securitization have not passed over sports as large as soccer.  Here are the respective desks’ picks:

  • JPM: England 1st, Spain 2nd, Netherlands 3rd (notes)
  • UBS: Brazil 1st, Germany 2nd, Italy 3rd (notes, p. 37)
  • GS: England, Argentina, Brazil, Spain (unranked) (notes, p. 71)
  • Dankse Bank: Brazil 1st, Germany 2nd (notes)

As could be expected, there is some disagreement as to the value of these predictions.  Gary Jenkins of Evolution Securities chimes in with his own thoughts:

Yes it’s that time again when analysts like me who can barely predict what is going to happen in the market the following day turn away from our area of so called expertise and instead focus our attention on who is going to win the World Cup. I first got involved in this attempt to get some publicity 8 years ago, when Goldman Sachs produced a report combining economics and the World Cup and included their predictions as to who would get to the last four (I believe they got them all wrong) and had Sir Alex Ferguson pick his all time best World Cup team. I decided to do the same thing but had to explain that we could not afford Sir Alex. Thus I got my dad to pick his all time team. It caused more client complaints than most of my research and my favourites to win the tournament got knocked out early, so I abandoned this kind of research for a while.

Again, for more interesting coverage of the real-world effects of the World Cup, see FT Alphaville’s South Africa 2010 series.  P.S. Go Azzurri this afternoon!

Tags:

Bursts: The Hidden Pattern Behind Everything We Do

April 17th, 2010 mjbommar No comments

Albert-László Barabási, in his usual creative fashion, has produced an interesting game to help publicize his new book, Bursts: The Hidden Pattern Behind Everything We Do.

Read their description of the game below and check it out if you’re interested!

BuRSTS

BuRSTS is a performance in human dynamics, a game of cooperation and prediction, that will gradually unveil the full text of Bursts. In a nutshell, if you register at http://brsts.com, you will be able to adopt one of the 84,245 words of the book. Once you adopt, the words adopted by others will become visible to you — thus as each words finds a parent, the whole book will become visible to the adopters. But if you invite your friends (and please do!) and you are good at predicting hidden content, the book will unveil itself to you well before all words are adopted. We will even send each day free signed copied of Bursts to those with the best scores.

From http://barabasi.com/bursts/.

Tax Day: A Mathematical Approach to the Study of the United States Code

April 15th, 2010 mjbommar No comments
United States Code
United States Code

April 15th is Tax Day! Unless you’ve filed for an extension or you’re a corporation on your own fiscal year, you’ve hopefully finished your taxes by now!

While you were filing your return, you may have noticed references to the Internal Revenue Code (IRC). The IRC, also known as Title 26, is legal slang for the “Tax Code.”  Along with the Treasury Regulations compiled into the Code of Federal Regulations (26 C.F.R.), the Internal Revenue Code contains many of the rules and regulations governing how we can and can’t file our taxes.  Even if you prepared your taxes using software like TurboTax, the questions generated by these programs are determined by the rules and regulations within the Tax Code and Treasury Regulations.

Many argue that there are too many of these rules and regulations or that these rules and regulations are too complex. Furthermore, many also claim that the “Tax Code” is becoming larger or more complex over time. Unfortunately, most individuals do not support this claim with solid data. When they do, they often rely on either the number of pages in Title 26 or the CCH Standard Federal Tax Reporter. None of these measures take into consideration the real complexity of the Code, however.

In honor of Tax Day, we’re going to highlight a recent paper that we’ve written that tries to address some of these issues – A Mathematical Approach to the Study of the United States Code. The first point to make is that this paper is a study of the entire United States Code. Title 26, the Tax Code, is actually only one small part of the set of rules and regulations defined in the United States Code. The United States Code as a whole is the largest and arguably most important source of Federal statutory law. Compiled from the legislation and treaties published in the Statutes at Large in 6-year intervals, the entire document contains over 22 million words.

In this paper, we develop a mathematical approach to the study of the large bodies of statutory law and in particular, the United States Code. This approach can be summarized as guided by a representation of the Code as an object with three primary parts:

  1. A hierarchical network that corresponds to the structure of concept categories.
  2. A citation network that encodes the interdependence of concepts.
  3. The language contained within each section.

Given this representation, we then calculate a number of properties for the United States Code in 2008, 2009, and 2010 as provided by the Legal Information Institute at the Cornell University Law School. Our results can be summarized in three points:

  1. The structure of the United States Code is growing by over 3 sections per day.
  2. The interdependence of the United States Code is increasing by over 7 citations per day.
  3. The amount of language in the United States Code is increasing by over 2,000 words per day.

The figure above is an actual image of the structure and interdependence of the United States Code. The black lines correspond to structure and the red lines correspond to interdependence.  Though visually stunning, the true implication of this figure is that the United States Code is a very interdependent set of rules and regulations, both within and across concept categories.

If you’re interested in more detail, make sure to read the paper -A Mathematical Approach to the Study of the United States Code. If you’re really interested, make sure to check back in the near future for our forthcoming paper entitled Measuring the Complexity of the United States Code.

H.R. 4872 Word Cloud

March 22nd, 2010 mjbommar No comments

With the passage of H.R. 4872 in the House last night, our previous research on the relative size of H.R. 3962 has been in high demand. While we are not yet prepared to run similar calculations for H.R. 4872, at first glance we can say – as passed, H.R. 4872 is marginally longer H.R. 3200 and nearly 20% longer than H.R. 3962.

We have created the word cloud below … it is mainly just for fun. However, one interesting thing that jumps out from the cloud is the prevalence of the word “secretary.” Indeed, this may indicate the legislation provides a non-trivial amount of discretion to the relevant various administrative agencies.

H.R. 4872 Word Cloud

H.R. 4872 Word Cloud

New Paper Available on SSRN: A Profitable Trading and Risk Management Strategy Despite Transaction Cost

March 16th, 2010 mjbommar 3 comments

Readers might be interested in an article that A. Duran and I have coming out in Quantitative Finance this year entitled A Profitable Trading and Risk Management Strategy Despite Transaction Cost. In the article, we develop a strategy which outperforms the “market” in rigorous out-of-sample testing. We’ve made sure to check the robustness of the results by performing Monte Carlo simulations on both the S&P 500 and Russell 2000 while varying the subsets of stocks and time periods used in the simulation.

The strategy is interesting in that it is based on behavioral patterns.  Unlike many other algorithmic trading models, our strategy is modeled after a human trader with quarterly memory who categorizes the market return distribution and market risk into low, medium, and high categories.  Technically, it accomplishes this by non-parametrically categorizing windowed estimates of the first four moments of the return distribution and the normalized leading eigenvalue of the windowed correlation matrix.  Based on the assessment of these low/medium/high categories and past experience in similar states, the strategy then decides whether to invest in the market index, invest in the risk-free asset, or short the market.   The strategy soundly outperforms the market index in multiple markets over random windows and on random subsets of stocks.

While you’re waiting for its publication in Quantitative Finance, you might check out a copy over at SSRN.  Here’s the abstract and a figure below comparing the log-return of our strategy with the market over one realization:

We present a new profitable trading and risk management strategy with transaction cost for an adaptive equally weighted portfolio. Moreover, we implement a rule-based expert system for the daily financial decision making process by using the power of spectral analysis. We use several key components such as principal component analysis, partitioning, memory in stock markets, percentile for relative standing, the first four normalized central moments, learning algorithm, switching among several investments positions consisting of short stock market, long stock market and money market with real risk-free rates. We find that it is possible to beat the proxy for equity market without short selling for S&P 500-listed 168 stocks during the 1998-2008 period and Russell 2000-listed 213 stocks during the 1995-2007 period. Our Monte Carlo simulation over both the various set of stocks and the interval of time confirms our findings.

On the Stability of Community Detection Algorithms on Longitudinal Citation Data: Paper & New Code

March 4th, 2010 mjbommar No comments

Last summer, Dan, Jon and I wrote our conference paper for ASNA 2009 entitled On the Stability of Community Detection Algorithms on Longitudinal Citation Data.  The purpose of this paper was to experimentally explore a number of properties of community detection algorithms, especially as applied to citation networks (c.f. Supreme Court, Tax Court, United States Code).

The model is a fairly simple discrete-time directed growing graph.  At time 0, we create a completely disconnected graph with some initial number of vertices.  The number of new vertices after time 0 is then modeled by a homogeneous Poisson process.  For each of these new vertices, we also model the number of edges per vertex as IID Poisson distributions.  The probability distribution over these edges is specified by a modified preferential attachment mechanism.

Most of our questions consider what we’ve called the average pairwise stability.  This can be thought of as answering the following question: “what is the probability that Alice and Bob are friends tomorrow if they were friends today?” Here, friendship corresponds to sharing the same community.  By asking this question for all pairs of vertices (dyads) for all time steps in which both vertices existed, we get a probability that a community dyad is preserved from step to step.  It is important to note that we’re not claiming all algorithms applied to all systems should have high average pairwise stability. In fact, for systems that involve dynamics like random rewiring, the only way to get high pairwise stability is to put all vertices in the same community or their own community at all steps – obviously trivial and unhelpful solutions in practice.

Given this model and this conception of stability, we want to perform the following experiments:

  1. How do the edge-betweenness, fast greedy, and leading eigenvector community detection algorithms compare in terms of their average pairwise stability…
    1. for varying levels of preferential attachment?
    2. for varying vertex and edge rates?
  2. Is there a significant tradeoff between the number of communities and the average pairwise stability of these community detection algorithms?

The answers are yes, yes, and yes!  You should read the paper for more details.

I’ve also produced some code to help you assess the average pairwise stability for your dataset.  The code requires igraph and is only in Python at this point due to an issue with R’s vertex label handling (which I can hopefully work around).  You can get the average pairwise stability methods on github and check out the example below:

Slides from our Presentation at UPenn Computational Linguistics (CLUNCH) / Linguistic Data Consortium (LDC)

January 26th, 2010 mjbommar No comments

We have spent the past couple days at the University of Pennsylvania where we presented information about our efforts to compile a complete United States Supreme Court Corpus.  As noted in the slides below, we are interested in creating a corpus containing not only every SCOTUS opinion, but also every SCOTUS disposition from 1791-2010. Slight variants of the slides below were presented at the Penn Computational Linguistics Lunch (CLunch) and the Linguistic Data Consortium(LDC).  We really appreciated the feedback and are looking forward to continue our work with the LDC.  For those who might be interested, take a look at the slides embedded below or click on this link:

GraphMovie: A Library for Generating Movies from Dynamic Graphs with igraph

January 17th, 2010 mjbommar No comments

Over the past few months, we’ve developed a library for simply generating dynamic network animations. We’ve used this library in visualizations like (1) Visualizing the Gawaher Interactions of Umar Farouk Abdulmutallab, the Christmas Day Bomber and (2) Dynamic Animation of the East Anglia Climate Research Unit Email Network.  Prior to these visualizations, we’ve used Sonia to produce animations like this one. While certainly a useful program for those without programming expertise, Sonia suffers from a number of issues that make it unusable for large graphs or graphs with many “slices.”  Furthermore, in our experience rendering various movies a number of platform issues with the Quicktime and Flash rendering engines have arisen.  Fixing these problems is possible, but Sonia’s large Java codebase makes for a steep learning curve.  As a result, we’ve decided to release this GraphMovie class so that others can use or possibly improve this library.

In order to use the GraphMovie, you’ll need the following:

  • python (tested with 2.6)
  • igraph for network manipulation and visualization
  • Python Imaging Library for manipulating the image frames
  • mencoder from the MPlayer package for encoding the image frames into a movie

Here are the files, hosted on github:

GraphMovie: Example 1 from Computational Legal Studies on Vimeo.

GraphMovie: Example 2 from Computational Legal Studies on Vimeo.

Gawaher Forum Content from Farouk1986, the Christmas Day Bomber

December 31st, 2009 mjbommar No comments

Drew Conway over at Zero Intelligence Agents brought to our attention the Farouk1986 data set provided by Evan Kohlman from the NEFA Foundation. For those not familiar, Umar Farouk Abdulmutallab, who is charged with willful attempt to destroy an aircraft in connection with the Christmas Day Flight 253 from Amsterdam to Detroit, made a  number of posts to the Islamic Forum Gawaher.com using the handle “Farouk1986.”

This data is useful and some good initial analysis has been offered at ZIA as well as other sites across the blogosphere. Moving beyond this initial analysis, we thought it would be helpful to analyze this set within the broader context of Umar Farouk Abdulmutallab’s posts. Thus, we downloaded the entire thread for each post in the NEFA data set — including content from many other authors.

Having this content allows us to understand the broader context of Abdulmutallab communications … including to what Abdulmutallab was responding and how others in the relevant community responded to his contributions.  We have parsed the data into threads and posts, and for each post, we have indicated the author, date, and content. For those interested in executing their own analysis, you can find an XML document with all this data here:  http://www-personal.umich.edu/~mjbommar/farouk.xml.

Feel free to use this data with proper attribution and keep your eyes posted for further analysis of the Abdulmutallab communications on this blog in the coming days.

And to all of our readers Happy New Year!

Tags:

Visualizing the East Anglia Climate Research Unit Leaked Email Network

November 27th, 2009 mjbommar No comments

As reported in a wide variety of news outlets, last week, a large amount of data was hacked from the Climate Research Unit at the University of East Anglia.  This data included both source code for the CRU climate models, as well as emails from the individuals involved with the group.  For those interested in background information, you can read the NY Times coverage here and here.  Read the Wall Street Journal  here.  Read the Telegraph here.  For those interested in searching the emails, the NY Times directs the end user to http://www.eastangliaemails.com/.

Given the data is widely available on the internet, we thought it would be interesting to analyze the network of contacts found within these leaked emails.  Similar analysis has been offered for large datasets such as the famous Enron email data set. While there may be some selection issues associated with observing this subset of existing emails, we believe this network still gives us a “proxy” into the structure of communication and power in an important group of researchers (both at the individual and organization level).

To build this network, we processed every email in the leaked data. Each email contains a sender and at least one recipient on the To:, Cc:, or Bcc: line.  The key assumption is that every email from a sender to a recipient represents a relationship between them.  Furthermore, we assume that more emails sent between two people, as a general proposition indicates a stronger relationship between individuals.

To visualize the network, we draw a blue circle for every email address in the data set.  The size of the blue circle represents how many emails they sent or received in the data set – bigger nodes thus sent or received a disproportionate number of emails.  Next, we draw grey lines between these circles to represent that emails were sent between the two contacts.  These lines are also sized by the number of emails sent between the two nodes.

Typically, we would also provide full labels for nodes in a network.  However, we decided to engage in partial “anonymization” for the  email addresses of those in the data set.  Thus, we have removed all information before the @ sign.  For instance, an email such as johndoe@umich.edu is shown  as umich.edu in the visual.  If you would like to view this network without this partial “anonymization,” it is of course possible to download the data and run the source code provided below.

Note: We have updated the image.  Specifically, we substituted a grey background for the full black background in an effort to make the visual easier to read/interpret. 

Click here for a zoomable version of the visual on Microsoft Seadragon.

 Network Zoom

Don’t forget to use SeaDragon’s fullscreen option:

Picture-24

Hubs and Authorities:

In addition to the visual, we provide hub and authority scores for the nodes in the network.  We provide names for these nodes but do not provide their email address.

Authority

  1. Phil Jones: 1.0
  2. Keith Briffa: 0.86
  3. Tim Osborn: 0.80
  4. Jonathan Overpeck: 0.57
  5. Tom Wigley: 0.54
  6. Gavin Schmidt: 0.54
  7. Raymond Bradley: 0.52
  8. Kevin Trenberth: 0.49
  9. Benjamin Santer: 0.49
  10.   Michael Mann: 0.46

Hubs returns nearly identical ranks with slightly perturbed orders with the notable exception that the UK Met Office IPCC Working Group has the highest hub score.

Thus, so far as these emails are a reasonable “proxy” for the true structure of this communication network, these are some of the most important individuals in the network.

Source Code:

Unlike some existing CRU code, the code below is documented, handles errors, and is freely available. 

New Paper: Properties of the United States Code Citation Network

November 11th, 2009 mjbommar No comments

We have been working on a larger paper applying many concepts from structural analysis and complexity science to the study of bodies of statutory law such as the United States Code. To preview the broader paper, we’ve published to SSRN and arXiv a shorter, more technical analysis of the properties of the United States Code’s network of citations.

Click here to Download the Paper!

Abstract: The United States Code is a body of documents that collectively comprises the statutory law of the United States. In this short paper, we investigate the properties of the network of citations contained within the Code, most notably its degree distribution. Acknowledging the text contained within each of the Code’s section nodes, we adjust our interpretation of the nodes to control for section length. Though we find a number of interesting properties in these degree distributions, the power law distribution is not an appropriate model for this system.

Citation In-Degree
Citation In-Degree

WP SlimStat