Archive

Posts Tagged ‘data mining’

Introduction to Computing for Complex Systems — ICPSR 2010 — My Full Course Slides Available Online!

August 13th, 2010 dmartink No comments

I am going to bump this post to front of the blog one last time. We have now completed the full four week class here at the ICPSR Summer Program in Quantitative Methods. In this course, I (together with my colleagues) highlight the methods of complex systems as well as several environments designed to explore the field. These include Netlogo (agent based models and network models), Vensim (system dynamics / ecological modeling) and Pajek (empirical network analysis).  In the final week, we cover a variety of advanced topics:

Although, we do not work with more advanced languages within the course, those who need to conduct complex analysis are directed to alternatives such as RPythonJava, etc.

Anyway, the slides are designed to be fully self-contained and thus allow for individually paced study of the relevant material. If you work through the slides carefully you should be able to learn the software as well as many of the core principles associated with the science of complex systems. The material should be available online indefinitely. If you have questions, feel free to email me.

The Google Prediction API [From Google Labs]

July 19th, 2010 dmartink No comments

Computational World Cup

June 14th, 2010 mjbommar No comments

The Financial Times’s Alphaville blog recently covered a number of quantitative models for predicting World Cup outcomes – models developed by well-known “quant” desks.  Though this may seem like a waste of brains and shareholder value, World Cup outcomes are historically predictive of regional equity performance; furthermore, recent trends in securitization have not passed over sports as large as soccer.  Here are the respective desks’ picks:

  • JPM: England 1st, Spain 2nd, Netherlands 3rd (notes)
  • UBS: Brazil 1st, Germany 2nd, Italy 3rd (notes, p. 37)
  • GS: England, Argentina, Brazil, Spain (unranked) (notes, p. 71)
  • Dankse Bank: Brazil 1st, Germany 2nd (notes)

As could be expected, there is some disagreement as to the value of these predictions.  Gary Jenkins of Evolution Securities chimes in with his own thoughts:

Yes it’s that time again when analysts like me who can barely predict what is going to happen in the market the following day turn away from our area of so called expertise and instead focus our attention on who is going to win the World Cup. I first got involved in this attempt to get some publicity 8 years ago, when Goldman Sachs produced a report combining economics and the World Cup and included their predictions as to who would get to the last four (I believe they got them all wrong) and had Sir Alex Ferguson pick his all time best World Cup team. I decided to do the same thing but had to explain that we could not afford Sir Alex. Thus I got my dad to pick his all time team. It caused more client complaints than most of my research and my favourites to win the tournament got knocked out early, so I abandoned this kind of research for a while.

Again, for more interesting coverage of the real-world effects of the World Cup, see FT Alphaville’s South Africa 2010 series.  P.S. Go Azzurri this afternoon!

Tags:

Legal Studies in the Era of ‘Big Data’ – Google Releases 10 Terabytes of Patent and Trademark Data

June 2nd, 2010 dmartink No comments

Bursts: The Hidden Pattern Behind Everything We Do

April 17th, 2010 mjbommar No comments

Albert-László Barabási, in his usual creative fashion, has produced an interesting game to help publicize his new book, Bursts: The Hidden Pattern Behind Everything We Do.

Read their description of the game below and check it out if you’re interested!

BuRSTS

BuRSTS is a performance in human dynamics, a game of cooperation and prediction, that will gradually unveil the full text of Bursts. In a nutshell, if you register at http://brsts.com, you will be able to adopt one of the 84,245 words of the book. Once you adopt, the words adopted by others will become visible to you — thus as each words finds a parent, the whole book will become visible to the adopters. But if you invite your friends (and please do!) and you are good at predicting hidden content, the book will unveil itself to you well before all words are adopted. We will even send each day free signed copied of Bursts to those with the best scores.

From http://barabasi.com/bursts/.

Data on the Legal Blogosphere [Via the Library of Congress]

April 8th, 2010 dmartink No comments

From the LOC Website … “The Law Library of Congress began harvesting legal blawgs in 2007. The collection has grown to more than one hundred items covering a broad cross section of legal topics. Blawgs can also be retrieved by keywords or browsed by subject, name, or title.” To access our visualization of the legal blogosphere (pictured above) … please click here.

Detrending Career Statistics in Professional Baseball: Accounting for the Steroids Era and Beyond

March 20th, 2010 dmartink No comments

From the abstract: “There is a long standing debate over how to objectively compare the career achievements of professional athletes from different historical eras. Developing an objective approach will be of particular importance over the next decade as Major League Baseball (MLB) players from the “steroids era” become eligible for Hall of Fame induction. Here we address this issue, as well as the general problem of comparing statistics from distinct eras, by detrending the seasonal statistics of professional baseball players. We detrend player statistics by normalizing achievements to seasonal averages, which accounts for changes in relative player ability resulting from both exogenous and endogenous factors, such as talent dilution from expansion, equipment and training improvements, as well as performance enhancing drugs (PED). In this paper we compare the probability density function (pdf) of detrended career statistics to the pdf of raw career statistics for five statistical categories — hits (H), home runs (HR), runs batted in (RBI), wins (W) and strikeouts (K) — over the 90-year period 1920-2009. We find that the functional form of these pdfs are stationary under detrending. This stationarity implies that the statistical regularity observed in the right-skewed distributions for longevity and success in professional baseball arises from both the wide range of intrinsic talent among athletes and the underlying nature of competition. Using this simple detrending technique, we examine the top 50 all-time careers for H, HR, RBI, W and K. We fit the pdfs for career success by the Gamma distribution in order to calculate objective benchmarks based on extreme statistics which can be used for the identification of extraordinary careers.”

160,000 Hours of C-Span Coverage at Your Finger Tips

March 16th, 2010 dmartink No comments

As reported in the NY Times …  roughly 160,000 hours of C-SPAN coverage is going live for your consumption.  Yet another example that the Era of Big Data is upon us!

The Data Deluge [Via The Economist]

March 1st, 2010 dmartink No comments

The cover story of this week’s Economist is entitled The Data Deluge. This is, of course, a favorite topic of the hosts of this blog. While a number of folks have already highlighted this trend, we are happy to see a mainstream outlet such the Economist reporting on the era of big data. Indeed, the convergence of rapidly increasing computing power, and decreasing data storage costs, on one side, and large scale data collection and digitization on the other … has already impacted practices in the business, government and scientific communities. There is ample reason to believe that more is on the way.

In our estimation, for the particular class of questions for which data is available, two major implications of the deluge are worth reiterating: (1) no need to make assumptions about the asymptotic performance of a particular sampling frames when population level data is readily available; and (2) what statistical sampling was to the 20th century, data filtering may very well be to the 21st ….

United States Court of Appeals & Parallel Tag Clouds from IBM Research [Repost from 10/23]

February 3rd, 2010 dmartink No comments

Download the paper: Collins, Christopher; Viégas, Fernanda B.; Wattenberg, Martin. Parallel Tag Clouds to Explore Faceted Text Corpora To appear in Proceedings of the IEEE Symposium on Visual Analytics Science and Technology (VAST), October, 2009. [Note: The Paper is 24.5 MB]

Here is the abstract: Do court cases differ from place to place? What kind of picture do we get by looking at a country’s collection of law cases? We introduce Parallel Tag Clouds: a new way to visualize differences amongst facets of very large metadata-rich text corpora. We have pointed Parallel Tag Clouds at a collection of over 600,000 US Circuit Court decisions spanning a period of 50 years and have discovered regional as well as linguistic differences between courts. The visualization technique combines graphical elements from parallel coordinates and traditional tag clouds to provide rich overviews of a document collection while acting as an entry point for exploration of individual texts. We augment basic parallel tag clouds with a details-in-context display and an option to visualize changes over a second facet of the data, such as time. We also address text mining challenges such as selecting the best words to visualize, and how to do so in reasonable time periods to maintain interactivity.

Slides from our Presentation at UPenn Computational Linguistics (CLUNCH) / Linguistic Data Consortium (LDC)

January 26th, 2010 mjbommar No comments

We have spent the past couple days at the University of Pennsylvania where we presented information about our efforts to compile a complete United States Supreme Court Corpus.  As noted in the slides below, we are interested in creating a corpus containing not only every SCOTUS opinion, but also every SCOTUS disposition from 1791-2010. Slight variants of the slides below were presented at the Penn Computational Linguistics Lunch (CLunch) and the Linguistic Data Consortium(LDC).  We really appreciated the feedback and are looking forward to continue our work with the LDC.  For those who might be interested, take a look at the slides embedded below or click on this link:

WP SlimStat