Tag: Google for Government
Visualizing the Gawaher Interactions of Umar Farouk Abdulmutallab, the Christmas Day Bomber
Based on the Farouk1986 Gawaher data posted earlier this week, we have analyzed the communication network of the alleged Christmas Day Bomber, Umar Farouk Abdulmutallab.
Using the handle “Farouk1986,” Abdulmutallab was a regular participant on the Islamic forum Gawaher.com. Several years prior to the Christmas Day incident, the alleged Christmas Day Bomber took part in a significant number of communications. Of course, these communications can be analyzed in a number of ways. For example, over at Zero Intelligence Agents, Drew Conway has already done some useful initial analysis. We sought to contribute our analysis of the time-evolving communication network contained within these posts. While more extensive documentation is available below, in reviewing the dynamic network visualization, consider the following observations:
Click on the Full Screen Button! (4 Arrow Symbol in the Vimeo Bottom Banner)
#1) “Farouk1986” Entered an Existing Network Which Appeared to Increase the Salience of Religion in His Life
Although individuals in society may feel isolated or appear to be loners, the internet offers like minded, potentially meaningful networks of people with whom to connect. These internet is full of communities of individuals who interests are wide ranging—including topics such as Blizzard’s World of Warcraft, sports, culinary interests or religion. With whatever prior beliefs he held, “Farouk1986” entered this subset of the broader Islamic online community in late 2004. While it is not possible for us to make definitive conclusions, it appears that the community with whom he connected increased the salience of religion in his life. In other words, through the internet “Farouk1986” experienced a reinforcing feedback and this likely primed him for further radicalization.
#2) The Network of “Farouk1986” Grows Increasingly Stable Once Established
“Farouk1986” increasingly communicated with the same set of individuals over the window in question. Thus, while communication continued to flow through the network … the network, once established, remains fairly stable. In other words, instead of being exposed to diverse sets of individuals, “Farouk1986” continued to communicate with the same individuals. In turn, those direct contacts also continued to communicate with the same individuals.
#3) Additional Streams of Data Would Enhance Analysis
The forum posts which serve as the data for this analysis are only a subset of the communication network experienced by Umar Farouk Abdulmutallab. Additional streams of relevant data would include phone records, emails, participation in other forums, etc. would likely enhance the granularity of our analysis. If you have access to such data and are legally authorized to share it-please feel free to contact us.
Background
For those not already familar with the case, Umar Farouk Abdulmutallab is charged with willful attempt to destroy an aircraft in connection with the December 25, 2009 Delta Flight 253 from Amsterdam to Detroit. Like many, we wondered precisely what path led the son of a wealthy banker to find himself as a would be suicide bomber. While these communications represent only a small portion of the broader picture, a number of illuminating analyses can still be conducted using this available information.
Using the handle “Farouk1986,” Umar Farouk Abdulmutallab was a regular participant on the popular Islamic forum Gawaher.com. Thus, as a small contribution to the broader analysis of the Christmas Day Incident, we have generated a basic visualization and analysis of the time-evolving structure of the Farouk1986 online communication network.
Filtration of the Gawaher.com Forum
As a major Islamic forum, Gawaher.com features a tremendous number of participants and a wide range of post on topics including Islamic culture, religion, international football, politics, etc.
Given our specific interest in the online behavior of Umar Farouk Abdulmutallab, we were most interested in analyzing the direct and indirect communication network associated with the handle “Farouk1986” (aka Umar Farouk Abdulmutallab). Therefore, it was necessary to filter the broader universe of communication on Gawaher.com to the relevant subset.
A portion of this information is contained in publically available NEFA dataset. While useful, we determined that this dataset alone did not include the information necessary for us to construct the Farouk1986 secondary/indirect communications network. In order to obtain a better understanding of this communication network, we retrieved every “topic” in which Farouk1986 participated at least once. Each “topic” is comprised of one or more “posts” from one or more users. Each “post” may be in response to another user’s “post.” The NEFA data contains only posts made by Farouk1986 – our data contains the entire context within which his posts existed.
Building the Time-Evolving Network of Direct and Indirect Communications
Building from this underlying data, we sought to both visualize and analyze the time evolving structure of the “Farouk1986” communication network. For those not familiar with network visualization and analytic techniques—networks consist of both nodes and edges.
In the animation offered above, each “node” is an author. The labels of all best the most central authors have been removed for visibility purposes. Each “edge” is a weighted connection between two authors, where the weight is the strength of connection between each individual. Thus, within the communication network, thicker edges represent more communications while thinner edges reflect fewer communications.
In the visualization, you will notice most nodes are colored black. For purposes of ocular differentiation, the Farouk1986 node is colored red. In addition, we color direct communications with Farouk1986 in red and communications not directly involving Farouk1986 in black.
Given each forum post is datestamped, we can order the network such that the animation reflects the changing composition of the Farouk1986 online communication network. The datestamp is reflected in the upper left corner of animation. Our analysis is limited to the 2004-2005 time period when Farouk1986 was a regular participant.
The network is visualized in each time step using the Kamada-Kawai Visualization Algorithm. Kamada-Kawai is spring embedded force directed placement algorithm commonly used to visualize networks similar to the one considered herein. In order to smooth the visual while not undercutting the qualitatively results, we apply linear interpolation between frames.
10 Most Central Participants in the Farouk1986 Network
The following are the ten most central participants in the network, as measured by weighted eigenvector centrality:
Author | Centrality |
Crystal Eyes | 1 |
property_of_allah | 0.84 |
Farouk1986 | 0.81 |
amani | 0.69 |
Mansoor Ansari | 0.61 |
sis Qassab | 0.55 |
muslim mujahid | 0.49 |
Arwa | 0.43 |
sister in islam | 0.31 |
Anj | 0.29 |
Directions for Additional Analysis
(a) Computational Linguistic Analysis of the Underlying Posts
Over what substantive dimensions did these networks of direct and indirect communication form?
(b) Recursive Growth of the Network
Friends-Friends-Friends and so on….
(c) Complete Analysis of the Gawaher.com Forum
Were the patterns of communications by Farouk1986 noticeably different from other forum participants?
(d) Linkage of Content and Structure
What is the nature of information diffusion across the Gawaher.com?
How did this differ by substantive topic?
Visualizing the Structure of H.R. 3962 — The Health Care Bill
In addition to the facts we have presented on HR 3962, we wanted to offer a visualization for the structure of the Bill. Like many other bills, HR 3962, is divided into Divisions, Titles, Subtitles, Parts, Subparts, Sections, Subsections, Clauses, and Subclauses. These hierarchical splits represent the drafters’ conception of its organization, and thus the relative size of these categories may provide an indication of both the importance of each section of the Bill as well as the overall size of the document. By clicking through the image below, you can navigate a zoomable representation of the structure of HR 3962 using Microsoft’s Seadragon zoom interface. Many of the Divisions, Titles, Subtitles, Parts, and Subparts of the Bill are labeled. The balance are not labeled because they fell on an angle on the radial layout which rendered them impossible to read.
The graph is laid out in a radial manner with the center node labeled “H.R. 3962.” Legislation, the broader United States Code as well as many other classes of information are organized as hierarchical documents. H.R. 3962 is no different. For those less familiar with this type of documents, we thought it useful to provide a tutorial regarding (1) how to use this zoomable visualization (2) the correspondence between the visual and the Library of Congress version of H.R. 3962
How Do I Open/Navigate the Visualization?
(1) Open the Library of Congress version of H.R. 3962 in another browser window.
(2) Open the visualization by clicking on the large image above.
(3) Clicking on the image above will take you to the Seadragon platform. (Note: Load times will vary from machine to machine… so please be patient.)
(4) Seadragon allows for zoomable visualizations and for full screen viewing. Full screen is really the best way to go. If you run your mouse over the black box where the visual is located you will see four buttons in the southeast corner. The “full screen” button is the last one on the right. Click the button and you will be taken to full screen viewing!
(5) Click to zoom in and out, hold the mouse down and drag the entire visual, etc. Now, you are ready to traverse the graph using this visualization as your very own “H.R. 3962 Magic Decoder Wheel.”
How Do I Understand the Visualization?
To introduce the substance of the visualization, we have color coded two separate examples right into the visualization.
Example 1: Bills such as HR 3962 often feature a “short title” provision at the very begining of the legislation. For example, if you download the PDF copy of the bill, you can see the short title at the bottom of page 1 of the bill. You can also see this in the Library of Congress version of H.R. 3962.
SECTION 1. SHORT TITLE; TABLE OF DIVISIONS, TITLES, AND SUBTITLES.
(a) Short Title- This Act may be cited as the `Affordable Health Care for America Act’.
Zoom in close to start in the center where the large node labeled “HR 3962.” Notice the blue colorized path features the blue labels 1. and terminates with the label (a). The labels in the graph are the labels in the text above. While this is a simple example, the precise logic defines the entire graph.
Example 2: This is a bit more difficult as it requires the traversal of several provisions in order to reach a terminal node. In this case, the terminal node read as follows … “SEC. 401. INDIVIDUAL RESPONSIBILITY.For an individual’s responsibility to obtain acceptable coverage, see section 59B of the Internal Revenue Code of 1986 (as added by section 501 of this Act).”
DIVISION A–AFFORDABLE HEALTH CARE CHOICES
TITLE IV–SHARED RESPONSIBILITY
Subtitle A–Individual Responsibility
SEC. 401. INDIVIDUAL RESPONSIBILITY.
Again, zoom in close to start in the center--where the large node labeled “HR 3962.” Notice the blue colorized path features the blue labels A and terminates with the label 401. In between the start and finish, there are stops at IV and A, respectfully. Just as before, the labels in the graph are the labels in the text above. The end user can follow the precise journey but without the visual by using the Library of Congress version of H.R. 3962.
Facts About the Length of H.R. 3962, the Affordable Health Care for America Act (AHCAA)
In light of last night’s vote on H.R. 3962, the Affordable Health Care for America Act, we decided to calculate a few numbers on the current bill. Based on the Library of Congress’s XML representation of the bill (which can be obtained here), we have calculated a number of linguistic and citation properties of the Bill. The House of Representative approved HR 3962 by a 220-215 margin. The New York Times features a useful analysis of the vote including a breakout by party and region here.
On the Sunday morning talk shows as well as in other outlets, there has been significant discussion regarding the size of H.R. 3962. Specifically, many critics have decried the length of the bill citing its 1990 pages. The bill is indeed 1990 pages as you can see if you choose to download a PDF copy of the bill.
The purpose of this post is to provide a perspective regarding the length of H.R. 3962. Those versed in the typesetting practices of the United States Congress know that the printed version of a bill contains a significant amount of whitespace including non-trivial space between lines, large headers and margins, an embedded table of contents, and large font. For example, consider page 12 of the printed version of H.R. 3962. This page contains fewer than 150 substantive words.
We believe a simple page count vastly overstates the actual length of bill. Rather than use page counts, we counted the number of words contained in the bill and compared these counts to the number of words in the existing United States Code. In addition, we consider the number of text blocks in the bill– where a text block is a unit of text under a section, subsection, clause, or sub-clause.
Basic Information about the Length of H.R. 3962
Number of words in H.R. 3962 impacting substantive law:
- 234,812 words (w/ generous calculation)
Number of total words in H.R. 3962: 363,086 words (w/ titles, tables of contents …)
Number of text blocks: 7,961
Average number of words per text block: 24.18
Average words per section: 267.03
Is this a Large or Small Number? Comparison to Harry Potter
Number of substantive words in H.R. 3962: 234,812 words
Harry Potter and the Order of the Phoenix – 257,000 words
Harry Potter and the Goblet of Fire – 190,000 words
Harry Potter and the Deathly Hallows – 198,000 words
Is this a Large or Small Number? Comparison to Other Legislation
Number of substantive words in Energy Bill of 2007: 157,835 words
Number of substantive words in Defense Authorization Act for 2010: 119,960 Words
H.R. 3962 is roughly 2x the Size of Medicare Rx Bill of 2003 (Given there is no public XML version of the bill, the Exact “Substantive Words” Number is not available)
Is this a Large or Small Number? Comparison to the Full U.S. Code
Size of the United States Code: 42+ Million Words
Relative Size of H.R. 3962: H.R. 3962 is roughly 1/2 of one percent of the size of the United States Code
Longest Sections in H.R. 3962
- Sec 341. Availability Through Health Insurance Exchange
- Sec 1222. Demonstration to promote access for Medicare beneficiaries with limited English proficiency by providing reimbursement for culturally and linguistically appropriate services.
- Sec 1160: Implementation, and Congressional review, of proposal to revise Medicare payments to promote high value health care
- Sec 305: Funding for the construction, expansion, and modernization of small ambulatory care facilities
- Sec 1417: Nationwide program for national and State background checks on direct patient access employees of long-term care facilities and providers
Modifications of the Existing U.S. Code By H.R. 3962
Number of Strikeouts: 332
Number of Inserts: 390
Number of Re-designations: 65
Acts Most Cited By H.R. 3962
Social Security Act: 622 times
Public Health Service Act: 134 times
Affordable Health Care for America Act: 60 times
Indian Health Care Improvement Act: 56 times
Indian Self-Determination and Education Assistance Act: 45 times
Employee Retirement Income Security Act: 39 times
Medicare Prescription Drug, Improvement, and Modernization Act: 11 times
American Recovery and Reinvestment Act: 7
Sections of the U.S. Code Cited (Properly) Most By H.R. 3962
25 U.S.C. §450. Congressional statement of findings: 38
25 U.S.C. §13. Expenditure of appropriations by Bureau: 13
42 U.S.C. §1396a(a). State plans for medical assistance: 10
42 U.S.C. §1396d(a). Definitions: 7
42 U.S.C. §2004a. Sanitation facilities: 7
Real Time Visualization of US Patent Data [Via Infosthetics]
Using data dating back to 2005 and updating weekly using information from data.gov the Typologies of Intellectual Property project created by information designer Richard Vijgen offers almost real time visualization of US Patent Data.
From the documentation … “[T]ypologies of intellectual property is an interactive visualization of patent data issued by the United States Patent and Trademark Office. Every week an xml file with about 3000 new patents is published by the USTPO and made available through data.gov. This webapplication provides a way to navigate, explore and discover the complex and interconnected world of idea, inventions and big business.”
Once you click through please note to adjust the date in the upper right corner to observe earlier time periods. Also, for additional information and/or documentation click the “about this site” in the upper right corner. Enjoy!
Death and Taxes 2010 — Using the Zoomorama Interface
Death and Taxes is an infographic classic created by Jess Bachman. The new version for 2010 is now available. Place the cursor over the graphic and wait for the {+,-} to show up. Then, zoom in read any part of the poster. Click and hold to move side to side. For more information or to order a poster … click through to Wall Stats. It is worth the click through as Wall Stats features a fully searchable legend which will autozoom on major executive agencies.
The Rise of the Data Scientist [From Flowing Data]
Earlier in the month, there was a very interesting discussion over at Flowing Data entitled the Rise of the Data Scientist. We decided to highlight it in this post because it raises important issues regarding the relationship between Computational Legal Studies and other movements within law.
As we consider ourselves empiricists, we are strong supporters of the Empirical Legal Studies movement. For those not familiar, the vast majority of existing Empirical Legal studies employ the use of econometric techniques. For some substantive questions, these approaches are perfectly appropriate. While for others, we believe techniques such as network analysis, computational linguistics, etc. are better suited. Even when appropriately employed, as displayed above, we believe the use of traditional statistical approaches should be seen as nested within a larger process. Namely, for a certain class of substantive questions, there exists tremendous amounts of readily available data. Thus, on the front end, the use of computer science techniques such as web scraping and text parsing could help unlock existing large-N data sources thereby improving the quality of inferences collectively produced. On the back end, the use of various methods of information visualization could democratize the scholarship by making the key insights available to a much wider audience.
It is worth noting that our commitment to Computational Legal Studies actually embraces a second important prong. From a mathematical modeling/formal theory perspective, at least for a certain range of questions, agent based models/computational models ≥ closed form analytical models. In other words, we are concerned that many paper & pencil game theoretic models fail to incorporate interactions between components or the underlying heterogeneity of agents. Alternatively, they demonstrate the existence of a P* without concern of whether such an equilibrium is obtained on a timescale of interest. In some instances, these complications do not necessarily matter but in other cases they are deeply consequential.
On the Road Again… Trip to Colorado Law School
We just finished a few very interesting days at Colorado Law School. Given the intersect of Computer Program and Law is a fairly narrow set, it was great to spend sometime time at CU Law School because its faculty features two scholars with a significant programming background — Paul Ohm and Harry Surden.
In addition to discussing CLS, we participated in a workshop on New Institutional Economics (NIE) and Law. I found this workshop very interesting as outside of my work in Computational Legal Studies, I have authored scholarship at the crossroads of New Institutionalism and Constitutional Political Economy. For example, I have this article and this work in progress. My work follows the tradition of the Bloomington School of NIE. In two weeks, I will be presenting work in progress at ISNIE in Berkeley.
Following the Colorado NIE Workship, we participated in the Silicon Flatirons Government 3.0 roundtable. I do not want to preempt the forthcoming white paper but I will say that it was a very worthwhile discussion. It solidified my views on some topics and changed my mind on some others. So, the road show continues… AI & Law in Barcelona starts tomorrow… so light blogging for the next week. But as I like to say… more to come…
Tracking the TARP [From Information Aesthetics]
Information Aesthetics is now highlighting Subsidyscope — a project designed to track how various institutions receive federal monies. Of particular interest is their visualization of disbursements under the Troubled Asset Relief Program (TARP). Sponsored by the PEW Charitable Trust, the site also contains .csv files for most of the underlying data.
The Bailout Breakdown from the Associated Press
Datavisualization.ch/ recently highlighted this interactive “Bailout Breakdown” offered by the Associated Press….. “Bailout Breakdown from Associated Press is an interactive applet that lets the user analyze the recipients and amounts of the $700 Billion bailout plan from the American government. The data is presented as a scatterplot with additional information about the representations when the hovers over a plotted item. The markers are color coded to distinguish between pending, pre-approved, approved and paid status.” At the end of his post, Benjamin Wiederkehr offers some principled critiques of the visualization techniques employed by the authors. Notwithstanding, we still thought it was still worthy of highlighting.
Google for Government? Broad Representations of Large N DataSets
In our previous post, a post which has generated tremendous interest from a variety of sources, we demonstrated how applying the tools of network science can provide a broad representation for thousands of lines of information. Throughout the 2008 Presidential Campaign then Senator Obama consistently discussed his Google for Government initiative.
From the Obama for America Website:
Google for Government: Americans have the right to know how their tax dollars are spent, but that information has been hidden from public view for too long. That’s why Barack Obama and Senator Tom Coburn (R-OK) passed a law to create a Google-like search engine to allow regular people to approximately track federal grants, contracts, earmarks, and loans online.
We agree with both President Obama and Senator Coburn that universal accessibility of such information is worthwhile goal. However, we believe this is only a first step.
In a deep sense, our prior post is designed to serve as a demonstration project. We are just two graduate students working on a shoestring budget. With the resources of the federal government, however, it would certainly be possible to create a series of simple interfaces designed to broadly represent of large amounts of information. While these interfaces should rely upon the best available analytical methods, such methods could probably be built-in behind the scenes. At a minimum, government agencies should follow the suggestion of David G. Robinson and his co-authors who argue the federal government “should require that federal websites themselves use the same open systems for accessing the underlying data as they make available to the public at large.”
Anyway, will be back on Monday providing more thoughts on our initial representation of the 110th Congress. In addition, we hope to highlight other work in the growing field of Computational Legal Studies. Have a good rest of the weekend!