Culturomics research uses quarter-century of media coverage to forecast human behavior

Posted on: September 7, 2011

This article on Culturomics is very interesting.  the Scan Man wonders what would happen if we took some of the massive scanned archives and fed them into the Culturomics models for human behavior.  From the Page of Gizmag:

“Culturomics” is an emerging field of study into human culture that relies on the collection and analysis of large amounts of data. A previous culturomic research effort used Google’s culturomic tool to examine a dataset made up of the text of about 5.2 million books to quantify cultural trends across seven languages and three centuries. Now a new research project has used a supercomputer to examine a dataset made up of a quarter-century of worldwide news coverage to forecast and visualize human behavior. Using the tone and location of news coverage, the research was able to retroactively predict the recent Arab Spring and successfully estimate the final location of Osama Bin Laden to within 200 km (124 miles).

The research used the large shared-memory supercomputer called Nautilus, which is part of the National Institute for Computational Sciences (NICS) network of advanced computing resources at Oak Ridge National Laboratory (ORNL) and boasts 1,024 cores and 4 terabytes of global shared memory. The dataset used was formed by combining three massive news archives that totaled more than 100 million articles worldwide. They included the complete New York Times (NYT) from 1945 to 2005, the unclassified edition of the Summary of World Broadcasts (SWB) from 1979 to 2010, and an archive of English-language Google News articles spanning 2006 to 2011. These archives provided a cross-section of the U.S. media spanning half a century and the global media over a quarter-century.

Using this data, Kalev Leetaru of the University of Illinois in Urbana-Champaign and author of the study used advanced tonal, geographic, and network analysis methods to produce a network 2.4 petabytes in size containing more than 10 billion people, places, things, and activities linked by over 100 trillion relationships that provided a cross-section of Earth from the news media. Leetaru let the supercomputer find interesting patterns in the bulk of the data, which he then recreated using a more traditional targeted and smaller-scale approach. In this way, Leetaru was able to produce real-time forecasts of human behavior, such as national conflicts and the movement of specific individuals.


Leetaru says that examining the tone of a news story is one of the most important aspects of his version of culturomics and the most reliable metric for conflict. He cites the example of the Foreign Broadcast Information Service (FBIS) news-monitoring service, which produced an analytical report on December 6, 1941 – the day before the bombing of Pearl Harbor – that noted the bitterness of Japanese radio broadcasts in relation to the U.S. had increased and appeals for peace had ceased.

“They recognized the most valuable part about the news was not the factual parts, but the latent parts – the tone, the emotion,” said Leetaru.

“Almost every Fortune 500 company monitors the tone of news and social media coverage about their products,” Leetaru added. “There’s been a huge amount of research coming out of the business literature on the power of news tone to predict economic behavior, yet there hasn’t been as much work in using it to predict social behavior.”

To create a numeric measurement of overall tone in a document, Leetaru used an algorithm that counted the number of “positive” and “negative” words that appear and assign a positive or negative value. Using dictionaries with pre-assigned positive and negative words, Leetaru used two tone-mining methods. The first counted the density of positive and negative words then subtracted the values to get a measure of overall tone. The second method used a dictionary that numerically rated each word from extremely negative to extremely positive and then averaged the score of all the words found in the story for a more nuanced result.

Location, location, location

Leetaru also used fulltext geocoding to provide an approximate geographic coordinate for locations referenced in a news article and network analysis to show how global media groups countries together in “civilizations.”

“Using global news coverage, you count how many times every city on Earth is mentioned with every other city in an article,” explained Leetaru. “Group those results by country and you have a network of how the world news media relates and frames all the countries on Earth.”

Using the SWB and NYT archives provided an insight into how the media of different countries groups countries together. The SWB news led to seven civilizations, while the NYT archive led to only five, with a greater proportion of countries grouped with the U.S.

World “civilizations” according to SWB, 1979-2009 (Image: Leetaru)

World “civilizations” according to NYT, 1945-2005 (Image: Leetaru)

“Each country’s media will depict the world differently,” explained Leetaru. “It’s a standard principle of journalism – you write for your audience. Still, it vividly reinforces that what we get here in the U.S. is a very U.S. centric view of the world.”

Culturomics crystal ball

Using the three key data mining techniques of tone-mining, fulltext geocoding and network analysis, Leetaru was able to produce some interesting results. He says that “pooling together the global tone of all news mentions of a country over time appears to accurately forecast its near-term stability, including predicting the revolutions in Egypt, Tunisia, and Libya, conflict in Serbia, and the stability of Saudi Arabia.”

While Leetaru says Tunisia played a huge role in the Egyptian revolution, the real beginnings of the revolt can be traced back to the New Year’s Eve bombing of a Coptic Church in Alexandria that killed 21 and injured 70. It was this domestic terrorism attack that provoked local anger at the government and the global news media captured this negative shift towards the government and how the bombing, coming on the heels of the Tunisian revolution, could destabilize the country.

Not only was Leetaru able to retroactively predict the Arab Spring and dissect the basis for the uprisings, but he was also able to narrow focus and use the news to map the movement of a specific individual – Osama Bin Laden. Although the city of his death, Abbottabad, is only mentioned once in all the articles within the dataset, it is less than 200 km from the two most popular cities associated with him – Islamabad and Peshawar. In fact, nearly 49 percent of all the articles mentioning Bin Laden included a city in Pakistan.

Global geocoded tone of all Summary of World Broadcasts content, January 1979-April 2011 mentioning “bin Laden” (Image: Leetaru)

While Leetaru admits the global news content couldn’t provide a definite lock on Bin Laden’s location, it suggested that he was almost twice as likely to be found in Pakistan as Afghanistan and that a 200 km radius around Islamabad and Peshawar was his most likely location.

“I never expected to pinpoint him so accurately,” admitted Leetaru. “But it’s fascinating – if you make a map of all the cities mentioned in articles about him over the last decade it leads to a 200-kilometer radius around where he was found. It begs the question, ‘Why did that work so well?'”

Although Leetaru says the findings of his study are captivating, his real goal is to encourage further study.

“The purpose of this paper is not to say, ‘Here’s the magic bullet that solves these problems,’ but more as a road map for future research,” he said. “I see it as diving beneath the ocean – we’ve been so focused on the surface that we’re only just beginning to start exploring the entire new world that’s underneath.”

Leetaru’s paper, “Culturomics 2.0: Forecasting Large-Scale Human Behavior Using Global News Media Tone in Time and Space,” can be read in full in the journal First Monday. The research was funded by the National Science Foundation and managed by the University of Tennessee’s Remote Data Analysis and Visualization Center.