Trend Detection through Large-Scale News and Blog Analysis

Go Back to Home

DETAILS

       Team

Sentiment analysis is an emerging field employing algorithmic methods to identify and summarize opinions expressed in text. Professor Steven Skiena's Lydia system uses text analysis to build a relational model of people, places, and things. Natural language processing of news, blog, and other web sources coupled with statistical analysis of entity frequencies and co-locations enables us to track the temporal and spatial distribution of news entities: who is being talked about, by whom, when, and where?

Our analysis of several large scale text-corpora (including news, blogs, and scientific/medical abstracts) are available to the research community through our Access portal. Users can check our analysis on any of the 100 million entities we monitor daily in our terabyte-scale news/blog corpus -- including over 500 daily online newspapers we have monitored continually since November 2004, 20 million blog sources, the historical archives of selected newspapers dating back over 150 years, and millions of patent and medical abstracts.

Lydia analysis has begun to be exploited for serious social science research, including political science, sociology, and business/marketing. Research results demonstrate that our sentiment analysis can be incorporated into models to predict movie grosses, stock prices, and even betting on NFL football games. Lydia technology has been licenced to General Sentiment, a social-media monitoring startup which public opinion expressed regarding brands, products, politicians, celebrities, companies, and more.

{\em Lydia} has evolved through two generations and over six years of University research and development. We now employ a high-performance Hadoop-based parallel architecture for text analysis which enables us to easily work with massive text corpora on our 28-node dedicated cluster computer.

At a technical level, the Lydia system consists four primary components -- spidering, NLP markup, entity analysis and aggregation, and data visualization:

  • Input document collection -- We actively spider text sources ranging from mainstream news sources to blogs on a continual, daily basis.
  • Natural language processing -- Documents are passed through the Lydia NLP pipeline, which performs part-of-speech tagging, marks up and classifies named entities through rule-based and Bayesian analysis, resolves pronouns, normalizes geographical location names, and identifies the sentiment polarity of each entity occurrence. Local entity co-reference resolution is performed to unify references to the same entity under different names; e.g., Barack Obama might be referred to as ``Barack Obama,'' ``Obama,'' and ``President Obama'' in the same article.
  • Aggregation -- Lydia extracts references from this NLP-processed documents through a series of Hadoop-based map-reduce jobs, storing the results in a persistent data structure that we call a depository. This includes comprehensive reference counts, juxtaposition statistics, article / entity search indices, co-referential (synonymous) entity sets, derived entity classifications, and aggregated statistics for these derived groups.
  • Data Visualization -- The final Lydia depository can be accessed through a set of flexible APIs, exposing interesting slices of the data. Our website provides an interactive user interface for data exploration through data maps and time-series processing.

Lydia news analysis distinguishes itself in several ways over previous efforts, including (1) news entity vs. document-level analytics, (2) more sophisticated NLP than simple bag-of-word techniques, (3) dissemination of general frequency/sentiment time-series data via an interactive web interface vs. project-specific custom programming, and (4) comprehensive and up-to-date analysis of substantial collections of news sources. The data streams resulting from our analysis readily lend themselves to statistical investigation, using techniques from time series and spatial analysis.

 

Article created on: November 2010

Department of Computer Science • Stony Brook University, Stony Brook, NY 11794-4400 • 631-632-8470 or 631-632-8471