Blog Mining and Emotion Argumentation Project is to mine the information available in the Blogs, one of the important communicative and informative repository of text based emotional contents in the Web 2.0. The torrent of facts, data, figures and insights that blogs deliver daily are random and chaotic, yet immensely valuable in the right context.
Blogs are perhaps widely used by the internet users due to their ability to disseminate information and present their ideas on various topics. These blogs have increasingly become an important information source for the users’ ideas, sensitivity and sentiments. These subjective information in blogs helps in understanding a blogger’s views and observations about various topics.
Since bloggers are a good representative set of the entire population, it’s important to identify the sentiments, both positive and negative opinions about the topic to understand the public views in detail. As the amount of on-line text keeps growing, it becomes increasing difficult for humans to process the deluge of information in the time available. Automatic text processing is an obvious solution to the information overload problem in blog mining. We need automatic text processing systems to help us scan through huge volume of texts, route them to relevant parties, filter them into prespecified categories, or even summarize them. To achieve this, one crucial step is to identify the major topics of the texts, since summarization, text routing, etc., centrally require knowing the topics.
The present thesis, we mainly target the following topics
Topic Identification using Bigram, Named Entity, and Sentiment approaches:
The rapid growth of blog documents in Web 2.0 and categorizing search applications based on topics motivates us to develop a system that identifies topic names of the blog documents using Bigrams, Named Entity and Sentiment features. We also associate the sentiment scores to the blog documents using the SentiWordNet 1 and Named Entity feature using Stanford NER2. The individual module based on Bigrams, Named Entity and Sentiment produces the topic bag for each blog document containing probable topic names of that blog. Though the combined module of Bigram and Sentiment performs better than the combined module of Bigram and Named Entity, the combination of all the three modules produces satisfactory results. We evaluated these approaches using the Mean Reciprocal Rank. We identify the sentiment of the documents based on phrases as well as sentences.
WordNet Affect for Identifying Emotions in the Telugu Text:
To the best of our knowledge, most of the lexical resources for emotion or sentiment analysis have been created for English. A recent study shows that non-native English speakers support the growing use of the Internet3. Hence, there is a demand for automatic text analysis tools and linguistic resources for languages (e.g. Telugu) other than English. The present work reports the development of Telugu WordNet Affect from the English WordNet Affect 4 lists with the help of English SentiWordNet, Google dictionary 5 and English to Telugu bilingual dictionary. Expanding the available synsets of the English WordNet Affect using SentiWordNet, the expanded lists have been translated into Telugu using the Google dictionary and the English to Telugu bilingual dictionary. We design a basic system for identifying emotions in the Telugu text. There is currently a large number of lexical resources (e.g.WordNet and SentiWordNet) and tools/software (e.g. Stanford parser, Named Entity Recognizer) available for English. Resources such as WordNet and SentiWordNet have been widely used as a means of syntactic and semantic analysis for various NLP tasks in English. But in Telugu, only lexical resource which is publicly available is the Shallow Parser 6 developed by Indian Institute of Information technology, Hyderabad.
Anaphora Resolution for Telugu Language:
In blogs, the bloggers comment on different views of their own as well as to others. Thus, the statements in blogs are usually stored in nested structure. Sometimes, it is necessary to track the emotion/opinion holder and to resolve the pronouns; the task of anaphora resolution is required. Most of the recent work in Anaphora Resolution was related to Hindi, Malayalam and Tamil. We have attempted to build a Rule Based System for Anaphora Resolution for the Telugu language. The system designed is mostly based on syntactic information with only certain semantic and morphological features. We make some syntactic cues for each Telugu pronoun and based on these syntactic cues we make rules for the pronominal resolution. And
We have prepared a rule based baseline system followed by machine learning frame work and have applied them on two types of corpora, ECHR (European Court of Human Rights ) and the Araucaria Database. We use the Naïve Bayes’, SMO (Sequential Minimal Optimization) and Decision Tree classifiers for evaluation of the machine learning frame work. We evaluated the results of rule based frame work by manual experts. We used the Bayes’ theorem to find the emotional effect in generating the conclusion from the set of premises using the notion of argumentation.
Necessity of WordNet Affect
Emotion analysis, a recent sub discipline at the crossroads of information retrieval and computational linguistics is becoming increasingly important from application view points of affective computing. The majority of affect analysis methods that are related to emotion is based on textual keywords spotting and therefore explores the necessity to build specific lexical resources. WordNet is a large lexical database of English. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (Synsets), each expressing a distinct concept. Synsets are interlinked by means of conceptual relations - semantic and lexical. WordNet- Affect is an extension of WordNet Domains, and includes a subset of Synsets suitable to represent affective concepts correlated with affective words.
The SentiWordNet (Stefano et al., 2010) is a lexical resource that is used in opinion mining and sentiment analysis to assign positive, negative and objective scores to each Synset of WordNet (Miller A. G, 1995). Subjectivity wordlist (Banea et al., 2008) assigns words with the strong or weak subjectivity and prior polarities of types positive, negative and neutral. Affective lexicon (Carlo Strapparava and Valitutti A, 2004), one of the most efficient resources of emotion analysis, contains words that convey emotion. It is a small well-used lexical resource but valuable for its affective annotation. To the best of our knowledge, all of these lexical resources have been created for English. A recent study shows that non-native English speakers support the growing use of the Internet5. Hence, there is a demand for automatic text analysis tools and linguistic resources for languages (i.e. Telugu) other than English.
About Emotion Argumentation
Argumentation mining occupies a position between natural language processing, argumentation theory and information retrieval. Argumentation mining aims to automatically detect, classify and structure argumentation in text. Argumentation mining focuses on the detection of all the arguments in a text and their relationships with their preceding and following arguments (Mochales and Moens, 2011). Argumentation mining does not analyze the validity of the argumentation or its correctness. The aim is to detect those pieces of text which seem to function as argumentative (from a linguistic and semantic point of view) and the relations between them, i.e., their structure.
The result is an argumentative structure of the text from the linguistic analysis of its propositions. Therefore, argumentation mining is an important part of a complete argumentation analysis, i.e., understanding the content of serial arguments, their linguistic structure, and the relationship between the preceding and following arguments, recognizing the underlying conceptual beliefs, and understanding within the comprehensive coherence of the specific topic. The area of argumentation theory is an increasingly important area of artificial intelligence and mechanisms that are able to automatically detect the argument structure provide a novel area of research.
Argumentation is the process by which arguments are constructed and handled. Argumentation constitutes a major component of human intelligence. Argumentation is a collection of propositions, all of which are premises except, at most one, which is a conclusion. Emotion argumentation means to evaluate the consistency of emotions from a set of premises to its corresponding conclusion.
Our Aims and Approaches
The goals of the Blog mining are to collect the blog corpus and design a system for
(i) topic identification and other text processing tasks such as text summarization unit, text categorization, and information routing, and
(ii) identifying the emotions in the Telugu text, and also
(iii) to resolve the anaphora for Telugu Language.
The goals of Emotion Argumentation are
a) Developing, analyzing, and categorizing the arguments
b) To find the effect of emotions in the generation of conclusions and
c) Evaluate the consistency of emotions from a set of premises to its corresponding conclusion.
The System Diagram
In the fields of computational linguistics and probability, an n-gram is a contiguous sequence of n items from a given sequence of text or speech. An n-gram of size 1 is referred to as a "unigram"; size 2 is a "bigram" (or, less commonly, a "digram"); size 3 is a "trigram" and so on. We calculated unigram counts and then retrieved the unigrams for the blog documents with respect to five different topics. Primarily, it has been observed that the unigrams fail to produce complete topic names. For example, the top-5 relevant unigrams for the topic “2G Scam” are ‘public’,’ money’,’ people’, ‘like’,’ scam’. Thus, to improve the performance of the topic identification system, we have moved for Bigram count approach. Bigram counts maintain the same principle as monogram counts, but instead of counting occurrences of single words, bigram counts count the frequency of pairs of words. We calculated bigram word frequency and tagged these bigrams in the input file and retrieved top-5 bigrams based on the frequency count for the blog documents.
In the project, we have collected the blog corpus on recent topics and developed a prototype system for evaluating the performance of identifying topic names. We have incorporated some simple features like Bigram, Named Entity and Sentiment Words to identify the topic names from the blog documents. To the best of our knowledge, most of the lexical resources for emotion or sentiment analysis have been created for English.
A recent study shows that nonnative English speakers support the growing use of the Internet1. Hence, there is a demand for automatic text analysis tools and linguistic resources for languages (e.g. Telugu) other than English. The present work reports the development of Telugu WordNet Affect from the English WordNet Affect 2 lists with the help of English SentiWordNet, Google dictionary 3 and English to Telugu bilingual dictionary.