Collection of Research Papers: Web Search and Data Mining
Keywords:| Social Web | Web Spam | Propaganda | evolution | Google | WWW |
Abstract: Search Engines have greatly influenced the way we experience the web. Since the early days of the web, users have been relying on them to get informed and make decisions. When the web was relatively small, web directories were built and maintained using human experts to screen and categorize pages according to their characteristics. By the mid 1990’s, however, it was apparent that the human expert model of categorizing web pages does not scale. The first search engines appeared and they have been evolving ever since, taking over the role that web directories used to play. But what need makes a search engine evolve? Beyond the financial objectives, there is a need for quality in search results. Search engines know that the quality of their ranking will determine how successful they are. Search results, however, are not simply based on well-designed scientific principles, but they are influenced by web spammers. Web spamming, the practice of introducing artificial text and links into web pages to affect the results of web searches, has been recognized as a major search engine problem. It is also a serious users problem because they are not aware of it and they tend to confuse trusting the search engine with trusting the results of a search. In this paper, we analyze the influence that web spam has on the evolution of the search engines and we identify the strong relationship of spamming methods on the web to propagandistic techniques in society. Our analysis provides a foundation for understanding why spamming works and offers new insight on how to address it. In particular, it suggests that one could use social anti-propagandistic techniques to recognize web spam.
Keywords:| Social Web | Web 2.0 | video | ads | politics | elections | Google | WWW |
Abstract: With 39% of Americans admitting the use of the Web to get unfiltered campaign materials, it becomes important to evaluate how they are searching for these materials and what they are finding. Assuming that the search will take place on one of the major search engines, such as Google, the results need to be scrutinized to ensure that standards of fairness and balanced coverage are upheld. In this paper, we offer an exploratory analysis of political online video data collected in the framework of a broader project aimed at capturing efforts of spamming search engine results for political motives. By exploiting online video features such as the added date, number of views, and ranking position, as well as content related features such as description keywords, political inclination of the submitter, the political message, and comments associated with a video, we depict a picture of how the online video medium was used during the last congressional political campaign. Our analysis takes into account three players: video providers (usually the campaigns or other interested parties), video consumers (the users), and facilitators (Google and YouTube). The results show that online video coverage might be susceptible to technological bias that adds to the political bias common in electoral campaigns. Educating the wide audience of users about this inherent bias should be a common effort of the involved players and fairness advocacy groups.
Keywords:| Social Web | Web 2.0 | video | ads | politics | elections | Google | WWW |
Abstract: We have collected a set of 1131 textual ads that appeared in the Google Search results when searching for a candidate name running in the 2008 US Congressional elections. We have categorized the advertisers in four different categories: commercial, partisan, non-affiliated, and media. By ana- lyzing the content of the collected ads, we discovered that the majority of them (63%) are commercial ads that have no political message, while the partisan group contributed only 14% of the ads. Furthermore, only 21 out of 124 mon- itored candidates were actively participating in sponsored search, by providing their own political message. We de- scribe the different ways in which the advertisements were used and several problems that damage the quality of spon- sored search, providing some suggestions to avoid such issues in the future.
Keywords:| Web search | Information Reliability | Web graph | Link structure | Propaganda | Trust | Web Spam |
Abstract: Search Engines have greatly influenced the way we experience the web. Since the early days of the web people have been relying on search engines to find useful information. However, their ability to provide useful and unbiased information can be manipulated by Web spammers. Web spamming, the practice of introducing artificial text and links into web pages to affect the results of searches, has been recognized as a major problem for search engines. But it is mainly a serious problem for web users because they tend to confuse trusting the search engine with trusting the results of a search. In this paper, first we discuss the relationship between Web spam in cyber world and social propaganda in the real world. Then, we propose “backwards propagation of distrust,” as an approach to finding spamming untrustworthy sites. Our approach is inspired by the social behavior associated with distrust. In society, recognition of an untrustworthy entity (person, institution, idea, etc) is a reason for questioning the trustworthiness of those that recommended this entity. People that are found to strongly support untrustworthy entities become untrustworthy themselves. In other words, in the society, distrust is propagated backwards. Our algorithm simulates this social behavior on the web graph with considerable success. Moreover, by respecting the user’s perception of trust through the web graph, our algo- rithm makes it possible to resolve the moral question of who should be making the decision of weeding out untrustworthy spammers in favor of the user, not the search engine or some higher authority. Our approach can lead to browser-level, or personalized server-side, web filters that work in synergy with the powerful search engines to deliver personalized, trusted web results.
Keywords:| Smart Mobs | Social Web | Social Web | Reputation Systems. |
Abstract: This paper explores the creation of smart mobs and their ability to gain knowledge, social capital, and expand individual authority through participation. The evolution of smart mobs through computer-mediated communications has extended and expanded the traditional realm of individual influence and authority. In the decade since Rheingold identified smart mobs, this type of behavior has continued to grow and expand, influencing social, political and economic domains. But what makes or breaks a smart mob? Researchers in the last decade have tried unsuccessfully to identify their key components. This paper offers a “recipe” for creating smart mobs, by discovering the must-have characteristics that are necessary for their success: Desire for Communication; Affordable Communication Devices; Opportunities for Instantaneous Communication; Shared Goal; and Small Time Frame. In conclusion we propose that the future of effective smart mobs can be based on a template that has been successfully implemented since 2008, and that as individuals gain authority through the vehicle of smart mobs, governments will have to re-define their role in response to the collective action facilitated by smart mobs.
Keywords:| Social Web | Web 2.0 | Twitter | Real-Time Web | politics | elections | Twitter-bomb | Google | WWW |
Abstract: Recently, all major search engines introduced a new feature: real-time search results, embedded in the first page of organic search results. The content appearing in these results is pulled within minutes of its generation from the so-called ``real-time Web'' such as Twitter, blogs, and news websites. In this paper, we argue that in the context of political speech, this feature provides disproportionate exposure to personal opinions, fabricated content, unverified events, lies and misrepresentations that otherwise would not find their way in the first page, giving them the opportunity to spread virally. To support our argument we provide concrete evidence from the recent Massachusetts (MA) senate race between Martha Coakley and Scott Brown, analyzing political community behavior on Twitter. In the process, we analyze the Twitter activity of those involved in exchanging messages, and we find that it is possible to predict their political orientation and detect attacks launched on Twitter, based on behavioral patterns of activity.
Keywords:| Google | economy | Google Trends | prediction |
Abstract: Can Google queries help predict economic activity? Economists, investors, and journalists avidly follow monthly government data releases on economic conditions. However, these reports are only available with a lag: the data for a given month is generally released about halfway through the next month, and are typically revised several months later. Google Trends provides daily and weekly reports on the volume of queries related to various industries. We hypothesize that this query data may be correlated with the current level of economic activity in given industries and thus may be helpful in predicting the subsequent data releases. We are not claiming that Google Trends data help predict the future. Rather we are claiming that Google Trends may help in predicting the present. For example, the volume of queries on a particular brand of automobile during the second week in June may be helpful in predicting the June sales report for that brand, when it is released in July. Our goals in this report are to familiarize readers with Google Trends data, illustrate some simple forecasting methods that use this data, and encourage readers to undertake their own analyses. Certainly it is possible to build more sophisticated forecasting models than those we describe here. However, we believe that the models we describe can serve as baselines to help analysts get started with their own modeling efforts and that can subsequently be refined for specific applications. The target audiences for this primer are readers with some background in econometrics or statistics. Our examples use R, a freely available open-source statistics package; we provide the R source code for the worked-out example in Section 1.2 in the Appendix.
Keywords:| Social Web | Web 2.0 | Twitter | politics | elections | social networks |
Abstract: We connect measures of public opinion measured from polls with sentiment measured from text. We analyze several surveys on consumer confidence and political opinion over the 2008 to 2009 period, and find they correlate to sentiment word frequencies in contemporaneous Twitter messages. While our results vary across datasets, in several cases the correlations are as high as 80%, and capture important large-scale trends. The results highlight the potential of text streams as a substitute and supplement for traditional polling.
Keywords:| graph | network | structure | algorithms |
Abstract: Graphs (networks) are very common data structures which are handled in computers. Diagrams are widely used to represent the graph structures visually in many information systems. In order to automatically draw the diagrams which are, for example, state graphs, data-flow graphs, Petri nets, and entity-relationship diagrams, basic graph drawing algorithms are required. There have been only a few algorithms for general undirected graphs. This paper presents a simple but successful algorithm for drawing undirected graphs and weighted graphs. The basic idea of our algorithm is as follows. We regard the desirable "geometric" (Euclidean) distance between two vertices in the drawing as the "graph theoretic" distance between them in the corresponding graph. We introduce a virtual dynamic system in which every two vertices are connected by a "spring" of such desirable length. Then, we regard the optimal layout of vertices as the state in which the total spring energy of the system is minimal. This paper brings a new significant result in graph drawing based on the spring model.
Keywords:| government | framework | digital network | democracy | internet |
Abstract: In this paper, we take preliminary steps toward building a framework that explains the potential impact of digital networks on democratic processes and demonstrate why they may be ineffective or even detrimental. When most people talk about the impact of the Internet on democracy, they cite the use of digital tools during political revolutions, protests, and other "flashpoints" that often make headlines briefly and then fall from the public eye. We argue that to understand the real potential impact of the Internet, it is necessary to look at how the Internet will strengthen the quality of democracy in individual states over the long term. To better explain how this may happen, we draw on the work of previous authors who have described the key differences between the horizontal processes within governments and the vertical processes between citizens and governments. We contend that the Internet is transforming peer-to-peer relationships-the way citizens interact with one another-as well as the vertical relationships between citizens and government. However, the Internet and digitally networked technologies are not as good at improving the relationships and processes among government institutions, in other words, the horizontal processes. We believe this may explain the potential and limits of using networked technologies to strengthen democracy. This also leaves a number of open questions about the lasting impact of the Internet on democracy and its ability to affect the consolidation of democracy globally.