Nobody's Papers
“Why we twitter: understanding Microblogging usage and Communities” by Java et al.

One year and a half ago I spent lot of times reading those foundational papers about Web Information Retrieval where first analysis of usage on search engines (main information retrieval systems for Web) were performed.

Those papers showed that the average user on the web search for information in a very different way, hardly describing their goals, visiting few alternatives and spending not much time.

Some months ago I was thinking about the quality and quantity of information that a search engine shows to the user in order to accomplish their goals and then a discussion with a colleague of mine about twitter made me think that a study focusing on how and why people uses twitter would be very interesting.

Obviously you cannot figure out the reasons because every single user is twitting, but we cannot figure out the real motivations behind every single query submitted to a search engine and we are still trying to have a better understanding.

My first thought was to apply Broder’s taxonomy to tweets, but it’s clear that Broder’s Taxonomy can only be applied to Information Retrieval actions, not Information Creation actions (such as a tweet). An equivalent taxonomy should be found and that path led too far from current interests.

Last days I was thinking again about the information exposed by a search engine and Dani recommended me:

Why We Twitter: Understanding Microblogging Usage and Communities by Java et al. (Akshay Java, Xiaodan Song, Tim Finin and Belle Tseng) http://doi.acm.org/10.1145/1348549.1348556

In this paper the authors perform an analysis over a dataset from twitter collected by the authors during a period of two months focusing on the network properties, geographic distribution of the users and user intention.

According to user intention the authors implement the HITS algorithm in order to find authorities (users with a high number of followers) and hubs (users who follows lot of authorities). This way they identify 3 kind of users:

  1. Those who share information.
  2. Those who seek information.
  3. Those who want to be connected with friends.

A user can overlap in these categories. For example, I have an account in Twitter because I want to stay in touch with my friends and I want to receive information about my interests.

Categories 1 and 2 can be solved with HITS algorithm but category 3 demands a community analysis which is also performed in this paper, showing some examples of overlapping comunities.

A term trends (in fact this was the related part to my research right now) study was also performed, using a log-likelihood measure to find descriptive term for a day.

I think this is a very interesting paper and I’m quite sure that we’re going to see some other studies about twitter and other microblogging communities.

Akshay Java, Xiaodan Song, Tim Finin, & Belle Tseng (2007). Why we twitter: understanding microblogging usage and communities Proceedings of the 9th WebKDD and 1st SNA-KDD 2007 workshop on Web mining and social network analysis, 56-65 : 10.1145/1348549.1348556

Comments
“The Anti-Social Tagger – Detecting Spam in Social Bookmarking Systems” by Krause et al.

After reading “Combating Spam in Tagging Systems” by Koutrika et al. I followed the citation path looking for other papers and I found this:

The Anti-Social Tagger – Detecting Spam in Social Bookmarking Systems by Krause et al. (Beate Krause, Christoph Schmitz, Andreas Hotho and Gerd Stumme) http://doi.acm.org/10.1145/1451983.1451998

This paper deals with spam detection in Bibsonomy website. Bibsonomy is an alternative social bookmarking to del.icio.us focused in the academic community, as it includes the option of sharing papers, not only URL’s (as del.icio.us does).

As any well-known social network, Bibsonomy must deal with spam users who try to give relevance to their own bookmarks. As the community grows inspecting manually all the introduced bookmarks is unfeasible, so this paper shows a way to find the spam in an automatic way.

The paper applies a machine learning approach, where an algorithm is trained with a previously analyzed dataset (training dataset) to find the features shown by spam users, in order to detect new spam users (which are supposed to have the same features). The features are divided in four categories:

  • Profile-based: Those based on the information given by the user when registering (i.e. length of the nickname, digits in the nickname, etc.)
  • Location-based: Those based on the location of the user extracted from the e-mail address (i.e. the domain) or the IP address.
  • Activity-based: Those based on the behaviour of the user (i.e. The number of tags per resource, the time before the first post, etc.)
  • Semantic: Those based on the tags submitted (i.e. tags previously defined as spam) and the coocurrence with other spam users.

Once they are identified the paper makes an evaluation over a test dataset. 4 different clasification techniques were used being SVM (Support Vector Machinesthe one which achieved the best performance. Features were also analyzed separately, finfing out that those based on coocurrence (not all the semantic group, but also coocurrence) are the best ones for predicting spam (followed by other semantic features).

Unfortunately the most interesting group for me (the based on activity features) is the second less useful group, although I can think some other activity based features which may seem interesting and have not been used here.

By reading this paper I have also come to the Bibsonomy dataset homepage which offers database dumps with research purposes. I will try to get some of these datasets as they can be used as training dataset for some of my ideas.

From this datasets homepage I got to ECML PKDD Discovery Challenge 2008 which hosted a competition of spam detection in a Bibsonomy dataset. I think some of its papers may be interesting for me.

Comments
“Combating Spam in Tagging Systems” by Koutrika et al.

I’m interested in how SPAM in social networks is avoided when it comes to resources tagging.

By resources tagging I mean “Ok, this resource is correctly associated with this tag” which differs a little bit from the usual behaviour when a user enters a URL in del.icio.us.

In my case I have a series of resources that are shown to the user when he/she asks for a specific tag (or query if you want) and the user is able to confirm the association between the resource and the tag.

Searching a little bit I have found

Combating Spam in Tagging Systems” by Koutrika et al. (Georgia Koutrika, Frans Effendi, Zoltan Gyongyi, Paul Heymann y Hector Garcia-Molina) http://doi.acm.org/10.1145/1409220.1409225

in the following URL: http://heymann.stanford.edu/tagspam.html

The paper is easy to read (not too much maths involved), well structured and very useful. It offers lots of data and itneresting concussions.

I’m a little bit disappointed though because in the paper no real widely used tagging system is studied. Instead, an ideal system is proposed where ideal users follow previously defined behaviours (including spammers behaviours) and consequences of modifying some system parameters are analyzed (i.e. how many taggings a user performs, how many bad users are in the system, how many resources are in the network, etc.).

Despite this the paper is highly interesting as it’s a glimpse of how the system degrade as bad users begin to hack it.

Some interesting thoughts for me:

  • The idea of an spam “ghetto”, a set of queries to the system where most of the spammers are confined and where they don’t disturb good users as these doesn’t search in this “ghettos”.
  • The interest of analizing the time where the taggings are performed. A possible research line may appear if we apply some time series analysis.
  • The user profiles described in section 3 are interesting as the allow an easy implementation so a simulator can be built (in fact, they built one for their ideal system). I miss some bad behaviours though (i.e. the bad user which overemphasizes the relevance of a resource for a tag which, in fact, is valid).
  • Some of the concussions are interesting, but they cannot be summarized here (you must read the paper :) ).
  • Most of the methods rely on user profile. This is a pity, as I probably won’t have such a profile (if I have a spammer could easily fake a new one).

As a summary, this paper was interesting, but it’s not exactly what I was looking for. It’s not because of the lack of real analysis (in the end, it seems to be a research subject in progress and the may perform this kind of analysis in the future) but because I’m more interested in a time-centered annonymous analysis.

How the network evolves when a spam attack is on course? How does the ant colony defend itself?

Comments