“The Anti-Social Tagger – Detecting Spam in Social Bookmarking Systems” by Krause et al.
After reading “Combating Spam in Tagging Systems” by Koutrika et al. I followed the citation path looking for other papers and I found this:
The Anti-Social Tagger – Detecting Spam in Social Bookmarking Systems by Krause et al. (Beate Krause, Christoph Schmitz, Andreas Hotho and Gerd Stumme) http://doi.acm.org/10.1145/1451983.1451998
This paper deals with spam detection in Bibsonomy website. Bibsonomy is an alternative social bookmarking to del.icio.us focused in the academic community, as it includes the option of sharing papers, not only URL’s (as del.icio.us does).
As any well-known social network, Bibsonomy must deal with spam users who try to give relevance to their own bookmarks. As the community grows inspecting manually all the introduced bookmarks is unfeasible, so this paper shows a way to find the spam in an automatic way.
The paper applies a machine learning approach, where an algorithm is trained with a previously analyzed dataset (training dataset) to find the features shown by spam users, in order to detect new spam users (which are supposed to have the same features). The features are divided in four categories:
- Profile-based: Those based on the information given by the user when registering (i.e. length of the nickname, digits in the nickname, etc.)
- Location-based: Those based on the location of the user extracted from the e-mail address (i.e. the domain) or the IP address.
- Activity-based: Those based on the behaviour of the user (i.e. The number of tags per resource, the time before the first post, etc.)
- Semantic: Those based on the tags submitted (i.e. tags previously defined as spam) and the coocurrence with other spam users.
Once they are identified the paper makes an evaluation over a test dataset. 4 different clasification techniques were used being SVM (Support Vector Machines) the one which achieved the best performance. Features were also analyzed separately, finfing out that those based on coocurrence (not all the semantic group, but also coocurrence) are the best ones for predicting spam (followed by other semantic features).
Unfortunately the most interesting group for me (the based on activity features) is the second less useful group, although I can think some other activity based features which may seem interesting and have not been used here.
By reading this paper I have also come to the Bibsonomy dataset homepage which offers database dumps with research purposes. I will try to get some of these datasets as they can be used as training dataset for some of my ideas.
From this datasets homepage I got to ECML PKDD Discovery Challenge 2008 which hosted a competition of spam detection in a Bibsonomy dataset. I think some of its papers may be interesting for me.
“Combating Spam in Tagging Systems” by Koutrika et al.
I’m interested in how SPAM in social networks is avoided when it comes to resources tagging.
By resources tagging I mean “Ok, this resource is correctly associated with this tag” which differs a little bit from the usual behaviour when a user enters a URL in del.icio.us.
In my case I have a series of resources that are shown to the user when he/she asks for a specific tag (or query if you want) and the user is able to confirm the association between the resource and the tag.
Searching a little bit I have found
“Combating Spam in Tagging Systems” by Koutrika et al. (Georgia Koutrika, Frans Effendi, Zoltan Gyongyi, Paul Heymann y Hector Garcia-Molina) http://doi.acm.org/10.1145/1409220.1409225
in the following URL: http://heymann.stanford.edu/tagspam.html
The paper is easy to read (not too much maths involved), well structured and very useful. It offers lots of data and itneresting concussions.
I’m a little bit disappointed though because in the paper no real widely used tagging system is studied. Instead, an ideal system is proposed where ideal users follow previously defined behaviours (including spammers behaviours) and consequences of modifying some system parameters are analyzed (i.e. how many taggings a user performs, how many bad users are in the system, how many resources are in the network, etc.).
Despite this the paper is highly interesting as it’s a glimpse of how the system degrade as bad users begin to hack it.
Some interesting thoughts for me:
- The idea of an spam “ghetto”, a set of queries to the system where most of the spammers are confined and where they don’t disturb good users as these doesn’t search in this “ghettos”.
- The interest of analizing the time where the taggings are performed. A possible research line may appear if we apply some time series analysis.
- The user profiles described in section 3 are interesting as the allow an easy implementation so a simulator can be built (in fact, they built one for their ideal system). I miss some bad behaviours though (i.e. the bad user which overemphasizes the relevance of a resource for a tag which, in fact, is valid).
- Some of the concussions are interesting, but they cannot be summarized here (you must read the paper :) ).
- Most of the methods rely on user profile. This is a pity, as I probably won’t have such a profile (if I have a spammer could easily fake a new one).
As a summary, this paper was interesting, but it’s not exactly what I was looking for. It’s not because of the lack of real analysis (in the end, it seems to be a research subject in progress and the may perform this kind of analysis in the future) but because I’m more interested in a time-centered annonymous analysis.
How the network evolves when a spam attack is on course? How does the ant colony defend itself?