After reading “Combating Spam in Tagging Systems” by Koutrika et al. I followed the citation path looking for other papers and I found this:
The Anti-Social Tagger – Detecting Spam in Social Bookmarking Systems by Krause et al. (Beate Krause, Christoph Schmitz, Andreas Hotho and Gerd Stumme) http://doi.acm.org/10.1145/1451983.1451998
This paper deals with spam detection in Bibsonomy website. Bibsonomy is an alternative social bookmarking to del.icio.us focused in the academic community, as it includes the option of sharing papers, not only URL’s (as del.icio.us does).
As any well-known social network, Bibsonomy must deal with spam users who try to give relevance to their own bookmarks. As the community grows inspecting manually all the introduced bookmarks is unfeasible, so this paper shows a way to find the spam in an automatic way.
The paper applies a machine learning approach, where an algorithm is trained with a previously analyzed dataset (training dataset) to find the features shown by spam users, in order to detect new spam users (which are supposed to have the same features). The features are divided in four categories:
- Profile-based: Those based on the information given by the user when registering (i.e. length of the nickname, digits in the nickname, etc.)
- Location-based: Those based on the location of the user extracted from the e-mail address (i.e. the domain) or the IP address.
- Activity-based: Those based on the behaviour of the user (i.e. The number of tags per resource, the time before the first post, etc.)
- Semantic: Those based on the tags submitted (i.e. tags previously defined as spam) and the coocurrence with other spam users.
Once they are identified the paper makes an evaluation over a test dataset. 4 different clasification techniques were used being SVM (Support Vector Machines) the one which achieved the best performance. Features were also analyzed separately, finfing out that those based on coocurrence (not all the semantic group, but also coocurrence) are the best ones for predicting spam (followed by other semantic features).
Unfortunately the most interesting group for me (the based on activity features) is the second less useful group, although I can think some other activity based features which may seem interesting and have not been used here.
By reading this paper I have also come to the Bibsonomy dataset homepage which offers database dumps with research purposes. I will try to get some of these datasets as they can be used as training dataset for some of my ideas.
From this datasets homepage I got to ECML PKDD Discovery Challenge 2008 which hosted a competition of spam detection in a Bibsonomy dataset. I think some of its papers may be interesting for me.