Nobody's Papers
“A Local Maxima method and a Fair Dispersion Normalization for extracting Multi-Word Units from corpora” by Ferreira and Pereira

Because of my research I’m interested in term correlation not just in pairs but in groups of ‘n’ terms (ngrams). Looking for some statistic measures and explanations about the advantages and implementations of Log-Likelihood measures I reached:

Joaquim Ferreira da Silva, & Gabriel Pereira Lopes (1999). A Local Maxima method and a Fair Dispersion Normalization for extracting multi-word units from corpora Sixth Meeting on Mathematics of Language

In this paper the authors present a new algorithm for extracting MultiWords Units (MWU) which corresponds with what I called ngrams before.

This algorithm, called LocalMaxs, finds the best MWU from a given sentence finding which MWU has a higher value than any of it’s antecessors (MWUs with less words) and succesors (MWUs with more words), that is a local maxima just as in any other mathematical function.

What I was really interested into was in “how can I know how appropriate is a MWU?”. Here is where the pure statistical knowledge is needed and the second part of the paper.

In this second part they make a very helpful and understandable introduction to a serious of statistical measures. The only problem is all of these measures are thought to evaluate a pair of terms, not a group of ‘n’ terms.

The authors then propose a simple way to generalize the measures from bigram to a ngram (MWU). They consider a ngram just as a bigram being the first term of the bigram the first “x” terms of the MWU and the second term of the bigram the last”n - x” terms.

As an example, consider we have the MWU “michael jordan nba statistics”. We could consider this ngram as a bigram formed by the “terms” “michael jordan nba” and “statistics”.

Of course, depending where you cut (what they call the dispersion point) the MWU you can have different “bigrams” with very different statistical relevance. In our example we could have the bigram [“michael” “jordan nba statistics”], [“michael jordan” “nba statistics”] or [“michael jordan nba” “statistics”].

To solve this dispersion point problem and obtain a simple statistic value they make the arithmetical average of the value of every possible “bigram”.

As if all of this was not enough to justify the reading of this paper every statistical measure is explained with a mathematical formula (not simple, but very helpful) which make all the paper easily reproducible and implementable for your own system.

In fact, I have implemented all of them for a brief evaluation in my own environment. You can download my implementation and play with it. I think I didn’t make any mistake, but if you find something wrong, please let me know it.

In every PHP file are included the required functions and there’s always a function which receives the ngram to evaluate and a text string which will act as a little corpus to find the ngram occurences. It shoud be easy improve it in a more sofisticated way, but my intention was giving a simple implementation of the mathematic formula.

Comments
“Why we twitter: understanding Microblogging usage and Communities” by Java et al.

One year and a half ago I spent lot of times reading those foundational papers about Web Information Retrieval where first analysis of usage on search engines (main information retrieval systems for Web) were performed.

Those papers showed that the average user on the web search for information in a very different way, hardly describing their goals, visiting few alternatives and spending not much time.

Some months ago I was thinking about the quality and quantity of information that a search engine shows to the user in order to accomplish their goals and then a discussion with a colleague of mine about twitter made me think that a study focusing on how and why people uses twitter would be very interesting.

Obviously you cannot figure out the reasons because every single user is twitting, but we cannot figure out the real motivations behind every single query submitted to a search engine and we are still trying to have a better understanding.

My first thought was to apply Broder’s taxonomy to tweets, but it’s clear that Broder’s Taxonomy can only be applied to Information Retrieval actions, not Information Creation actions (such as a tweet). An equivalent taxonomy should be found and that path led too far from current interests.

Last days I was thinking again about the information exposed by a search engine and Dani recommended me:

Why We Twitter: Understanding Microblogging Usage and Communities by Java et al. (Akshay Java, Xiaodan Song, Tim Finin and Belle Tseng) http://doi.acm.org/10.1145/1348549.1348556

In this paper the authors perform an analysis over a dataset from twitter collected by the authors during a period of two months focusing on the network properties, geographic distribution of the users and user intention.

According to user intention the authors implement the HITS algorithm in order to find authorities (users with a high number of followers) and hubs (users who follows lot of authorities). This way they identify 3 kind of users:

  1. Those who share information.
  2. Those who seek information.
  3. Those who want to be connected with friends.

A user can overlap in these categories. For example, I have an account in Twitter because I want to stay in touch with my friends and I want to receive information about my interests.

Categories 1 and 2 can be solved with HITS algorithm but category 3 demands a community analysis which is also performed in this paper, showing some examples of overlapping comunities.

A term trends (in fact this was the related part to my research right now) study was also performed, using a log-likelihood measure to find descriptive term for a day.

I think this is a very interesting paper and I’m quite sure that we’re going to see some other studies about twitter and other microblogging communities.

Akshay Java, Xiaodan Song, Tim Finin, & Belle Tseng (2007). Why we twitter: understanding microblogging usage and communities Proceedings of the 9th WebKDD and 1st SNA-KDD 2007 workshop on Web mining and social network analysis, 56-65 : 10.1145/1348549.1348556

Comments
“The Anti-Social Tagger – Detecting Spam in Social Bookmarking Systems” by Krause et al.

After reading “Combating Spam in Tagging Systems” by Koutrika et al. I followed the citation path looking for other papers and I found this:

The Anti-Social Tagger – Detecting Spam in Social Bookmarking Systems by Krause et al. (Beate Krause, Christoph Schmitz, Andreas Hotho and Gerd Stumme) http://doi.acm.org/10.1145/1451983.1451998

This paper deals with spam detection in Bibsonomy website. Bibsonomy is an alternative social bookmarking to del.icio.us focused in the academic community, as it includes the option of sharing papers, not only URL’s (as del.icio.us does).

As any well-known social network, Bibsonomy must deal with spam users who try to give relevance to their own bookmarks. As the community grows inspecting manually all the introduced bookmarks is unfeasible, so this paper shows a way to find the spam in an automatic way.

The paper applies a machine learning approach, where an algorithm is trained with a previously analyzed dataset (training dataset) to find the features shown by spam users, in order to detect new spam users (which are supposed to have the same features). The features are divided in four categories:

  • Profile-based: Those based on the information given by the user when registering (i.e. length of the nickname, digits in the nickname, etc.)
  • Location-based: Those based on the location of the user extracted from the e-mail address (i.e. the domain) or the IP address.
  • Activity-based: Those based on the behaviour of the user (i.e. The number of tags per resource, the time before the first post, etc.)
  • Semantic: Those based on the tags submitted (i.e. tags previously defined as spam) and the coocurrence with other spam users.

Once they are identified the paper makes an evaluation over a test dataset. 4 different clasification techniques were used being SVM (Support Vector Machinesthe one which achieved the best performance. Features were also analyzed separately, finfing out that those based on coocurrence (not all the semantic group, but also coocurrence) are the best ones for predicting spam (followed by other semantic features).

Unfortunately the most interesting group for me (the based on activity features) is the second less useful group, although I can think some other activity based features which may seem interesting and have not been used here.

By reading this paper I have also come to the Bibsonomy dataset homepage which offers database dumps with research purposes. I will try to get some of these datasets as they can be used as training dataset for some of my ideas.

From this datasets homepage I got to ECML PKDD Discovery Challenge 2008 which hosted a competition of spam detection in a Bibsonomy dataset. I think some of its papers may be interesting for me.

Comments
“Combating Spam in Tagging Systems” by Koutrika et al.

I’m interested in how SPAM in social networks is avoided when it comes to resources tagging.

By resources tagging I mean “Ok, this resource is correctly associated with this tag” which differs a little bit from the usual behaviour when a user enters a URL in del.icio.us.

In my case I have a series of resources that are shown to the user when he/she asks for a specific tag (or query if you want) and the user is able to confirm the association between the resource and the tag.

Searching a little bit I have found

Combating Spam in Tagging Systems” by Koutrika et al. (Georgia Koutrika, Frans Effendi, Zoltan Gyongyi, Paul Heymann y Hector Garcia-Molina) http://doi.acm.org/10.1145/1409220.1409225

in the following URL: http://heymann.stanford.edu/tagspam.html

The paper is easy to read (not too much maths involved), well structured and very useful. It offers lots of data and itneresting concussions.

I’m a little bit disappointed though because in the paper no real widely used tagging system is studied. Instead, an ideal system is proposed where ideal users follow previously defined behaviours (including spammers behaviours) and consequences of modifying some system parameters are analyzed (i.e. how many taggings a user performs, how many bad users are in the system, how many resources are in the network, etc.).

Despite this the paper is highly interesting as it’s a glimpse of how the system degrade as bad users begin to hack it.

Some interesting thoughts for me:

  • The idea of an spam “ghetto”, a set of queries to the system where most of the spammers are confined and where they don’t disturb good users as these doesn’t search in this “ghettos”.
  • The interest of analizing the time where the taggings are performed. A possible research line may appear if we apply some time series analysis.
  • The user profiles described in section 3 are interesting as the allow an easy implementation so a simulator can be built (in fact, they built one for their ideal system). I miss some bad behaviours though (i.e. the bad user which overemphasizes the relevance of a resource for a tag which, in fact, is valid).
  • Some of the concussions are interesting, but they cannot be summarized here (you must read the paper :) ).
  • Most of the methods rely on user profile. This is a pity, as I probably won’t have such a profile (if I have a spammer could easily fake a new one).

As a summary, this paper was interesting, but it’s not exactly what I was looking for. It’s not because of the lack of real analysis (in the end, it seems to be a research subject in progress and the may perform this kind of analysis in the future) but because I’m more interested in a time-centered annonymous analysis.

How the network evolves when a spam attack is on course? How does the ant colony defend itself?

Comments