Because of my research I’m interested in term correlation not just in pairs but in groups of ‘n’ terms (ngrams). Looking for some statistic measures and explanations about the advantages and implementations of Log-Likelihood measures I reached:
In this paper the authors present a new algorithm for extracting MultiWords Units (MWU) which corresponds with what I called ngrams before.
This algorithm, called LocalMaxs, finds the best MWU from a given sentence finding which MWU has a higher value than any of it’s antecessors (MWUs with less words) and succesors (MWUs with more words), that is a local maxima just as in any other mathematical function.
What I was really interested into was in “how can I know how appropriate is a MWU?”. Here is where the pure statistical knowledge is needed and the second part of the paper.
In this second part they make a very helpful and understandable introduction to a serious of statistical measures. The only problem is all of these measures are thought to evaluate a pair of terms, not a group of ‘n’ terms.
The authors then propose a simple way to generalize the measures from bigram to a ngram (MWU). They consider a ngram just as a bigram being the first term of the bigram the first “x” terms of the MWU and the second term of the bigram the last”n - x” terms.
As an example, consider we have the MWU “michael jordan nba statistics”. We could consider this ngram as a bigram formed by the “terms” “michael jordan nba” and “statistics”.
Of course, depending where you cut (what they call the dispersion point) the MWU you can have different “bigrams” with very different statistical relevance. In our example we could have the bigram [“michael” “jordan nba statistics”], [“michael jordan” “nba statistics”] or [“michael jordan nba” “statistics”].
To solve this dispersion point problem and obtain a simple statistic value they make the arithmetical average of the value of every possible “bigram”.
As if all of this was not enough to justify the reading of this paper every statistical measure is explained with a mathematical formula (not simple, but very helpful) which make all the paper easily reproducible and implementable for your own system.
In fact, I have implemented all of them for a brief evaluation in my own environment. You can download my implementation and play with it. I think I didn’t make any mistake, but if you find something wrong, please let me know it.
In every PHP file are included the required functions and there’s always a function which receives the ngram to evaluate and a text string which will act as a little corpus to find the ngram occurences. It shoud be easy improve it in a more sofisticated way, but my intention was giving a simple implementation of the mathematic formula.