Introduction to TF-IDF and Machine Learning: Part 2 – Machine Learning

Ranking Search Results

A search engine returns hundreds of documents for a query of a few words. Which documents are “better” matches to the query? One approach is to treat the query itself as a document and calculate the cosine similarity (see Part 1 of this series) between it and each of the returned documents. The documents are then sorted by that similarity score.

This generally has the effect of bringing the documents that contain the more distinctive words in the corpus as well as the documents with larger counts of the matching words to the top.

This isn’t perfect since it doesn’t know anything about the user’s intent. It could be that one of the less distinctive words that has a smaller term frequency is actually the one most important to the user. But it does a decent enough job that it is a useful tool (but not the only one) for sorting through a large number of results.

Supervised Machine Learning


Let’s imagine we have a million documents that we want to classify into some classification system. An example might be news articles, and we want to classify them as politics, weather, science, technology, crime, etc.

So, we take a random sample of two thousand articles and manually classify them. We then split that sample into two random samples of a thousand each: one we’ll call the training sample, and the other the test sample.

We’re going to use the training sample to train a classifier. This is the “supervised” part of supervised machine learning. A human supervisor has defined the classifications and annotated a subset of documents with those classifications. With unsupervised machine learning, we don’t tell the classifier what the categories are a priori. Rather, the classifier must come up with those on its own.

A Simple Classifier

For the purpose of illustrating the concepts around supervised machine learning, we’re going to use cosine similarity to illustrate a simple classifier. There are many classifiers out there, and new ones are created every day. But this simple one will do for our purposes.

As we saw in Part 1, every document can be represented by a vector of the TF-IDFs of its terms. For each classification, calculate the sum of all the vectors of the training set documents with that classification. Such a vector would point in the average direction of the training set documents with that classification (dividing by the number of documents so summed to get the average vector will just change the vector’s magnitude, but not the direction it is pointing).

The vectors of the documents in the training set will likely cluster around that summed vector, and we would expect that any other document that should receive the same classification would also be near the summed vector.

So we can classify unclassified documents by calculating the cosine similarity between them and the summed vectors. The most similar summed vector tells us the classification of the unclassified document.


To get an idea of how well our classifier works, we can run the annotated test sample through the classifer to determine it’s accuracy. We can grade it like one would a school test, saying out of 1,000 test samples, it got 921, right, giving it a 92% score (that was a B+ when I went to school).

Two other calculations are often made to characterize the kinds of errors seen for a particular classification: precision and recall. Say that for a particular classification we classified 100 documents in the test sample as having that classification. Let’s also say that there were 110 documents in the test sample that really have that classification. But let’s say the classifier only identified 90 of them. Our precision would be 90/100=90%. 90% of the documents we classified with a particular classification really had that classification. The other 10% were false positives

Precision = true positives/(true positives + false positives)

Equation 3: Precision

In this example recall would be the portion of the 110 documents that really have a particular classification were correctly identified. This would be 90/110≈82%. The other 18% were false negatives.

Recall = true positives/(true positives + false negatives)

Equation 4: Recall

Unsupervised Machine Learning


In unsupervised learning, we don’t tell the computer what the classifications are. It is up to the algorithm to discover the clusters itself. Of course, there’s always some degree of human intervention, so the difference between supervised and unsupervised learning is one of degree rather than a simple Boolean.


On simple approach is generate a set of random vectors in lieu of the summed vectors of supervised vectors. When we classify the documents by their proximity to the random vectors, then sum the vectors within each classification, that sum should be closer to the center of a cluster that appears within that classification than the original random vector. Use those summed vectors to run the classifier again, repeating the process until in converges.

Of course the user must make an accurate guess of the number of clusters. Too few, and a classification might include two clusters; too many, and some clusters might get split into two classifications.

Nearest Neighbor

If the number of documents is not too large, then the cosine similarity of every pair of documents can be calculated. We can start with the closest pair and assume they are in the same cluster. Then proceed to the second closed pair and assume they are in the same cluster. We can either guess a target number of clusters or look at the distribution of distances. Within a cluster there should be a larger frequency of short distances. The frequency will drop off with distance, then increase as other clusters come into range. Loop through the pairs of documents until that first drop of in frequency occurs.

If the number of documents is huge, then the computing the distance between every pair is not practical. One million documents translates into a half trillion unorder pairs. Instead, a reasonably small sample could be used. If the size of the sample is too small though, some smaller clusters might be missed.

For an extra measure, the summed vectors of the sample’s classifications could be used to seed the k-Means algorithm so that in the end, all documents are taken into account in defining the classifications.

Human Intelligence

Human intelligence is never out of the loop. With unsupervised learning, there is much less human effort because people are not manually classifying training and testing sets. But people still have to evaluate the results and adjust the parameters. Ultimately, there is no Real Intelligence in Artificial Intelligence outside the humans using the tools.

Continued in Part 3…

Part 3 will examine things other than words as terms and alternate equations to calculate TF-IDF.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: