Introduction to TF-IDF and Machine Learning: Part 2 – Machine Learning

Ranking Search Results

A search engine returns hundreds of documents for a query of a few words. Which documents are “better” matches to the query? One approach is to treat the query itself as a document and calculate the cosine similarity (see Part 1 of this series) between it and each of the returned documents. The documents are then sorted by that similarity score.

This generally has the effect of bringing the documents that contain the more distinctive words in the corpus as well as the documents with larger counts of the matching words to the top.

This isn’t perfect since it doesn’t know anything about the user’s intent. It could be that one of the less distinctive words that has a smaller term frequency is actually the one most important to the user. But it does a decent enough job that it is a useful tool (but not the only one) for sorting through a large number of results.

Supervised Machine Learning

Definition

Let’s imagine we have a million documents that we want to classify into some classification system. An example might be news articles, and we want to classify them as politics, weather, science, technology, crime, etc.

So, we take a random sample of two thousand articles and manually classify them. We then split that sample into two random samples of a thousand each: one we’ll call the training sample, and the other the test sample.

We’re going to use the training sample to train a classifier. This is the “supervised” part of supervised machine learning. A human supervisor has defined the classifications and annotated a subset of documents with those classifications. With unsupervised machine learning, we don’t tell the classifier what the categories are a priori. Rather, the classifier must come up with those on its own.

A Simple Classifier

For the purpose of illustrating the concepts around supervised machine learning, we’re going to use cosine similarity to illustrate a simple classifier. There are many classifiers out there, and new ones are created every day. But this simple one will do for our purposes.

As we saw in Part 1, every document can be represented by a vector of the TF-IDFs of its terms. For each classification, calculate the sum of all the vectors of the training set documents with that classification. Such a vector would point in the average direction of the training set documents with that classification (dividing by the number of documents so summed to get the average vector will just change the vector’s magnitude, but not the direction it is pointing).

The vectors of the documents in the training set will likely cluster around that summed vector, and we would expect that any other document that should receive the same classification would also be near the summed vector.

So we can classify unclassified documents by calculating the cosine similarity between them and the summed vectors. The most similar summed vector tells us the classification of the unclassified document.

Testing

To get an idea of how well our classifier works, we can run the annotated test sample through the classifer to determine it’s accuracy. We can grade it like one would a school test, saying out of 1,000 test samples, it got 921, right, giving it a 92% score (that was a B+ when I went to school).

Two other calculations are often made to characterize the kinds of errors seen for a particular classification: precision and recall. Say that for a particular classification we classified 100 documents in the test sample as having that classification. Let’s also say that there were 110 documents in the test sample that really have that classification. But let’s say the classifier only identified 90 of them. Our precision would be 90/100=90%. 90% of the documents we classified with a particular classification really had that classification. The other 10% were false positives

Precision = true positives/(true positives + false positives)

Equation 3: Precision

In this example recall would be the portion of the 110 documents that really have a particular classification were correctly identified. This would be 90/110≈82%. The other 18% were false negatives.

Recall = true positives/(true positives + false negatives)

Equation 4: Recall

Unsupervised Machine Learning

Definition

In unsupervised learning, we don’t tell the computer what the classifications are. It is up to the algorithm to discover the clusters itself. Of course, there’s always some degree of human intervention, so the difference between supervised and unsupervised learning is one of degree rather than a simple Boolean.

k-Means

On simple approach is generate a set of random vectors in lieu of the summed vectors of supervised vectors. When we classify the documents by their proximity to the random vectors, then sum the vectors within each classification, that sum should be closer to the center of a cluster that appears within that classification than the original random vector. Use those summed vectors to run the classifier again, repeating the process until in converges.

Of course the user must make an accurate guess of the number of clusters. Too few, and a classification might include two clusters; too many, and some clusters might get split into two classifications.

Nearest Neighbor

If the number of documents is not too large, then the cosine similarity of every pair of documents can be calculated. We can start with the closest pair and assume they are in the same cluster. Then proceed to the second closed pair and assume they are in the same cluster. We can either guess a target number of clusters or look at the distribution of distances. Within a cluster there should be a larger frequency of short distances. The frequency will drop off with distance, then increase as other clusters come into range. Loop through the pairs of documents until that first drop of in frequency occurs.

If the number of documents is huge, then the computing the distance between every pair is not practical. One million documents translates into a half trillion unorder pairs. Instead, a reasonably small sample could be used. If the size of the sample is too small though, some smaller clusters might be missed.

For an extra measure, the summed vectors of the sample’s classifications could be used to seed the k-Means algorithm so that in the end, all documents are taken into account in defining the classifications.

Human Intelligence

Human intelligence is never out of the loop. With unsupervised learning, there is much less human effort because people are not manually classifying training and testing sets. But people still have to evaluate the results and adjust the parameters. Ultimately, there is no Real Intelligence in Artificial Intelligence outside the humans using the tools.

Continued in Part 3…

Part 3 will examine things other than words as terms and alternate equations to calculate TF-IDF.

Retirement Planning: Bucket Budget Debunked

Spreadsheets Are Programming Too

Watching the stock market recover dramatically since the bottom was hit in 2009 when all hopes for not dying at my desk in a windowless office somewhere were dashed, I’m suddenly in a mind to work out how best to budget retirement though it’s still likely ten years or more away. A popular method is called bucket budgeting, where retirement savings are divided into three (or more) buckets with increasing amounts of risk (and returns) for the near, medium, and far term.

I decided to program a spreadsheet to simulate decades of retirement under different assumptions, using various stochastic models for future market behavior as well as replaying historic returns for various periods. This spreadsheet may be downloaded below for your own adaptation and experiments.

Disclaimers: Past history is not indicative of future performance. I am not a financial advisor. I do not warrant this spreadsheet to be free of defects in any way. This blog is not financial advice, but rather programming advice on how one might program a simulation. You should make your own assumptions and models and draw your own conclusions in concert with a professional fiduciary financial advisor. The assumptions and tradeoffs I make for myself are probably not the same ones you should make.

An Illustrated Tour of the Spreadsheet

Parameters

This block of values defines various parameters you can set to tune the simulation.

Inflation. This sets what inflation is expected to be. It has been running below 2% for a few years now, but something like 3% or even 4% is more normal.

Social Security (2019). This is the social security benefit per year in 2019 dollars, which is the value given by most social security calculators. By the way, none of the financial numbers here are my actual numbers. They are contrived for illustration purposes only.

After Tax Income. Working from my budget for today, I remove those items like mortgage that I won’t be paying in retirement, and add those I have that I don’t have now but hope to in retirement, like hobbies or travel, to come up with a realistic desired post-retirement income level.

After Tax Income @ 80. A different income level can be set for old age. It might be lower because one travels less, or more to pay for a nursing home.

Tax Rate. This is an estimate of federal, state, and local income taxes. It will vary depending on jurisdiction as well as on the size of the social security benefit. I expect an approximation in the right neighborhood is good enough. But politics might require radical revisions to this estimate in the future.

Stock Return. This estimates the amount stocks will grow per year. The long term average since 1901 is a little over 5%, but around 7% since WW II, about 9% since Reagonomics, and even higher since the Great Recession. The “Rule of 25”, also called the “4% Rule”, is basically based on 5% growth since it looks at all 30 year periods since 1901.

The spreadsheet also allows setting returns for individual years (the Stock Variation column). In that case, this number is added to the Stock Variation value for a given year. This permits simulating an index fund that beats or falls short of the actual market numbers.

Bond Return. This estimates the return on bonds in the middle bucket of this simulation.

Cash Return. This estimates the return on the cash in the near-term bucket of this simulation. It should probably be something near zero these days.

Initial 401(k) Balance. This is the 401(k) balance on the day of retirement.

Social Security (July 2034 $). This is an inflation adjusted estimate of the social security benefit for the year one begins collecting social security. It is calculated from the above parameters. For me, this is 15 years in the future. You will need to change this for your own age and planned date for collecting social security. I’m assuming I’ll delay until age 70 while retiring at 65.

Multiplier. Unfortunately, social security will go bankrupt the same year I start collecting it. At that time, it will only be able to pay out what it also collects that year, which is 77 cents on the dollar of the defined benefit. I’m guessing a combination of tax increases and benefit cuts will eventually be enacted, so I split the difference and guess it will pay 88 cents on the dollar.

Social Security (2034). So this will be the actual estimated social security benefit for the year of retirement. All the social security numbers are for the full year rate and should be amortized in the spreadsheet for the portion of the year one is actually 70.

Time

TImeline

Age. This is the age one turns each year. First row sets the intial age, then subsequent rows add one to the previous row. Depending on personal preference, one might want to have it be one’s age at the beginning of the year. It’s purely a convenience value and not used for any calculations directly.

Year. This is the calendar year. It is calculated the same way as age, setting an initial value explicitly and adding one to the previous row on subsequent rows.

Income Goals

Inflation adjusted goals for income and social security

Income Before Tax. The first row is calculated from the parameters, dividing after tax income by one minus the tax rate. Subsequent years are inflation adjusted. There is an adjustment at age 80 for the difference in income that can be configured for age 80 and later in the Parameters. It simply scales the regular calculation proportionally.

Social Security. Since I’m born near the middle of the year, the first social security row half the Social Security (2034) in the Settings section. Subsequent years are full collection and inflation adjusted. This will need to be adjusted for individual situations.

There has been talk about changing the CPI index that social security is adjusted on, but I’m not modeling that possibility at this time.

Cash Draw Needed. This is the actual amount of cash drawn from the 401(k) each year of retirement. I’m assuming I’ll retire at the beginning of the year I turn 65. This could be adjusted for midyear retirement (by proportionally amortizing the calculated value) or for retiring a different year. The calculation is simply Income Before Tax minus Social Security.

Bucket Goals

Goals for the buckets’ balances

Deep dives in the stock market usually recover in about five years or less, so I somewhat arbitrarily chose five years worth of income as the horizon for the Cash and Bonds buckets. Others may or even should choose other time horizons.

Cash. The cash bucket has the goal of having a balance of the next two years of 401(k) draw needed.

Bonds. The bonds bucket has the goal of having the following three years of 401(k) draw needed.

When the stock market is down, the plan is to defer transfers from stocks to bonds until it recovers, draining the cash and bonds buckets until then, and then catch them back up once the stock market recovers.

Buckets

Now we come to the meat of the spreadsheet. This is where future history is simulated. For each row, money flows from right to left. Money is transferred at the beginning of the year and balances represent the end of the year balance after transfers and then growth have occurred.

Cash bucket

Draw <-. This is the amount taken out of Cash and is normally the number from Income Goals, but it is limited by the Cash balance, which is calculated as the previous year’s Cash balance plus the amount transferred into cash this year.

Cash. This is the amount of cash at the end of the year. It is previous year’s balance plus the transfer into cash minus the draw, then that summation multiplied by one plus the Cash Return rate from Parameters.

<- Transfer <-. Cash is replenished by transferring money from Bonds to Cash. The transfer is calculated as the amount of money needed to bring the Cash balance back up to this year’s goal after the draw is taken. It is limited by the amount of money in Bonds after funds have been transferred from Stocks to Bonds.

Bonds. This is the amount of money in Bonds at the end of the year. It is the previous year’s blance plus the transfer from Stocks minus the draw, then that summation multipled by one plus the Bonds Return rate from Parameters.

<- Transfer <-. Bonds are replenished by transferring money from Stocks to Bonds, unless they aren’t (more on this below). The transfer is calculated as the amount of money needed to bring the Bonds balance back up to this year’s goal after the transfer to Cash is done. It is limited by the amount of money in Stocks. It can also be adjusted by the Deferred Transfer, which see.

Deferred Transfer. If the market is down for some year, it might be desirable to not pull money from Stock until the market recovers, else losses are locked in. The amount to defer may be entered in this column. Bonds, and then Cash, will be drawn down instead then. The following year will catch up and re-fund Bonds unless it has a deferral too. When copying values from other cells into this column, be sure to paste as values.

Stock Variation. This will be discussed in detail in the following Stochastic Data section below and in the description of the Stocks column immediately below.

Stocks. The is the stock balance at the end of the year. After the transfer is made to Bonds, the balance is multipled by one plus the sum of Stock Return in Parameters and that year’s Stock Variation. So for a constant year-over-year growth, leave Stock Variation blank (zero). Variations relative to the Stock Return may be entered in the Stock Variation column. Or year-by-year returns may be specified by setting Stock Return to zero in Parameters and entering the actual return in Stock Variation rows. This is discussed in more detail below under Stochastic Data.

Stochastic Data

In the real world, the market does not go up year after year by a constant amount. These columns provide some data to paste into the Stock Variation column to stochastically model realistic variation. The equations of most of them include random numbers that get reëvaluated every time a cell gets updated. To do experiments with this data, be sure to use Excel’s Paste-as-Values feature to lock in a snapshot of the random values.

Data for supplying random variation to simulations of the future

Random Variation. This provides normally distributed percentages. In Parameters give the average yearly Stock Return. Then Paste-as-Values numbers from this column. To change the standard deviation you’ll need to edit the equation in the cells as I didn’t parameterize it.

But in truth, this data hasn’t proven to be realistic. It comes up with long sequences of growth or decline much more often than happens in the real world. I suppose this could be useful for an extreme stress test of any set of parameters one wishes to try.

Historic Variation. This allows replaying of historic market behavior in the simulated future. There’s no random number generator used, so it’s OK, even suggested, that you copy the cells or equations into the Stock Variation column. Don’t copy the top most number (1901 in the illustration). Rather edit that number to set the initial year of returns to use.

Set the Stock Return in Parameters to zero as that value will be added to every Stock Variation number. Or set it to some small percentage to simulate an index fund beating or missing the Dow’s capital return. I usually use 1% to simulate dividends adding to the capital return. This applies for using Random Year and Random Decade columns too.

Random Year. This picks a random year’s return in the range of 1945 to 2018 for each year. The character of the market and its volatility was very different prior to then. This tends to look much more like real markets, although it can have “too many” bad or good years in a row sometimes.

Random Decade. This picks random ten-year spans’ returns for each decade of retirement. This repeats historic patterns of boom and bust fairly well, like Historic Variation, but with some scrambling so you aren’t locked into a comparatively small sample of years. This is what I use most of the time.

Results

Assumptions

The Parameters are set to de-personalized values. They also violate the 4% rule. With these settings, draw is a little over 6% once social security kicks in at 70 and much higher prior to that. This makes catastrophic failure of savings occur more often than if the 4% rule were followed. This was intentional as the 4% rule results in catastrophe less than 5% of the time and that invoves too many experiments get worst case scenarios.

Fixed Rate Returns.

ReturnAgeLast YearLast Draw
11%125+2089+N/A
9%992063$87,528.45
7%912055$20,690.10
5%872051$24,938.24

The market has been approaching 11% returns lately. But the long term average since 1901 has been about 5.3%. So let’s look at how long money lasts for various fixed rate returns between 5 and 11 percent. This was done in the Bucket Model tab, setting different values for Stock Return in Parameters and leaving the Deferred Transfer and Stock Variation columns blank.

Random Decades.

AgeLast YearLast Draw
912055$25.73
832047$17,549.85
872051$15,941.36
952059$75,044.79
842048$65,189.22
862050$2,323.56
1142078$8,554.19
902054$79,491.73
125+2089+N/A
962060$80,889.02

Now let’s apply some random decades and assuming beating the market by 1% for dividends. Using the Random Decades tab (a copy of the Bucket Model tab), I’ll set Stock Return to 1% and paste in the equations from the Random Decades column and recalculate ten times to see the variation we get.

As you can see, the results can be dramatically variable. You flaunt the 4% rule at your peril.

Strategies for Extending Savings.

Now let’s take a typical bad year where we run out of money at 84 in 2048 after only 19 years of retirement. I’m still using the Random Decades tab with the values from the Random Decades column pasted in as values so that they don’t change every time the spreadsheet recalculates.

The other “Random Decades” tabs all reference the Stock Variation column in the Random Decades tab so we can experiment with different strategies in parallel without having to set the Stock Variation column in each tab.

Random Decades (Deferred). The first stategy is to defer transfers from Stocks when the market is down until it recovers. This drains Bonds and Cash. Once the Market recovers, we refund Bonds and Cash.

So, the market was down in 2032, so in 2033 no transfer from stock was made. This was signaled by copying a value from the Transfer column to the Deferred Transfer column.

I’ve not figured out a way to calculate deferrals automatically in the spreadsheet without programming some macros. Instead I just do a manual process.

Deferrals continue happening through 2035. After the growth in 2035, the market recovered and transfers were allowed to catch the other buckets up to plan again.

A similar situation occurs immediately afterwards. The net effect is that money holds out till late in 2050, a couple years longer than the base situation.

Random Decades (No Bonds). What if we dispensed with bonds altogether and just kept the cash bucket? We can’t defer quite as much because cash only has two years of buffer, so that third year we want to defer, we can only keep deferring the first two years of transfers, but little or no more.

For the sake of brevity (and laziness), I’ll not continue posting screenshots, and just summarize the results. The result for no bonds is running out of money in late 2054, a full four years later still.

Random Decades (No Cash). What if we also dispense with the cash bucket and just let it all sit in Stocks till we draw it down. No deferrals are done at all then. The result is running out of money in mid-2059, nearly five more years into the future!

Summary. It seems that almost always the losses one avoids from being able to draw from Cash or Bonds when the market is down are dwarfed by the gains one loses by holding cash and bonds for the duration of retirement.

I must reiterate, this is not financial advice. There are as many bucket strategies (and probably more) as there are advocates for bucket budgeting. One of them might do better. A better simulation of bonds returns similar to what I do for stocks might change the balance between avoiding loses and missing gains.

This blog entry is about how to program simulations with tunable assumptions. It illustrates techniques that can be expanded for a more complete simulation. The sample assumptions may be catastrophically poor. Your financial well being is your responsibility, not mine.

The following table summarizes when money runs out for the various strategies outlined above.

StrategyAgeLast YearLast Draw
All Three Buckets with No Deferrals842048$42,780.60
Deferral While Market Is Down852049$7,696.57
Deferral with No Bonds Bucket862050$71,291.21
No Bonds nor Cash Buckets882052$41,793.47

Stress Tests

So as a further experiment, what if one retired at the beginning of 2000, right at the start of the dot-com bust? In the simulation, I’ll copy the data for 2000 through 2018 and play those years on a loop so we get back-to-back-to-back-to-back major stock tankings. This historic data is conveniently off to the far right of the spreadsheet. The results are summarized in the following table.

StrategyAgeLast YearLast Draw
All Three Buckets with No Deferrals822046$24,798.56
Deferral While Market Is Down862050$58,826.29
Deferral with No Bonds Bucket902047$60,231.10
No Bonds nor Cash Buckets932057$16,914.71

Even with this extreme of a situation, it seems better to hold all stock by a wide margin.

Let’s go for even worse. Let’s loop 2000 through 2009. This would be a long term bear market for the entirety of retirement: a net loss every decade. Let’s further double the amount of initial savings to $2,000,000 so that the baseline retirement lasts an appreciable amount of time, till age 94.

StrategyAgeLast YearLast Draw
All Three Buckets with No Deferrals942059$78,138.33
Deferral While Market Is Down972061$31,775.66
Deferral with No Bonds Bucket912059$40,981.75
No Bonds nor Cash Buckets902054$54,283.43

Only in this most extreme situation does it appear the bucket scheme puts you ahead. But the down side here is considerably smaller than the upside during “normal” retirements.

I honestly didn’t expect this extreme of a difference. Spreadsheets can be notoriously difficult to get right. I’d rather be modeling this with something like MathLab, I think. But as I think about it, it seems right, with the market doubling every decade (or better of late) that five years of savings being held in cash and bonds could translate into much than that when held in stocks for 20 years or so.

What’s Next?

Given the possibility of extreme gains or loses during different periods of retirement, applying the 4% rule to pull out a constant amount (inflation adjusted) of money may not be the best approach. If the market turns into a real bear, it might be better to scale back the income a bit. If it turns into a raging bull, income might be increased significantly. I’m working on a model for dynamically adjusting income while stochastically modeling returns and still being assured (hopefully) of not running out of money too soon. Some day, I’ll turn it into a blog entry as well.

Introduction to TF-IDF and Machine Learning: Part 1 – Cosine Similarity

Goals

My goal here is not to provide a detailed how-to for creating a machine learning system. Most people who do anything with machine learning will be using systems already built into tools such as Splunk, Elastic Search or other such software. I’m only seeking to provide some basic understanding of how machine learning systems work.

An analogy would be that you don’t need to understand how a differential gear system works in detail, but a basic knowledge of what it is doing is useful to understand when you should lock the differential on an off-road vehicle.

Similarity as Determined by Human Intelligence

Consider that you have a set of documents, be they a collection of news articles, encyclopedia entries, essays, customer comments, etc. You as a human are tasked with identifying which documents are similar. What does similarity mean?

A human can read the documents and develop an understanding of them. He can form a mental ontology to classify them into categories. If they are news articles, this one is about politics. This one about the economy. This one argues against Marxism. This one against Capitalism.

Ontology: A set of concepts and categories in a subject area or domain that shows their properties and the relations between them.

OED

Although so-called Deep Learning attempts to mimic a hypothetical model of human learning, it is still not real human intelligence and is computationally demanding. Instead, we usually take shortcuts that will be more practical, if not as powerful.

Big Bag of Words

Specific concepts have unique vocabularies

Assumption 1

Our first simplification is to throw out all knowledge of grammar and even word meanings. We treat documents as a “big bag of words” without any grammatical relationship between them, and rely on presence or absence of individual words to measure the similarity of documents.

Figure 1: Vocabulary

Words have Weight

Some words have more importance in measuring the ontological similarity of documents. To estimate this importance, we make some assumptions about the relationship between words and concepts.

Term Frequency

Specific concepts use certain words with greater frequency.

Assumption 2

We incorporate Assumption 2 by weighting each word by the number of times it occurs in each document. This weight is referred to as the Term Fequency (TF).

Inverse Document Frequency

Specific concepts use certain words that are less likely
to be used by other concepts.

Assumption 3

We incorporate Assumption 3 by weighting each word by the reciprocal of the portion of documents it occurs in. A word that occurs in every document gets weight 1. One that occurs in only 1% of the documents gets weight 100. This factor is referred to as the Inverse Document Frequency (IDF). We are essentially assuming that documents that share a rare word are more likely to be similar.

The two factors are combined by multiplying them. This combined weighting factor is referred to by the acronym TF-IDF.

Stop Words

Some words intrinsically carry less semantic information
and so are less indicative of concepts
regardless of their TF-IDF weight.

Assumption 4

Words like the, a, preprositions, helping verbs, and the like carry little specific semantic information on their own related to ontological concepts. They are all very common, having a high term frequency, but a low inverse document frequency. These words, referred to as stop words, are left out of the calculation, effectively giving them a weight of zero. Most turn-key systems will have a list of stop words built into them.

Stemming

For our purposes, it generally doesn’t matter if a verb is past or future tense. We’ve already omitted helping verbs as stop words above. We might as well discard inflections and conjugations as well, reducing words to their root forms.

There are several heuristic algorithms available in most natural language processing packages for chopping off the prefixes and suffixes added to words. This is referred two as stemming. The better algorithms also include dictionaries for stemming irregular forms as well. Thus go, gone, went, goes, and going will all be stemmed to go.

Vectors

We could continue using the Venn diagram like in Figure 1 above. A key difference with that previous diagram is that because TF-IDF is being utilized and different documents have different term frequencies, some of a word’s weight will be in the Common Vocabulary section, and some will be in a specific document’s unique vocabulary section.

Figure 2: Vocabulary with Weighted Words

However, there is an alternative approach used that can be concisely expressed as mathematical formulas with calculations done with readily available math packages. This approach is to express each document as an n-dimensional vector of weighted terms.

Equation 1: Document Expressed as a Vector

Similar documents will be closer together in this n-dimensional Euclidean space, and dissimilar documents will be further apart. This is difficult for us to imagine, living in an only 3-dimensional world. However, we can easily illustrate the similarity of two documents in two dimensions. Two vectors are defined by three points: the point of origin and the end points of the two vectors. Two points define a 1-dimensional line, even when the points are located in n-dimensional space. The third point defines a 2-dimensional plane within that space. It is always possible to rotate our perspective around to look at the 2-dimensional plane within the n-dimensional space that contains our two vectors.

Figure 3: Cartesian Distance Between Two Documents

Size Doesn’t Matter

Imagine that you have a largish document about a specific concept. Now imagine that you edit a copy of that document to condense it, removing redundancies and writing shorter sentences, being in general more concise. The new document still discusses the same concept and makes all the same points about that concept. It still uses the same vocabulary for the most part. It probably even uses the words in the same proportions. But the length of the document and the term frequency of its words are reduced proportionally (or nearly so).

Figure 4: Cartesian Distance Between a Document and It’s Digest

The Cartesian distance between these two documents in Figure 4 is nearly the same as in Figure 3, even though they are ontologically almost identical. The length of the treatments of a concept really doesn’t matter much to us for the purpose of measuring their similarity.

For this reason, we normalize the documents’ vectors to have a length of one unit.

Figure 5: Normalized Vectors for a Document and Its Digest

Cosine Similarity

Because the TF-IDF values are never negative, the vectors for two documents will always be within 90 degrees of each other. Consider the two-dimensional case; the two vectors will be in the upper right quadrant. Add a third dimension. No matter how large the third coordinate, the vectors can only approach asymptotically the third dimension, which being perpendicular, limits them to becoming no more than 90 degrees apart. The only way they can exceed 90 degrees, is if one takes on a negative coordinate while the other takes on a positive one. For n dimensions, we can always rotate our perspective to two dimensions, so inductively we can see that for any number of dimensions, they’ll be within 90 degrees.

This gives the Cartesian distance a range of 0.0 (for maximum similarity) to the square root of 2.0 (for maximum dissimilarity).

A range of 0.0 (for maximum dissimilarity) to 1.0 (for maximum similarity) would be more intuitive. As it happens there is a simple way to get that: The cosine of the angle between the vectors. Better, there is a simple formula for calculating that cosine. This calculation is referred to as cosine similarity.

Equation 2: Cosine Similarity

Continued in Part 2…

Part two will examine some uses of TF-IDF and Cosine Similarity…

Commenting Code

Bad Advice

I’ve encountered a lot of bad advice about commenting code over the decades.  I’ve known developers who insisted that commented code is actually worse than uncommented code because the comments will always be wrong or out-of-date.  But even dated comments give a clue about the original intentions of the code and how subsequent changes might have gone wrong.

I’ve seen style guides that recommend a one line comment above every paragraph of code. Mindlessly adding short comments every few lines ends up with obvious statements like “add one to index” that don’t actually provide any useful information.  Back in the 70s or 80s when hand coding assembly language was still a thing, such comments were bread and butter of the profession.  But modern high level languages are usually self-documenting in that regard.

Self-Documenting Code

Self-documenting code has all the sound and feeling of moral superiority.  But, like Communism’s “from each according to their ability, to each according to their need” which sounds all noble and just, in the real world it just doesn’t work.  This is often for the same reason: some people are just evil, and many more are just plain lazy.

But it also doesn’t work because programming languages are not intrinsically expressive enough to self-document more complex ideas.  More often than developers realize (because developers are too close to the problem and make assumptions about what a later developer will intuitively understand), the problem being solved is complex, with much jargon, hidden assumptions, and irreducibly complex algorithms being necessary to solve it.

A further issue is the fact that well-done comments serve more purposes than explaining what a bit of code does.  So leaving the political analogy snark above behind, I introduce another analogy to explain the many purposes of comments.

The Book Analogy

Let’s consider the text of a book as an analogy to the text of a program.  A well written book can be read from cover to cover without the aid of typographical conventions and be understood.  Consider the popularity of books-on-tape as proof of this idea.  The text of the book is effectively self-documenting.

But now let’s look at a couple of pages of a real book.  (Wallace, Daniel B. Greek Grammar Beyond the Basics.  Zondervan, 1996.)

We see many features that are not part of the text itself, but which aid the reader in navigating and understanding the text:

  • Titles
  • Headings
  • Subheadings…
  • Overviews
  • Bibliography
  • Footnotes
  • etc.

All these have useful counterparts in commenting code to make the easier to navigate and understand.

Titles, Headings, and Subheadings…

A hierarchy of titles and headings make navigating a book (and code) easier.  In a book they are usually set off by size, and often weight or font selection as well.  We don’t have such typographical conventions available in development environments.  Instead a construct called “block comments” have traditionally been used instead.

/////////////////////////////////////////////////////////////////
/////////////////////////////////////////////////////////////////
////                                                         ////
//// MyPojo                                                  ////
////                                                         ////
/////////////////////////////////////////////////////////////////
/////////////////////////////////////////////////////////////////

public class MyPojo {

    /////////////////////////////////////////////////////////////
    //                                                         //
    // CONSTRUCTORS                                            //
    //                                                         //
    /////////////////////////////////////////////////////////////

    public MyPojo() {
        …
    }

    public MyPojo(MyPojo source) {
        …
    }

    /////////////////////////////////////////////////////////////
    //                                                         //
    // PROPERTIES                                              //
    //                                                         //
    /////////////////////////////////////////////////////////////

    /////////////////////////////////////////////////////////////
    // primaryKey                                              //
    /////////////////////////////////////////////////////////////

    private String primaryKey;

    public String getPrimaryKey() {
        return primaryKey;
    }

    private String setPrimaryKey(String primaryKey) {
        this.primaryKey = primaryKey;
    }

    ///////////////////////////////////////////////////////////
    // widgetName                                            //
    ///////////////////////////////////////////////////////////

    private String widgetName;

    …

    ///////////////////////////////////////////////////////////
    //                                                       //
    // BUILDER                                               //
    //                                                       //
    ///////////////////////////////////////////////////////////

    public class Builder {

        ///////////////////////////////////////////////////////
        // Constructors                                      //
        ///////////////////////////////////////////////////////

        MyPojo template;

        public Builder() {
            …
        }

        …

        //////////////////////////////////////////////////////
        // Initializers                                     //
        //////////////////////////////////////////////////////

        public Builder primaryKey(String primaryKey) {
            template.setPrimaryKey(primaryKey);
            return this;
        }

        public Builder widgetName(String widgetName) {
            …
        }
    }
}

Headings in code can be given different font sizes and weights virtually by using heavier or lighter boxes around the heading text and by using all capitals, mixed case, or all lowercase text.

Creating headers this way has several advantages:

Style guides recommend grouping related constructs together, such as constructors, properties, inner classes, etc.  By exposing the structure imposed by these guidelines in such a hard-to-miss fashion, future developers are less likely to place constructs in the wrong place.  Experience shows that without such headings, groupings of constructs inevitably break down as the code is further processed.

The typographical conventions are what they are because several centuries worth of experience has discovered that these conventions cooperate with the way the brain works.  By being able to pick out the higher level structure of the text (or code) prior to actually reading it, the mind is prepared to more quickly learn what the text has to say (or what the code does).  Headings speed up the learning curve.

Readily visible visual markers make it easier for even the readers who are most intimately familiar with the text (or code) to quickly navigate from place to place.  The visual shape of the different headings becomes a map in the reader’s brain.  They are the mile posts that the reader will remember  where specific passages are located relative to.

Overviews

Many developers like to gather the declarations of backing-store variables ahead of all the getters and setters to serve as a makeshift Overview section.  This is problematic for three reasons.  First, breaking up of the implementation of a property into two places makes the code harder to follow or modify.  But more importantly (and second), it frequently doesn’t serve as an adequate summary because (with some regularity) there is not always a one-to-one correspondence between backing-store variables and properties.  Finally, third, there is frequently more than just properties that can benefit from an Overview.

My preference generally is to add an overview to the title box for the class, giving a quick list of constructors, properties, inner classes, methods, etc.

/////////////////////////////////////////////////////////////////
/////////////////////////////////////////////////////////////////
////                                                         ////
//// MyPojo                                                  ////
////                                                         ////
//// Constructors:                                           ////
////     MyPojo()                                            ////
////     MyPojo(MyPojo source)                               ////
//// Properties:                                             ////
////     primaryKey                                          ////
////     widgetName                                          ////
//// Inner Classes:                                          ////
////     Builder                                             ////
////                                                         ////
/////////////////////////////////////////////////////////////////
/////////////////////////////////////////////////////////////////

An alternative would be to put the overview in a Javadoc comment on the class.  But generally IDEs already provide this information in name completion and the tags needed within the Javadoc for semantic annotation and formatting are hardly human readable.  The Overview is for the human reader of the source code to get a high level view of what the class contains, not for the IDE to read and generate separate documentation.

Bibliography

Any well written code most probably has design documents that may or may not be part of the source code itself.  That is, they may be held in a repository of Word documents such as One Drive or in a Wiki.  Where documentation is not held in the source file itself, the source file should contain citations of those documents.

Code that is more scientific in nature will often be based on published algorithms.  Citations to these documents should also be included in the source to aid future developers.

There are numerous styles of bibliographic citations.  Which one to use is not particular important.  But it is important for intranet or internet citations to include more than just a URL as URLs have an unfortunate tendency to become stale.  When citing a URL, include the date that the URL was valid.  This will aid in finding it in an internet archive site.  But also include standard bibliographic information such as author, title, publisher, and date of publication like would be done for any paper resource as these will aid in finding the resource again after the URL has become stale.

When citing a particular location within a document, don’t just cite the page number, as a future developer may be accessing an online version of the document.  Cite the section heading and possibly multiple levels of section headings, as this will help the future developer find the intended location in documents stripped of page number information.

Citations that cover the whole source file can be placed in the title box along with the overview.  Otherwise that may be attached in regular comments.  See “Footnotes” below.

Footnotes

The way that comments are most often thought of, as a short remark explaining what a small piece of code does, most closely correspond to footnotes in book typography.  There are, however, many functions that may be served by these footnotes.

Javadoc comments (or the equivalent in other program languages and development environments) serve as documentation for individual classes, methods, properties, and variables.  They often have semantic or formatting annotations within for generating printed documentation or documentation within integrated development environments, and so aren’t necessarily extremely human readable.

Just like the Bibliography section above, it is often useful to provide cross references to other documents, or even other code within the project, on smaller pieces of code.  This is analogous to what is probably the most common form a footnote in books: citing sources.

Summarizing what a paragraph of code does is often useful.  But caution must be exercised to not go overboard.  Writing a comment that says “Add one to index” is not very useful, whereas “collect sums needed for calculation of average and standard deviation” probably is.

More often than not, the summarizing comment will actually provide a rationale for the code.  For example, where a comment saying “Add two to index” isn’t useful, a comment that says “Process array elements in even-odd pairs” on an increment by two probably is useful because it explains why we are incrementing by two.

Self-Documenting Code Redux

I’ve seen style guides that champion self-documenting code that would argue that the right way to document the “even-odd pairs” processing in the above example would be to refactor the code to have a processEvenOddPair() method.  This can certainly be a useful technique, but pushed to extremes tends to fragment the code into a non-linear mishmash of methods.  And extremes are what some style guides recommend.  I’ve seen one that insists no method should be more than 5 or 7 lines long!

Such code is nearly unreadable and very difficult to debug.  I know, because I’ve tried to read and debug such code before.  The human brain tends to process input in a linear fashion.  When one has to interrupt the brain to go looking for the next method, the mental stack gets overwhelmed.  Trying to trace through seven levels of function calls, each with long parameter lists typically, is just as hard as trying to follow seven levels of deeply nested loops spanning multiple pages of code.

A few judicious comments on well-paragraphed code result in much easier to read and understand code.  This is not to say that the concept of self-documenting code is useless and should be abandoned.  By all means, methods and variables should be given precise, accurate, self-documenting names.  And overly large blocks of code should be divided up into smaller methods, especially when an obvious, intuitive, object-oriented abstraction can be made, or when a complex piece of code has too many levels of deeply nested loops.

Paragraphing

While strictly not a comment in the explicit syntactical sense, grouping lines of code together into paragraphs of related code (with paragraphs separated by a blank line) is an important way to make code self-documenting.  When the lines of code that do a “thing” are scattered, the human brain has make considerable effort to work out how the “thing” is done.  It also makes future refactoring easier when an intuitive abstraction becomes obvious.

Afterthoughts

Writing good comments and keeping them up-to-date is hard.  Often even harder than writing good code in the first place.  When up against a deadline, it becomes tempting to crank out the code without concern for comments.  This almost always results in inferior code because the developer is not forced to impose structure and thoughtfulness on the code, and it becomes technical debt that slows down future enhancements and bug fixes.

While crunches are unavoidable, they should always include scheduling time after the crunch is over to clean up the code and comments.  Developers should consider it a matter of pride in their work to create code that is both stable and maintainable long after they are gone.

An analogy (I do like analogies don’t I?) would be a building contractor who puts up temporary walls and pillars while inserting new beams in a structure.  The contractor will always come back and replace the temporary structures with finely finished structures.  The programmer in a crunch is doing the equivalent of temporary walls.  They shouldn’t still be the primary wall ten years later as all too often happens in the software world.

Writing Builders for Complex Data Structures

1 Introduction

New objects are usually created by using the “new” operator and possibly passing a number of arguments to the constructor.  But for classes with many properties or with a more complex structure, the number of parameters the constructor needs becomes unwieldy or intractable.  Allocating an uninitialized object and then initializing its properties with setters gets very repetitive and unwieldy as well.  The builder design pattern offers a concise alternative.

A builder is a class that has a method (or methods) for each property to be initialized that initializes that property.  Each of these methods returns the builder so calls can be chained together using the fluent syntax pattern.  At the end of the chain, a build() method is called to actually build the object.  Typically,  a builder’s build() method can be called multiple times to construct multiple instances of an object.

This tutorial will examine what I have found to be the best way to design and implement the builder’s API.

2 Where to Define the Builder

Ideally, a class’s builder should be defined as a public part of the class itself and simply called Builder (though later on, a specialized builder with a variation on that name will be explored).

public class Resource {
    public Long id;
    public String resourceType;

    public class Builder {
        ………
    }
}

However there will be times when one doesn’t have control over the definition of the class being built, requiring the builder to be coded as a standalone class.  In this case, I recommend concatenating the class’s name with the word “Builder” to name the class’s builder.

public class ResourceBuilder {
   ………
}

3 Cloneable

In order to return a unique instance of an object for each call to build(), it is necessary to be able to make a deep copy of the object being built.  A copy constructor could be used, but I find implementing the Cloneable interface more convenient as it requires less maintenance work when the class changes.

Public class Resource implements Cloneable {
    // declare properties
    ………

    @Override
    public Resource clone() {
        try {
            Resource clone = (Resource)super.clone();

            // do any deep copying needed here
            ………

            return clone;
        } catch (CloneNotSupportedException e) {
            // this should never happen
            throw new RuntimeException(e);
        }
    }
}

In cases where one doesn’t have control over the implementation of the class being built, it may be necessary to have a utility method to perform the deep copy in the builder itself.

Public ResourceBuilder() { 
    ………

    private Resource clone(Resource resource) {
        Resource clone = new Resource();

        // deep copy from resource to clone
        ………

        return clone;
}

4 The Builder’s Constructors and build() Method

4.1 build()

The builder needs to have an object of the type being built to serve as the prototype for the object returned by the build() method.  The build() method will return a clone of the prototype object, allowing the prototype object be be further modified to build other similar object(s).

Though build() is the most common name for this method, something like toResource() where Resource is the type of the object being built is acceptable as well.  The important thing is for the method’s name to not resemble the name of a field being initialized.

4.2 Constructors

For building an object from scratch, a parameterless constructor is defined which allocates a new uninitialized prototype object.

For building an object similar to another object, a constructor is defined which takes an object as a parameter which the constructor will clone to use as the builder’s prototype object. The user might then modify just a few properties before calling the build() method to get an object that’s the same as the original object, except for the handful of properties that were modified.

For building an object similar to what would be built by another builder, a constructor is defined which takes another builder as a parameter. The constructor will clone that other builder’s prototype object to use as the new builder’s prototype object.

class Builder {
    Resource resource;

    public Builder() {
        resource = new Resource();
    }

    public Builder(Resource resource) {
        this.resource = resource.clone();
    }

    public Builder(Builder builder) {
        this.resource = builder.resource.clone();
    }

    public Resource build() {
        return resource.clone();
    }

    ………
}

4.3 Prototype Object

The “resource” field serves as the prototype or template for the object to be built.  It is common to name this variable “prototype” or “template” as well, but I prefer a name to describe what it is rather than how it’s used.  This is a matter of personal taste though.

In many cases, it can be private, but for complex objects it’s more convenient to leave it package scoped so that it is visible to subobject’s builders.  These cases will be expored in a later section.

4.4 Abstract Base Class

If a program has lots of builders, this boilerplate could be refactored into a parameterized abstract class for all builders extend.  But given the simplicity of the code (all the methods are one-liners), this may be more obfuscating than it’s worth.  The derived classes would still need to define constructors that call super(…).

abstract class AbstractBuilder<T> {
    T prototype;

    public AbstractBuilder() {
        prototype = new T();
    }

    public AbstractBuilder(T resource) {
        this.prototype = prototype.clone();
    }

    public AbstractBuilder(Builder builder) {
        this.prototype = builder.prototype.clone();
    }

    public T build() {
        return prototype.clone();
    }
}

5 Simple Builders

Consider the following simple class (a real class would define setters and getters, but for the purposes of illustration, I omit these details along with the Cloneable implementation):

public class Resource implements Cloneable {
    public Long id;
    public String resourceType;
    ………

    public class Builder {
        ………
    }
}

A builder for this class, might be used to create an object in this manner:

Resource resource = new Resource.Builder()
        .id(1234L)
        .resourceType(“PLACE”)
        .build();

Each property has its own method, with the same name as the property, for initializing that property.  Each method returns the Resource.Builder object so calls can be chained together using fluent syntax.  Also, there’s a build() method that returns a unique instance of the built Resource.

Some people like to name the initialization methods withId(), withResourceType(), etc.  But the addition of “with” to the name is syntactical sugar that just makes the syntax look cluttered.  I don’t recommend it.

public Builder id(Long id) {
    resource.id = id;
    return this;
}

public Builder resourceType(String resourceType) {
    resource.resourceType = resourceType
    return this;
}

6 Objects with Sub-Objects

Objects can have other objects as properties (in order to group related properties, for example).

 public class Resource {
    public Long id;
    public String resourceType;
    public PersonDetails personDetails;
    public PlaceDetails placeDetails;
    public ThingDetails thingDetails;
}

public class PlaceDetails {
    public String city;
    public String state;
}

………

A builder could be used to build each of the sub-objects.  Then the sub-object properties could be initialized in the object like any other property.

Resource resource = new Resource.Builder()
        .id(1234L)
        .resourceType("PLACE")
        .placeDetails(new PlaceDetails.Builder()
            .city("Phoenix")
            .state("Arizona")
            .build())
        .build();

But this is really ugly and messy, and it gets worse when sub-sub-objects and arrays of sub-objects get involved.  Better would be to access a PlaceDetails.Builder directly in the fluent syntax, like this:

Resource resource = new Resource.Builder()
        .id(1234L)
        .resourceType("PLACE")
        .startPlaceDetails()
            .city("Phoenix")
            .state("Arizona")
        .endPlaceDetails()
        .build();

6.1 The Start Method

Here’s how this can be coded.  In Resource.Builder, define startPlaceDetails() to return a builder for the PlaceDetails sub-object.

If the placeDetails property in the builder’s prototype is not yet already initialized, it is initialized here with an uninitialized PlaceDetails object, otherwise it is left alone.  This allows multiple calls to be made to startPlaceDetails() if the Resource.Builder is being used to build multiple similar objects.

public PlaceDetails.NestedBuilder startPlaceDetails() {
    if (resource.getPlaceDetails()==null) {
        resource.setPlaceDetails(new PlaceDetails());
    }
    return new PlaceDetails.NestedBuilder(this,
                                          resource.getPlaceDetails());
}

6.2 NestedBuilder

The sub-object’s builder here is called NestedBuilder to distinguish it from a regular standalone builder that terminates in a build() method.  In a later section, I’ll discuss having both a Builder and a NestedBuilder class for the same type.

The nested builder’s constructor is package scoped so that it can only be used from within the package defining resources and sub-resources.

The nested builder takes a pointer to the parent builder as an argument. When it terminates with endPlaceDetails(), that parent builder must be returned for the fluent syntax to continue with initializing the parent object.

The nested builder also takes a pointer to the sub-object it will use as its prototype object. For nested builders, it is the concern of the parent builder to manage that property and the concern of the sub-object builder to manage its initialization.

A parent object could conceivably have multiple properties of the same type.   Resource could have a PlaceDetails properties specifying birth location, marriage location, and burial location, for example.  The same nested builder can initialize all three.

Recall above that the prototype object was declared with package scope.  This was so that these nested builders can access it.

public static class NestedBuilder {
    private Resource.Builder parent;
    private PlaceDetails placeDetails;

    NestedBuilder(Account.Builder parent,
                  PlaceDetails placeDetails) {
        this.parent = parent;
        this.placeDetails = placeDetails;
    }

    public Account.Builder endPlaceDetails() {
        return parent;
    }

    public NestedBuilder city(String city) {
        placeDetails.city = city;
        return this;
    }

    public NestedBuilder state(String state) {
        placeDetails.state = state;
        return this;
    }
}

6.3 Clear Method

Ordinarily, I omit having a placeDetails(…) method when there’s a startPlaceDetails() method.  But this leaves no way to clear the placeDetails property.  If this functionality is needed, a clearPlaceDetails() method may be defined.  This sort of functionality should be used sparingly however (as discussed below).

public Builder clearPlaceDetails() {
    resource.placeDetails = null;
    return this;
}

6.4 Building Multiple Similar Objects

Several similar objects can be built with a shared builder with this setup. Obviously, this example is quite contrived, so that it would be better in this case to use separate builders, but for larger more complex objects with only minor differences between them, this capability can be very convenient.  Methods like clearPlaceDetails() should be used only very sparingly so that interdependence between the building of multiple objects doesn’t become fragile.

// create builder to reuse
Resource.Builder builder = new Resource.Builder().resourceType("PLACE");

// Phoenix
Resource phoenix = builder
        .id(1L)
        .startPlaceDetails()
            .city("Phoenix")
            .state("Arizona”)
        .endPlaceDetails()
        .build();

Resource tucson = builder
        .id(2L)
        .startPlaceDetails()
            .city("Tucson")
        .endPlaceDetails()
        .build();

Resource abrahamLincoln = builder
        .id(3L)
        .resourceType("PERSON")
        .clearPlaceDetails()
        .startPersonDetails()
            .name(“Abraham Lincoln”)
        .endPersonDetails()
        .build();

7 Objects with Arrays of Sub-Objects

What if Resource has, say, an inventory list?  This can be handled by Resource.Builder having a startInventoryList() method that returns an InventoryListBuilder, which in turn has a startInventoryItem() method that returns an InventoryItem.NestedBuilder.

InventoryListBuilder is coded as a nested class in Resource.Builder, so its constructor doesn’t need to be passed the Resource.Builder object as an explicit parameter.  It’s startInventoryItem() method adds a new item to the list.

The InventoryItem.NestedBuilder’s constructor gets passed the InventoryItem to use as its prototype object.  It is the responsibility of the InventoryListBuilder to do all manipulation of the inventoryList.

public InventoryListBuilder startInventoryList() {
    return new InventoryListBuilder();
}

public class InventoryListBuilder {
    List inventoryList;

    InventoryListBuilder() {
        if (Builder.this.account.getInventoryList()==null) {
            Builder.this.account.setInventoryList(new ArrayList());
        }
        inventoryList = Builder.this.account.getInventoryList();   
    }

    public InventoryItem.NestedBuilder startInventoryItem() {
        InventoryItem inventoryItem = new InventoryItem();
        inventoryList.add(inventoryItem);
        return new InventoryItem.NestedBuilder(this, inventoryItem);
    } 
}

An object containing an array property can then be initialized like this:

Resource resource = new Resource.Builder()
        .id(100L)
        .startInventoryList()
            .startInventoryItem()
                .item("Sword")
                .count(2)
            .endInventoryItem()
            .startInventoryItem()
                .item("Knife")
                .count(4)
            .endInventoryItem()
        .endInventoryList()
        .build();

When using builders to create several similar objects, it is convenient to add additional methods to the InventoryListBuilder to manipulate the list.  I find this particularly useful when construction test values and expected results in tests.

// modify an existing element
public InventoryItem.NestedBuilder startInventoryItem(int index) {
    return new InventoryItem.NestedBuilder(this, inventoryList.get(index));
}

// insert before an existing element
public InventoryItem.NestedBuilder insertInventoryItem(int index) {
    InventoryItem inventoryItem = new InventoryItem();
    inventoryList.add(index, inventoryItem);
    return new InventoryItem.NestedBuilder(this, inventoryItem);
}

// remove an existing element
public InventoryListBuilder removeInventoryItem(int index) {
    inventoryList.remove(index);
    return this;
}

InventoryItem.NestedBuilder needs a constructor that takes both the parent builder for being returned by endInventoryItem() and a prototype object to use.  It doesn’t clone the prototype object like the public Builder constructors do because we are explicitly wanting this builder to edit a particular prototype object in the parent builder.

NestedBuilder(Resource.Builder.InventoryListBuilder parent,
              InventoryItem inventoryItem) {
    this.parent = parent;
    this.inventoryItem = inventoryItem;
}

8 Multiple Builders for a Class

8.1 AbstractBuilder

Sometimes, both a Builder for building standalone objects and one or more NestedBuilders for building objects that are sub-objects of another object might be needed.  It is desirable for these builders to share as much code as possible (the DRY principle: Don’t Repeat Yourself).  For encapsulating the common code between the builders, a private AbstractBuilder can be defined which Builder and NestedBuilder extend.  This AbstractBuilder defines the property initializer methods while the derived builders define the constructors and the build() and end…() methods.

Writing an AbstractBuilder is a little tricky because it must be parameterized so that the property initializer methods return the derived builder object, not the AbstractBuilder object.  So, AbstractBuilder gets parameterized with a type that’s an extension of AbstractBuilder.  Amazingly, this works.

The builders derived from AbstractBuilder will provide a subThis() method that returns the subclasses’ “this” pointer.  It is used to provide the value that is returned by the initializer methods.

private static abstract class 
        AbstractBuilder<T extends AbstractBuilder<T>> {
    protected InventoryItem inventoryItem;

    // sub classes provide a this pointer of the appropriate type
    protected abstract T subThis();
    private T subThis = subThis();

    public T item(String item) {
        inventoryItem.setItem(item);
        return subThis;
    }

    public T count(int count) {
        inventoryItem.setCount(count);
        return subThis;
    }
}

8.2 Extensions of AbstractBuilder

Extensions of AbstractBuilder should parameterize AbstractBuilder with their own type so that the initialization methods can return the subtype’s object instead of the AbstractBuilder object. They provide a constructor, a subThis() method, and an appropriate build() or end…() method. The constructors and build() or end…() methods are coded just like they are for regular Builder and NestedBuilder classes as described above.

The subThis() method just returns the extension’s “this” pointer and is called by the AbstractBuilder’s initialization methods to return the appropriate builder object.

public static class Builder extends AbstractBuilder<Builder> {
    public Builder() {
        this.inventoryItem = new InventoryItem();
    }

    // more constructors can go here as needed
    ………

    public InventoryItem build() {
        return inventoryItem.clone();
    }

    @Override
    protected Builder subThis() {
        return this;
    }
}

public static class NestedBuilder extends AbstractBuilder<NestedBuilder> {
    private Account.Builder.InventoryListBuilder parent;

    NestedBuilder(Account.Builder.InventoryListBuilder parent,
                  InventoryItem inventoryItem) {
        this.parent = parent;
        this.inventoryItem = inventoryItem;
    }

    public Account.Builder.InventoryListBuilder endInventoryItem() {
        return parent;
    }

    @Override
    protected NestedBuilder subThis() {
        return this;
    }
}

8.3 Multiple NestedBuilders

Conceivably, a class may be used in serveral different places of multiple complex data structures.  For this case, parameterize the NestedBuilder class with the type of the parent builder.

public static class NestedBuilder<T> 
        extends AbstractBuilder<NestedBuilder<T>> {
    private T parent;

    NestedBuilder(T parent, InventoryItem inventoryItem) {
        this.parent = parent;
        this.inventoryItem = inventoryItem;
    }

    public T endInventoryItem() {
        return parent;
    }

    @Override
    protected NestedBuilder<T> subThis() {
        return this;
    }
}

Then, in the definition of the start…() method in the parent builder, parameterize the NestedBuilder return type with the parent builder’s type.

InventoryItem.NestedBuilder<InventoryListBuilder> startInventoryItem() {
    InventoryItem inventoryItem = new InventoryItem();
    inventoryList.add(inventoryItem);
    return new InventoryItem.NestedBuilder<InventoryListBuilder>
            (this, inventoryItem);
}

9 Cookbook

This essay presents a broad cookbook of builder features and how to implement them. While the implementation of some of these features is complex, and some of the features can easily be abused to create fragile code, I have found these ideas to be very powerful and useful in building complex data structures in a succinct and easy-to-read manner.

They are especially useful in writing tests because lots of similar objects with minor variations are often needed.  For example in a test for a REST service, a number of similar resources might be built for POSTing.  The result that is returned from the REST service may then consist of what was posted, plus some additional bits and pieces like a generated primary key and HATEOAS links.  The expected results to compare against can be built from the POSTed resources with some additional initialization.

Pragmatic Unit Testing, Part 2

4 Integration Testing REST Services

Now, we’ll look at using junit to do integration testing of a full REST service.  This will work much like functional tests in that we’ll first test creating resources, then test the rest of the CRUD operations by operating on such created resources.  But first some preliminaries.

Continue reading “Pragmatic Unit Testing, Part 2”

Pragmatic Unit Testing, Part 1

1  Extreme Programming Extremism

Dark chocolate is good for you.  It’s good for your heart and may prevent cancer.  But no one should take consumption of dark chocolate to the extreme and eat it for breakfast, lunch, and dinner. That wouldn’t just be silly, it would be decidedly unhealthy.  Yet a current fad in software development goes under the moniker of “extreme programming” where otherwise good ideas are taken to extremes.  Code reviews are good, so have someone sitting next you continuously code reviewing every keystroke you type.  Or, the subject of this essay, testing code in isolation is good and using mocks is useful, so obsessively test every class in isolation while mocking every other class it depends on.  It is the goal of this essay to argue for a more pragmatic, moderate approach to unit testing. Continue reading “Pragmatic Unit Testing, Part 1”