Naming Unit Tests

This entry is for anyone who hasn’t done much TDD in Java yet and plans on making the transition.

If you have ever written or looked at specs written with frameworks like RSpec or Jasmine, you will have noticed the “describe” and “it” keywords that lend the spec a certain level of legibility. These frameworks are considered “BDD” tools because when you write tests a certain way, you effectively specify behavior right in the test names.

While JUnit does not lend itself to this as readily, I believe it can be simulated well enough to bother with. In the spirit of specification by example, I will proceed with an example.

Let’s take a method named isAPalindrome.

If this were Ruby, it would probably end with a question mark but, in any case, the specs would read something like:

describe "isAPalindrome?" do
it "should return true given short palindrome string" do ...
it "should return false given short non-palindrome string" do ...
it "should ignore non-breaking whitespace" do ...
it "should return false given empty string" do ...
it "should throw error given nil" do ...
it "should return false given non-alpha string" do ...
...

This isn’t a complete list of specs but it gives you an idea of the method’s behavior and it reads fairly well.

Here’s what I often see when I look at a JUnit test for a method named isAPalindrome:

@Test
public void testIsAPalindrome() {
...
}

Yep. That’s it.

It’s not descriptive. It doesn’t tell you anything about what isAPalindrome does. In fact it’s not even necessarily clear that isAPalindrome is a method under test.

There’s also redundancy. We know it’s a test — it’s annotated with @Test. JUnit 4 has been around for a long time. There’s no need to include the test- prefix. Of course, then you’re left with a test method name that happens to exactly match the method being tested, and that’s not great.

The solution, then, is to follow a better naming convention in JUnit. There are many that are decent (some are here: https://dzone.com/articles/7-popular-unit-test-naming). There’s a convention that I like to use, and while it’s fairly verbose, it has always served me very well.

${methodName}Should${exhibitBehavior}Given${inputOrCondition}

It starts with the name of the method under test. That is actually key. This lets you look at your code outline in Eclipse or IntelliJ and see, at a glance, what methods are directly covered by tests. It also lets you group tests by method name.

Not only that, but the convention lets you easily see what behaviors are being covered for each method. This means that when it is time to add a new behavior, you know exactly how the test for it should read. Everyone on the team does. And when it is time for a peer review, your peers will know what test to look for.

If you encounter a bug, it will mean either an existing test is incorrect or a test is missing. Naming your tests according to this pattern will make that discovery trivial.

Here’s my JUnit test for the method we looked at earlier (pretend they all start with <code>public void</code>):

@Test
"isAPalindromeShouldReturnTrueGivenShortPalindromeString" () {...}
@Test
"isAPalindromeShouldReturnFalseGivenShortNonpalindromeString" () {...}
@Test
"isAPalindromeShouldIgnoreNonbreakingWhitespace" () {...}
@Test
"isAPalindromeShouldReturnFalseGivenEmptyString" () {...}
@Test
"isAPalindromeShouldThrowExceptionGivenNull" () {...}
@Test
"isAPalindromeShouldReturnFalseGivenNonalphaString" () {...}
...

And there you have it. It’s not as easy to read as RSpec but it’s not RSpec. Adding underscores can help readability in some cases and not others. I would rather not worry about trying to push a convention about when those should and shouldn’t be used; “Should” and “Given” are good enough delimiters for me at the moment. And if you want to use “When” instead of “Given” some or all of the time, that also works.

The important thing is to have the method name, then a portion of the spec for that method. That’s how you end up with readable, useful unit test names that make life easier in the long run. A test report can tell you precisely what’s being tested (and what’s not), which is really great. It then makes it easier to identify what integration tests to write, what end-to-end tests to write, and so forth. It makes duplication of effort less likely.

If you have another good naming convention that provides the same amount of value (or more), I definitely want to know about it.

An Overview of Estimates

Oh boy, here we go again. If there’s one topic in agile development that has been talked to death it is estimation. That is, if one has been in the “agile space” long enough and done enough research and collaboration. To everyone else, the various methods of estimation and what they yield is frequently news, to this day. Until that stops being the case, I am happy to help people new to agile development figure out what tools they have at their disposal.

As with most of my ideas, they are only current with respect to when I put them on paper. Tomorrow, with new information or perhaps some more sleep, they may change.

Back on topic, there are so many things we can estimate. The amount of time we think we will spend on a deliverable (in hours, or in “ideal days”, or in “moons”, or whatever time measurement you fancy). The relative complexity of that deliverable. The business value of that deliverable. The effort of testing that deliverable. And so on.

Ultimately, these estimates are tools, and the focus of this discussion will be on what these tools are for and how to use them properly. I will start by saying this: using them improperly is likely worse than not using them at all, because using them invariably comes at a cost.

So let’s look more closely at our estimation toolbox. Once we have an idea of what kind of data different estimates can provide, we’ll look at situations in which we should consider applying them to gather that data.

Hours to completion
A common estimate is how long a task or set of tasks (such as a User Story) will take in hours. Estimating whether a task will take one hour, or 10 hours, can help with capacity planning.

Pros

  • simple concept to understand
  • all kinds of work, from deliverables to technical discovery efforts to training, can be estimated in the same units (time)
  • easy to plot in a chart since hours scale linearly

Cons

  • very large margins of error
  • precision of estimate often at odds with error margin
  • cannot yield per-time data since units will cancel
  • does not take into account context-switching
  • depends on who does the work
  • difficult to track due to meetings and other context switches

Ideal days to completion
Some find it simpler to estimate using “ideal days,” which translate to a full workday without any interruptions. The point is to acknowledge that such days rarely exist and that interruptions and context switches inevitably occur. If something is estimated to take two ideal days, it will likely take longer than two days, and may even take several depending on how much availability someone actually has. This also helps think about capacity when planning a sprint. Note that performing a conversion to hours using something like 6 or 8 hours per day and using the resulting hours for planning largely misses the point. Consider: going from two ideal days to 16 hours introduces another significant digit of precision to the estimate. It also creates a burden of accounting for interruptions and context switching rather than just acknowledging that they exist.

Pros

  • less time spent estimating than with hours
  • precision is kept reasonably low
  • implicitly acknowledges that days are in fact not “ideal” and have interruptions
  • simplifies capacity planning

Cons

  • cannot yield per-time data since units will cancel
  • depends on who does the work
  • may not improve predictability over hour estimation
  • management may unnecessarily question discrepancy between ideal and actual days

Relative complexity
The most frequently suggested form of estimation is through relative complexity. Sometimes this is referred to as “story points.” Deliverables, or user stories, are assigned a measure of complexity relative to each other, using a “simple” story as a baseline. That story gets one point, and the rest are compared to it and assigned two, three, or more points. Points of complexity follow the Fibonacci sequence in order to reflect the growing margin of error as complexity increases. So point values are 1, 2, 3, 5, 8, 13, and so forth. A common practice is to take any story that is, say, eight or more points and try to make sure that effort is taken to break it down into smaller deliverables. The more small, lower complexity deliverables there are, the easier it is to track work and make predictions. Complexity is also not tied to any person’s experience, skill level, or availability.
As a general practice, if points of complexity are used to measure velocity (points delivered per unit of time), then points are often only assigned to deliverables that provide business value. The idea is that other tasks, while important to call out and track, exist in order to improve cycle time and quality of business deliverables. Tasks without points, such as technical discovery efforts, can be time-boxed if appropriate, and in any case should be factored into capacity planning.
Metrics such as velocity can be used to determine how well a team manages who does what work. Complexity is irrespective of a person’s ability or experience, so as velocity improves it may mean that a team has reduced expertise-related bottlenecks and is sharing knowledge. Velocity can also measure how much faster or slower a team delivers software in general as they evolve their process.

Pros

  • less time spent estimating than when using time-based values
  • can yield per-time data since it is not itself a measure of time
  • built-in acknowledgment of error margins by using Fibonacci sequence
  • does not depend on who does the work
  • can be used to predict how long a project will take based on velocity
  • immune as a measurement to distractions and context-switches
  • will not be factored into utilization planning or metrics
  • less likely to have a large discrepancy in estimated vs actual

Cons

  • require some experience to use appropriately in planning
  • impossible to convert to hours but people often try
  • team-specific; one point means something different to each team

Relative size or effort
Another way to estimate relative values is through something like t-shirt sizes rather than story points. People assign values like “small,” “medium,” “large,” or “extra large” to deliverables. This is a fast way of estimating but has some disadvantages compared to story points. One is that to get a metric such as velocity, the relative sizes need to be mapped to numbers in the first place. This means basically switching over to a point-based system to take advantage of any predictability of the data.
Another concern with relative sizes is that some people are thinking in terms of complexity while others are thinking in terms of how long it will take them to do, which can lead to misleading data.

Pros

  • little time spent estimating
  • if representing complexity
    • can yield per-time data once converted to points
    • does not depend on who is doing the work
  • will not be factored into utilization planning or metrics

Cons

  • if representing time to completion (avoid this)
    • cannot yield per-time data since units cancel
    • depends on who does the work
  • team-specific; sizes mean something different for each team

Value Points
This is a very different measure that looks at the business value of work rather than its complexity or the time it will take to complete. It is something orthogonal to time and complexity — it has nothing to do with them and can be mapped on a perpendicular axis. It is in fact this mapping that can help with product planning.
Time and complexity are ultimately measures that drive cost. Business value is the opposite — it provides a return. And so, the higher business value a certain deliverable has, the higher it can be prioritized by the product owner. More precisely, the higher its value relative to the cost of producing it, the higher the deliverable can be prioritized. High-value, low-cost deliverables may be selected over high-value, high-cost ones. Low-value, low-cost are ones typically left for last and low-value, high-cost deliverables may not be completed at all.
What are value points, ultimately? Are they abstractions for dollar amounts, similar to “ideal days” being abstractions for time? They can be, since using actual dollar amounts is difficult and similar to using hours to estimate time — the precision is too high given the amount of unknowns and there is no baseline to compare to. But deciding that a “value point” is, say, $10000, is also too much of a guessing game because returns on investment are quite difficult to predict in terms of absolute numbers. So comparing relative value through points may make more sense. Using the Fibonacci sequence, one can say that a one-value-point story feels like it is three times more valuable than a one-value-point story, and that is usually good enough to inform how work should be prioritized.

To understand what various estimates can be used for, we have to take a step back and look at our end goals as a software development shop (or team).

  1. be successful in an engagement with a client (i.e. a software project)
  2. generally improve as a software development team

When working on a software project, you will invariably have constraints. A nearly unavoidable one is money — the client only has a certain amount that will be spent on your team, whether it is per a given time period or overall. Another is product quality — whether it is stated outright or not, it is unlikely that a client will accept work that falls below some standard of quality and leave it at that. Next, we have constraints such as time — there may be a hard deadline for the project — and scope, where there is a minimal viable feature set that must be implemented. Together these constraints may be familiar to some as the “iron triangle” (https://en.wikipedia.org/wiki/Project_management_triangle), which maps cost, schedule, and scope as constraints tied to one another. Quality is included as a fourth constraint, often depicted in the middle of the triangle. It is also the one that is least likely to be actually negotiable despite being frequently compromised through poor processes and planning.

The estimate that is the most difficult to avoid is an early one: “how much will this project cost?”. It actually pertains to the first constraint mentioned — money. If your engagement happens to be a fixed-bid contract where you state that you will do a certain amount of work for a particular price, you have to have arrived at that number somehow. In this case, your best friend is historical data. If you’ve done similar work before, use the actual time you spent on that work to inform the estimate for this new project. If this work is completely new, then you have have to do some investigation and come up with a ballpark time estimate that you then convert to a dollar amount based on cost per unit of time. The only suggestions I have for this estimate is to use low precision, put a margin of error around the final number, and use the high end for your negotiations. This estimate’s main purpose is to close a deal and set some basic expectations.

When you are constrained by time because there is a fixed deadline for your project, then it is useful to know whether you are on track to meet that deadline given current scope.

When you are constrained by scope because there is a minimum viable product, then it is useful to know whether you are on track to deliver that scope given the current deadline.

When you are constrained by scope and by time, then it is useful to know whether you are on track to deliver that scope within that time given the current resources.

If you want to know that information — that given your current scope, timeline, and resources, you are on track for success — then you probably have to come up with some estimates.

The estimation methods described above are your options here: hours, ideal days, points, or t-shirt sizes. Depending on your team makeup, how many priorities you have to juggle (especially different projects), whether you care about measuring utilization (you probably shouldn’t), how much time you are spending on estimation, how well-groomed your product backlog is, and so forth, choose the option that works best for you.

Another way to look at it is this: which of these options you choose to pursue should depend on what actionable knowledge you expect these estimates to provide you, and what actions you would be capable of taking in turn. For example, if you want a measurement of work completed per unit of time, then you need a velocity and points are your best bet.

The other side of the equation is, what kind of information do you need in order to be successful in your engagement with a client? For example, do you need to predict how long it will take to complete the current product backlog?

Something to consider is, if you can’t think of how you could use the knowledge that gathering estimates yields, then it’s not valuable knowledge and there is no point in gathering it. Having it for its own sake is a great example of generating waste.

If you choose to estimate the cost of work, then I would recommend points of complexity because the pros outweigh the cons by so much compared to the other available options. But if you use points, you have to make sure there is discipline around when they are assigned and when they are not. If capacity planning is important to you, then cards with points must live alongside cards without points that nevertheless take away from capacity (administrative tasks, training, technical discoveries, defect fixes, and so forth). Other things such as meetings will take away from capacity as well, as will team size fluctuations, naturally.

If you choose to estimate the value of work provided, which is actually not a common practice as far as I know, then value points are a good option compared to dollar amounts. But few businesses use them because the initial several sprints are typically to create the “minimum viable product” which implies that all deliverables within it are equally and maximally valuable — no matter what, they must get done. But once a product gets into straight “maximize ROI mode,” they can help prioritization and yield a velocity that may even be more meaningful (business value per unit of time) than one that points of complexity can provide.

Effective planning is something I might look at in an upcoming writing.

I want to conclude by mentioning that there ARE alternatives to estimating altogether, depending on circumstances. If you have a fixed bid project that has a frozen scope and a hard deadline, then further estimation is a waste of what precious time you have to actually complete the work at high quality.

There is also the case, albeit an uncommon one, where a team is capable of breaking the backlog out into deliverables of roughly equal size, all “small enough.” In that case the team can just track how many deliverables are completed per iteration and how many are left in the backlog.

Good User Stories and the Definition of Done

On occasion, I see dev shops complaining of work taking too long to finish, of deliverables being carried over from Sprint to Sprint, and the root cause turns out to be that the developers don’t know what it means for their deliverables to be “done” until they are done. They start work, having a “fairly good idea” of what the customer is asking for, hoping that they’ll finish the work in time to demo it at the end of the iteration. And they find out the hard way that “fairly good” often just doesn’t cut it.

So how to get from “I think I have an idea of what they want” to “we are on the same page about exactly what needs to be done”? It’s a question with a potentially very long answer, but ultimately comes down to creating quality requirements.

The quality of software requirements has a direct influence on the quality of deliverables. You need a sufficiently clear understanding of the business domain, of the users involved and their needs, of the scope of the work to be done, and any edge cases in the logic that needs implementing.

Eliciting all of that requires a certain baseline of understanding and commitment on the part of the product owner (or customer) and the team gathering and implementing these requirements. Further, when it comes to building the actual requirements, there are many kinds of documents to consider, from a specification of required product features, to user personas and how they work day-to-day, to a list of business challenges that currently have no solution. Some of these can be captured in free-form documents, others in bulleted lists, and others in specialized formats. Product features, broken down into small deliverables, can be recorded as User Stories. Today I will take a very narrow focus and look at User Stories in particular.

A key idea behind User Stories is that they are “atomic,” the idea being that a single User Story cannot be broken down further into requirements that themselves are good User Stories. (If it can, then that should be done, and the original user story may become an Epic or some other metadata that can logically group the resulting, smaller User Stories.)

A good User Story should be valuable (for the business), independent (provide that value on its own), testable, estimable (understood well enough that its complexity can be determined within an acceptable range), and small (ideally, so small that it cannot be broken down further into independent, valuable, testable user stories).

The I.N.V.E.S.T. mnemonic is sometimes used to consider how “good” a User Story is.
https://en.wikipedia.org/wiki/INVEST_(mnemonic)

The typical format for the opening statement of a User Story can capture the “valuable” criterion.

As a ~
I want ~
So that ~

For example: As the site owner, I want users to authenticate prior to browsing, so that I can create a small barrier to entry, personalize user experience, and track user behavior.

Each part of the 3-phrase story contains useful information. The “As a ~” will describe the role from whose perspective the story is written. This can imply all sorts of things. “As an administrator” may imply that only those with administrator-level privileges are able to experience the described behavior and others should not. The “I want ~” briefly describes the desired behavior, and “So that ~” describes the value that this behavior provides. If a product owner cannot articulate the “So that ~” phrase — that is, if he or she cannot think of a good reason to desire the behavior — then it is likely that the work does not need to be done or in any case requires further conversation.

The rest is down to the details. Is the story independent or does it not actually contribute value on its own? Is it small or can it be further broken up into two or more user stories? Is it testable?

Most of the time, the “As a ~, I want ~, so that ~” statement is not enough to answer those questions. And, importantly, it is not enough to tell you the scope of the work. It is not clear how much work you have to do in order for the story to be considered done.

No User Story is complete without a clear definition of “done.”

One way to think about what “done” means is that if the product owner has accepted the work as complete, then there is no more work to be done. In other words, the User Story has a set of “Acceptance Criteria” that must be met. These should be written out so that they are clear to everyone, including the product owner, those implementing the Story, and those testing the Story.

Acceptance Criteria should certainly include everything relevant from the user’s perspective. They may also call out other requirements that concern the development team more than the product owner.

An example with our authentication story:

As the site owner
I want users to authenticate prior to browsing
So that I can create a small barrier to entry, personalize user experience, and track user behavior

Acceptance Criteria

  • The user should be shown a login form upon visiting the site
  • The login form should appear over site content and disappear upon authentication (that is, the login form does not live on a separate page)
  • The user should not be directed to a different URL upon successful navigation
  • The user should be shown an error message if incorrect credentials are entered

If there are any unknowns in this story that prevent it from being estimable, those should be brought up as soon as possible so that the User Story is ready to be implemented once it is high-enough priority. One crucial detail is missing in the story above — the credentials themselves. What are the accepted user credentials? Is it a username and password? Is it an email and password?

In this case the missing criterion for this story is: User credentials are an email and password. This story now has enough clarity that the work described can be estimated in terms of complexity.

There are other behaviors that relate to user authentication that are not touched on in this story. They may be worth asking about as part of requirement elicitation. An example is “can the user link his or her Facebook or Google account instead of using a site-specific one?” or “what happens when a user tries to log in with the wrong password several times in a row? should the site lock that user out to prevent brute-force account hacks?” or “what if the user has forgotten his or her password? should there be a ‘forgot password’ flow?” Mind you, the answer to these questions, if yes, does not mean additional acceptance criteria for the story described above. These behaviors are independent enough that the story above, done as-is, will provide business value on its own. Then features such as password reset, account locking, and so forth, would be described in separate User Stories that follow up on this one.

Perhaps this is the first time that the notion of user account management has even come up, which can spawn a whole separate discussion around security and how credentials are collected and stored. And shouldn’t there be user registration since there is user authentication?

Note that some things are not mentioned in the Acceptance Criteria that nevertheless become a decision point for the implementer and tester. For example, does the user submit credentials by clicking a “Log in” button? Or does the user press the Enter key on the keyboard? Or is it both? These details are specific to the solution rather than the problem and therefore the product owner may not care which direction is taken. Nevertheless, it is usually worth a quick conversation as work progresses so that there is less need for rework down the road.

I attended a talk by Ken Schwaber (a founder of Scrum) at a conference, where he succinctly explained that a story is done when “there is no more work to be done.” In other words, nothing is hidden, such as database migration scripts, or creation of user accounts, or whatever else may not be called out in a user story and therefore left until later but nevertheless must happen before that work can be live in production.

So Acceptance Criteria may not fully capture the definition of “done” for a User Story. In fact, many “nonfunctional requirements” are typically called out somewhere else. For example, stability and scalability. Does the fact that the site must support 10000 concurrent users logging in mean that should be called out in the Acceptance Criteria for user authentication? It almost certainly won’t be. Things such as performance, usability, accessibility, running on multiple platforms (Chrome, Edge, Safari, etc.), architectural conformance — these are probably considerations for every single bit of work done, and therefore are universally implied, provided they are called out somewhere and with sufficient precision (e.g. “must support 100 transactions per second for a sustained (1 hour +) period of time without experiencing statistically significant performance degradation”).

As you collaborate with the product owner and elicit requirements that are broken down and organized into User Stories with accompanying Acceptance Criteria, always keep in mind what “done” means and seek to clarify that definition before you commit to implementation.

Velocity, Estimates, and Cost

I recently read a blog post by Gojko Adzic (see here) about velocity and using its measurement. It’s a good post — it warns of relying on velocity as an indicator of success. In short, he argues that while low velocity can point to problems, a high-enough velocity doesn’t imply long-term success.

I agree. I also think velocity can be used to revisit estimated completion dates and costs. Much as developers don’t like those notions (I know I don’t).

One of the inescapable issues of software development, at least for me (if you have found a way to escape it, please share it), is the need for upper management to always have an “end date” for the work in sight. For better or worse, estimated completion dates are used to inform high-level budgets, resource allocation, and so forth.

Naturally, regardless of what estimation methods developers use, “the business” wants hours. The idea is to convert those hours to dollars.

But estimating in hours sucks because it’s not reliable (let’s be honest) and it doesn’t really yield a velocity. So some people end up with an arbitrary conversion of hours to “story points” (maybe 1 point is 1 day which is 6 hours or something like that). Which makes the word “point” mean “X hours” and nothing more.

Well, that’s a horrible thing to do. For one, what becomes of velocity? If all you have to measure rate of delivery with is hours, then isn’t that just how many hours of work a team does? Why even measure that number? It will stay constant from iteration to iteration unless you add or remove team members.

If velocity were to actually reflect the pace at which a team is delivering, then it could be used as a predictor. But I submit that then it can’t be tied to hours!

I accept that some things cannot be [easily] changed. Budget approvals have to happen very early in a project’s lifecycle and so you need a relatively high cost and time estimate because the earlier in the lifecycle you are, the more unknowns you have to accept. And no matter what you use to estimate that early in the game, your margin of error will be very large.

But the cool thing about having a properly measured velocity is that revisiting those estimates becomes a trivial exercise. And there’s some value to that, since the sooner it is known that work is taking longer than initially promised, the more time there is to make appropriate adjustments. Conversely, the sooner it is known that work is taking less time than intially expected, the sooner any benefits of this realization can be leveraged.

How to do it? Well like I already said, hours suck, so stop using hours altogether. They are too precise and they mean different things for different people in terms of effort. Using points of complexity is less precise, particularly if you go with the usual sequence of 1, 2, 3, 5, 8, and more universal. Whether a 2-point deliverable takes me one day or takes another developer half a day, it’s the same amount of complexity delivered, and that’s cool because now you are able to look at delivering more/less points per unit of time (proper velocity!).

I am serious about not using hours. If someone asks you “how many hours is a point?” you say “NO.” If someone asks you “how long will this take?” you say “NO.” Points are a swag at relative complexity. That’s it. There is no conversion and there are no hours. Deliverables have points as metadata and those points are used by the team to derive velocity — points delivered per iteration. And from velocity, a few other things can be derived.

One is an approximate amount of work for the team to pull into the next iteration. Historical data  — the velocity and its trend over the last few iterations — can give the team an indicator of about how much work it can reasonably commit to right now.

The other thing is more relevant to this blog entry, and that is a prognosis for the completion date assuming the current backlog. The backlog is continuously groomed, with work broken down for easier estimation and work added or removed as business needs dictate. A team can calculate their velocity, weighted by recent trends, and use that to get a rough count of how many iterations would be needed to clear out the current backlog.

How do you arrive at a weighted velocity? However you want, as long as it makes sense. I’ve used a formula I basically made up on the spot and it has worked “okay.” Given n is the latest completed iteration and n-1 is the previous iteration:

Weighted Velocity = Vn * .5 + Vn-1 * .3 + Vn-2 * .2

You can fiddle with the weighting factors; I don’t know how good they are. I just want to try and capture a trend if there is one and also help diminish the effect of any outliers (e.g. everyone having the flu during some iteration).

Anyway, so you get your velocity, and you look at how many points of work currently remain in your product backlog, and you do simple division and round up. With a velocity of 20 and 175 points remaining, you have 9 more iterations to go. If each iteration is 2 weeks and your team costs you about $400 per hour altogether then you are looking at $400 * 80 * 9, which yields $288000, so you round to $300000 to preserve the appropriate number of significant digits.

Bam. 9 weeks and about $300k. And that’s as of right now. Look at this at the end of every iteration and adjust. You still should not think of these numbers as any sort of promise, but at least the numbers themselves will be better and better informed, and hopefully closer and closer to accurate as work goes on.

Is there a whole lot of value to doing this exercise with your project? Well, if you don’t have a need to predict completion date and remaining cost,  then of course don’t waste your time with this stuff altogether. But if you do have that need, and many organizations continue to, then I think this is a reasonable approach.

Though, if you have ideas about how it can be changed for the better, I am very interested in hearing them.

On Individual Performance Metrics

My previous entry was all about how individual performance metrics in a collaborative, agile environment are misguided. But to keep things focused on reality, where demands for such metrics are sometimes unavoidable, I ended on a challenge for myself to come up with some individual, quantitative performance metrics that did not conflict with notions of self-organization, trust, collaboration, teamwork, and all that stuff that makes agile development teams happy and productive.

While driving home from work I came up with a couple of ideas.

To begin, I’ll jot down some attributes of what I would consider an acceptable individual, quantitatively measurable performance metric.

  1. [ ]  it is not adversely affected by taking time to help others
  2. [ ]  it is not adversely affected by doing high priority work over high complexity work
  3. [ ]  it does not discourage doing high complexity work for fear of failure
  4. [ ]  it makes individual praise sensible and individual punishment nonsensical
  5. [ ]  the closer to an ideal value it is, the more successful the team at delivering valuable work
  6. [ ]  it does not loom over people’s heads and lower morale

Let’s say that if an idea ticks all six of the boxen, then it’s fully acceptable. If it doesn’t, then it’s to be avoided. With that in mind, let’s look at a handful of “candidate” metrics.

Individual velocity
I ranted about this at length in my previous entry. This metric fails points 1, 2, 3, and 6, and makes a weak case for passing 5. It incentivizes selfishness and compromises team collaboration.

Defect count per person
Whether used to evaluate developers based on how few defects are discovered in their work or to evaluate testers based on how many defects they dicover in others’ work, this metric fails points 3 and 6. In the case of evaluating testers for, effectively, reporting how crappy the developers’ code is, it also fails points 1 and 5.

Number of lines of code
LOL.

Ok on to the ideas that I think might at least be conversation starters…

Contributions to an “agility index”
This one came to me while reflecting on a conversation featuring Ken Schwaber and the notion of having an Agility Index that would indicate how “agile” a company is. From what I could tell, it amounted to a checklist of agility-enabling practices that in turn yielded a score between 0 and 100. While I can’t seem to find this particular checklist on Scrum.org — presumably they’d want to sell its usage — I think a comparable checklist can be formulated by anyone with sufficient experience. The more items checked off, the higher the score on the “agility index.”

Making progress towards a higher index benefits the entire team and should not compromise any other good practices. In other words, all six requirements mentioned above would be met. What’s left is figuring out how to make it a quantifiable individual measurement. Well, when it comes time for a performance evaluation, simply ask each developer to mark the “agility index” checklist items that he or she contributed to, with a short blurb specifying the nature of the contribution. The more items contributed to, the better. And the nature of the items should ensure that nothing beneficial got compromised.

Kudos received through team retrospectives
This one is a bit wacky, but I think it’s worth a go. Some variations on team retrospectives include “giving kudos” to fellow team members. These could be for help offered, or ideas presented, or the quality of some deliverable, or really anything at all that was appreciated by others on the team. By being all about collaboration and contribution to team success, this approach ticks all six of the above boxes. And if a process or project manager keeps track of the number of “kudos” each team member receives, that can later be turned into a performance indicator.

So, that’s what I came up with on my drive home. There may be other things of this nature that focus much more on success than failure, and team success at that, all the while allowing for an individual perspective. And I think these are the kinds of metrics we should focus on if we absolutely have to. If it were up to me, I’d just focus on enabling team success.

Don’t Measure Individuals

I recently started looking at some project management software called AtTask, evaluating whether it is appropriate for agile development. While it seemed to be quite capable as a “waterfall” PM tool I wasn’t thrilled by its take on “Agile” (yes, the AtTask people use that word as a proper noun, which annoys me but I’ll write on that another time). I brought up AtTask’s inadequacies to the client, briefly mentioning my recommendation to go with Greenhopper… oh wait Atlassian renamed it… to JIRA Agile…  dang they’re also using it as a proper noun).

Anyway, what project management tool ends up being used remains to be seen. But what the client impressed on me is that the software used should allow for measuring individual velocity. (By velocity I am referring to “points of complexity” delivered per iteration.)

HOLD ON NOW!

Individual velocity?

Folks, if someone says that they want to quantitatively measure individual performance in an environment that’s supporting teams, just say no. That idea is bad in so many ways that I will actually enumerate some of them.

1. It is a conceptual non-starter
What does individual velocity mean when you are talking about a software development team? In the environment in question, each deliverable will be handled by at least three people — a primary developer, a developer who provides peer review, and a dedicated tester. In many cases there will be even more team members involved. So whose “velocity” is at stake when a deliverable isn’t done at the end of an iteration? If a developer hands off a bunch of stuff to testers, does that developer have a high velocity? If those testers find a million defects, is that a low velocity for the developer and a high one for the testers? If a technical lead who is needed for peer review is at a conference for a few days and some deliverables don’t make progress, does the primary developer’s velocity take a hit? Nothing about this concept makes sense to me.

Software is a team effort. Isolating each team member’s individual effort in delivering points of complexity is a suspect task. Just as an example, something like sending an email can be easy enough to implement but a pain in the arse to write automated tests for. Does Bob the tester get more credit than Mary the developer? Perhaps Bob got a whole bunch of assistance from Vick the technical lead. Should Vick get in on some of that sweet velocity loot? There is a team velocity because the team as a unit delivers software.

2. It is impossible to actually do
Besides the troubles listed above, velocity, like all such metrics, can be gamed. And I want to be very clear when I say that it will be gamed, at the expense of the team and the product.

Suppose that not all work has “points of complexity” (sometimes only business deliverables get this attribute, while other technical and non-technical tasks do not; the latter is then seen as helping get the former “done”). Presumably people that work on tasks that don’t have points won’t be looked at negatively when they don’t “take points across the wall.” So, let’s say Sam is a below-average developer and is having trouble writing as much production-ready code as some of his team members. Sam could just take on tasks that don’t have any points, so as he finishes them at his comparatively slow pace, he doesn’t have to worry about being compared to his colleagues. Sam makes himself difficult to measure.

Alternately, Sam may avoid doing any work that doesn’t have story points (e.g. setting up continuous integration) so that he can maximize how many points he delivers, leaving the “non-measured” work to the rest of the team.

Or, even worse, Sam just decides to forego any notion of maintainability and cuts corners like he’s got a Super Star in Mario Kart. His points delivered for the iteration go up. But the technical debt incurred from his shenanigans raises the complexity of future work. Team velocity suffers down the road.

But come on, Sam wouldn’t do that. Sam is a good team member who knows what it means to write quality code. He’s just inexperienced and needs a bit more guidance than his peers. But when he asks for help, Jack and Satou tell him “ain’t nobody got time for that” because they’ve got their own points to worry about!

Satou would never say something like that, though. He’s from Rhode Island. Also, he would help Sam out, and in turn his own work would stall while Sam’s moved along. This might be good for the team (and the business), but it is at Satou’s expense. That is, unless Satou also “took some credit,” which would only be fair, right?

At this point, we’ve really lost all sense of a metric for individual performance. Who did “more” work, Satou or Sam? Who did more important work? What about Jack — if part of Jack’s value to the team is the knowledge he is able to share, then does his refusal to help Sam reflect on his performance and how does it weigh against the “points” he was able to deliver?

“Velocity” on an individual level is a metric that’s vulnerable to so many deliberate and non-deliberate breakages that it’s effectively void. And hopefully it’s clear that the actual numbers that you’d come up when trying to measure “individual velocity” in a real situation would be very hard to distinguish from some you’d get by rolling dice. There is nothing to ensure that they reflect anyone’s skill level or actual productivity. And there are still other reasons not to go down this dark path.

3. It shifts focus to all the wrong things
When a business asks for software, what is ultimately promised by the team tasked to deliver the software? That the right functionality and quality will be implemented for a reasonable cost? Or that each team member will perform adequately in accordance with some performance metric? I’m guessing the first one.

When a team is focused solely on writing valuable, high-quality software, they will assist each other as needed and avoid compromising their goal for no good reason.

But I submit that reward for visible high contribution and/or punishment for visible low contribution can be quite compelling reasons indeed. When one is incentivized to look better than one’s peers (or to at least not look worse), then a conflict of interest arises where the actual quality of the software competes with the perceived quality of one’s individual contribution.

4. It lowers morale
Individual performance metrics are stressors as well as a potential source of tension and discord among team members. Rather than emphasizing success and movement in a positive direction; rather than encouraging collaboration and teamwork; rather than fostering a feeling of joint ownership, they introduce the fear of punishment for failure; they discourage altruism, knowledge sharing, and generally working together; they incentivize people to mask their inexperience. They can single-handedly make an otherwise positive experience into a negative one. Developers can become less happy. And when morale is low, so is productivity.

5. It is a net value loss
I suppose I should address the elephant in the room at this point, so here we go…

Why would anyone want individual performance metrics? Is it to give everyone cookies and donuts and Clif bars relative to how awesome they are? Probably not; it’s much more likely an attempt to target the underperformers. It’s a gathering of “objective evidence” that people that you already perceive to totally suck in fact do.

I have yet to see any other reason put forth that makes sense. If you want to reward people for good performance, nobody is going to challenge you for “proof.” If you want to manage resources in such a way that teams are balanced in skill and capability then you can do better than rely on fuzzy math to do it.

So this endeavor adds rather little value and carries a rather high cost. As mentioned, people will game the system, focusing on perception at the expense of ultimate quality. There’s the problem of lower morale and in turn lower productivity. These result in higher costs for the business to get what it needs. And the supposed value-add? The “addition by subtraction” of removing an underperforming team member? It’s far from a guarantee, not least because the system is vulnerable to gaming from all angles.

So what happens is the person who you think totally sucks merely continues to totally suck except now you’ve introduced a whole bunch more problems to worry about in terms of damaging team dynamics.

6. It goes against the principle of empowered, self-organizing teams
If a team is entrusted with delivering software then why should that team be burdened with a handicap like “individual velocity” just for the sake of gathering evidence against “bad” developers? Let the team figure out how to deliver the best software it can, let people collaborate as they see fit, and if the team decides a member is having a negative effect then trust the team to make that decision. (Naturally, asking for proof in the form of some numbers can take you down a very ugly path of infighting, subterfuge, and sabotage as people try to game a flawed system in conflicting ways. So don’t do it.) Find someone empowered to manage team personnel and remove the problem member if the team deems it necessary.

To conclude, individual performance metrics look to be a terribly unproductive endeavor at best and a highly damaging one at worst. Development teams, especially ones that have a good level of transparency built into their approach, already have no secrets about who’s good and who sucks. Efforts can be made to let team members help and improve each other and remove negative members if necessary, or efforts can be made to undermine what a productive dev team should be all about. Don’t fall into a trap of going for the latter.


As an afterword, suppose you absolutely have to obtain some “quantitative” measurement for individual performance reviews due to some stinky contract that was signed eons ago when software was written by fish. The challenge is to come up with metrics that do not compromise the principles of agility, trust, and self-organization that are worth so much to a dev shop — metrics that don’t introduce a conflict of interest. This is actually a bit of a puzzle and I will think on it some. I’ll post my thoughts in my next entry.

Greetings

This is the beginning of my blog. I plan on writing primarily about the following topics:

  • software (various topics)
  • travel, photography
  • fitness
  • movies, music, and other media entertainment
  • cars
  • the NBA (possibly through another blog of mine)
  • whatever else I please. this is my blog I do what I want!

Everyone is welcome to post comments.

Also, I plan on playing with the color scheme of this blog for a little while so if things are hard to read it’s probably because I haven’t bothered to style them intelligently yet. Give it some time.

Cheers
–Sciros (I also go by “Sci” or sometimes even my real name o_O)