Popular software engineering metrics, and how they're gamed
When people talk about software engineering metrics, there are two contradicting truisms one often hears:
What's measured improves. -Peter Drucker
When a measure becomes a target, it ceases to be a good measure. -Goodhart's Law
Both of these statements describe software engineering metrics in part, but taken together they imply that it's impossible to use measurement to improve. How can a data-driven manager bridge this divide?
The bad news is that it takes some commitment to learning, especially for those coming from distant (i.e., non-technical) roles. The good news is that carefully chosen metrics can and do continue to work in spite of "becoming a target." In fact, for many of the metrics we'll recommend, the more an employee "games the system," the more the business' long-term interests are served. The end goal is metrics that continue to work well while being "gamed." There are five metrics that meet this bar.
Target audience: Action-minded Manager
Any manager who has tried to pry more completed tickets from their engineering team knows the struggle. They want to take fair, data-backed actions that improve their velocity. They know that making decisions based on gut instincts is expensive and unscalable. Yet, when they look at their to-do list, it's full of tasks that seem more pressing than "improve measurement." In practice, embracing measurement often happens after a catastrophic failure, like a buggy release build or noticing a developer hasn't closed any tickets in a month.
Even when it's clear that measurement needs to happen, it's ambiguous where to start. You can Google "software engineering metrics," click through the entire first page of results, and remain unclear about what next steps to take. Take the oft-touted metric "Team Velocity" as an example. Almost every Google result recommends measuring "Team Velocity," but nobody wants to say whether it's being measured in... tickets closed? Lines of code? Story points? Whatever it is, managers don't have time to figure it out. They need to get from this 15,000 foot mountain of theory down into something they can understand, trust, and benefit from... preferably immediately.
The organizing premise of this article is that you are a manager who wants to understand software engineering metrics just well enough to make your team more effective. Being a manager means you're too busy to learn about theory, except where it ties straight back to how you can benefit from that theory.
Metrics must be practical for real businesses to try out
Before writing this article, we digested the metrics proposed by the top 20 Google results in the domain of software metrics. As you might imagine, every source had their own theories about which metrics mattered most. Their articles often bucketed metrics into groups like "process metrics" and "project metrics." We'll follow their cue on grouping metrics, but our groups will be "quality metrics" and "everything else."
Here's what qualifies as a Quality Metric:
- Business value. Is it possible to draw a straight line from metric => business value created?
- Measurable. Can we define a specific number and its units of measure (e.g., points, percentage), so that it can be tracked over time and correlated with events and experiments?
- Actionable. Can it regularly inform action that leads to positive results? Preferably with minimal false positives?
- Available/convenient. Can an average software engineering team get access to this metric without changing their existing development processes?
- Popular. Does it have enough of a following to be well-documented and credible?
Inclusion in the "Quality Metrics" bucket requires all five of the above, plus a satisfactory answer to our stretch question: if we trust this metric, and developers subsequently optimize for it, what happens then? In our time building and iterating our own metric ("Line Impact"), we've learned that toxic byproducts are an expected consequence of measurement, unless the metric is precisely aligned with long-term business value. As an example, we will later show how developers trying to game the "Lead Time" metric cause reduced transparency for management.
Quality software engineering metrics
Here are five software metrics that check all the boxes to help Managers run a more efficient team.
Any list that's focused on maximizing business value ought to start here. Leading companies like Google use OKRs as a primary axis on which to evaluate senior engineers. The drawback of OKR-driven metrics is that they are, by definition, specific to business particulars. This makes it impossible to generalize a formula by which to capture them, or to generalize the path by which they'll get gamed.
How to game it? The risk of OKRs being gamed is low since they are usually business-centric. The main danger of OKR-driven metrics is that since they're typically formulated on a one-off (or periodic) basis, they may be less reliable than metrics that have been vetted over years. There are always possible edge cases lurking. For example, setting an OKR to "introduce less than 10 bugs per month" could be gamed by not launching any new features. Assuming both parties are acting in good faith, OKR gaming should in practice be rare.
How to get it? Via existing measurement dashboards, or by working with companies like Weekdone who help provide OKR inspiration.
🎲 Story Points
Story Points could lay claim to be the most "classic" of software engineering metrics. There are many different ways to calculate Story Points. All forms map back to an estimate of expected developer time to complete some task. Managers can use Story Points to calculate the costs implied by a task, e.g., if a developer earns $10k/month, ticket is estimated at 10 Story Points = half of month, then cost is $10,000 * 0.5 = $5,000. Using some version of that calculation helps prioritize tasks based on which yield the highest ratio of Projected Value / Projected Implementation Cost.
How to game it? Most attention here belongs on how the Story Points get calculated. The easiest way to "game" Story Points is for the developer team to bias the rating scale, usually toward overestimating difficulty. One option to address this is to allow tasks to be selected in a "task marketplace," where overvalued tasks can be spread evenly between developers to maintain relative calibration. Another way to address it is to regularly recalibrate the constant used to translate Story Points into "developer days." Note that since the implementation of Story Points often differs by team, it's not advisable to use them to compare cross-team performance.
How to get it? Most any issue tracker this side of GitHub allows specifying Story Points. In terms of viewing them, Jira provides a "next-gen velocity report" dedicated to the purpose of showing Story Points completed over time. They also offer a "release burndown report" that uses Story Points to estimate how the team is tracking toward their goals for the sprint. GitClear also offers rudimentary (for the moment) graphs that illustrate Story Points completed over time.
How to game it? Any metric that claims "Lines of Code" (LoC) as a data source (as Line Impact does) ought to be approached with skepticism. Subtle imperfections in processing quickly get magnified into noise that crowds out signal. Our position is that any metric related to LoC ought to show the user how it interpreted their work using a diff viewer. Unless you can see how the metric was calculated on a per-commit, per-file basis, there is potential for metric gaming.
This makes GitClear's refined diff viewer a good first step toward sufficient transparency to prevent gaming. Additional protection comes in the form of notifications that get triggered when a developer contributes commits at an unusual velocity. As well, a per-committer graph of previous Line Impact makes it simple to visually spot if there are abnormal spikes in a user's contribution patterns. All these measures form a tapestry of safeguards that makes gaming unknown in the wild. That said, it would be disingenuous to claim that the metric is somehow impervious to gaming, so there is a list of known paths by which to accumulate rapid Line Impact in the Rich Footnotes of our source document.
🐞 Bug Rate / Escaped Defects
This metric is defined in various ways by various sources, but they all get at the same essential question: how many bugs are getting out into the world, and how well are we responding to them? Stackify relates it in specific numerical terms:
Your defect escape rate is expressed as a percentage based on how many defects you find before they get to production or how many make it to production, however you prefer. Every software project and team will be different, but we suggest striving to find 90% of all defects before they get to production.
Stackify's definition makes the assumption that one will designate a field in their issue tracker to capture what phase of development the bug occurred at. For teams that don't have the time to set up and track development phases, a substitute definition can be:
How often does code authored in a new feature become code changed in the course of resolving a bug?
A data-linked answer to this question is provided to GitClear users, but is difficult to calculate manually (if you're not a GitClear user). A guaranteed-available fallback measurement of bug rate can be "what is the ratio of bugs created vs bugs resolved"?
How to game it? This metric is difficult to game. A developer can take on more small tickets, but each ticket introduces an incremental risk of bugs if they aren't careful. This risk of using Bug Rate is that it needs a counterbalance for "progress" like Story Points or Line Impact, otherwise a developer could linger on a single issue for the entire sprint, ensuring a 0% bug rate.
How to get it? If your team has the bandwidth to label issues by the phase of development at which they occurred, then Escaped Defects can be manually calculated through any issue tracker. GitClear is another available option to calculate the percentage of features that become bugs.
✅ Pull Request Completion Rate
What percentage of pull requests were closed within a week of being opened? This rolls up several more granular questions, like "is the team engaged in responding promptly to new pull requests," "are they able to cooperate," and "have the developers polished their code before submitting it for merge?"
How to game it? Acing the "Pull request completion rate" test means always closing PRs within a week. This is relatively hard to game, but it must be counterbalanced by a propulsive force such as Story Points or Line Impact, otherwise all of the team's incentives will be to polish work rather than ship it (or submit it for PR review).
How to get it? All of the "name brand" Engineering Insight providers (GitClear, Pluralsight, Code Climate, Pinpoint) offer stats that illustrate the efficiency of a team's pull request process. We help compare the providers here.
Beneath the "top tier" metrics, we find an interstitial tier. These are metrics that possess desirable features alongside important shortcomings. They can still deliver value, so long as a Manager knows enough to account for their limitations.
Lead Time/Cycle Time
While it is listed (twice!) in Pluralsight's list of Top 5 Metrics, Cycle Time is a good idea that is highly susceptible to gaming.
"Lead Time" is most often defined as "What is the interval between when a task is filed in the issue tracker and when it is delivered to the customer?" Its cousin, "Cycle Time," is a subset of Lead Time, defined as "the time between when the product team plucked it out of the queue and when it reached production."
How to game it? Lead Time performance depends to some degree on the mechanism by which Jiras get filed, which means that its applicability across teams is low. Cycle Time hinges on the conventions by which the committer chooses to make their first commit. Thus, this metric is straightforward to game if a developer saves up their commits locally and releases them in one burst of work immediately before submitting the PR. This failure is what relegates Cycle Time to the "Honorable mentions" section. Also not ideal: by incentivizing a developer to save up their commits without pushing, it's impossible for teammates to intervene while that work is in progress. This side effect reduces transparency between a manager and their developers.
How to get it? Azure Devops Services provides the graph featured above. Jira offers a version of it as well. All of the major Engineering Insight offerings (GitClear, Pluralsight, Code Climate Velocity, Pinpoint) implement some version of Lead Time calculation.
Trying to capture some measure of test coverage is a reasonable goal to consider pursuing. However, it doesn't quite make the cut as a Top Tier metric for two reasons:
- It's an incomplete solution. Adequate testing must include functional and integration tests alongside unit tests, and no automated solution can know which pages need to be visited within the app or website, this still takes human direction
- It's very difficult to tie back to top-line business metrics. Test coverage is worth assessing as part of a Lead Developer's plan to monitor code quality, but it doesn't connect to business value directly enough to make the cut unto itself. A related but more business-functional metric to track code defects is "Bug Rate / Escaped Defects."
"Everything Else" Metrics
The list above is relatively short compared to corpus of past metrics that have been proposed by other authors. For the sake of completeness, let's review some of the most commonly mentioned metrics elsewhere, and why they didn't make the cut.
This was one of the most popular metrics cited by other articles. It's recommended by TechBeacon, InfoPulse, and SeaLights.io. However, authors seem wary to define the units of this important metric. TechBeacon defines it as "how many 'units' of software the team typically completes in an iteration." Not so tangible. InfoPulse doesn't try specifying units. SeaLights says the units are "Story Points." Thus, we conclude that the notion of "velocity" is adequately captured by Story Points, or Line Impact by proxy.
The CEO of GitPrime (now acquired by Pluralsight) was among those who advocated for considering Code Churn. We argue in response that the implications of Code Churn are all over the map, which renders it very difficult to act on high churn. Metrics need to be actionable.
SeaLights and InfoPulse recommend paying attention to how a team is trending toward their sprint goals. We agree this is important, but it's already covered by the "Story Points" section above.
Mean time to repair / Mean time between failures
Among the handful of articles that champion these metrics, none venture into describing what units the metric is based in, or what service can be used to gather these data points. Research finds that Atlassian offers a help article on the subject. If this article weren't already 2,500 words long, MTTR might make the "Honorable Mentions" section. But it isn't easy to tie back to business value, and it isn't straightforward for an average team to get access to without changing their routines.
Did we miss anything?
If you have a metric that meets our criteria above but isn't yet on the list, drop us a line in the comments section on our blog post.