Measuring code activity in 2021 for data-driven decision makers
Welcome to the most comprehensive guide to measuring software developer coding activity in 2020. To minimize your time here, the sections of this guide were designed to be self-contained. Feel free to jump straight to whatever topic you're most curious about:
- Why did we write this guide? About 100 hours have been spent collecting and arranging the content for this article. Why did we bother?
- Can developer productivity really be measured? What modern academic and scientific research has to say about the extent to which dev productivity can be measured.
- Better data, better policies. Specific examples where measurement improved retention, morale, and amount of work completed.
- Best tools for measuring developers in 2020. Reviews the four tools available to measure developer performance, and how they differ.
- On securing developer buy-in. Addresses the first question asked by many companies pursuing developer measurement: how do I get the team on board?
- Transparency always wins on a long enough time horizon. What can be learned from the past about creating improved transparency?
- Conclusion and consolidated pricing. A brief summary of code measurement tools, including an apples-to-apples pricing comparison.
This guide will be updated continually to reflect new advancements in developer measurement. If you'd like to talk about this article, drop by our corresponding blog post.
Why did we write this guide?
In a top Google result for "measure developer productivity," Dustin Barnes neatly summarizes the industry's past efforts to outperform intuition, paraphrased here:
If there is a holy grail (or white whale) of the technology industry, especially from a management standpoint, it's the measurement of developer productivity. Measuring and managing developer productivity, however, has consistently eluded us ... In business school, it's still taught that "you can't plan if you can't measure," ... [but] as we've shown above, there still doesn't exist a reliable, objective metric of developer productivity. I posit that this problem is unsolved, and will likely remain unsolved.
Dustin's attitude, aka the prevailing outlook of 2015, hammers home how far developer measurement has come in just five years. As of early 2021, there are now four credible options for directly measuring team and developer performance. Each of the options approaches measurement in its own data-driven way. Each option has its adherents, and each provides documentation to help you evaluate its alignment with your own philosophy.
Can developer productivity really be measured?
Skeptics of code measurement have years of history, not to mention primitive tools like GitHub Stats, that they can cite to justify skepticism of "lines of code" as a measurement tool. Newer tools more often rely on "commit count," but is that any better?
Historically, sample sizes for developer research have been extremely small so it has been difficult to know to what extent it is possible to measure developer productivity. But, in early 2021, research was undertaken that uses 2,729 data points to study of how the classic git metrics ("lines of code" and "commit count") correlate with "software effort." The full analysis can be downloaded for free here.
In a nutshell, this research shows how various dev teams' estimation of effort correlates with observable git metrics. Since all developer metric companies. use the same few data sources as signal to drive the reports they offer, the quality of that signal puts an upper bound on how useful their reports can be. That is, if the correlation between "effort" and "commit count" is only 10%, you learn essentially nothing by looking at any graph or report based on commit count.
The research design separates correlation analysis by repo, since every repo has its own method by which Story Points are calculated. This chart shows the five largest repos in the study, and how strong the Pearson correlation was between popular git metrics and estimated software effort:
Or in table form,
|Effort vs Metric Correlation|
|Count of Data Points in Repo||Line Impact||Commit Count||Lines of Code Changed|
|Large repos (n=1847) weighted average||41%||32%||25%|
|All repos (n=2729) in dataset weighted average||38%||27%||25%|
These results suggest that git metrics can offer a useful first-pass approximation of developer productivity, but that even the most strongly correlated metric (Line Impact) in the most strongly correlated repo (n=655 issues analyzed) has only 61% correlation with effort. Thus, no existent git metric can tell you the full story behind which developers are most productive.
The good news for managers is that they're not usually looking for a silver bullet. Even if a git metric can't tell you the entire story of who is carrying the greatest load, having a metric that is 61% correlated in a repo of 655 issues proves that it is possible to create circumstances where effort is reasonably well approximated by git metrics.
Better data, better policies
Data opens the door to improve the throughput of specific committers, but it gets even more interesting when we zoom out to apply it at the team level.
At the individual level, there are many paths by which measurement accelerates output. The most basic benefit is a "heads up" for the manager when a developer's velocity is more than two standard deviations below their average. The less formal way to describe such situations is that "the developer is stuck." Knowing when to intervene to help a stuck developer can save hours of time and frustration -- especially for junior developers, who are prone to suffer in silence. Pluralsight Flow, GitClear, and Velocity identify stuck developers (in varying ways) as part of their base package.
A more advanced tactic to drive up individual efficiency is to match developers with tickets that play to their strengths. On GitClear, every developer's efficiency is shown relative to their peers, across code categories:
At the team level, measurement's greatest contribution lies in allowing relentless experimentation. Step one is to establish the baseline Line Impact during a typical week. This tends to be consistent to within about 20% for most GitClear customers. Step two is to try an experiment for a couple weeks and measure the change in Line Impact. It's like an A/B test for your business processes.
One of the first experiments we ran upon gaining access to developer measurement was to analyze the impact of working from home. Here were the results of our analysis, and here's the corresponding graph that exhibits our per-hour output over the past year:
At face value, it appears that working from home on Wednesday has zero impact on the team's output. But the real story is more nuanced. It's typical for our team to schedule all chores, errands, dental visits, etc., on Wednesdays to reduce the hassle of getting to their appointments. You can see this reflected in Wednesday morning cells in the above graph, which are less active (= lighter) than Tuesday or Thursday morning. How do they make up the lost time? Check out Wednesday evening -- the only day of the week with significant activity occurring after 5pm. Our developers are rewarding our trust by making up time after hours. It's a win/win that gives us the confidence to continue a cycle of trust. It would also let us spot if we hired a developer bent on abusing that trust (since any stat can be scoped to the individual level).
While working from home yields a neutral result on productivity, it has positive implications on morale, so this experiment is now ensconced as our company policy. Being able to run interesting experiments like this is a cornerstone of the well-measured company.
Best tools for measuring developers in 2020
More detailed comparison available
We've written a detailed standalone piece comparing Pluralsight Flow, GitClear, Pinpoint and Code Climate. If you're considering paying for developer measurement, we'd highly recommend visiting this article to get acquainted with how each provider sources their data.
Now that we've seen how teams can benefit from making decisions using data, let's pivot to explore the different flavors of measurement offered by the best developer productivity tools in 2020.
If this article were written five years ago, your options would lie somewhere between "shabby" and "pitiful." GitHub Stats already existed by that point, but their focus on raw lines of code and commits made were exactly as useless as one might predict. In these past few years, the performance measurement space has blossomed to host four "high polish" options worthy of your consideration. Each brings a distinct set of strengths and its own measurement philosophy. Let's review them in chronological order of their launch date.
The first company to plant their flag in the performance measurement space was GitPrime, debuting in 2015. In 2019, GitPrime was acquired by Pluralsight, which subsequently rebranded the product as "Pluralsight Flow" (home page, pricing) in January 2020. The Pluralsight business model relies on cross-selling customers into their tutorial and instruction business ("Pluralsight Skills") when customers seek out git analytics.
Work on Pluralsight Flow is interpreted via four key metrics: "active days," "commits per day," "impact," and "efficiency." The company offers many different reports types with limited explanation as to how the data was derived.
In terms of data sources, Pluralsight sources fairly evenly between commits, issues, and lines of code. We have written about the perils of these data sources elsewhere. Like GitClear, Pluralsight offers an on-prem version for enterprise-sized customers, although prices are higher for on-prem installs.
A 50 developer team on the fully featured "Plus" plan will pay $2,495 per month when billed annually.
In early 2017, GitClear (homepage, pricing) debuted with the mission to become the best solution for technical managers who wanted to quickly review code activity across 10+ repos or organization units. To this end, GitClear built robust code review tools, including a graphical Directory Browser that pinpoints tech debt.
In the time since 2017, GitClear has shifted to differentiating based on data source quality. Since competing products rely so heavily on commits, pull requests, and issues, GitClear spent 2019 doubling down to prove that data quality matters. To the extent GitClear succeeds in creating a single reliable metric to capture code activity, managers can unambiguously see how active each developer has been -- whether they're comparing between teams' location (remote or on-site), domain (front-end or back-end), seniority (senior, mid-level, code school graduate), country (outsourcing vs not), etc. Plenty of companies claim to allow such broad comparisons. As of 2020, a new one falls out of a VC incubator every few months. But among competitors that back up their claims with detailed analysis of how signal is maximized and noise is minimized: GitClear stands alone in the fervor of this pursuit. Here is a graph from another article that outlines how much more exhaustively GitClear has worked to refine Lines of Code into a consistent, reliable data source. Note how many Data Quality Factors on the right side are unique to GitClear:
The bonus for having a data source as consistent as Line Impact? You don't have to hide it from developers. They benefit from being able to review their teammates work with a set of tools that add clarity where even GitHub and GitLab fall short (for browsing commits and directories).
GitClear offers a fully functional on-prem version for enterprise customers for only $29 per contributor per month, billed annually.
A 50 developer team on the fully featured "Pro" plan will pay $450 per month when billed annually.
Launched in mid-2017, Pinpoint aims to provide a platform to help managers (especially senior managers) compare their teams using various data sources. Pinpoint is different from other choices reviewed in this survey in that analyzing source code is not their main focus. In addition to source code analysis, they "synthesize data from a range of software lifecycle systems" to provide stats like these:
The Pinpoint demo video is one of the few resources the company makes available to evaluate their product. I suppose this lack of transparency makes them more... enterprise-ish... than Pluralsight, GitClear, or Code Climate? In their demo video they state:
Factors we use to evaluate developer performance include number of commits made, how much code that person has contributed across repositories, number of issues worked, as well as his or her person cycle time and rework rate. A detailed view for each signal is available, just as it is for team performance.
It's straightforward to understand they evaluate "commits made" and "number of issues worked," but the rest of these are pretty opaque. One might deduce that keeping an air of mystery has helped them drive more demo signups?
A 50 developer team using Pinpoint will pay $1,599 per month when billed annually.
A recent entrant in the performance metrics space is Code Climate, which launched Velocity (homepage, pricing) in 2018. As of early 2019, the v2 launch of Velocity is now the focus of the Code Climate homepage, supplanting their long-popular code quality tools. This placement suggests that Code Climate expects Velocity to be the primary focus of their company moving forward.
Velocity recently published a robust comparison between their features and Pluralsight's. If you're considering Velocity, I recommend checking out their article. It illustrates how Velocity shares with Pluralsight a common thread of ideology, features, and design elements. For example, both offer a "commit activity log" that use shapes of various color and size to represent commit activity:
In contrasting with Pluralsight Flow, the article points toward "Surfacing issues" as the area where the products diverge most. In their words,
This is the category in which the two analytics tools differ most. Velocity, with PR-related metrics at the core of the product, does a better job drawing attention (inside and outside of the app) to actual artifacts of work that could be stuck or problematic. GitPrime, with mostly people-focused metrics, draws attention to contributors who could be stuck or problematic.The article concludes that, relative to GitPrime (now Pluralsight), "Velocity has put collaboration at the center of its product, so you’ll have access to more insights into the code review process and into PR-related details right out the gate."
A 50 developer team will pay $2,199 per month when billed annually (whereas a 51 developer team will pay $3,299/month 🤔)
A scrappy product seeking to make waves in git analytics is waydev.co (homepage, pricing). Under the heading of "Why we built Waydev," the company states, "as a technology company working with developers, it is almost impossible to understand what your team is doing, and your intuition doesn't help. But Waydev can." Waydev has gone through about 5 different incarnations over the years, as they've struggled to find product/market fit. The current version of the product looks on many pages that it could have been taken straight from the Pluralsight Flow help docs. Sometimes the similarity gets a little too close for comfort, like when their help page explains the Waydev "Impact" metric by obviously plagiarizing Pluralsight's explanation of the same.
If Waydev goes on to publish more pages describing how their code processing works (in their own words), then the attention dedicated to them may increase. At present, their differentiation amounts to one of price. You'll pay about $1,250/month for a team of 50 on Waydev, but we would recommend taking a long look at the data quality before treading these waters. Misleading engineering data can cause strife among developers and drive business decision-making that spawns from a misleading premise (i.e., noisy data).
Uplevel emerged from stealth mode in January 2020, well armed with $7.5m in its coffers from Seattle investors. As of late January 2020, they have yet to become the first Google result for their name, so suffice to say, they're a young company.
While documentation on their site is still scant (including no pricing page yet), their home page emphasizes how they intend to analyze "deep work time," "meeting health," "bandwidth," "work activity," and "collaboration." In terms of data sources, this sounds like a blend of pull requests, Jira, and Google Calendar. The scattershot variety of metrics and data sources harkens to an early version of Pinpoint. Will be interesting to see how the team works to differentiate from that comparison point.
On securing developer buy-in
One of the inevitable questions asked by new clients as they begin down the road toward measuring developer output: how do I explain this to my team? It is a fundamental question to address, especially if the manager intends to use code metrics as a component in performance reviews.
To earn developer buy-in, bring them along as you adopt measurement. Image credit: pexels.com
There's no single answer that is going to satisfy every developer personality. Programmers skew toward high intellect, which often goes hand-in-hand with skepticism toward authority, and broad mandates issued by management. The general approach I recommend to get your developers to embrace measurement: be transparent and just. Put yourself in their shoes and consider how you'd feel if a newly introduced tool put you in the bottom half of performers. If it's fair, and fairly applied, you'd come to terms with it. But that probably wouldn't be your first instinct.
Below are three ideas we've seen used to help developers grow comfortable with performance measurement. These focus on strategies available to GitClear users, since those are the ones with which we have the most familiarity. Here's a blog post describing how a Pluralsight manager gets buy-in from his small team.
Let developers see how specific commit activity corresponds to measured output. This is where code review tools can do double duty. They were created because looking through commits on GitHub one. at. a. time. is an incredibly inefficient use of a developer's time and we knew we could do better. But when it comes to securing developer buy-in, the code review tools serve a second purpose: allowing developers to get a tangible sense for how their work becomes Line Impact on a file-by-file, commit-by-commit basis. The longer a developer uses GitClear, the less mysterious it feels, as they develop an intrinsic sense for how measurement occurs.
Get empirical with your leaderboard. If you're a CTO or Engineering Manager using GitClear, you can configure Line Impact so that it matches your empirical sense of your team's top performers. Most managers begin with at least a vague sense for who's making the greatest contribution to their team. The better a new measurement aligns with the team's existing beliefs about its top performers, the better the confidence in the measurement.
Make clear that no single tool or approach tells the entire story. The last and most important factor to aid buy-in. One of the most common concerns among new users goes something like: "the developers that write the most lines of code aren't the most desirable -- it's closer to the opposite. The most valuable developers are those who don't create bugs, mentor others, pay down tech debt, trim crufty methods, are easy to get along with, etc." To this we respond: yes! Absolutely . And that's exactly why an algorithm will never replace the value of a great engineering manager. Looking at a metric like Line Impact provides one valuable piece of the puzzle when it comes to evaluating a developer's total package. But there are many pieces that software could never begin to approximate -- like whether a prolific developer's work is on the tickets they were assigned. It's essential that, when you introduce a measurement tool, you make clear that code measurement is just one aspect among many that inform your evaluation.
Transparency always wins on a long enough time horizon
The idea that we can measure development is new, so it's natural for it to face early skepticism. Consider how big a departure this is from tools a manager has used in the past. It carries the baggage of all the ineffective measurement systems of yesteryear. It begets winners and losers, which requires a strong manager to message effectively. Even with effective messaging, the possibility of early, instinctual pushback from developers is real. It seems expensive.
For some, it will take a leap of faith to believe that improved transparency justifies the cost. This final section glances back through similar inventions of the past to extrapolate whether trust is warranted. Over the last 20 years, the arrival of inventions that improve transparency follow a pattern. The first phase is that the invention is ignored. They lack sufficient data to inform their projections. The next phase is that they begin to gain traction. Oftentimes in this phase, the invention dramatically improves life for a tiny number of users who love them. The third phase is pushback, which begins tepidly and grows fireball-hot, as parties who benefited from past opaqueness begin to suffer from improved information. The final phase is broad public awareness. With this comes general acknowledgement that, while the system will never be perfect, its users could never go back to the way things were.
A small sampling of companies that have followed this arc, along with their detractors (i.e., worst product in category):
- Consumer Reports. Hated by: General Motors
- Tripadvisor. Hated by: Motel 6
- Zillow / Zestimate. Hated by: Realtors
- Metacritic. Hated by: Adam Sandler
- Airbnb. Hated by: Holiday Inn
- Yelp. Hated by: Applebees
- Redfin. Hated by: Realtors
- Glassdoor. Hated by: Amazon
Think about the first time you encountered Zillow. If it was early enough in their history, you probably weren't impressed. It takes time and a lot of iterations to gather enough data to make an algorithm like Zestimate consistent and reliable. Estimating the price of a home requires factoring in more nebulous, real world variables than any code measurement algorithm. But, before long, the early adopters discover the tool, and if it affords them an advantage over total opaqueness, so they use it. Once enough early adopters tell their friends, the parties that had benefited from lack of information begin to fight back. The Wikipedia page for Consumer Reports dedicates an entire section to the lawsuits that have been pursued against it. They're proud to have won all 13.
In the final phase -- that of broad public awareness -- the world becomes a little bit clearer, more predictable, place. If you live in a major city, the quality of the food you eat and the service you receive is better than it has been at any point in history. Ask a business owner about Yelp and you'll hear a litany of complaints about the unjust reviews they've received. Most business owners prioritize local effects over global effects, and that's fine. What matters is not perfection -- just that the biases of the measurement are minimized. In spite of their imperfect methods, most Americans would never consider going back to relying on intuition over Yelp.
As awareness around developer measurement grows, the early-adopters gain an advantage over their competitors by running their team 5-15% more efficiently . The difference sounds small-ish, until you add in the benefits to morale from the experiments you can run on more liberal, employee-friendly policies. Code transparency is still approaching phase two of its adoption pattern. But it won't take long until smarter companies recognize they can gain an information advantage in engineering. As the cycle picks up steam, the tools will keep growing more polished, until it's hard to remember a world when managers had to guess what their developers were working on. Multiplying the yield of your engineering budget by even a paltry 5% adds up in a hurry given current developer salaries.
Thanks for making it down to full scroll bar territory! I hope you better understand how developer measurement has changed over the past few years. If you manage more than 5 developers, you can get ahead of the curve.