Measuring developer productivity in 2020 for data-driven decision makers
Welcome to the most comprehensive guide to measuring software developer coding activity in 2019. To minimize your time here, the sections of this guide were designed to be self-contained. Feel free to jump straight to whatever topic you're most curious about:
- Why did we write this guide? About 100 hours have been spent collecting and arranging the content for this article. Why did we bother?
- Is measuring developer productivity really necessary? Reviews the emerging data on how software measurement impacts results.
- Better data, better policies. Specific examples where measurement improved retention, morale, and amount of work completed.
- Best tools for measuring developers in 2020. Reviews the four tools available to measure developer performance, and how they differ.
- What makes code measurement possible? What changed that made it possible to measure developer output, which has long been assumed impossible?
- On securing developer buy-in. Addresses the first question asked by many companies pursuing developer measurement: how do I get the team on board?
- Transparency always wins on a long enough time horizon. What can be learned from the past about creating improved transparency?
- Conclusion and consolidated pricing. A brief summary of code measurement tools, including an apples-to-apples pricing comparison.
This guide will be updated continually to reflect new advancements in developer measurement. If you'd like to talk about this article, drop by our corresponding blog post.
Regarding the distinction between "Developer Productivity" and "Code Activity"
Our experience is that the context in which managers tend to think about "developer productivity," is when they need a proxy for who’s coding what on their team. In this sense, "developer productivity" == "code activity." This is why GitClear's on-site header links to a version of this article that refers to "code activity."
If you want to get more === about the difference between "productivity" vs mere "activity," the distinction is whether the activity is pointed in the right direction. This is why having good managers is invaluable. A good engineering manager will interpret the sometimes subtle distinctions that make for productive engineering work, which is distinct from high code activity toward the wrong end.
Why did we write this guide?
What makes this topic sufficiently essential to earn the "comprehensive guide" treatment? My 20-year obsession with collecting developer productivity metrics can be tracked straight back to my first programming job.
The year was 1999. I’d been coding for awhile, but had just secured my first paid gig, writing software in a professional environment. The job posting promised a summer of experience, at $17.50/hour (!), to help a small Seattle company build software that controlled digital projectors. I still recall the rush of walking into the office on my first day, and discovering I had my own workstation, my own telephone number, even my own cube..! Heady stuff for a teenager.
My domain of ∞ infinite bliss ∞ aka the cube farm, circa 1999
The team was about 10 developers -- far more than I'd ever created software with before. With so many developers working together, it seemed there would be nothing we couldn’t achieve. But then, a few weeks after that heady onboarding, I noticed something that baffled me: the company's software was bug-ridden, slow, and, most perplexingly -- not improving.
I would try to use our software on the test projector we kept in the office, and it could take 20 seconds simply to turn the projector off (when it worked at all). No user was going to wait that long for a button to take effect. How could such a basic issue could be overlooked by a team of experienced developers working 40+ hours per week? We had several meetings, where the non-technical CEO gave pep talks to get our team "fired up" to make progress against the growing bug backlog. Still, the issues persisted. Our CTO was a fire-breathing despot who once bellowed at us that "I’m not going to wipe your asses, you morons need to test your code before you commit it!" He berated us for not putting in 10 hour days as the team creeped ever-further behind schedule. A cat-and-mouse game ensued among employees trying to escape the office under his detection; on one occasion I got caught in the stairwell and sent back to my desk. Still, the issues persisted.
In the three months I worked at this company, I witnessed the CEO work tirelessly to drum up meetings with clients who would have benefited from our product -- if it worked as advertised. He was a charismatic leader that worked harder than anyone I'd ever known. Yet, by Summer's end, we had lost more clients than we added. It didn’t make sense. How could such a large team, working such long hours, get so little done?
The answer was that management had zero visibility into the team's, ahem, work. The CEO was non-technical, so he didn't have a fighting chance. The CTO was disliked, so the developers ignored his directives while looking busy. They worked on pet projects, keeping up on instant messenger, and, under sufficient duress, an urgent Jira issue or two. Since all the salaried developers had offices, they were both literally and figuratively insulated from concern about having their process observed. When I began preparing to interview for my next job, I logged into our issue tracker and counted how many tickets each developer had resolved that Summer. I discovered that the hourly intern (me), had closed more tickets any one of the salaried developers. Naivete manifest.
While this company was an extreme example, it foreshadowed themes that remain ubiquitous across the software industry today. Even inside companies led by seasoned tech experts, managers struggle to gain visibility into their team's work. Instead of directly measuring developer output and using that data to improve team productivity, each manager devises their own set of intuitions to serve as their personal "North Star." Upper management then clamors to hire and retain technical managers whose past intuitions led to success.
If this arrangement sounds tenuous, it's not for lack of trying to find something better. In a top Google result for "measure developer productivity," Dustin Barnes neatly summarizes the industry's past efforts to outperform intuition, paraphrased here:
If there is a holy grail (or white whale) of the technology industry, especially from a management standpoint, it's the measurement of developer productivity. Measuring and managing developer productivity, however, has consistently eluded us ... In business school, it's still taught that "you can't plan if you can't measure," ... [but] as we've shown above, there still doesn't exist a reliable, objective metric of developer productivity. I posit that this problem is unsolved, and will likely remain unsolved.
Dustin's attitude, aka the prevailing outlook of 2015, hammers home how far developer measurement has come in just four years. As of early 2020, there are now four credible options for directly measuring team and developer performance. Each of the options approaches measurement in its own data-driven way. Each option has its adherents, and each provides documentation to help you evaluate its alignment with your own philosophy.
Is measuring developer productivity really necessary?
Engineering Managers don't have minutes to waste on anything non-essential (related: why the best developers tend to utilize a small set of tools). But, depending on the week, they might not even have time to perform all of their job's essential activities. If managers this busy are going to embrace a big, new idea like measuring developer performance, it must have provable evidence that the benefits justify the cost -- both financial and temporal.
GitPrime, the earliest entrant to today's developer productivity space, has done an excellent job of following up with their customers and documenting the impact that performance metrics can have on results. Their case studies include a 137% increase in Impact by Storyblocks, and a 25% increase in measured Impact enjoyed by Adext.
Code Climate, the most recent company to join the space, has documented an 83% increase in productivity while dogfooding their Velocity product. They ascribe their success to ditching their flat hierarchy and reducing their time spent in meetings. This anecdote from their story was especially compelling:
[Our manager] first started noticing that disengagement during daily standups. Some people were zoning out; others were on their phones. Almost no discussion was happening. It was 20 minutes of engineers taking turns speaking, one by one. When she brought this up in her 1:1s that week, she found that the team was in unanimous agreement: meetings were inefficient and taking up too much time.
This is consistent with our findings at GitClear. The "standup meeting" culture that permeates most Agile development teams has been assumed (as opposed to measured) to provide benefits that outweigh costs. We recommend teams measure the productivity difference of any regularly scheduled meetings to ensure that they're comfortable with the trade-off being made.
When it comes to standup meetings, it turns out the cost can be roughly expressed in terms of time spent:
15 minutes per standup meeting *
10 minutes before meeting when new tasks aren't started *
10 minutes after meeting to restore flow state *
number of meetings per week
The deeper one digs into the case studies emerging at companies that measure their development throughput, the more evidence emerges to support the narrative that a 5-15% improvement to developer output seems to be about the norm after beginning measurement .
Even if your business case yielded only a 5% increase to developer throughput, how many other avenues do you have available to multiply the efficacy of your engineering budget by 1.05? What does that math look like?
Better data, better policies
Data opens the door to improve the throughput of specific committers, but it gets even more interesting when we zoom out to apply it at the team level.
At the individual level, there are many paths by which measurement accelerates output. The most basic benefit is a "heads up" for the manager when a developer's velocity is more than two standard deviations below their average. The less formal way to describe such situations is that "the developer is stuck." Knowing when to intervene to help a stuck developer can save hours of time and frustration -- especially for junior developers, who are prone to suffer in silence. Pluralsight Flow, GitClear, and Velocity identify stuck developers (in varying ways) as part of their base package.
A more advanced tactic to drive up individual efficiency is to match developers with tickets that play to their strengths. On GitClear, every developer's efficiency is shown relative to their peers, across code categories:
At the team level, measurement's greatest contribution lies in allowing relentless experimentation. Step one is to establish the baseline Line Impact during a typical week. This tends to be consistent to within about 20% for most GitClear customers. Step two is to try an experiment for a couple weeks and measure the change in Line Impact. It's like an A/B test for your business processes.
One of the first experiments we ran upon gaining access to developer measurement was to analyze the impact of working from home. Here were the results of our analysis, and here's the corresponding graph that exhibits our per-hour output over the past year:
At face value, it appears that working from home on Wednesday has zero impact on the team's output. But the real story is more nuanced. It's typical for our team to schedule all chores, errands, dental visits, etc., on Wednesdays to reduce the hassle of getting to their appointments. You can see this reflected in Wednesday morning cells in the above graph, which are less active (= darker) than Tuesday or Thursday morning. How do they make up the lost time? Check out Wednesday evening -- the only day of the week with significant activity occurring after 5pm. Our developers are rewarding our trust by making up time after hours. It's a win/win that gives us the confidence to continue a cycle of trust. It would also let us spot if we hired a developer bent on abusing that trust (since any stat can be scoped to the individual level).
While working from home yields a neutral result on productivity, it has positive implications on morale, so this experiment is now ensconced as our company policy. Being able to run interesting experiments like this is a cornerstone of the well-measured company.
Best tools for measuring developers in 2020
More detailed comparison available
We've written a detailed standalone piece comparing Pluralsight vs GitClear vs Pinpoint vs Code Climate that you may enjoy if you'd like to dive into all the metrics available and their data source quality.
We've reviewed specific scenarios where measuring developer performance boosts results and creates opportunities. Once you've gathered how your team can benefit from data, the next step is to consider which flavor of measurement tool best matches your needs.
If this article were written five years ago, your options would lie somewhere between "shabby" and "pitiful." GitHub Stats already existed by that point, but their focus on raw lines of code and commits made were exactly as useless as you would predict. In these past four years, the performance measurement space has blossomed to host three "high polish" options worthy of your consideration. Each brings a distinct set of strengths and its own measurement philosophy. Let's review them in chronological order of their launch date.
The first company to plant their flag in the performance measurement space was GitPrime, debuting in 2015. In 2019, GitPrime was acquired by Pluralsight, which subsequently rebranded the product as "Pluralsight Flow" (home page, pricing) in January 2020. The Pluralsight business model relies on cross-selling customers into their tutorial and instruction business ("Pluralsight Skills") when customers seek out git analytics.
Work on Pluralsight Flow is interpreted via four key metrics: "active days," "commits per day," "impact," and "efficiency." The company offers many different reports types with limited explanation as to how the data was derived.
In terms of data sources, Pluralsight sources fairly evenly between commits, issues, and lines of code. We have written about the perils of these data sources elsewhere. Like GitClear, Pluralsight offers an on-prem version for enterprise-sized customers, although prices are higher for on-prem installs.
A 50 developer team on the fully featured "Plus" plan will pay $2,495 per month when billed annually.
In early 2017, GitClear (homepage, pricing) debuted with the mission to become the best solution for technical managers who wanted to quickly review code activity across 10+ repos or organization units. To this end, GitClear built robust code review tools, including a graphical Directory Browser that pinpoints tech debt.
In the time since 2017, GitClear has shifted to differentiating based on data source quality. Since competing products rely so heavily on commits, pull requests, and issues, GitClear spent 2019 doubling down to prove that data quality matters. To the extent GitClear succeeds in creating a single reliable metric to capture code activity, managers can unambiguously see how active each developer has been -- whether they're comparing between teams' location (remote or on-site), domain (front-end or back-end), seniority (senior, mid-level, code school graduate), country (outsourcing vs not), etc. Plenty of companies claim to allow such broad comparisons. As of 2020, a new one falls out of a VC incubator every few months. But among competitors that back up their claims with detailed analysis of how signal is maximized and noise is minimized: GitClear stands alone in the fervor of this pursuit. Here is a graph from another article that outlines how much more exhaustively GitClear has worked to refine Lines of Code into a consistent, reliable data source. Note how many Data Quality Factors on the right side are unique to GitClear:
The bonus for having a data source as consistent as Line Impact? You don't have to hide it from developers. They benefit from being able to review their teammates work with a set of tools that add clarity where even GitHub and GitLab fall short (for browsing commits and directories).
GitClear offers a fully functional on-prem version for enterprise customers. GitClear's on-prem version carries no additional cost for use.
A 50 developer team on the fully featured "Pro" plan will pay $1,495 per month when billed annually.
Launched in mid-2017, Pinpoint aims to provide a platform to help managers (especially senior managers) compare their teams using various data sources. Pinpoint is different from other choices reviewed in this survey in that analyzing source code is not their main focus. In addition to source code analysis, they "synthesize data from a range of software lifecycle systems" to provide stats like these:
The Pinpoint demo video is one of the few resources the company makes available to evaluate their product. I suppose this lack of transparency makes them more... enterprise-ish... than Pluralsight, GitClear, or Code Climate? In their demo video they state:
Factors we use to evaluate developer performance include number of commits made, how much code that person has contributed across repositories, number of issues worked, as well as his or her person cycle time and rework rate. A detailed view for each signal is available, just as it is for team performance.
It's straightforward to understand they evaluate "commits made" and "number of issues worked," but the rest of these are pretty opaque. One might deduce that keeping an air of mystery has helped them drive more demo signups?
A 50 developer team using Pinpoint will pay $1,599 per month when billed annually.
A recent entrant in the performance metrics space is Code Climate, which launched Velocity (homepage, pricing) in 2018. As of early 2019, the v2 launch of Velocity is now the focus of the Code Climate homepage, supplanting their long-popular code quality tools. This placement suggests that Code Climate expects Velocity to be the primary focus of their company moving forward.
Velocity recently published a robust comparison between their features and Pluralsight's. If you're considering Velocity, I recommend checking out this page. It illustrates how Velocity shares with Pluralsight a common thread of ideology, features, and even some design elements. For example, both offer a "commit activity log" that use shapes of various color and size to represent commit activity:
In contrasting with Pluralsight Flow, the article points toward "Surfacing issues" as the area where the products diverge most. In their words,
This is the category in which the two analytics tools differ most. Velocity, with PR-related metrics at the core of the product, does a better job drawing attention (inside and outside of the app) to actual artifacts of work that could be stuck or problematic. GitPrime, with mostly people-focused metrics, draws attention to contributors who could be stuck or problematic.The article concludes that, relative to GitPrime (now Pluralsight), "Velocity has put collaboration at the center of its product, so you’ll have access to more insights into the code review process and into PR-related details right out the gate."
A 50 developer team will pay $2,199 per month when billed annually (whereas a 51 developer team will pay $3,299/month 🤔)
A scrappy product seeking to make waves in git analytics is waydev.co (homepage, pricing). Under the heading of "Why we built Waydev," the company states, "as a technology company working with developers, it is almost impossible to understand what your team is doing, and your intuition doesn't help. But Waydev can." Waydev has gone through about 5 different incarnations over the years, as they've struggled to find product/market fit. The current version of the product looks on many pages that it could have been taken straight from the Pluralsight Flow help docs. Sometimes the similarity gets a little too close for comfort, like when their help page explains the Waydev "Impact" metric by obviously plagiarizing Pluralsight's explanation of the same.
If Waydev goes on to publish more pages describing how their code processing works (in their own words), then the attention dedicated to them may increase. At present, their differentiation amounts to one of price. You'll pay about $1,250/month for a team of 50 on Waydev, but we would recommend taking a long look at the data quality before treading these waters. Misleading engineering data can cause strife among developers and drive business decision-making that spawns from a misleading premise (i.e., noisy data).
Uplevel emerged from stealth mode in January 2020, well armed with $7.5m in its coffers from various investors. As of late January 2020, they have yet to become the first Google result for their name, so suffice to say, they're a young company.
While documentation on their site is still scant (including no pricing page yet), their home page emphasizes how they intend to analyze "deep work time," "meeting health," "bandwidth," "work activity," and "collaboration." In terms of data sources, this sounds like a blend of pull requests, Jira, and Google Calendar. The scattershot variety of metrics and data sources harkens to an early version of Pinpoint. Will be interesting to see how the team works to differentiate from that comparison point.
What makes code measurement possible?
Skeptics of code measurement have years of history, not to mention primitive tools like GitHub Stats, that they can cite to substantiate their distrust in "lines of code" as a measurement tool. Historically, their skepticism has been well-founded. This companion article goes in depth on the data sources used by modern code data processors
In essence, there are three data sources that are used to drive code measurement: lines of code, issues/pull requests, and commits. These data sources are not created equal -- our list of the worst data metrics used to measure developers showcases how things can go awry when these data sources aren't processed with sufficient precision.
On securing developer buy-in
One of the inevitable questions asked by new clients as they begin down the road toward measuring developer output: how do I explain this to my team? It is a fundamental question to address, especially if the manager intends to use code metrics as a component in performance reviews.
To earn developer buy-in, bring them along as you adopt measurement. Image credit: pexels.com
There's no single answer that is going to satisfy every developer personality. Programmers skew toward high intellect, which often goes hand-in-hand with skepticism toward authority, and broad mandates issued by management. The general approach I recommend to get your developers to embrace measurement: be transparent and just. Put yourself in their shoes and consider how you'd feel if a newly introduced tool put you in the bottom half of performers. If it's fair, and fairly applied, you'd come to terms with it. But that probably wouldn't be your first instinct.
Below are three ideas we've seen used to help developers grow comfortable with performance measurement. These focus on strategies available to GitClear users, since those are the ones with which we have the most familiarity. Here's a blog post describing how a Pluralsight manager gets buy-in from his small team.
Let developers see how specific commit activity corresponds to measured output. This is where code review tools can do double duty. They were created because looking through commits on GitHub one. at. a. time. is an incredibly inefficient use of a developer's time and we knew we could do better. But when it comes to securing developer buy-in, the code review tools serve a second purpose: allowing developers to get a tangible sense for how their work becomes Line Impact on a file-by-file, commit-by-commit basis. The longer a developer uses GitClear, the less mysterious it feels, as they develop an intrinsic sense for how measurement occurs.
Get empirical with your leaderboard. If you're a CTO or Engineering Manager using GitClear, you can configure Line Impact so that it matches your empirical sense of your team's top performers. Most managers begin with at least a vague sense for who's making the greatest contribution to their team. The better a new measurement aligns with the team's existing beliefs about its top performers, the better the confidence in the measurement.
Make clear that no single tool or approach tells the entire story. The last and most important factor to aid buy-in. One of the most common concerns among new users goes something like: "the developers that write the most lines of code aren't the most desirable -- it's closer to the opposite. The most valuable developers are those who don't create bugs, mentor others, pay down tech debt, trim crufty methods, are easy to get along with, etc." To this we respond: yes! Absolutely . And that's exactly why an algorithm will never replace the value of a great engineering manager. Looking at a metric like Line Impact provides one valuable piece of the puzzle when it comes to evaluating a developer's total package. But there are many pieces that software could never begin to approximate -- like whether a prolific developer's work is on the tickets they were assigned. It's essential that, when you introduce a measurement tool, you make clear that code measurement is just one aspect among many that inform your evaluation.
Transparency always wins on a long enough time horizon
The idea that we can measure development is new, so it's natural for it to face early skepticism. Consider how big a departure this is from tools a manager has used in the past. It carries the baggage of all the ineffective measurement systems of yesteryear. It begets winners and losers, which requires a strong manager to message effectively. Even with effective messaging, the possibility of early, instinctual pushback from developers is real. It seems expensive.
For some, it will take a leap of faith to believe that improved transparency justifies the cost. This final section glances back through similar inventions of the past to extrapolate whether trust is warranted. Over the last 20 years, the arrival of inventions that improve transparency follow a pattern. The first phase is that the invention is ignored. They lack sufficient data to inform their projections. The next phase is that they begin to gain traction. Oftentimes in this phase, the invention dramatically improves life for a tiny number of users who love them. The third phase is pushback, which begins tepidly and grows fireball-hot, as parties who benefited from past opaqueness begin to suffer from improved information. The final phase is broad public awareness. With this comes general acknowledgement that, while the system will never be perfect, its users could never go back to the way things were.
A small sampling of companies that have followed this arc, along with their detractors:
- Consumer Reports. Hated by: Companies with substandard products.
- Tripadvisor. Hated by: Motel 6.
- Zillow / Zestimate. Hated by: Realtors and their lobbyists.
- Metacritic. Hated by: Adam Sandler.
- Airbnb. Hated by: Marriott, Hyatt, and especially Holiday Inn .
- Yelp. Hated by: Every restaurant, especially Applebees.
- Redfin. Hated by: Realtors and their lobbyists.
- Glassdoor. Hated by: Retail stores.
Think about the first time you encountered Zillow. If it was early enough in their history, you probably weren't impressed. It takes time and a lot of iterations to gather enough data to make an algorithm like Zestimate consistent and reliable. Estimating the price of a home requires factoring in more nebulous, real world variables than any code measurement algorithm. But, before long, the early adopters discover the tool, and if it affords them an advantage over total opaqueness, so they use it. Once enough early adopters tell their friends, the parties that had benefited from lack of information begin to fight back. The Wikipedia page for Consumer Reports dedicates an entire section to the lawsuits that have been pursued against it. They're proud to have won all 13.
In the final phase -- that of broad public awareness -- the world becomes a little bit clearer, more predictable, place. If you live in a major city, the quality of the food you eat and the service you receive is better than it has been at any point in history. Ask a business owner about Yelp and you'll hear a litany of complaints about the unjust reviews they've received. Most business owners prioritize local effects over global effects, and that's fine. What matters is not perfection -- just that the biases of the measurement are minimized. In spite of their imperfect methods, most Americans would never consider going back to relying on intuition over Yelp.
As awareness around developer measurement grows, the early-adopters gain an advantage over their competitors by running their team 5-15% more efficiently . The difference sounds small-ish, until you add in the benefits to morale from the experiments you can run on more liberal, employee-friendly policies. Code transparency is still approaching phase two of its adoption pattern. But it won't take long until smarter companies recognize they can gain an information advantage in engineering. As the cycle picks up steam, the tools will keep growing more polished, until it's hard to remember a world when managers had to guess what their developers were working on. Multiplying the yield of your engineering budget by even a paltry 5% adds up in a hurry given current developer salaries.
Thanks for making it down to full scroll bar territory! I hope you better understand how developer measurement has changed over the past few years. If you manage more than 5 developers, you can get ahead of the curve.