Lines of Code Breakdown: A Compositional Analysis
Results of analyzing 10 million lines of code across the largest Open Source projects
Last updated January 21, 2021
There are a
lot
of
tools
that provide stats on lines of code (LoC). Conventional wisdom has long held that these metrics
are fraught, but absent hard data, it has only been possible to
gesture toward the disadvantages of relying on LoC,
without statistical proof.
GitClear has previously asserted that only 5% of lines of code
meaningfully evolve the repo's code base.
Because it is an extraordinary claim that 95% of LoC is noise, it is beholden upon us to substantiate this claim with data.
That is the purpose of this page.
The funnel below aggregates real world lines of code measurement
across 171,435 commits in
60 open source repos
between October 22, 2020 and January 21, 2021.
On desktop, hover on a funnel step to get more details about it.
First step: All changed code lines
11,176,047
changed lines of code factored into analysis
All changed code lines
The total lines of code in our most recent data set. This includes all lines that changed in any commit, so it is equivalent to the "Lines of Code" metric provided by GitHub or Pluralsight Flow.
Distinct commits
7,962,507 lines remain
Distinct: Ignore duplicated fragments
This step rinses all lines of code that occurred in a branch that is discarded, or code that is committed in multiple branches or repos.
Removes 3,213,540 lines
Effecting
6,707,761 lines remain
Effecting: Remove semantic lines
Changes that modify white space, blank lines, language keywords (e.g., begin, include), or types of lines that don't contain meaningful code content relative to the file type.
Removes 1,254,746 lines
Substantive
4,794,529 lines remain
Substantive: Negate batch operations
Line Impact approximates cognitive load per commit. Operations like cut/paste and find/replace change many lines but do not represent high cognitive load, so are discarded by this step.
Removes 1,913,232 lines
Purposeful
702,174 lines remain
Purposeful: Rinse commit artifacts
To normalize away the difference between a developer who commits 100 times vs 1 time daily, we identify churned code, and we devalue large-scale additions (like new libraries).
Removes 4,092,355 lines
Meaningful lines of code
Once you've cut through all the layers of noise that cloud lines of code, you find only a fraction of code evolving its repo in a meaningful way.
702,174
(6.3%)
impacting lines remain
How much noise does your analysis tool let through?
Since other git stat tools (including those that profess to offer "Engineering Insights") neglect to process
some or all of the steps above, the "insights" that they offer are as likely as not to be false positives or
commit artifacts.
If you would like to extract the fractional lines of code that correspond to meaningful work by developers,
consider signing up for a free GitClear trial, or a demo.