Bill is driven by the challenge of how best to quantify valuable questions that defy quantification. It's possible this instinct may have been awakened in Bill at age 14, when he won a soft, stuffed bunny at the orthodontist for guessing the number of jelly beans in the gumball machine.
Comments
Thanks for writing this! An obvious one for me is what is going to be the impact on the junior developers? Are they missing something while shifting into "system integrators"? How do we restructure our mentorship to replace the learning that used to happen when working hard through logical problems or algorithms that AI can now handle instantly?
Is the "Cost of Verification" starting to outweigh the "Speed of Creation"? Generating 500 lines of code takes seconds; verifying that those 500 lines don’t contain a security vulnerability or a logic bomb can take an hour. If our senior devs are spending their time auditing "fast" code from juniors, ROI could be negative. There is probably a sweet spot. Just like we could skip code review entirely or we could have multiple reviewers in deep discussions approved by a single benevolent dictator. I think ultimately we need a paradigm shift here - AI reviewing AI with the humans involved at a meta level -- something like sampling code for inspection.
The last question I'll ask is around the cost of these tools. Today, we are seeing great returns on investment. We can sure find ways to spend tokens... how stable is this pricing?
> An obvious one for me is what is going to be the impact on the junior developers
Great point, one that I've been meaning to write about for awhile. My guess is that junior developers are going to need to be start doing more to prove their value before they get an interview. It's now possible for a Junior Dev to build their own sprawling side project, and given that the big companies will want fewer, better devs, I think that they are going to be choosy about picking devs that show proactive tendencies & a proven record leveraging AI to its fullest.
> I think ultimately we need a paradigm shift here - AI reviewing AI with the humans involved at a meta level -- something like sampling code for inspection.
I hadn't considered "sampling" per se, but I've been thinking that any time I am quickly scrolling through a PR, I might be scrolling past code that doesn't need to be seen at all, to reserve more attention for the code that *really* matters.
The problem is, I keep leaving almost as many comments on CSS & tests as I do on substantive libraries. I can't decide whether that's because I really believe that having well-factored CSS/tests is going to be a big deal in the long-term, or more because it's easier to evaluate those than it is changes to models & libraries.
At any rate, it will be unsustainable for humans to review the volumes of code currently being generated, so there is going to be a lot of momentum toward delegating review to AI, for sure.
Thanks for taking the time to share your thoughts. Hope your questions inspire other answers!
My name is Brian and I am the Chief Engineering Officer at Apporto. When I joined 15 months ago, the team was struggling to release a new version of a legacy product. GitClear helped me identify engineers lagging behind in valuable contributions and I used GitClear to help them identify patterns to success and value add. The release cycles are now repeatable and impactful.
Ai is the big topic now and as an engineering leader I need to balance the hype the industry puts out there versus the reality of working on legacy complex code. My questions are
1) Can I measure the impact and multiplying factor of using AI in various types of projects. Example greenfield, maintenance project, or in a mix where we are adding new features to existing legacy (spaghetti code). Do green field projects see a much greater impact from AI?
2) Longevity - does code submitted by AI or maybe from various levels of AI user groups lead to more or les reports of defects from QA or customers leading to REWORK?
3) What is the return value on tests created while using AI to generate new code or add tests to existing code?
Bonus: Is there a way to measure complexity or PR (and whether Ai was used or not). I would like to see if complex debugs spanning multiple connected services benefit from AI? Are they able to deal with the scope and complexity of the interconected endpoints?
Bonus Bonus: Can GitClear and AI measure CD projects - containerization, DevOps, deployment?
Hey Brian, great to hear from you on this one, given how much attention I know you've spent thinking about this stuff.
1) My intuition (from experience over last few months) is that AI does indeed contribute much greater value to greenfield projects at the moment, but I think that every generation gets a bit more capable of working in the complex circumstances of a production-scale product. Since we started assigning AI model => changed code lines in the past month, I think that the Directory Browser will increasingly help us to see how the velocity of a repo differs based on how new/greenfield it is. Beyond that, your comment makes me wonder if our intention to begin instrumenting "Diff delta per dollar" will be a usable approach to understanding how much value AI is imparting, by being able to see that (I presume) repos with more AI use will cost less per Diff Delta than projects without, since humans tend to be more expensive to implement big features on greenfield projects at this point.
2) Yep, this is the $1m question! We've already found (in our previous research report) that more AI = about 9x more churn, so in that sense, longevity is certain reduced. But I think the better way to evaluate this is by analyzing the data that GitClear recently began collecting on "which lines were authored by which LLM model?", combined with our existing infrastructure that can traverse, when a Bug issue is resolved, back to the lines that preceded the code that fixes the bug. I suspect that we will find a greater percent of these lines are AI authored than usual, but the "AI ROI Stats" also recently began showing the propensity of AI-authored code to result in defects on a per-team basis, which seems like it will pair well with the large-scale research on the topic we plan to publish in the coming weeks
3) My hunch from experience is that the return value ranges from "small" to "tiny", with much of the value being lost when AI indulges its tendency to want to mock every response in a way that ensures the tests will pass. I'm hoping that our new (as of last weekend) tracking of how much AI uses mocks can help to illuminate its worse tendencies, so as to force it to work harder to write meaningful tests.
Login to leave a comment