Hey LLM! Please help me understand this system

Think “rubber duck development”

3 min readMay 8, 2023

In a recent blog post, I expressed my reservations about using ML-based code completion support, including using LLM-based agents to write code for me. Since then, I have had conversations about if and how such agents can be used to increase the productivity of software developers. And I had an aha moment in one such conversation.

What if I can have a conversation with an LLM-based agent to understand an existing legacy system?

Context

Onboarding developers onto an existing system takes a lot of work.

We can easily point developers to associated code base and information sources such as the version history of the code base, test execution logs, communication artifacts linked to the code (e.g., bug reports, group emails, release notes), documents related to the code (e.g., requirements doc, design doc, user-facing docs).

Even with rich information sources, developers spend a lot of time searching and stitching information to reconstruct the institutional knowledge about the system, i.e., the constraints and decisions along with their surrounding context that led the system to evolve to its current version. Along the way, they spend a good amount of time reconciling different views of the system. This process exposes the limitations in the system that they may try to fix or decide to work around.

Heck, even developers familiar with a system spend quite a bit of time reconciling the effect of recent changes when they return to the system after being away from it.

How would this work?

In the above context, let’s assume we have trained an LLM-based agent to have conversations in a natural language (say, English) on prevalent programming and coding best practices. Furthermore, let’s assume this agent has been extensively trained on the existing system of interest to consider both the explicit artifacts, such as code base, and the implicit artifacts that capture institutional knowledge.

With such an agent, I want to have the following conversation with the agent.

Me: What’s the architecture of this system?

Agent: Here’s the architecture diagram, along with a brief description of the responsibility/purpose of each component.

Me: What are the artifacts associated with component X?

Agent: Here’s the list of artifacts associated with component X.

Me: I can safely update map name2connection at line 630 in Connections.java without breaking any constraints.

Agent: name2connection map is always updated along with the connections list. Here are a few example locations. Also, the test test_updateConnection will only succeed if connections is updated with name2connection. So, you should update connections immediately after line 630.

And, so it goes.

Is this a novel idea?

Existing techniques like program slicing and specification mining aim to automatically extract such dependencies and relations in code bases to ease program comprehension (among other things). Modern-day IDEs already make such tasks more effortless. To integrate institutional knowledge into developers’ workflow, various efforts adapted techniques from information retrieval and recommender systems to speed up tasks like bug triaging, debugging (e.g., Debug Advisor), and customer support.

So, the idea of intelligent support to help developers understand an existing system is old.

So, what is new here?

With the ever-increasing complexity of modern software systems, can we leverage the following to improve developer productivity?

The ability to interact with a context-aware conversational agent in a natural language.
The ability to create an intelligent agent that can learn from a large corpus of data and provide relevant information tailored to the context of a conversation.
The availability of rich information sources (e.g., bug repository, version control system, documentation) about software systems.

Looking through the lens of tech in the 2000s

Can LLM-based agents act as continuously evolving, focused “search engines” that provide conversational access to both explicit (inherent) and implicit (institutional) knowledge about a system in a context-aware manner?