This is how I built a PR review agent that does not rely on one giant prompt.
The hard part was not getting an LLM to comment on a diff. The hard part was making sure it did not miss the file that actually mattered.
The constraint stack
-> PR has 3,000 lines changed
-> Full diff is 180k tokens
-> Model context is only 128k tokens
-> Worst bug is in file 37
-> Model never sees file 37
"How do you put a PR inside an LLM?"
down is the wrong question. the real one:
"How do you design a review system that does not miss the file that matters?"
Design checklist
all architecture risk coverage context memory
01 Never depend on one giant prompt architecture 02 Parse the full PR first architecture 03 Build a change graph risk 04 Scan every file at least once coverage 05 Separate light scan from deep review architecture 06 Rank by risk, not file size risk 07 Pull related context on demand context 08 Use multiple reviewers architecture 09 Keep external memory memory 10 Track coverage explicitly coverage 11 Force cross-file checks risk 12 Escalate uncertainty coverage The correct design
Click each stage to see what it actually does inside the system.
diff parser Extracts structured change inventory from the raw diff change graph Builds a dependency graph across all changed files risk scorer Assigns a risk score to every changed file retrieval engine Fetches callers, tests, interfaces, and old behavior on demand multi-pass reviewers Runs specialized agents for security, correctness, API, performance, and tests external memory Stores summaries, risks, links, and review status outside context coverage tracker Tracks scan, review, skip, and flag status for every file final verifier Merges reviewer outputs and generates the report
The correct design is not a prompt. It’s a pipeline.
Diff parser → Change graph → Risk scorer → Retrieval engine → Multi-pass reviewers → External memory → Coverage tracker → Final verifier
That pipeline decides what the model inspects next. Each stage feeds the next. Coverage is explicit. Context is budgeted per reviewer, not per PR. Risk drives ordering.
You don’t fight the model’s context limit. You work within it by breaking the review into pieces, routing each piece to a specialized reviewer, and tracking what’s been seen.
That is how you review a 180k-token PR with a 128k-token model.