Why We Built Manthan Intelligence

And What It Means For Knowledge Work

By Mayank Mathur, Founder — Manthan Intelligence, Operating Partner — Tavaga Advisory Services

I. THE OBSERVATION

I am creating a venture fund with two people and forty-three AI agents.

That is not a metaphor. Tavaga Advisory Services has exactly two human partners — Nitin and me — and forty-three autonomous software agents that operate across eight divisions: Engineering, Data Engineering, Go-To-Market, Product, Human Resources, Investment Committee Pipeline, Consulting, and Finance & Capital. Each agent has a defined mandate, a specific analytical framework, the ability to communicate with agents in other divisions, and learning loops that make it better at its job every week.

This is not how we planned to build a venture fund. We planned to hire analysts. But somewhere in early 2026, we realised that the architecture we were building for our own investment analysis had become more interesting — and more valuable — than the analysis itself.

The products we now offer were not designed as products. They were built to solve our own problems — the screening, the pattern-matching, the institutional memory, the cross-divisional coordination. Every tool we sell is a byproduct of running an AI-native fund. The knowledge graph exists because we needed comparative intelligence. The calibration loop exists because we needed to know when we were wrong. The analytical council exists because we needed twelve perspectives, not one. We turned our operating system into a product only after it proved itself on our own money.

II. THE PROBLEM NOBODY IS SOLVING

Here is an uncomfortable truth about artificial intelligence in 2026: 77 percent of knowledge workers say AI has added to their workload, not reduced it.*

Upwork 2025 Future Workforce Index, 2,500+ knowledge workers across US, UK, Australia, and Canada.

Over a trillion dollars invested. Thousands of products. And the majority of the people these products were built for say they are worse off.

The copilot trap. ChatGPT. Gemini. Copilot. Perplexity. Each solves one narrow task. The problem is that a knowledge worker doesn’t have one narrow task. They have a web of interconnected decisions that require context, memory, and judgement. Fifteen AI tools added to twenty existing tools means thirty-five tools and 2.5 work-weeks per year lost to context switching. Copilots don’t reduce complexity — they distribute it across more interfaces.

The developer platform trap. Claude Code. Cursor. Replit Agent. These are genuinely powerful — for people who want to become software developers. Irrelevant if your goal is to screen two hundred pitch decks a year, manage a hundred client portfolios, or run a consulting engagement.

There is a third path that almost nobody is building: autonomous analytical systems that do the work itself — not tools for the expert to use, not platforms for the expert to build on, but systems that deliver the completed analysis, the institutional memory that compounds.

When a VC partner submits a pitch deck to our system, they do not receive a tool to help them analyse it. They receive the analysis. The partner’s job is not to learn our system. The partner’s job is to apply their judgement to the signal we produce. Everything else — the screening, the comparable analysis, the framework application, the pattern matching — is execution. We do the execution. They do the judgement.

III. WHY SINGLE-AGENT SYSTEMS FAIL

The default architecture in 2026 is a single AI agent with a long prompt. This is the architecture behind most “AI analyst” products on the market. It is fundamentally broken for expert work.

Anchoring bias. A single agent forms its thesis in the first few paragraphs and spends the rest confirming it. No second perspective challenges the initial frame. Coherence is not correctness.

Consensus collapse. When you ask one agent to “consider multiple perspectives,” it simulates disagreement but cannot sustain it. By the conclusion, all perspectives have been averaged into a single recommendation. The tension — where the real signal lives — has been destroyed.

Context poverty. A single agent knows nothing about what it analysed yesterday. Every analysis starts from zero. The equivalent of hiring a new analyst for every deal.

Methodology vacuum. Without structured analytical frameworks, a single agent defaults to “things to consider” lists — comprehensive, vaguely relevant, and useless for decision-making.

We benchmark our system against vanilla Claude, ChatGPT-4o, and sophisticated single-agent prompts. On a matched sample of deals with known outcomes, the multi-agent architecture scores 66.7 percent weighted accuracy versus 41-48 percent for the best single-agent prompt. The gap is not marginal.

IV. THE ARCHITECTURE OF DISAGREEMENT

The intellectual foundation of Manthan Intelligence is that the best decisions come from structured tension between genuinely different perspectives — not from consensus.

When a company enters our pipeline, a Deal Memo Analyst first processes the pitch deck into a structured evidence base — separating facts from assumptions, identifying gaps, producing the shared foundation that all subsequent analysis builds on. This is not “ChatGPT reading a document.” It is a specific methodology where reproducibility is a first-class requirement.

Then independent analytical personas evaluate that evidence. Each embodies a different investment philosophy. They do not read each other’s work. They cannot anchor on each other’s conclusions. They commit to a position with a confidence level.

The natural tensions that emerge are the product’s soul: growth versus unit economics; founder capability versus market structure; short-term catalysts versus structural defensibility; technology moat versus execution speed; capital efficiency versus market capture; network effects versus product quality. The disagreement between a bullish and bearish view is signal, not noise.

A synthesis layer then identifies which disagreements are resolvable with more information and which are fundamental. It produces a signal summary that preserves the disagreement rather than averaging it away.

This methodology is calibrated against 213 real funding outcomes. Weighted accuracy — the average correctness score across all scored verdicts, measured against actual funding trajectories and adjusted by confidence level — is 66.7 percent and climbing. INVEST reliability: 93.1 percent — 27 of 29 companies we said “invest” in were correct.

When the system is wrong: Our weakest performance is on momentum-driven and brand-driven investments — companies where charisma or cultural resonance outweighs unit economics. Fourteen false negatives fall into this category. Each is documented as a learning entry that the system reads before its next assessment. Three containment layers — daily blind backtests, benchmark regression tests on ten frozen deals, and calibration notes injected before every analysis — prevent errors from compounding. The system improves as much by what we prune (1,058 dead stubs removed, 1,066 misclassified sectors reverted, an entire biased backtest methodology discarded) as by what we add.

V. THE MEMORY THAT COMPOUNDS

Every analysis feeds structured data into a knowledge graph that today contains 84,837 entities: 13,598 companies, 5,021 investors with $258 billion in tracked fund sizes, 62,935 relationships mapping who invested in whom and who competes with whom, 1,491 verified funding rounds, 1,134 analyses preserving reasoning not just verdicts, 213 calibration scorecards linking IC verdicts to real-world outcomes, 217 postmortems of failed companies, 148 cross-portfolio insights, and 25 documented learning entries where the system recorded its own mistakes.

When your fiftieth deal analysis runs, it draws context from the previous forty-nine. “This is the third logistics company this quarter with the same unit economics structure — here is what happened to the first two.” “This investor led three rounds in this sector last year; here is their pattern.”

This is the difference between a first-year analyst and a partner with twenty years of pattern recognition. And unlike a human partner, this memory does not retire, does not forget, and scales to every client simultaneously.

The data is not the moat. The accumulated analytical context layered on top of the data is. Every entity passes through validation gates. Forty-seven canonical sectors with normalisation rules eliminate messy classifications. LLM-generated categories are cross-validated against deterministic keyword rules. The graph grows daily, but it grows intelligently.

We also made a deliberate decision that most AI companies avoid: we preserve our mistakes. A learning system captures every missed call, every rejected hypothesis. When the daily calibration backtest — run blind, without knowing outcomes in advance — reveals an error, that failure is structured into a learning entry that the system reads before its next analysis. The knowledge graph that only remembers its wins repeats its losses. Ours remembers both.

Every prompt change is tested against ten frozen benchmark deals. If accuracy drops, the change is rejected — regardless of how elegant it is. The single metric: weighted accuracy. Changes that improve it ship. Changes that don’t, don’t.

VI. THE ENGINEERING DECISIONS THAT ACTUALLY MATTER

The public discourse on AI agents in 2026 is stuck at the wrong altitude. The conversation is about which model is smartest, which framework has the most GitHub stars, and which company can demonstrate the most impressive demo. These are the wrong questions. The right questions — the ones that determine whether an agentic system works on day thirty, not just day one — are architectural.

We have learned this the hard way, through sixteen weeks of daily production and 213 calibrated outcomes. Here are the engineering decisions that we believe the industry is getting wrong, and what we did instead.

The deterministic/probabilistic boundary.

This is the decision that most agent builders never consciously make — and it is the one that determines whether the system is reliable or merely impressive.

Most agentic systems are entirely probabilistic. The process flow itself — which agent runs when, what triggers what, how data moves between steps — is left to the LLM to figure out. This works brilliantly in demos. It fails catastrophically in production, because a probabilistic process means a different process every time. An agent that “usually” follows the right sequence is an agent that sometimes does not, and you will not know which runs were wrong until the damage is done.

Our architecture makes a deliberate split. The process is deterministic — the Deal Memo Analyst always runs first, the analytical personas always run independently and in parallel, the synthesis layer always reads all assessments before producing output, the knowledge graph harvest always fires at the end. That sequence is hardwired. It does not vary. It cannot vary.

What IS probabilistic — deliberately, structurally — is the reasoning at each step. The DMA extracts different evidence from different decks. The personas reach different verdicts on different companies. The synthesis identifies different tensions. The intelligence lives in the reasoning. The reliability lives in the process.

This is analogous to how a hospital operates. The triage protocol is deterministic — every patient goes through the same intake sequence. The diagnosis is probabilistic — it depends on the patient. No hospital would make the triage protocol probabilistic (“the doctor will probably check vitals first, maybe”). No analytical system should make its process flow probabilistic either.

Andrej Karpathy’s AutoResearch framework validates this at the ML layer: the improvement loop (modify prompt → evaluate → keep or discard) is a deterministic protocol wrapping probabilistic generation. Nate Jones, whose enterprise deployment analysis has documented dozens of failed agent rollouts, puts it bluntly: “Do not give your agent tasks that should properly be given to something with defined workflows and deterministic software underneath.” We agree. We go further: every validation gate in our pipeline — sector classification, schema compliance, entity deduplication — runs deterministic rules FIRST, and only escalates to LLM judgement when the rules cannot resolve. We discovered that LLM-only sector classification has a 58 percent error rate on edge cases. The deterministic gate catches these before they enter the knowledge graph.

The Protector/Operator/Visionary triad.

A single agent with a long prompt cannot simultaneously be creative, reliable, and self-correcting. These are competing objectives. Creativity requires risk-taking. Reliability requires constraint. Self-correction requires the willingness to override both.

We solve this by structuring every division around three complementary roles:

The Visionary defines direction — what to build, what to prioritise, where the division should be operating differently. Visionary agents run on the most capable model available (currently the reasoning-heavy tier) because strategic assessment requires the deepest intelligence.

The Operator executes — processing data, running pipelines, producing deliverables. Operator agents run on a balanced model because their tasks are well-defined but require competent reasoning.

The Protector ensures nothing degrades — running validation, compliance checks, regression tests, monitoring for drift. Protector agents run on the fastest, most efficient model because their tasks are frequent, well-scoped, and must run cheaply enough to be always-on.

This is not cost optimisation. It is cognitive architecture. A human organisation achieves this by paying partners more than associates and associates more than paralegals — not because partners are “better,” but because different cognitive tasks require different levels of capability, and misallocating expensive capability to routine tasks is waste. We achieve the same by routing different agent roles to different model tiers through a central resolver that tracks budgets, supports escalation when a task exceeds a model’s capability, and de-escalates for routine subtasks.

The implication for the industry: the conversation about “which model should I use” is the wrong conversation. The right conversation is “which model for which cognitive role in which part of the workflow.” A Protector agent running on the most expensive model is waste. A Visionary agent running on the cheapest model is dangerous. Model selection is an architectural decision, not a procurement decision.

The self-improvement loop as production infrastructure.

Karpathy demonstrated that agents can improve autonomously by modifying their own prompts against a metric. Shopify’s CEO achieved 19 percent improvement overnight using the pattern. What neither discussed — because their context is ML research and software engineering — is how to apply this to business judgement, where the feedback loop is months, not minutes.

Investment analysis produces a verdict today. The outcome — whether the company raised its next round, stalled, or died — materialises six to eighteen months later. This means the improvement loop cannot run overnight. It must run across calendar time, accumulating scorecards as outcomes emerge, and adjusting the system’s analytical posture gradually.

Our implementation: 213 scorecards linking verdicts to outcomes, accumulated over sixteen weeks. Weekly calibration sweeps aggregate new data into calibration notes that every analytical agent reads before its next assessment. Ten frozen benchmark deals serve as a permanent regression test — after every prompt change, these ten deals must still produce correct verdicts, or the change is rejected. The single metric (weighted accuracy) has improved from approximately 55 percent to 66.7 percent over this period. The improvement is monotonic — no week has been worse than the previous week — because the regression test prevents backsliding.

This is the Karpathy Loop applied to a domain where Karpathy himself has not applied it. ML research has immediate feedback (loss function, accuracy on test set). Business judgement has delayed, ambiguous feedback (did the company succeed? by whose definition? over what timeframe?). Making the improvement loop work in this domain required building the entire calibration infrastructure — blind backtests, outcome tracking, confidence-adjusted scoring, error preservation — that does not exist in any ML framework. We believe this is the next frontier for autonomous agent systems: not just improving on tasks with immediate feedback, but improving on tasks where the feedback is delayed, partial, and subjective.

Why this matters for knowledge workers.

These are not abstract engineering choices. They determine whether an AI system is something you can trust with real decisions or something that produces impressive demos.

The deterministic/probabilistic boundary means the system produces consistent, auditable results — the same evidence base processed through the same sequence every time, with the intelligence concentrated where it belongs. The P/O/V triad means the system simultaneously executes, improves, and self-corrects without requiring a human to manage each function. The self-improvement loop means the analysis you receive today is measurably better than the analysis produced last month — and you can verify that claim against published accuracy data rather than taking it on faith.

The industry will converge on these patterns. The question is whether you adopt them now, from a system that has already validated them in production, or wait until your current tools discover the hard way why they matter.

VII. WHAT THE MARKET IS TELLING US

Sequoia (Julien Bek, March 2026): “The next trillion-dollar company sells the work, not the tool.” We are autopilot-native. We skipped the copilot phase entirely.

Andreessen Horowitz (Cui & Li, March 2026): “Context — not model intelligence — is the bottleneck.” Our 84,000+ entity knowledge graph is precisely this context layer.

Foundation Capital (Gupta & Garg): “Context graphs — decision traces plus temporal context — are the trillion-dollar moat.” Our calibration scorecards and benchmark regression tests are temporal decision context that compounds.

Hedgineer (Michael Watson, ex-Citadel MD): “The cost of software is going to zero. Expertise margins don’t.” Blueprint is the AI-native Forward Deployed Engineer.

Andrej Karpathy (March 2026): Open-sourced AutoResearch, validating the autonomous agent improvement loop. Fortune named it “The Karpathy Loop.” We apply it to business judgement.

Jensen Huang (NVIDIA): “The future is a collection of agents.” Our forty-three agents across eight divisions are this vision, operational today.

Nate Jones (OpenClaw/Open Brain, April 2026): Published five commandments for enterprise agent deployment after analysing dozens of failed rollouts. Every commandment — audit before automating, fix the data first, redesign the org for throughput, build observability, scope authority deliberately — maps one-to-one to our existing architecture. His central insight: “Do not give your agent tasks that should properly be given to deterministic software underneath.” We agree. We built it.

Eight independent sources, same thesis. We designed Manthan to solve our own problems. The convergence is independent.

VIII. WHERE WE ARE TODAY — HONESTLY

What is live:

ManthanBot — the investment analysis pipeline — runs on a DigitalOcean droplet, accessible via Telegram and HTTP API. Real pitch decks. Real founders. Under ten minutes. Every analysis feeds into the calibration loop.

The knowledge graph: 84,837 entities, verified 6 April 2026. Grows daily. 93.8 percent sector coverage. 97.6 percent geography tagging.

Charaka Notes — daily intelligence dispatches from the knowledge graph — publishes five days a week at getmanthan.com/charaka-notes.

What is built, deploying soon:

Blueprint — the investment intelligence platform for VCs and advisors. React application, seven routes, deal analysis, investor matching, pipeline management.

What is designed, building next:

Blueprint/Consulting (agent employees for professional services firms). The full organisational AI operating system (validated in testing, preparing for client deployment). Professional navigation intelligence for relationship-heavy roles.

I list these gaps because overpromise is the default failure mode of AI companies.

IX. WHO THIS IS FOR

Venture capital firms and angel networks. You screen hundreds of deals per year. Our Analytical Council gives you independent frameworks evaluating every deal, knowledge graph context from 84,000+ entities. Start with a free diagnostic — no account, ten minutes, zero cost.

Fundraising advisory firms. You help startups raise capital. Our IC pipeline evaluates fundability with multi-framework rigour in under ten minutes. Our knowledge graph maps 5,021 investors. Your advisors spend time on what requires human credibility: the introduction, the negotiation, the trust.

SEBI-registered Investment Advisors in India. 988 advisors serving ten crore investors. Bloomberg costs twenty-four lakh per year. We price for India, not Wall Street. SEBI’s January 2025 guidelines explicitly permit third-party AI analytical tools with disclosure.

Professional services and consulting firms. Twenty-seven analytical lenses, four seniority tiers, institutional memory that compounds across engagements. Pricing reflects the replacement value of the human expertise — not the cost of the tokens consumed.

X. WHAT INTELLIGENCE COSTS

We do not price against our costs. We price against the value of what we replace.

PitchBook: $12,000-70,000/seat/year. McKinsey due diligence: $500,000+. Bloomberg Terminal: $24,000/year. Human Forward Deployed Engineer: $300,000+/year loaded.

We sit in the gap. Meaningfully below institutional-grade analytical infrastructure. Meaningfully above commodity AI tools. Nobody else occupies this gap because nobody else combines structured deliberation, institutional memory, and compounding intelligence in a single product. A VC firm screening 200 deals per year with a junior analyst at £80,000 loaded cost spends roughly £400 per deal. Our full Analytical Council at $2,000 per month costs $100 per deal at the same volume — a 75 percent reduction, with institutional memory the analyst cannot provide.

Free diagnostics exist because the product should demonstrate its value before asking for money. If a ten-minute analysis is not useful, you have lost nothing.

Start with a free diagnostic — send any pitch deck to @manthanAI_bot on Telegram and see the depth for yourself in ten minutes.

XI. THE TEAM — 2 HUMANS, 43 AGENTS

Mayank Mathur — managing partner. Builder of the Manthan architecture.

Nitin Mathur — partner. The investment eye. The operational judgement that calibrates what the agents produce.

Forty-three agents across eight divisions. Each named from Hindu scripture — not as decoration, but as engineering mandate. When clients deploy our products, they choose their own naming theme. The analytical framework underneath is identical.

Target: two hundred agents by end of 2026. Same two humans.

XII. AN INVITATION

Send a pitch deck to @manthanAI_bot on Telegram. Receive a signal summary in ten minutes. It costs nothing.

If the analysis is not useful, you have lost nothing. If it is, you have found the analytical infrastructure that Sequoia, a16z, Foundation Capital, and Karpathy say the market needs — built not by a large platform company but by two people and forty-three agents who believe that the future of knowledge work is not more tools, but better systems.

We are at getmanthan.com. The intelligence compounds from here.

Mayank Mathur is the Founder of Manthan Intelligence and Operating Partner at Tavaga Advisory Services. He can be reached at [email protected].

Statistics as of 6 April 2026: 84,837 knowledge graph entities, 213 calibration scorecards, 66.7% weighted accuracy, 93.1% INVEST reliability, 43 agents across 8 divisions.