The engineering behind AI research
When I started this Substack I planned to write more about AI Engineering itself and since there was a long weekend and I had a bit of time I decided to write about how I use AI to research companies.
Fair disclaimer: this is not “here is my skill/prompt to make a due diligence” post, this is a piece about wiring AI research with multiple models and software engineering, with practical examples of how a real setup of a multi agent system can look like, so it might be a bit too technical, but I believe eventually we will all end up working somehow with agents as colleagues, so better to get to know them.
Nowadays many people are using AI for researching equities and companies, starting with the simplest case which is asking a chat assistant like Claude or ChatGPT questions about a particular company, until the most advanced cases of multi agent pipelines, so I decided to write about a small real usage.
By the way, before reading this you might want to read about my own multi agent research. The link below is more about the concept, here it is more about what is behind the curtains. You can check it clicking here.
On this piece I want to describe a simple solution to a problem using AI to research companies. You can say Investor Relations pages, conferences, finding relationships, specialized forums, etc. Why am I doing this? Well, I see a problem today when you try to find how to research with AI most of the solutions on the internet are vendor locked, sponsored by companies selling their turnkey solutions. My idea is to offer a vendor neutral view combining small models trained in specific knowledge areas to get specific tasks done, ideally with a combination of running locally on your own GPU together with using large smart frontier labs models to get “criteria”, and gluing all together with a layer of software that allows you to schedule, aggregate, extract, parse, and integrate with external tools.
As I have mentioned before we should never be asking for answers to an LLM, you will be victim of whatever training data is there, and whatever the inference brings you tied to the way you prompted, but using them as an incredible research machine to automatize fuzzy processes where software itself could have its limitations.
A use case that I often do is to chain companies and collect their Investor Relations presentations and results, and aggregate the info together to compare it with previous presentations and previous filings. It is very interesting to see how a company moves in what they see as risks or potential areas of growth, and of course the finance evolution. But sometimes integrating this into multi agentic research isn’t the easiest case. Why? Well, imagine you have to parse an IR library. You get a number of documents, what do you let the agent do? Download and parse every document? When does it stop? Is this the agent’s criteria or a human decision? How do you bypass the security mechanism, anti bots/agents? What is interesting to extract and what isn’t? The answer is probably not so exciting because there are no magic bullets and in the end it is all in the criteria of the person developing the multi agent system, but I can mention a few things that have been working for me.
Let’s start top down. Let’s say you want to watch company A, B and C. When do you watch them? On demand? On earnings? Every time there is a filing? Each answer would need a different solution based on one orchestration layer, but let’s say a mild case, you are researching a company and you want the LLM to pick up whatever is on IR and compare it over the past. The first step: can you bypass the security of the IR in case there is some? Many IR websites block bots/LLMs, they don’t want the overload of their data and also nowadays it is so easy to do it with Cloudflare or similar services. But I have good news for you, everything has a solution if you know how. One of my first jobs close to 15 years ago I was doing web scraping for thousands of sites per day so there are a few tricks I can show you.
First, the easiest way to detect if something is a bot or an LLM bot is the user-agent. The user-agent is the signature of the process requesting a web page, encoded in the HTTP request while you are visiting a website. The first step every time your agent will pull an IR, in case it is blocked, is to fake the user-agent. How? Easy, user agents are just a string of information. They usually, if you are using something like Claude, look like this:
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ClaudeBot/1.0; +claudebot@anthropic.com)
What can you do if IR is blocking that? Just fake it on the web request. This is a good starting point but then you have another set of problems.
If you are not familiar with how a browser works I can give you a crash course, and I think when you are scraping a website your browser does a lot of work for you. Your browser has to take the javascript that the server sent over, interpret it and draw it so you can see it. If you make your agent do simple HTTP requests without the javascript interpretation, depending on the case you will basically have a useless thing. So what is the solution? Well it depends on the use case, but Playwright or using some flavor of a headless browser helps. What is that? Well, as you might know (or not), Chrome and Edge internally have an engine, both use Chromium (that is why it gives the name to that) but it is not a browser rather what lives inside the browser that is able to interpret the javascript, render it and actually make it usable for agents to get the information as the human would be getting it. It can be that links to decks or presentations are built on the fly while rendering the javascript via templates or similar, and without this you might find surprises. I think it is important to understand this sort of limitations that don’t show up on demos.
But back to work, you fake the user-agent and you use Playwright, problem solved right? Well, not so fast. That is the naive solution and modern anti bot are way more sophisticated than that. For instance you can face TLS fingerprinting (JA3/JA4). When your client opens an HTTPS connection it sends a TLS ClientHello with a specific order of cipher suites, extensions, elliptic curves. Every library has its own signature, Python requests, Node fetch, Chrome, Firefox, all of them look different at this level. Cloudflare and similar match this against known browser fingerprints, and if your user agent says “I am Chrome on macOS” but your TLS handshake screams “I am Python requests” you are flagged immediately. There are fixes of course, but not as simple as chasing the user-agent.
But also you can face headless detection. Even with Playwright there is a long list of leaks. The flag navigator.webdriver is set to true on a vanilla headless Chrome, the CDP (Chrome DevTools Protocol) traffic leaves traces, certain JS APIs return empty or weird values (plugins, languages, permissions), AudioContext and WebGL contexts behave differently. There are stealth plugins like puppeteer-extra-plugin-stealth or playwright-stealth that patch most of the well known leaks but it is a cat and mouse game and every few months a new detection vector pops up. Also mouse movements, scroll patterns, time spent on each section, time between clicks, focus and blur events. Pure programmatic navigation looks nothing like a human moving a mouse, advanced anti bot systems score this in real time.
This is also something trickier that I had to deal in the past a lot. IP reputation. Even if you nail everything above, if you hit 100 pages in 30 seconds something will catch you and will block your IP, plus you will have the CAPTCHA and challenge pages. Cloudflare Turnstile, hCaptcha, reCAPTCHA v3. Some can be bypassed with paid solver services, some require manual solve, some are basically impossible (sorry).
But ok, once you bypass the website and managed to get proper links you have to extract the information, and this is where it gets trickier depending on the case. On one hand a multi billion parameter model or a trillion parameter model of course would have no problems to parse a pdf and extract all info, they know how to read pdfs and they have built-in tools to do it, the problem is it can get very expensive in tokens, so this is where small specialized models that you can run either locally if you have the appropriate hardware, or on a rented gpu paid per hour, come in handy. There are a few options that you could try out, for instance NuExtract has an 8 billion model specialized in structured data extraction, or you can try a different strategy with MinerU, also a small model specialized in every kind of format extraction and representing it as plain text. If you have an IR presentation in a pdf or PowerPoint format you can easily extract it without using Anthropic or OpenAI tokens, and then later use it to feed an agent, therefore you use the tokens for the reasoning tasks not for the mechanical ones. Another alternative I used recently is the Vision Language models fine-tuned on top of Qwen like Qwen2.5VL, it is a 7 billion model with amazing OCR (Optical Character Recognition) capabilities.
Ok now you bypassed the website, you extracted the information, and what do you do with it? Again it depends on the use case, but I will just give a few real life examples. But first a story. Many years ago when you were building a data pipeline with an immense amount of text you used a search engine. Back in the day Solr was the state of the art, but it has rained a lot since then, and I guess today’s de-facto standard for new systems is Elasticsearch, which now ships with vector search and integrations that make it usable as a retrieval layer for RAG and agent pipelines (it is not an “agent platform” per se, but it gives you the search primitives an agent needs). A search engine allows you to search a lot of data fast if you have the right keywords. On top of that a layer of algorithms can always be applied, but nowadays there are a bit more sophisticated solutions. For instance, vectorization.
What is vectorization? You might know this but if not, an LLM inside is just numbers called weights, vectors of thousands of numbers that are clustered together and related to each other via a neural network and the attention mechanism. Something interesting is to try to build a vector database where you can store the data in the shape of weights. This would allow you to do “smart” queries in the sense that you don’t need to look for similar keys but similar concepts, and even different languages. On traditional search there is no way you can relate the same concept in different languages unless you translate them first, and even those translations might not fit. But once you vectorize the data they are clustered in weights that a model can associate, and you can actually have AI powered real queries inside a production grade database. Not asking random things to a chat. A very naive example: “半導体装置” (Japanese for “semiconductor equipment”) and “equipos de semiconductores” (Spanish) and “semiconductor equipment” all land in roughly the same spot in vector space, once the inference inside the database is done you can relate them, which would be impossible with any other technology.
For doing that the options are multiple and again it all depends on the case. There are vector databases that support embedding weights in multiple languages, for instance Milvus, Weaviate, Qdrant, Zvec, or even PostgreSQL with pgvector and CJK extensions for Asian languages. But if not, there are multiple tiny models designed and trained with the purpose of embedding text into weights, for instance BGE-M3 or Qwen3-Embedding. You can use them for embedding and retrieving data, and given the multi language context, get a real search connecting the same concept across multiple languages. In one of my last projects I used for embedding Nomic Embed Multimodal and it worked ok.
So is this it? Well, again sorry but not even close.
So far we covered scrape, extract, store. And that already gives a lot of headaches to most people, but honestly this is the easy half. Then you get a lot of other problems when the agents actually begin doing the work on top of all that data and I can mention a few of them.
Besides the most obvious problem of the token costs (they are getting more and more expensive) you have the second most discussed problem that is context window limits. Even with 200k or 1M token windows, when you start feeding multiple filings, IR documents, prior context, tool outputs, reasoning steps and the agents talking to each other, you fill the window very fast. And even if everything fits, models degrade very hard with long contexts, the “lost in the middle” effect and your agent maybe start to “not behave” to say it in a nice way.
Then there are the memory issues. LLMs are stateless, every call starts from scratch per se but there are working solutions around. I have strong opinions about the memory solutions based on .md files that are so heavily promoted right now. I think they are not only bad but terrible, and that is a hill I plan to die on. I think there are better solutions based on graph databases and vectorizations. Of course less easy to implement.
Also you have to consider that in long chains agents will eventualy drift. How much? It will depend of how good is the engineering behind but they start with task A, get a tool output, get carried away with something in that output, pursue B, forget A, then loop back and re-do something they already did. This is partially mitigated with explicit task lists, reflection steps, and a planner executor separation, but it is never fully solved.
Telemetry and observability. This is one of my specialities cause this is where my day work lives, in solving observability issues. When something goes wrong in a fifty step agent chain, how do you even debug it? You need traces, replay, the exact prompt and exact response of each step, the tool calls, the timings, the costs. There are tools but production-grade multi agent observability is its own engineering problem.
And there are a lot of other problems that probably deserve a separate piece, like sycophancy, prompt injection and the security issues it brings, also a simple problem that no one really knows how to solve right now: is this research pipeline good? Can I reproduce the same results again? In traditional software you have any kind of test for that: integration, black box tests, red green tests. With an LLM which is non deterministic how do you test? Well that is a total different engineering problem.
As you can see scraping and extracting text was actually the “easy” part. The hard part is gluing a system that can reason on top of it without breaking, drifting, lying with confidence, getting injected, or bankrupting you in tokens. The good news is all of this is somehow solvable, the bad news is that it requires the same discipline that real software engineering required since twenty years ago: logs, tests, budgets, abstractions, fallbacks, version pinning etc
But bear in mind this is how the enterprise future will look, and these are the problems every organization will eventually run into. Better to be early.
We can’t all be goose farmers.
If you are interested in incorporating this kind of workflows in your work or your projects feel free to reach out

