Fine-tuning open models for financial analysis
Evaluation tells us which models to use. Training research tells us how far we can push the parts of the pipeline that matter most for grounded, domain-specific output.
Choosing the right model is one half of the work. The other half is understanding what these models can be taught. Beyond evaluating off-the-shelf models, we run our own training research on open base models, focused narrowly on the kind of financial work Perch is built for.
This is not about replacing a general model with a fine-tune. It is about learning, in a controlled setting, where domain adaptation helps and where it does not.
Parameter-efficient fine-tuning
Our first line of training research uses low-rank adaptation, a parameter-efficient method that trains a small set of additional weights on top of a frozen base model rather than retraining the whole network. It is cheap enough to iterate on quickly and small enough to keep and compare many variants.
We built a LoRA adapter on Llama 3.1 8B for equity and volatility analysis, trained on a curated, real-world market dataset. The adapter targets the attention and feed-forward projections across the model, with a modest rank, which is enough capacity to shift behavior on the target domain without overwriting the base model's general ability.
The goal was deliberately narrow: better grounding on domain documents, cleaner structured output, and steadier behavior on the specific shapes of data that financial work produces. Narrow goals are easier to measure and harder to fool than broad ones.
Domain-adaptive pretraining
The second line is domain-adaptive pretraining, where a base model continues training on a large, unlabeled corpus from the target domain before any task-specific tuning. Where low-rank adaptation teaches a model to behave a certain way on a task, domain-adaptive pretraining shifts the model's underlying familiarity with the domain's language and structure.
We ran domain-adaptive pretraining experiments on Qwen to study that shift directly: how much domain exposure changes grounding and terminology handling, and where the returns flatten out.
What we are actually measuring
Training a model is easy. Knowing whether it got better at the thing you care about is the hard part, and it is the same discipline we apply when evaluating any model. For this research that means asking:
- Does it ground claims in the source material more reliably, or just sound more fluent in the domain?
- Is the structured output cleaner and more consistent, or only different?
- Does it hold up on documents it was not trained on, or has it narrowed?
A fine-tune that sounds more like a financial analyst but grounds its claims no better is not progress. It is a more convincing way to be wrong, which is exactly the failure mode we are trying to design out.
Why this work informs the product
This research does not mean a custom model powers every answer. It means we understand, from the inside, what these models can and cannot be taught about a domain, and that understanding shapes how Perch routes and grounds work today. The parts of the pipeline that matter most for verifiable, decision-useful output are the parts we are willing to do the slow, empirical work on.
For the broader picture of why this matters, see why most AI agents fail in production.