Services / Applied ML & LLMs
The gap between an impressive demo and a working system is engineering.
Applied AI work at iHC covers forecasting, document automation, retrieval-augmented systems, and agentic workflows. Every build starts with a success metric and ends with a system in production, not a slide deck.
The full picture
The model is the smallest box
In a production ML system, the model code is a small fraction of the whole. Around it sits the engineering that keeps it correct, current, and safe: data pipelines, evaluation, serving, monitoring, and the automation that ties them together. Demos show the small box. Production is everything else.
Data collection & labeling
Sourcing, freshness, consent, ground truth.
Feature & data pipelines
The same transformations in training and serving.
Experiment tracking
Every run logged and reproducible.
Evaluation & testing
Code tests, data tests, model tests.
Config & secrets
Versioned, reviewed, no magic constants.
Serving infrastructure
Batch, API, or in-database scoring.
CI/CD & rollback
Release gradually, retreat automatically.
Monitoring & drift
Watch inputs and outputs, retrain on evidence.
Adapted from Sculley et al., Hidden Technical Debt in Machine Learning Systems, NeurIPS 2015.
The main ideas
How we think about it
The principles behind the work, in plain language. If these make sense to you, we'll get along.
01
Success criteria first
Before any model is trained, we agree what number it has to move and how we'll know. 'Accuracy' is not a business outcome; hours saved and errors prevented are.
02
The right tool, not the biggest
Plenty of problems are solved better by a regression model than a large language model, at a fraction of the cost. We match the tool to the problem and tell you why.
03
Honest evaluation
AI systems fail quietly, especially ones that generate text. We build evaluation in from day one: test sets, groundedness checks, and human review where the stakes demand it.
04
A path to production
Versioning, monitoring, retraining, rollback. The pilot is built as the first version of the production system, so 'productionizing' is a step, not a rebuild.
What teams miss
A notebook is not a system
The classic data science workflow lives in a notebook: load a CSV, train a model, admire the accuracy. That is exploration, and it matters, but it is the easy half. Production ML is software engineering with models inside.
The notebook
- Data pulled by hand, once
- Runs on one person's laptop, in one person's head
- Accuracy measured on a single held-out file
- 'Done' when the chart looks good
- The model is a file saved somewhere
- Breaks silently the day the data changes
The production system
- Data, code, and models versioned; every run reproducible
- An automated pipeline: train, evaluate, deploy, on a trigger
- Evaluation suites run before and after every release
- CI/CD with rollback when a new model underperforms
- A model registry: what is live, since when, approved by whom
- Drift and performance monitoring, with retraining on evidence
In production
What production ML actually involves
This is the work between a promising notebook and a system the business can lean on. It is also where most AI initiatives quietly stall.
01
Experiment tracking
Every training run logged with its data version, parameters, and metrics. Results you can reproduce and compare, not folklore in a notebook.
02
Testing beyond code
Unit tests for code, expectations for data, behavioral tests for models. Three test suites, because ML systems fail in three different ways.
03
CI/CD for models
Shipping a model is a pipeline, not a hand-off: build, evaluate against the current champion, release gradually, roll back automatically if it underperforms.
04
Serving, chosen deliberately
Batch scoring, a real-time API, or in-database inference. Latency, volume, and cost decide the pattern, not habit.
05
Monitoring and drift
Input data drifts, user behavior shifts, performance decays. Production ML watches inputs and outputs, alerts on change, and retrains on evidence.
06
Evals for LLM systems
Generative systems need their own discipline: groundedness checks, regression prompt suites, cost and latency budgets, and human review where stakes are high.
The flow
How an engagement runs
A deployed system with a measured result, owned by your team.
01
Frame
Define the decision the AI supports, the metric it must move, and the cost of being wrong.
02
Prove
A tightly scoped proof-of-concept on your real data, evaluated against the agreed criteria, in weeks.
03
Harden
The proven approach is engineered for production: pipelines, monitoring, guardrails, and human-in-the-loop where needed.
04
Measure & hand over
We measure against the original success criteria, document everything, and train your team to own it.
Sound like your situation?
A free 30-minute call to talk it through. No pitch, no obligation.