Services / Applied ML & LLMs

The gap between an impressive demo and a working system is engineering.

Applied AI work at iHC covers forecasting, document automation, retrieval-augmented systems, and agentic workflows. Every build starts with a success metric and ends with a system in production, not a slide deck.

Talk this through

The full picture

The model is the smallest box

In a production ML system, the model code is a small fraction of the whole. Around it sits the engineering that keeps it correct, current, and safe: data pipelines, evaluation, serving, monitoring, and the automation that ties them together. Demos show the small box. Production is everything else.

Data collection & labeling

Sourcing, freshness, consent, ground truth.

Feature & data pipelines

The same transformations in training and serving.

Experiment tracking

Every run logged and reproducible.

ML codethe part the demo shows

Evaluation & testing

Code tests, data tests, model tests.

Config & secrets

Versioned, reviewed, no magic constants.

Serving infrastructure

Batch, API, or in-database scoring.

CI/CD & rollback

Release gradually, retreat automatically.

Monitoring & drift

Watch inputs and outputs, retrain on evidence.

Adapted from Sculley et al., Hidden Technical Debt in Machine Learning Systems, NeurIPS 2015.

The main ideas

How we think about it

The principles behind the work, in plain language. If these make sense to you, we'll get along.

Success criteria first

Before any model is trained, we agree what number it has to move and how we'll know. 'Accuracy' is not a business outcome; hours saved and errors prevented are.

The right tool, not the biggest

Plenty of problems are solved better by a regression model than a large language model, at a fraction of the cost. We match the tool to the problem and tell you why.

Honest evaluation

AI systems fail quietly, especially ones that generate text. We build evaluation in from day one: test sets, groundedness checks, and human review where the stakes demand it.

A path to production

Versioning, monitoring, retraining, rollback. The pilot is built as the first version of the production system, so 'productionizing' is a step, not a rebuild.

What teams miss

A notebook is not a system

The classic data science workflow lives in a notebook: load a CSV, train a model, admire the accuracy. That is exploration, and it matters, but it is the easy half. Production ML is software engineering with models inside.

The notebook

Data pulled by hand, once
Runs on one person's laptop, in one person's head
Accuracy measured on a single held-out file
'Done' when the chart looks good
The model is a file saved somewhere
Breaks silently the day the data changes

The production system

Data, code, and models versioned; every run reproducible
An automated pipeline: train, evaluate, deploy, on a trigger
Evaluation suites run before and after every release
CI/CD with rollback when a new model underperforms
A model registry: what is live, since when, approved by whom
Drift and performance monitoring, with retraining on evidence

In production

What production ML actually involves

This is the work between a promising notebook and a system the business can lean on. It is also where most AI initiatives quietly stall.

Experiment tracking

Every training run logged with its data version, parameters, and metrics. Results you can reproduce and compare, not folklore in a notebook.

Testing beyond code

Unit tests for code, expectations for data, behavioral tests for models. Three test suites, because ML systems fail in three different ways.

CI/CD for models

Shipping a model is a pipeline, not a hand-off: build, evaluate against the current champion, release gradually, roll back automatically if it underperforms.

Serving, chosen deliberately

Batch scoring, a real-time API, or in-database inference. Latency, volume, and cost decide the pattern, not habit.

Monitoring and drift

Input data drifts, user behavior shifts, performance decays. Production ML watches inputs and outputs, alerts on change, and retrains on evidence.

Evals for LLM systems

Generative systems need their own discipline: groundedness checks, regression prompt suites, cost and latency budgets, and human review where stakes are high.

The flow

How an engagement runs

A deployed system with a measured result, owned by your team.

Frame

Define the decision the AI supports, the metric it must move, and the cost of being wrong.

Prove

A tightly scoped proof-of-concept on your real data, evaluated against the agreed criteria, in weeks.

Harden

The proven approach is engineered for production: pipelines, monitoring, guardrails, and human-in-the-loop where needed.

Measure & hand over

We measure against the original success criteria, document everything, and train your team to own it.

Sound like your situation?

A free 30-minute call to talk it through. No pitch, no obligation.

Book a call

Other services

The gap between an impressive demo and a working system is engineering.

The model is the smallest box

How we think about it

Success criteria first

The right tool, not the biggest

Honest evaluation

A path to production

A notebook is not a system

What production ML actually involves

Experiment tracking

Testing beyond code

CI/CD for models

Serving, chosen deliberately

Monitoring and drift

Evals for LLM systems

How an engagement runs

Frame

Prove

Harden

Measure & hand over

Sound like your situation?

Solution Architecture & Advisory

Data Foundations & Pipelines

Dashboards & Decision Support