marcduerst.com

Integrating Machine Learning — A Software Engineer's Guide

This is not a post about how to train models or write Python notebooks. It’s about what I’ve learned as a software engineer working alongside data scientists — how to collaborate effectively, what to watch out for when integrating ML into production, and why the offline/online distinction matters more than most people think.

But lets start with some basics…

What is a Data Scientist

There are two roles sound similar for software engineer: Data Analyst and Data Scientist. The two roles are often confused, but from an engineering perspective they have very different needs and outputs. Think of it in software terms:

Data Analysts

Data analysts are like your monitoring and debugging experts — but for the business. They look at what happened and try to explain why.

They query collected data, build dashboards, spot anomalies, and present findings to stakeholders. Their output is insights and recommendations, not code that runs in production.

In engineering terms: if a data analyst were a developer, they’d be the one reading logs and metrics after an incident, writing the post-mortem, and suggesting what to fix — but not writing the fix themselves.

Example: “Conversion dropped 12% last week. I dug into the funnel data and found that 80% of drop-offs happen on the new checkout page on mobile devices. I recommend we prioritize a mobile UX review.”

Data Scientists

Data scientists are closer to R&D engineers. They don’t just analyze what happened — they build things that predict what will happen or automate decisions. Their output is algorithms, trained ML models, or heuristics that engineering integrates into production.

In engineering terms: if a data scientist were a developer, they’d be the one writing a proof-of-concept for a new feature — except their “feature” is a mathematical model that needs your infrastructure to run.

Example: “We trained a recommendation model that predicts which products a user is likely to buy based on their browsing history. It improves click-through rate by 15% in our offline tests. We need engineering to integrate it into the product page — the model takes a user ID and returns a ranked list of product IDs in ~10ms.”

How Data Science and Engineering Work Together

Data science and software engineering don’t work in isolation. There are three distinct phases where the two disciplines intersect — and each has different needs.

Phase 1: Data Science Research

This is the exploratory phase. Think of it as the data scientist’s playground.

  • Needs ad-hoc query capabilities on big data
  • Needs historical data — current data alone is mostly unusable for research; timelines matter
  • Very individual — every data scientist works differently, with their own tools and workflows

Phase 2: Model Training

A more structured phase where models get trained and validated.

  • More aligned tooling, possibly standardized across a team or company
  • Needs query capabilities on big data
  • Needs historical data (e.g. snapshots from various points in the past)
  • A more stable, repeatable approach to querying data
  • Data lineage becomes important: who uses what data, for what purpose, and what are the dependencies

Phase 3: Software Engineering Integration

This is where engineering takes over — integrating the algorithm or ML model into the production system.

  • Can be offline (batch) or online (real-time) — more on this below
  • Requires tight alignment on data formats, types, and semantics
  • Fast database queries: all data including feature data must be loaded in milliseconds
  • No historical data needed — only current data matters in production
  • Data access may be charged per query, so optimize for efficient database usage
  • Smart caching for feature data may be needed to reduce latency and cost
  • May need to log predictions for future model training (feedback loop)
  • Operations monitoring, SLA/SLO — are ML predictions working as expected? Latency, error rates, prediction quality

Keys to Good Collaboration

Start early. Data scientists and software engineers need to collaborate already in late research phase. Why? Because integrating some ML features as well as the type of model in production can be a deal-breaker. Often features that sound easy from a data science perspective are hard to implement in engineering — or vice versa. The engineering must also be able to load and use the model on their tech stack.

Get the data right. When engineering integrates an ML model, it’s crucial that the data types and semantics match exactly what the data scientist used during training. If you use two different tech stacks, things get tricky — there can be semantic differences between online and offline data. Or there can simply be bugs. Even if the tech-stack is the same the data-sources may not. Ensure that the data-science database (eg DWH) aligns with the engineering database (online).

Agree on the model format early. Both sides need to be able to work with the chosen type of ML model. Data science needs to train it effectively, while engineering needs to load it and run predictions/inferences in production. From an engineering perspective, what matters is: how big is the model (memory, load time), and do we have libraries to load this model type in our stack? This becomes critical when the tech stacks differ — e.g. data science uses Python/Airflow while engineering runs .NET or Java backends. A model format that’s trivial to export in Python might have no mature loader in your production runtime. Solve this before anyone starts training.

Offline vs. Online: The Most Important Distinction

Offline vs. Online ML predictions

This is where most of the engineering complexity lives.

Offline (Batch Predictions)

From an engineering perspective, offline means running ML predictions in batch jobs — typically on a schedule.

  • Latency doesn’t matter — jobs can take minutes or hours
  • Data freshness doesn’t matter — you work with whatever snapshot is available
  • Results are pre-calculated and stored for fast lookup by the online system
  • Stability is secondary — a failed batch can be retried without user impact
  • Resource costs (memory, storage, compute) still need attention

A practical example: Calculate trending products every 2 hours. Once done, write the results to a fast key-value store where the online system picks them up.

Online (Near Real-Time)

Online means the user is waiting — your service needs to load the model, gather feature-data, run inference, and return a result within a single request.

Engineering challenges:

  • Total response time often needs to be below 50ms — including feature loading and inference
  • The ML model must be loaded in memory and ready to serve (cold starts are a killer)
  • The ML model updates most be loaded while staying full operational
  • May support A/B testing of ML models — not one but two models must hot loaded in memory
  • To detect model skew you need some kind of model validation (eg. run a well defined test-case the model is loaded into memory but before you use the model for serving user requests)
  • Feature data must be queryable in milliseconds — this is often the actual bottleneck, not the inference itself
  • Data must be kept current via streaming pipelines (e.g. Kafka, Pub/Sub) into a fast store
  • Everything must be failsafe with graceful fallbacks — if the model fails, the user shouldn’t notice
  • Must scale horizontally under load
  • Cloud costs can spike quickly — every query, every byte of storage adds up
  • Massive tracking data for model inferences need to be collected stored while under full load

When you can’t avoid online predictions:

  • When real-time data is needed as model input (e.g. products viewed in the current session)
  • When the number of feature combinations is too large to pre-calculate — you can’t pre-compute every possibility

What engineering typically needs to build:

  • A feature store with millisecond read latency, fed by streaming data. For a large E-Commerce site like Digitec-Galaxus expect hundreds of GB if not TB’s of data. Collections/tables can contain multi-million documents/rows each.
  • Model serving infrastructure (load & validation, version, swap models without downtime)
  • A distributed caching layer for frequently requested feature data (not for prediction results)
  • Monitoring for prediction latency, error rates, and model quality (drift & skew detection)
  • Logging of predictions and inputs for future model retraining

Decision Framework: Offline or Online?

Default to offline if you can. Challenge requirements so you don’t need real-time online ML predictions. Two factors push you toward online:

  1. Data freshness — if you need very current data, you can’t pre-calculate every few hours. At some point, you have to do it in real-time.

  2. Granularity — if the model has many features or a feature has a wide range of values, pre-calculating all combinations becomes infeasible. But if the model only needs a product ID and an age group, you can pre-calculate rows like (product_id, age_group) → recommendation — even millions of rows are fine for today’s cloud databases.

Tips for Online ML

  • Don’t use slow or costly models in online scenarios
    • Challenge to use an efficient model for the job; not the fanciest newest thing
    • Quantise the model for production (data-scientist can do this after they trained the model but often you need to tell them to do so)
  • Custom ML models can be quantized — this reduces model size dramatically and makes inference fast
  • Remember: the bottleneck is usually data loading, not inference
  • Design your feature store for millisecond lookups with streaming updates

Drift vs. Skew — Know the Difference

Two terms you’ll hear from data scientists that matter for engineering:

Model drift means the model’s predictions get worse over time because the real world changed. The patterns it learned during training no longer match reality. Example: a recommendation model trained on pre-pandemic shopping behaviour performs poorly after consumer habits shifted. Engineering’s job: monitor prediction quality and trigger retraining when metrics degrade.

Model skew is a mismatch between training data and serving data — same point in time, but the data looks different due to technical inconsistencies. Example: during training the feature “user_age” was a float, but in production it arrives as an integer. Engineering’s job: ensure the feature pipeline produces identical data formats and semantics in both offline training and online serving.

In short: drift = the world changed, skew = your pipeline has a bug.

What Does “Quantize a Model” mean?

ML models store their learned knowledge as millions of numerical weights — typically as 32-bit floating point numbers. Quantization reduces the precision of these weights, for example from 32-bit floats to 8-bit integers. The model file gets dramatically smaller (often 4x), loads faster, and runs inference quicker — with only a small, often negligible loss in prediction accuracy.

Why this matters for engineering: a quantized model uses less memory, starts faster (critical for scaling and cold starts), and is cheaper to serve. Data scientists can quantize after training, but they often don’t think about it unless you ask. Make it part of your handover checklist.

Closing Thoughts

Neither data science nor engineering can deliver ML value alone. Collaborate early, push back when something won’t work in production, and stay in regular contact — not just at handover points. A tricky feature for engineering might have a simpler alternative that data science can explore while still in research. The best ML systems are built by teams where both sides understand each other’s constraints.