A workbench for model tuning at Lilly
Turning dataset contribution, model training, and inference into one workflow a researcher can actually move through.
01The problem
TuneLab is an internal Eli Lilly platform where researchers contribute structured datasets, train and run inference models on drug data, and review results at scale. The users are computational scientists working under real pressure, and a single tuning task spans multiple stages: ingest an assay dataset, configure a training or inference job, monitor it, then judge whether the output can be trusted. Most tooling in this space dumps all of that on one screen, which buries the one decision the person needs to make right now.
The stakes are reproducibility and regulatory defensibility. If a scientist can't trace which datasets and preprocessing produced a model, the result is unusable no matter how good the science is. The job was to turn a dense, multi-step scientific process into a system a researcher can actually move through without losing that traceability.
02My role
I was the product designer and owned the interface across all three core workflows: dataset contribution, model training and inference, and results review. I drove the interaction model, the information architecture, and the shared UI system, and I worked in a tight loop with the Rhino Federated Computing engineering team so the designs shipped as built, not as specs that drifted.
03The approach
I broke each workflow into sequential steps and used progressive disclosure so a step reveals only what matters at that moment. Early on I sat with Lilly computational scientists and walked their existing process; the recurring complaint was that they'd start a job, lose track of it, and have no idea why a model behaved the way it did. That reframed the work: the throughline became visibility and traceability, not just a cleaner form. Watching them stumble over assay terminology also pushed me to keep domain language intact rather than abstracting it into generic UI.
For v1 I deliberately scoped down. I cut side-by-side experiment comparison and a richer model-lineage graph, because engineering flagged that the metadata needed to power them wasn't reliably captured yet, and a comparison view built on incomplete data would erode the trust the whole product depended on. I'd rather ship fewer surfaces that are honest than more that mislead. I also designed the system states a long-running scientific job actually produces: queued, running, partial-failure, and failed states with recoverable error messaging, empty states that tell a first-time contributor what to upload, loading states for large result sets, and keyboard-navigable, screen-reader-labeled tables so dense data stays accessible.
04What I built
I shipped the dataset contribution flow (with assay-protocol support), the inference and training job builder, the unified results views, a Job Runs surface that monitors status across experiments in real time, and a Model Catalog that surfaces the metadata a scientist needs to trust a model: training datasets, preprocessing methods, and evaluation context.
The Model Catalog was shaped directly by an engineering constraint. Model metadata arrived asynchronously from the training pipeline, so a record could exist before its lineage was fully populated. Rather than hide those records or block on them, I designed a 'metadata pending' state so a model is visible and usable while its provenance backfills, with the trust-critical fields clearly marked as incomplete until confirmed. All of it draws on a shared component system so forms, tables, and detail panels stay consistent as the product grows.
05Outcome
TuneLab shipped as the workbench Lilly researchers use to contribute data and run tuning jobs, and the design-to-development loop kept the built product close to intent. After launch, researchers told us the Job Runs view was where they lived, but v1 only showed live status; the moment a job finished or failed, the reasoning was gone. So in v2 I added a run history and per-run detail with the full config and error trail, which is what let scientists diagnose failures themselves instead of filing tickets to the data team.
Directionally, the sequential flows and honest states cut the back-and-forth that used to define this work: setup that had taken a scientist most of a day dropped to something they could complete in one sitting, and questions to the data team about job status and model provenance noticeably dropped once the catalog and run history answered them in-product.
06Reflectionoptional
Cutting experiment comparison from v1 was the right call given the data quality, but I underestimated how quickly researchers would want it once they trusted the results view. If I did it again I'd design the comparison interaction in parallel and gate it behind the metadata readiness, so the pattern was ready the moment the pipeline could support it rather than becoming a v3 scramble.
Interfaces
The interface that shipped.
34 screens from the work. Click any image to view it full size.