Making black-box AI transparent and controllable
Turning weak supervision and heuristic creation into patterns data scientists and domain experts can actually steer.
01The problem
Jaxon builds domain-specific AI tuned for high accuracy and minimal hallucination. The people who use it are data scientists and the domain SMEs they lean on, lawyers, clinicians, and analysts, who together label training data and align a model to a narrow domain. The problem was that the model was a black box. SMEs wrote heuristics they couldn't see the effect of, data scientists couldn't tell which labeling functions were dragging accuracy down, and nobody trusted a model they couldn't audit.
This is workflow-heavy enterprise work, not a dashboard. Weak supervision is a multi-step loop: define labeling functions, run them across an unlabeled corpus, inspect where they conflict or abstain, resolve the conflicts, retrain, and audit confidence, then repeat. My job was to make every step of that loop legible and steerable for two very different kinds of user.
02My role
I was the product design partner and owned the labeling and confidence-audit experience end to end: the interaction model, the system states, and the pattern library the engineers built against. I drove how weak supervision and heuristic creation surfaced in the UI, ran the SME interviews that shaped those decisions, and iterated the flows directly with the founding engineer and PM. I did not own the model internals. I owned how humans see, question, and correct them.
03The approach
I started with interviews and think-alouds with data scientists and two SMEs on a live customer engagement. The clearest signal: SMEs didn't think in confidence scores, they thought in examples. Watching an SME hesitate over a 0.62 score changed the design. I moved from leading with raw probabilities to leading with concrete labeled examples per heuristic, with confidence as supporting evidence rather than the headline.
For v1 I deliberately scoped to the audit-and-correct loop and cut automated heuristic suggestion. It was tempting, but SMEs told us they wouldn't trust a suggestion they couldn't trace, and building trust in the manual loop had to come first. I also designed the states the demo happy-path skips: empty states for a corpus with no labeling functions yet, the abstain state where every function declines to label a row, conflict states where functions disagree, and loading states for retraining runs that take minutes. Confidence was never encoded by color alone. I paired every band with a label and shape so it held up for color-blind users and in high-contrast mode.
04What I built
A platform interface for programmatic labeling, confidence auditing, and SME/data-scientist collaboration on custom AI: a heuristic builder, a conflict-resolution view that shows where labeling functions disagree, and a confidence-audit surface that ties each model prediction back to the functions that produced it.
Working with engineering, we hit a real constraint. Retraining couldn't run synchronously, so confidence updates lagged the user's edits by a full training cycle. Rather than fake instant feedback, I designed an explicit 'stale, pending retrain' state on affected rows with a job indicator, so users always knew whether they were looking at current or pending confidence. That negotiation with the PM and engineer traded a snappier feel for honesty about model state, which mattered more to a distrustful audience.
05Outcome
V1 shipped, and SMEs could finally trace a prediction back to the heuristic behind it, which is what moved them from spectators to participants. The clearest post-launch finding was that conflicting labeling functions were the real accuracy killer, and the original conflict view buried them in a long list. For v2 I promoted conflicts into a prioritized queue that surfaced the highest-impact disagreements first. After that change, teams stopped treating conflict resolution as an afterthought and worked it deliberately, and data scientists reported reaching a trusted model in noticeably fewer retraining cycles than before.
06Reflectionoptional
I over-invested early in visualizing raw confidence distributions because they looked impressive in demos, and SMEs mostly ignored them. If I did it again I'd lead with examples and conflicts from day one and treat the probability charts as a power-user detail, not the centerpiece.
Interfaces
The interface that shipped.
5 screens from the work. Click any image to view it full size.