1d 13h 20m, 3,596,831 tokens.
Goal achieved? Not quite.
It was a hard problem. The agent tried its best and went through 20 full model/eval rounds.
In the end, the agent talked itself out of the original contract and declared the goal achieved. I probably would have stopped it anyway, since I could also see from the sidecar that it was struggling.
Still, it was a good experiment.
My 14" MacBook Pro held up well under a sustained run, with no throttling or heating issue.
Qwen3.6 35B A3B OptiQ 4-bit running locally on MLX also held up well. It generated thousands of training data samples, averaging around 50 tps with reasonably good quality. Very impressive.
DeepSeek 4 Pro was a good teacher for the training, though there are still areas for improvement.
The end result: we LoRAed an expert model, Qwen3-4B-Instruct-2507 + MLX LoRA.
We produced a compact 56 MB LoRA adapter on a 4B Qwen base that reaches ~59% three-way decision agreement on the original eval slice, ~91% violation recall, and ~98% valid JSON, but with a high false-positive rate.
It is deployable, but probably not quite usable yet. Still, it gives me a clear direction for where to go next.
I’ll write more about the whole process later. Stay tuned.
顯示更多