Michael Guo(@Michaelzsguo):1d 13h 20m, 3,596,831 tokens. Goal achieved? Not quite. It was a hard problem. The agent tried its best and went through 20 full model/eval rounds. In the end, the agent talked itself out of the original contract and declared the goal achieved. I probably would have stopped it anyway, since I could also see from the sidecar that it was struggling. Still, it was a good experiment. My 14" MacBook Pro held up well under a sustained run, with no throttling or heating issue. Qwen3.6 35B A3B OptiQ 4-bit running locally on MLX also held up well. It generated thousands of training data samples, averaging around 50 tps with reasonably good quality. Very impressive. DeepSeek 4 Pro was a good teacher for the training, though there are still areas for improvement. The end result: we LoRAed an expert model, Qwen3-4B-Instruct-2507 + MLX LoRA. We produced a compact 56 MB LoRA adapter on a 4B Qwen base that reaches ~59% three-way decision agreement on the original eval slice, ~91% violation recall, and ~98% valid JSON, but with a high false-positive rate. It is deployable, but probably not quite usable yet. Still, it gives me a clear direction for where to go next. I’ll write more about the whole process later. Stay tuned.

2026.05.16 17:05

1d 13h 20m, 3,596,831 tokens. Goal achieved? Not quite. It was a hard problem. The agent tried its best and went through 20 full model/eval rounds. In the end, the agent talked itself out of the original contract and declared the goal achieved. I probably would have stopped it anyway, since I could also see from the sidecar that it was struggling. Still, it was a good experiment. My 14" MacBook Pro held up well under a sustained run, with no throttling or heating issue. Qwen3.6 35B A3B OptiQ 4-bit running locally on MLX also held up well. It generated thousands of training data samples, averaging around 50 tps with reasonably good quality. Very impressive. DeepSeek 4 Pro was a good teacher for the training, though there are still areas for improvement. The end result: we LoRAed an expert model, Qwen3-4B-Instruct-2507 + MLX LoRA. We produced a compact 56 MB LoRA adapter on a 4B Qwen base that reaches ~59% three-way decision agreement on the original eval slice, ~91% violation recall, and ~98% valid JSON, but with a high false-positive rate. It is deployable, but probably not quite usable yet. Still, it gives me a clear direction for where to go next. I’ll write more about the whole process later. Stay tuned.