Open-sourced Research-LLM (MIT): real StructureGuard T1 runs on long-context (~120k) synthetic corpora — committed JSON/MD, raw completions, failure analysis.
Same corpus + format enforcement (2026-06-04): grok-3 & claude-sonnet-4-6 IPR 1.0; gemini-2.5-flash & gpt-4o-mini IPR 0.0.
Baseline without enforcement: only grok-3 scored 1.0 (OpenAI row used gpt-4o-mini, not gpt-4o).
顯示更多