@AppenResearch independently evaluated
@subquadratic's SSA kernel - a learned sparse attention mechanism designed to reduce the quadratic scaling limitations of full attention.
Results at 1M-token context lengths:
- 56.2× wall clock speedup vs. FA2
- 62.8× FLOP reduction (validated via torch.profiler, <4% variance from theoretical)
- 95.6% average score across RULER tasks at 128K
- 86.2% average score on the hardest MRCR 8-needle bucket (512K–1M contexts)
- 81.8% SWE-Bench Verified resolved rate
Full report: