Lei Wang (@alphalein) — X Web Viewer

Lei Wang Reposted

𝗿𝗮𝗺𝗮𝗸𝗿𝘂𝘀𝗵𝗻𝗮— 𝗲/𝗮𝗰𝗰@techwith_ram

2026.03.27 12:00

0

16

638

83

Forward to community

Lei Wang Reposted

alphaXiv@askalphaxiv

2026.05.07 06:03

a new paper from Anthropic Fellows Program! "Model Spec Midtraining: Improving How Alignment Training Generalizes" A lot of alignment training teaches models what to say, but not why those behaviors are right. So before normal alignment fine-tuning, this research trains the model on synthetic documents that discuss its Model Spec, including its values, rules, and reasoning, then do the usual supervised alignment. On agentic misalignment evals, MSM + AFT cuts misalignment from 68% -> 5% on Qwen2.5-32B and 54% -> 7% on Qwen3-32B, beating the baseline. The gain shows up especially out of distribution, where normal “say the aligned thing” training can look good in QA but break in harder scenarios.

0

6

130

22

Forward to community

Lei Wang Reposted

elvis@omarsar0

2026.02.08 14:48

NEW research from FAIR at Meta, Cornell, and CMU. This paper is a bigger deal than it seems. Apparently, you don't need billions of parameters to teach an AI model to reason. The default approach to post-training language models for reasoning today remains finetuning millions or even billions of parameters. But what if the signal needed for reasoning is far sparser than we assume? This new research introduces TinyLoRA, a method that scales low-rank adapters down to as few as a single trainable parameter. Using TinyLoRA with RL, they trained Qwen2.5-7B to 91% accuracy on GSM8K with only 13 parameters in bf16. That's 26 total bytes. So what's the idea? RL and SFT require fundamentally different amounts of model capacity. SFT must absorb the full demonstration, encoding both task-relevant structure and irrelevant noise into the update. RL receives a sparser, cleaner signal. The reward separates what matters from what doesn't, so resampling amplifies useful information while noise cancels out. Here are the results: On GSM8K, models trained with GRPO reach 90% accuracy with fewer than 100 parameters. Models of the same capacity trained with SFT barely outperform the base model. On harder benchmarks like MATH500, AIME, and AMC, finetuning just 196 parameters retains 87% of the absolute performance improvement averaged across six benchmarks. The trend scales with model size, too. Larger models need proportionally smaller updates, suggesting trillion-scale models may be trainable for many tasks with just a handful of parameters. The key takeaway is that reasoning may already live inside pretrained models. RL doesn't inject new knowledge; it surfaces what's already there, and it can do so with almost no parameter change at all. Paper: Learn to build effective AI agents in our academy:

0

21

583

94

Forward to community

Lei Wang Reposted

Sepp Hochreiter@HochreiterSepp

2025.09.03 05:36

xLSTM excels in time series forecasting: . Introduces "stochastic xLSTM" (StoxLSTM). "StoxLSTM consistently outperforms state-of-the-art baselines with better robustness and stronger generalization ability." TiRex shows that xLSTM is time series king.

0

11

474

90

Forward to community

Lei Wang Reposted

Sander Dieleman@sedielem

2025.08.20 01:33

@DavidSHolz @NicolasPerezNi1 Not entirely clear to me, I've mostly worked on the audiovisual modalities since 2023 so I wasn't around for this😬 The diffusion duality paper is nice in this regard, it potentially enables discrete models to access some of those theoretical advantages

0

2

24

4

Forward to community

Lei Wang Reposted

Sander Dieleman@sedielem

2025.08.19 20:44

New survey on diffusion language models: (via @NicolasPerezNi1). Covers pre/post-training, inference and multimodality, with very nice illustrations. I can't help but feel a bit wistful about the apparent extinction of the continuous approach after 2023🥲

0

7

590

92

Forward to community

Lei Wang Reposted

Jinjie Ni@NiJinjie

2025.08.09 13:45

Token crisis: solved. ✅ We pre-trained diffusion language models (DLMs) vs. autoregressive (AR) models from scratch — up to 8B params, 480B tokens, 480 epochs. Findings: > DLMs beat AR when tokens are limited, with >3× data potential. > A 1B DLM trained on just 1B tokens hits 56% HellaSwag & 33% MMLU — no tricks, no cherry-picks. > No saturation: more repeats = more gains. 🚨 ” We also dissected the serious methodological flaws in our parallel work “Diffusion Beats Autoregressive in Data-Constrained Settings” — let’s raise the bar for open review! 🔗 Blog & details: 18 🧵s ahead:

0

42

1.6K

252

Forward to community

Lei Wang Reposted

Aadit Sheth@aaditsh

2025.08.08 03:59

Andrej Karpathy shares a 3-step blueprint on how to master anything

0

59

8.9K

633

Forward to community

Lei Wang Reposted

Sander Dieleman@sedielem

2023.01.09 14:40

New blog post about diffusion language models: Diffusion models have completely taken over generative modelling of perceptual signals -- why is autoregression still the name of the game for language modelling? And can we do anything about that?

0

21

833

167

Forward to community

Lei Wang Reposted

Andrej Karpathy@karpathy

2025.02.27 01:31

This is interesting as a first large diffusion-based LLM. Most of the LLMs you've been seeing are ~clones as far as the core modeling approach goes. They're all trained "autoregressively", i.e. predicting tokens from left to right. Diffusion is different - it doesn't go left to right, but all at once. You start with noise and gradually denoise into a token stream. Most of the image / video generation AI tools actually work this way and use Diffusion, not Autoregression. It's only text (and sometimes audio!) that have resisted. So it's been a bit of a mystery to me and many others why, for some reason, text prefers Autoregression, but images/videos prefer Diffusion. This turns out to be a fairly deep rabbit hole that has to do with the distribution of information and noise and our own perception of them, in these domains. If you look close enough, a lot of interesting connections emerge between the two as well. All that to say that this model has the potential to be different, and possibly showcase new, unique psychology, or new strengths and weaknesses. I encourage people to try it out!

0

373

11.5K

1.5K

Forward to community

Lei Wang Reposted

Alex Finn@AlexFinn

2025.02.17 01:07

This is the most powerful Deep Research AI prompt I've ever used It can literally make you thousands of dollars The AI will do research based on your niche and give you a DETAILED plan on how to build software for that niche (no experience required) BOOKMARK THIS Prompt: I create content about [YOUR SUBJECT OR NICHE HERE]. I want you to perform thorough, in-depth research on this niche by analyzing its common pain points, the root causes of those problems, and the types of solutions that exist (if any). 1. Identify Key Challenges: • Provide me with 5 major challenges that people in my niche frequently encounter. • Explain each challenge in detail, focusing on: • Why it occurs • Who is most impacted • What current (if any) solutions or workarounds exist 2. Propose Software Solutions: • For each of the 5 challenges, propose one unique software idea that could solve or significantly reduce that challenge. • Break each software idea down into: • Core Functionality: What does it do? How does it address the challenge directly? • Key Features: List 3–5 critical features that make it stand out from existing solutions. • Value Proposition: Clearly explain how it benefits users, saves time/money, or simplifies tasks compared to other tools on the market. • Potential Tech Stack / Implementation Notes: If applicable, suggest frameworks, languages, or libraries that might be well-suited to build this solution. 3. Cite Sources & Data Points (If Available): • If you refer to any statistics, facts, or expert opinions, please provide references (studies, articles, or credible sources) to support the claim or finding. 4. Conclusion & Next Steps: • Summarize why these challenges are significant. • Emphasize how the proposed software ideas could disrupt or advance the niche. • Suggest any further reading or research paths that could help refine these software concepts. • Give me a detailed action plan on how I can get started building these ideas with Cursor. Act as if I have no programming experience. At the end of your response, provide a concise action plan or checklist summarizing how to go from idea to product validation. PLUG THIS INTO ANY DEEP RESEARCH TOOL (ChatGPT or Perplexity) AND WATCH THE MAGIC HAPPEN

0

219

1.9K

167

Forward to community

Lei Wang Reposted

Jiayi Pan@jiayi_pirate

2025.01.24 17:14

We reproduced DeepSeek R1-Zero in the CountDown game, and it just works Through RL, the 3B base LM develops self-verification and search abilities all on its own You can experience the Ahah moment yourself for < $30 Code: Here's what we learned 🧵

0

192

6.3K

1.2K

Forward to community

Lei Wang Reposted

Andrej Karpathy@karpathy

2025.01.30 18:03

We have to take the LLMs to school. When you open any textbook, you'll see three major types of information: 1. Background information / exposition. The meat of the textbook that explains concepts. As you attend over it, your brain is training on that data. This is equivalent to pretraining, where the model is reading the internet and accumulating background knowledge. 2. Worked problems with solutions. These are concrete examples of how an expert solves problems. They are demonstrations to be imitated. This is equivalent to supervised finetuning, where the model is finetuning on "ideal responses" for an Assistant, written by humans. 3. Practice problems. These are prompts to the student, usually without the solution, but always with the final answer. There are usually many, many of these at the end of each chapter. They are prompting the student to learn by trial & error - they have to try a bunch of stuff to get to the right answer. This is equivalent to reinforcement learning. We've subjected LLMs to a ton of 1 and 2, but 3 is a nascent, emerging frontier. When we're creating datasets for LLMs, it's no different from writing textbooks for them, with these 3 types of data. They have to read, and they have to practice.

0

380

11.8K

1.8K

Forward to community

Lei Wang Reposted

Andrej Karpathy@karpathy

2025.01.10 23:29

@alexanderchen @hapticdata very cool! when people use LLMs like this repeatedly and with very low latencies like it's some kind of free, persistent, almost disposable resource it gives me the "feel the AGI" feels.

0

10

289

10

Forward to community