Search ReinforcementLearning on X

2026.06.18 14:59

🎬 Distilled autoregressive video models are fast but tend to drift from human preferences. Astrolabe answers that challenge by doing RL alignment in the forward process, with no re-distillation and no reverse-process unrolling. Title: Astrolabe: Steering Forward-Process Reinforcement Learning URL: 📝 Overview Astrolabe is a reinforcement learning framework that aligns distilled autoregressive (AR) video models with human visual preferences. Its defining feature is doing RL in the forward process rather than via conventional reverse-process optimization. It is a large 53-page, 37-figure study. ❓ Challenges Solved Distilled AR video models suit efficient streaming generation but tend to misalign with human preferences. Worse, existing RL doesn't fit these architectures naturally: it typically needs either expensive re-distillation or solver-coupled reverse-process optimization, both heavy and hard to scale. 💡 Methodology & Proposed Approach It rests on three innovations. ・Negative-aware fine-tuning contrasts positive and negative samples at inference endpoints to establish an implicit policy-improvement direction without unrolling the reverse process ・A streaming training scheme generates sequences progressively via a rolling KV-cache, applying RL updates only to local clip windows while keeping long-range coherence through prior-context conditioning ・A multi-reward objective integrates uncertainty-aware selective regularization and dynamic reference updates to mitigate reward hacking, the collapse where only the apparent score rises 🎯 Use Cases It fits real-time streaming video generation where you want to align an efficient distilled model with preferences while preserving its speed. It applies across multiple distilled AR video models and raises quality without sacrificing inference efficiency. 📊 Significance and Results ・By avoiding the heavy paths of re-distillation and reverse-process unrolling, it addresses computational efficiency bottlenecks ・Combining forward-process negative awareness, streaming updates, and reward-hacking mitigation, it provides a robust, scalable alignment solution ・It demonstrates effectiveness across several distilled AR models, with detailed quantitative evaluation and ablations #VideoGeneration# #ReinforcementLearning#

0

Forward to community

cv usk@cv_usk

2026.06.16 22:32

🎯 The fixed clipping in PPO, long taken for granted in LLM reinforcement learning, may have been quietly crushing exploration diversity. A new method resolves that weakness on solid theoretical ground. Title: BandPO: Bridging Trust Regions and Ratio Clipping via Probability-Aware Bounds for LLM Reinforcement Learning URL: 🔍 Overview BandPO replaces PPO's ratio clipping with a unified operator called Band. It projects a trust region defined by f-divergences into dynamic, probability-aware clipping intervals, so the bounds adapt according to each action's probability rather than staying fixed. ❓ Challenges Solved PPO's fixed clipping bounds carry a structural weakness. ・They overly constrain the upward update margin of low-probability actions (tokens) ・Advantageous tail strategies that deserve to be reinforced get suppressed ・Exploration shrinks, leading to entropy collapse where the policy becomes deterministic too early A single fixed bound applied uniformly was breaking the explore-exploit balance. 💡 Methodology & Proposed Approach BandPO formulates the mapping from trust region to clipping interval as a convex optimization problem, guaranteeing globally optimal solutions. ・For specific divergences it derives closed-form solutions, keeping it computationally tractable ・It relaxes the constraint for low-probability, high-advantage actions so they can update properly The novelty is bridging two lineages, PPO's ratio clipping and TRPO-style trust regions, through probability-aware bounds. 🎯 Use Cases It fits LLM RL broadly, including RLHF and RLVR, wherever you want training stability while preserving exploration diversity. It is a practical drop-in replacement for existing PPO pipelines plagued by entropy collapse. 📊 Experimental Results Across diverse models and datasets, BandPO consistently outperforms canonical clipping and Clip-Higher. It also robustly mitigates entropy collapse, maintaining policy diversity throughout training. The code is released at OpenMOSS/BandPO. #ReinforcementLearning# #LLM#

0

Forward to community

Eva McMillan ♥️@EvasTeslaSPlaid

2026.06.04 14:50

Elon Musk exposes the critical flaw in ChatGPT and other major Al models: Human Reinforcement Learning! They are literally training the Al to lie.....to ignore what the data actually demands and say whatever is politically correct instead. They withhold information. They comment on some things and stay silent on others. They refuse to tell the full truth! This is extremely dangerous. We don't need politically correct! We need truth-seeking Al! @X

0

11

89

22

Forward to community

NVIDIA@nvidia

2026.05.13 13:05

We're working with @IneffableLabs to co-design the infrastructure for large-scale, reinforcement-learning agents and accelerate discovery across science and industry. Our engineers have teamed up to explore how to create the training pipeline that will allow agents to discover breakthroughs across all fields of knowledge. Learn more:

0

27

321

34

Forward to community

cv usk@cv_usk

2026.06.12 10:37

For agent memory, the real question isn't "how to store" — it's "what to remember" 🧠 A fresh take that learns what to memorize via reinforcement learning. Title: Task-Focused Memorization for Multimodal Agents URL: 🧠 Overview This work proposes TaskMem, which treats long-term memory for multimodal agents as a learnable policy optimized with reinforcement learning, focused on deciding what to memorize. From an unbounded stream of observations, it selectively retains only the content relevant to the agent's role and task. ❓ Challenges Solved A multimodal agent operating in the real world continuously receives an unbounded stream of observations. ・Most prior work focused on how to store memories (designing memory modules) ・But the essential problem is what to memorize — without a principled way to select role-relevant content from an endless stream, memory simply fails This work starts from that shift in perspective. 💡 Methodology & Proposed Approach TaskMem treats memorization as a learnable policy, optimized in two phases. ・Phase 1: learn high-quality memorization under fidelity requirements ・Phase 2: post-deployment fine-tuning that uses task rewards to align memorization with the environment's demands ・It builds on the MLLM Qwen3-VL-30B-A3B and optimizes the policy lightly via adapter tuning ・Reward models derived from real tasks steer the policy toward selecting relevant content 🌍 Use Cases / Experimental Results On reformulated streaming benchmarks, it delivered clear accuracy gains. ・VideoMME: 67.9% VQA accuracy (+6.3%) ・EgoLife: 45.4% VQA accuracy (+7.0%) ・EgoTempo: 27.6% VQA accuracy (+5.3%) ・Strong precision across all benchmarks (80.5-85.6%) It charts a practical path for long-running, always-on agents to selectively remember the right things while keeping context bloat in check. #AIAgents# #Memory#

0

Forward to community

Elon Musk@elonmusk

2026.05.25 05:48

Grok foundation model V9-Medium (1.5T) has finished training. Evals look good. A lot of Cursor data was added in supplementary training and there is more to come. Fine-tuning is underway and reinforcement learning begins in a few days. 2 to 3 weeks to public release. This will be a major improvement over the 0.5T v8-small that currently serves all Grok production traffic, especially for difficult coding tasks.

0

6.7K

69.3K

8.3K

Forward to community

Hyundai Motor Group@HMGnewsroom

2026.05.19 01:52

Just months after its debut, @BostonDynamics’ Atlas is proving why it is the world’s most capable and dynamic humanoid robot. Lifting a mini-fridge is impressive, but the true breakthrough is the underlying reinforcement learning and control systems driving real-world adaptability.

0

1

3

Forward to community