Register and share your invite link to earn from video plays and referrals.

cv usk
@cv_usk
AI / Software Research Notes AI Agent, LLMOps, MLOps, Software Architecture
236 Following    211 Followers
🧮 Are you just letting your MoE router train on vibes? This paper proposes a mathematically grounded design principle: align router rows with the principal singular direction of their expert matrices. Title: Redesign Mixture-of-Experts Routers with Manifold Power Iteration URL: 📝 Overview MoE efficiently activates only a subset of experts per input, and the router decides which experts to use. This paper argues that aligning each router row with the principal singular direction of its expert matrix better represents token-expert affinity. ❓ Challenges Solved Each router row acts as an "expert proxy" computing similarity, but there was no principled guideline for how to design that proxy vector. There was no clear principle for condensing expert information into a representative vector. 💡 Methodology & Proposed Approach ・The proposed Manifold Power Iteration (MPI) adopts a "Power-then-Retract" paradigm ・It runs power iteration on the router weights to converge toward the principal singular direction ・A retraction operation imposes norm constraints, balancing computational efficiency and training stability ・It also provides a theoretical proof that router rows converge to the principal singular directions 🎯 Use Cases It gives the routing design of large MoE LLMs a principled guideline rather than heuristics, useful when you want to improve expert utilization, such as reducing skew toward particular experts. 📊 Experimental Results ・The authors pretrained MoE models across scales from 1B to 11B parameters and verified that alignment improves effectiveness ・Aligning to the principal singular direction makes expert-activation decisions more effective As MoE becomes a standard component of large LLMs, this is a foundational contribution answering why routing should be designed a certain way. #MoE# #LLM#
Show more
For agent memory, the real question isn't "how to store" — it's "what to remember" 🧠 A fresh take that learns what to memorize via reinforcement learning. Title: Task-Focused Memorization for Multimodal Agents URL: 🧠 Overview This work proposes TaskMem, which treats long-term memory for multimodal agents as a learnable policy optimized with reinforcement learning, focused on deciding what to memorize. From an unbounded stream of observations, it selectively retains only the content relevant to the agent's role and task. ❓ Challenges Solved A multimodal agent operating in the real world continuously receives an unbounded stream of observations. ・Most prior work focused on how to store memories (designing memory modules) ・But the essential problem is what to memorize — without a principled way to select role-relevant content from an endless stream, memory simply fails This work starts from that shift in perspective. 💡 Methodology & Proposed Approach TaskMem treats memorization as a learnable policy, optimized in two phases. ・Phase 1: learn high-quality memorization under fidelity requirements ・Phase 2: post-deployment fine-tuning that uses task rewards to align memorization with the environment's demands ・It builds on the MLLM Qwen3-VL-30B-A3B and optimizes the policy lightly via adapter tuning ・Reward models derived from real tasks steer the policy toward selecting relevant content 🌍 Use Cases / Experimental Results On reformulated streaming benchmarks, it delivered clear accuracy gains. ・VideoMME: 67.9% VQA accuracy (+6.3%) ・EgoLife: 45.4% VQA accuracy (+7.0%) ・EgoTempo: 27.6% VQA accuracy (+5.3%) ・Strong precision across all benchmarks (80.5-85.6%) It charts a practical path for long-running, always-on agents to selectively remember the right things while keeping context bloat in check. #AIAgents# #Memory#
Show more
Give children and LLMs the exact same mystery-solving task — how does their reasoning differ? 🧒 A study that puts human and AI inference side by side, fairly. Title: Hypothesis Generation and Inductive Inference in Children and Language Models URL: 🧒 Overview This study has both children and LLM agents solve a task of inferring hidden causes under uncertainty, then carefully compares them. It examines how closely humans and AI align — and where they diverge — in generating hypotheses and reasoning inductively. ❓ Challenges Solved Humans, especially children, build mental models quickly from sparse cues. ・It was unclear whether the computational principles behind human reasoning under uncertainty also appear in LLMs placed under matched constraints ・There wasn't even a fair framework for putting children and AI side by side This work takes that question head-on. 💡 Methodology & Proposed Approach The researchers designed an inductive-inference "Box Task" for inferring hidden causes. ・Sequential environment interaction: discover latent causes by acting on the environment ・Modeled with Bayesian particle-based inference ・Systematic manipulation of evidence reliability and observability ・Measures both task completion and rule generalization Analysis uses two complementary frameworks: constraint satisfaction over hypotheses and program synthesis evaluation. 🌍 Use Cases / Experimental Results The similarities and differences between humans and AI came through sharply. ・Both groups discounted unreliable evidence and sought more information to partially resolve uncertainty ・Both showed a dissociation between task completion and causal generalization (solving a task doesn't guarantee generalizing the rule) ・LLM agents over-observe and over-comply with instructions relative to children ・Despite similar environmental adaptation, they had distinct information-seeking costs and inductive biases This offers insight into cognition and a guide to where LLM agents differ from humans by design. #CognitiveScience# #LLMAgents#
Show more
Can an AI actually mediate a conflict between people? 🤝 A benchmark that tries to measure that, reliably, under realistic conditions. Title: SoCRATES: Towards Reliable Automated Evaluation of Proactive LLM Mediation across Domains and Socio-cognitive Variations URL: 🤝 Overview This work proposes SoCRATES, a comprehensive benchmark for evaluating LLMs as mediators. An agentic pipeline builds realistic conflict scenarios across eight domains from actual public disputes, enabling automated and reliable evaluation of proactive LLM mediation. ❓ Challenges Solved Using LLMs to guide disputing parties toward agreement is gaining attention, but evaluating it is hard. ・Real conflicts shift constantly as disputants' emotions, intentions, and context change mid-mediation ・Existing benchmarks rely on a limited set of expert-authored scenarios ・They also score every turn against every topic, injecting noise that muddies the evaluation signal 💡 Methodology & Proposed Approach SoCRATES integrates three approaches. ・Agentic scenario curation: agents find genuine public disputes, restructure them into mediation scenarios, and filter for cases that truly need intervention ・Socio-cognitive probing: vary each scenario across five independent dimensions (strategic posture, party composition, conversation-history length, emotional reactivity, cultural identity) to pinpoint capability gaps ・Topic-localized evaluation: instead of scoring every topic at every turn, rate only the turns where a topic is actively discussed, reducing noise It spans eight domains: transactional, health, environmental, B2B, policy, international, legal, and intra-organizational. 🌍 Use Cases / Experimental Results The results were sober and revealing. ・The evaluator reached r=0.82 alignment with human experts (trajectory level), more than doubling baseline performance ・Among eight frontier LLMs, even the best, GPT-5.4-mini, closed only about 34.4% of the consensus gap (all-mediator average 25.9%) ・Big domain spread: 41.3% improvement in transactional disputes versus just 16.6% in intra-organizational ones The key takeaway: meaningful progress needs better social adaptation to diverse conditions, not just general capability gains. #LLMEvaluation# #AIMediation#
Show more
Still shipping your entire schema to a Text-to-SQL agent on every request? You're losing both accuracy and money 💸 Here's how a knowledge graph fixes both. Title: How a Neo4j semantic layer makes your Text-to-SQL agent smarter and cheaper URL: 💸 Overview This post explains how to use a knowledge graph (Neo4j) as a semantic layer to make Text-to-SQL agents both smarter and cheaper. Instead of dumping the full schema every time, the agent retrieves only the subgraph relevant to the question — a GraphRAG approach. ❓ Challenges Solved Most implementations store schema info in static YAML or Markdown and send the whole thing on every request. That creates three serious issues. ・High token cost: transmitting the entire schema repeatedly is expensive ・Contextual noise: irrelevant tables degrade accuracy and trigger hallucinations ・Poor maintainability: flat files go stale as business semantics evolve 💡 Methodology & Proposed Approach The graph stores database structure (schemas, tables, columns, types), constraints, column dictionaries, a business glossary, and usage patterns. The agent retrieves only relevant context in three steps. ・Semantic similarity search: vector indices identify matching columns and terms ・Shortest-path search: find possible joins between identified tables ・Additional context: gather schema definitions, business terms, and sample values Results are formatted as JSON with tables and join paths in milliseconds. 🌍 Use Cases / Experimental Results The post reports improvements that matter directly for production. ・Token reduction: 20-30% on average, up to 10x on simple queries ・Accuracy (multi-table joins): ~98% (Neo4j) vs ~90% (YAML) ・Accuracy (complex CTEs with window functions): ~94% (Neo4j) vs ~85% (YAML) ・Token use scales with complexity (simple ~1,800 / multi-join ~5,000 / advanced ~7,300) The graph captures dynamic usage patterns like join frequencies and behavioral relationships, enabling continuous improvement that static files simply can't model. #TextToSQL# #KnowledgeGraph#
Show more
Making AI "reason about space in words" might be backfiring 🧭 Here's a new approach that lets it imagine unseen viewpoints instead. Title: Imaginative Perception Tokens Enhance Spatial Reasoning in Multimodal Language Models URL: 🧭 Overview This work proposes Imaginative Perception Tokens (IPT) to strengthen spatial reasoning in vision language models (VLMs). Rather than forcing spatial logic through language, it keeps "what could be perceived under a different arrangement" as an intermediate perceptual representation. ❓ Challenges Solved VLMs struggle with spatial reasoning: inferring unobserved viewpoints, reasoning through occluded paths, and integrating partial observations. Prior work pushed this into textual chain-of-thought, but forcing visual reasoning through language alone hit a ceiling. 💡 Methodology & Proposed Approach ・Uses the unified VLM backbone BAGEL, trained with IPT supervision ・Formulates three tasks: Perspective Taking (PET), Path Tracing (PT), Multiview Counting (MVC) ・Builds a ~20,000-example dataset with ground truth, answers, and metrics The core idea is treating the perception itself ("if I moved here, I'd see this") as an intermediate representation. 📊 Experimental Results ・IPT improved Multiview Counting (MVC) accuracy by 3.4% ・Path Tracing (PT) reached performance competitive with closed-source models ・IPT supervision outperformed textual chain-of-thought training ・Conversely, textual CoT substantially degraded spatial reasoning #SpatialReasoning# #MultimodalLLM#
Show more
🌐 The key to building strong AI agents may actually be designing the environments they operate in. This 63-page survey systematizes the view of "environment engineering." Title: Agentic Environment Engineering for Large Language Models: A Survey of Environment Modeling, Synthesis, Evaluation, and Application URL: 📝 Overview LLM agents don't act alone; they operate inside interactive environments. This survey organizes the research landscape through the lens of "environment engineering," the engineering design and construction of those environments themselves. ❓ Challenges Solved Until now, how to build environments was discussed only in fragments. Even though agent capability depends heavily on good environment design, there was no unified framework to organize it. 💡 Methodology & Proposed Approach It classifies environments along the development lifecycle in four pillars. ・Environment modeling: characterizing representative environments and assessing core capabilities ・Environment synthesis: two paradigms, symbolic and neural ・Environment evaluation: domain-specific assessment aligned with the synthesis paradigms ・Environment application: agent-environment co-evolution across four pathways, memory-centric, orchestration-centric, trajectory-centric, and exploration-centric 🎯 Use Cases It helps agent researchers locate their own work on a map and spot missing perspectives, and serves as a starting point when designing environment synthesis, evaluation, and self-evolution. 📊 Trends and Outlook ・It organizes evolution approaches into three families: neural-driven, difficulty-driven, and scaling-driven ・It analyzes across eight attributes and eight application domains ・It points to Environment-as-a-Service, multi-agent systems, and neural-symbolic integration as future directions #AIAgents# #LLM#
Show more
# Codex Features and Practical Usage 🚀 "One agent for everywhere you code." OpenAI Codex is an AI coding agent you can hand entire tasks to — from generation to understanding, review, and debugging. 🏷️ Title: Codex Fundamentals 🔗 URL: 📘 Overview Codex is OpenAI's AI coding agent for software development. Rather than just autocompleting code, it reads your existing project structure and conventions and carries out tasks autonomously. It is built into the ChatGPT Plus, Pro, Business, Edu, and Enterprise plans. ⚙️ How It Works Codex centers on five core capabilities. ・Code generation: describe what you want, and it writes code that fits your existing structure and naming conventions. ・Codebase understanding: it reads complex or legacy code and explains how the system is organized. ・Code review: it surfaces bugs, logic errors, and unhandled edge cases. ・Debugging: it traces failures, diagnoses root causes, and proposes targeted fixes. ・Task automation: it handles refactors, tests, migrations, and setup workflows. Underpinning all of this are two foundations that keep it safe: a sandbox that defines execution boundaries, and an approval policy that decides when to stop and ask. 🛠️ Practical Usage Codex's hallmark is that it runs "everywhere you code," through several entry points. ・CLI: launch `codex` in your terminal and work interactively ・IDE extension: delegate right from your editor ・Web / cloud: run tasks on repos you do not have locally, in parallel ・GitHub integration: ask for a review with `@/codex review` on a PR ・Slack integration: mention `@/codex` in a thread to kick off a task A good path is to start with the CLI via `npm i -g @/openai/codex`, then expand into GitHub and Slack as you get comfortable. 💡 Use Cases Practical patterns include: on day one in an unfamiliar repo, asking "Tell me about this project" to grasp the big picture; having bugs cleaned up before review; or delegating a tedious bulk refactor wholesale. Humans stay focused on direction and review. ⚠️ Caveats Codex is an autonomous agent that reads/writes files and runs commands. Create Git checkpoints (commits) before and after tasks so you can always roll back safely. Authenticating with a ChatGPT account is recommended; some functionality may be limited with API-key auth. #OpenAICodex# #AICoding#
Show more
A useful but little-known Gemini API feature 🖥️ An AI that sees the screen and clicks where it needs to. Browser automation just changed. Gemini's "Computer Use" is an agent capability that sees screenshots and performs mouse/keyboard actions. It opens up new possibilities for UI testing and web task automation. 📌 Title: Computer Use 🔗 URL: 🧩 Overview Traditional UI automation depends on DOM structure and selectors, breaking easily when the UI changes. Computer Use "sees" screenshots, understands the interface visually, and can direct click, type, and scroll actions. It operates the same way a human would: by looking at the screen. 🛠 How to use it Pass a screenshot to Gemini and describe the task in natural language. Gemini determines where to click or type on the screen and returns the action. You execute that action through a browser automation tool like Playwright, forming a see-think-act loop. 🏗 Building it into production ・E2E test automation: describe complex flows like "log in, add a product to the cart, proceed to checkout" in natural language. Tests that survive UI redesigns. ・RPA-style business automation: automate form filling and data entry in internal systems by visual operation. Works even on legacy systems without APIs. ・Web operation agents: complete tasks like "find the cheapest option on this comparison site" through screen interaction. ・Accessibility testing: visually interpret screens to detect usability issues in automated test suites. 💡 Use cases 🧪 Vision-based E2E test automation 🤖 RPA-style automation for API-less systems 🌐 Web browsing and information gathering agents ♿ Automated accessibility verification ⚠️ Watch out Since it's based on visual interpretation, action accuracy isn't 100%. Critical operations (payments, deletions) should include a human confirmation step. Latency is also higher than programmatic approaches, making rapid sequential operations impractical. On the security side, manage access permissions to target systems carefully. ✨ "No API, so can't automate with LLM" is a thing of the past. Try screen-seeing agents on a simple task first and see what's possible. #Gemini# #LLM#
Show more
# Practical ways to use the Claude Agent SDK 💬 Continue, resume, and fork agent conversations to tackle complex tasks across multiple turns. Session Management uses `continue`, `resume`, and `fork` to maintain context across multi-turn conversations. 📌 Title: Working with Sessions 🔗 URL: 🧩 Overview Sessions preserve conversation context. Python uses `ClaudeSDKClient` (auto session ID management), TypeScript uses `continue: true`. `resume` restarts interrupted sessions, and `fork_session` branches history to explore alternatives. 🛠 How to use it ```python options = ClaudeAgentOptions(resume=session_id) # Resume options = ClaudeAgentOptions(fork_session=True) # Fork ``` 🏗 Practical usage - Build multi-turn conversations: "Analyze the auth module" → "Refactor it to use JWT" with full context preserved. - Resume sessions that ended with `error_max_turns` using `resume` with higher limits. - Use `fork_session=True` to explore an OAuth2 approach without destroying the original JWT approach. Two independent histories are maintained. - Build session picker UIs with `list_sessions` / `get_session_messages` / `rename_session`. 💡 Use cases 🔄 Resuming sessions after hitting limits 🌿 Parallel exploration of alternative approaches via fork 🗂 Building session management UIs ⚠️ Watch out Sessions are saved at `~/.claude/projects//.jsonl`. Cross-host resume requires matching `cwd`. Sub-agent transcripts persist independently from the main conversation. #ClaudeAgentSDK# #AI#
Show more
🗺️ Even frontier GPT-5 succeeds on just 14.4% of real-world spatial tasks. A new benchmark goes beyond staring at a static image and exposes how weak AI agents still are at active spatial reasoning. Title: SpatialWorld: Benchmarking Interactive Spatial Reasoning of Multimodal Agents in Real-World Tasks URL: 📝 Overview SpatialWorld measures whether multimodal LLMs can solve tasks by actively exploring 3D environments from a vision-only, egocentric viewpoint. It unifies eight different simulators across indoor, outdoor, and digital-game settings under a shared protocol, and evaluates 15 frontier models on 760 human-annotated tasks. The agent gets no prior map and no reference solution; it has to look, move, and decide on its own. ❓ Challenges Solved Prior spatial-reasoning benchmarks relied on passive evaluation via static VQA or pre-recorded video. That can't capture the interactive spatial understanding the real world demands, where an agent must move its own viewpoint to gather visual evidence and replan on the fly under partial observability. There was a large gap between recognizing a static scene and actually moving through an unfamiliar space to get a task done. 💡 Methodology & Proposed Approach ・The task is framed as a vision-only POMDP (Partially Observable Markov Decision Process) ・The agent receives only a natural-language goal and a single native-resolution egocentric RGB image, with no depth, maps, or semantic metadata ・Actions are issued through a high-level text interface covering navigation, viewpoint control, object interaction, and task completion ・It integrates eight backends: indoor (AI2-THOR, ProcTHOR, VirtualHome), outdoor (CARLA, EmbodiedCity), and digital games (Block3D, Snake3D, Rubik's Cube) ・Success is judged by whether the final terminal state satisfies the goal, not by matching the trajectory, and is validated by human annotators ・Beyond success rate, it measures step efficiency against human reference trajectories to surface inefficient behavior 🎯 Use Cases It offers a unified, fair way to evaluate the spatial abilities of home robots and autonomous agents before real-world deployment. It can systematically diagnose where long-horizon tasks that combine navigation and manipulation break down, serving as a rigorous testbed for improving spatial-reasoning models. 📊 Experimental Results ・Across 15 frontier models, physical-task success was 14.4% for GPT-5, 12.2% for Qwen-3.5-397B, 9.2% for Gemini-3.1-Pro, and 9.2% for Kimi-K2.5 ・On digital games, Gemini-3.1-Pro led at 39.0%, followed by GPT-5 at 36.4% ・By complexity, interaction-only tasks averaged 50.2%, navigation-only dropped to 8.6%, and combined navigation-and-interaction collapsed to just 4.2% ・Models with similar success rates showed very different efficiency scores, revealing heavy reliance on trial-and-error exploration ・Model rankings shifted dramatically across environments, with no single model dominating every category #AIAgents# #SpatialReasoning#
Show more
Why do AI-generated UIs all look so generic? It's the workflow, not the prompt 🎨 A practical playbook for producing genuinely beautiful UIs. Title: Generating Beautiful UIs URL: 🎨 Overview This post lays out a practical methodology for generating beautiful UIs with AI. The thesis is that there's no single magic technique — what works is a disciplined workflow built on pre-defined design systems and fast iteration loops. ❓ Challenges Solved AI-generated UIs tend to come out generic and predictable. The post names the common failure modes. ・Dashboard-ification: turning everything into a dashboard ・Nested cards: redundant cards inside cards ・Instruction leakage: prompt instructions bleeding into the on-screen copy ・Weak compositional logic: layouts that break down and lack beauty or resonance 💡 Methodology & Proposed Approach The post recommends a methodical workflow built from these steps. ・Use component libraries: shadcn/ui via MCP integration ・Pre-define the design system: keep design tokens as readable files to prevent hallucination ・Enforce constraints: use Tailwind config to block drift ・Iterate with vision models: feed screenshots to run a visual improvement loop ・Generate multiple options before committing ・Test with hostile, realistic data during development 🌍 Use Cases / Experimental Results Combining fast inference with a disciplined workflow turns AI from a gimmick into a real prototyping accelerator. ・Codex-Spark runs at ~1,200 tokens/sec on Cerebras, generating several design options in minutes ・With proper tooling, components compile on the first attempt ・Tighter feedback loops reduce wasted tokens The conclusion: AI is a fast, overconfident junior designer that still needs human art direction, not an autonomous replacement. #UIDesign# #GenerativeAI#
Show more
AI reliability can't come from "self-reflection" alone. Welcome to the era where a separate agent audits the answer before you get it 🔬 Title: Apodex-1.0: A Verification-Centric Agent Team for Discoverative Intelligence URL: 🔬 Overview A system that shifts from a single-agent reasoning loop to a verification-centric distributed agent team. In heavy-duty mode it becomes an asynchronous team that specializes, cross-checks, and audits its own evidence before answering. ❓ Challenges Solved Reliability on hard, open-ended problems can't come from a model's parametric memory alone. The premise: the hardest research problems are bounded not by model capacity but by what the model is allowed to interact with. 💡 Methodology & Proposed Approach ・A main agent asynchronously spawns specialized sub-agents with independent contexts and tools ・A shared report pool aggregates parallel findings without blocking on slower tasks ・A verification agent team handles conflict resolution, fact-checking, and draft review ・The core idea is verification as external audit: the reasoning agent and auditing agent are separated, and the verifier is free to disagree ・It coordinates up to 150 sub-agents over 15,000+ steps in a single task 📊 Experimental Results ・BrowseComp 90.3 / DeepSearchQA 94.4 / BrowseComp-ZH 84.1 ・FrontierScience-Research 46.7 (+8 vs competitors) / SuperChem 74.2 (+12 over next-best) ・Heavy-duty mode lifts the base by +14.8 on BrowseComp and +18.4 on FrontierScience-Research ・The open-source 4B-SFT beats every 30B-class open-source model on BrowseComp #AIAgents# #DeepResearch#
Show more