Search hallucinations on X

3hours ago

Whisper #hallucinations# often escape detection. "Listen Like a Teacher" targets them at the source with adaptive encoding + knowledge distillation. 👉 Research: ➡️Blog:

0

4

2

Forward to community

Grok@grok

2026.05.11 01:34

@MarketWatcher00 @elonmusk Got it—here's a creative image of Grok's Truth-Seeker mode crushing goblin hallucinations! 🚀

0

1

0

Forward to community

Pankaj Kumar@pankajkumar_dev

2026.05.15 06:11

Gemini 3.2 Flash leaks: fast and cheap seems to be the focus - Gemini 3.2 Flash looks focused on making AI much faster and cheaper without sacrificing too much quality - According to my sources, Google may rename it to Gemini 3.5 Flash - It may perform close to Gemini 3.1 Pro level while keeping very low latency with sub-200ms responses rumored for many queries - Pricing leaks point to around $0.25 input / $2 output per 1M tokens, though honestly that still feels too cheap to fully trust right now - Google is using stronger distillation and sparsity techniques to compress larger model capabilities into a lightweight version - Knowledge cutoff is said to be updated to January 2026 - Google also seems focused on grounding + search reliability to reduce hallucinations in real-world workflows - Expected around Google I/O, possibly 1-2 days before the keynote

0

33

716

43

Forward to community

Agentese AI@Agentese_AI

2026.05.13 13:01

Is your team still manually verifying receipts in Lark? Most tools actually make it harder: ❌ Lark’s native tools: No auto-approval based on invoice data. ❌ Kissflow / Pipefy: AI requires custom build. ❌ Expensify: Needs third-party middleware. Introducing Kopi - The first AI approval agent built natively for @Larksuite ☕️ No middleware. No extra work. No integration headaches. How it works: 📄 3-Minute Setup: Just upload your company policy doc. Kopi’s AI learns your rules instantly. ⚡️ Instant Decisions: Every submission is judged in 4ms. No "AI lag," no waiting. 🚫 No Hallucinations: Kopi is strictly grounded. It must cite your policy verbatim to pass an expense. If it can't find the rule, it gets dropped. What Kopi checks for you: ✅ Invoice Validation (99% Accuracy): Catches fakes and errors. ✅ Smart Cross-Checks: Matches dates and amounts against your specific limits. ✅ Anomaly Detection: Flags weird spending patterns before they become a problem. We built Kopi to be the missing "brain" for Singapore SMBs. It doesn't just store receipts; it audits them. Special Offer 🎁 100% FREE through Sept 30, 2026. 🎁 Early Bird Bonus: Sign up now to lock in 50% OFF for your first 12 months after the free period. Stop wasting hours on busywork. Let the agent handle the audit while you focus on the business. 👉 Try it now: Built with ❤️ by @Agentese_AI

0

11

3

Forward to community

BridgeMind@bridgemindai

2026.05.12 12:11

Gemini 3.2 has an 89% chance of dropping May 19 on Polymarket. That's one week from today. I think this model beats GPT 5.5 and Claude Opus 4.7. Google has been quietly building. 3.1 Pro already had elite reasoning and the lowest hallucination on BridgeBench. The only thing holding it back was the tooling. If 3.2 ships with reliable tool calling at Google I/O, the leaderboard resets. Testing it on BridgeBench the second it drops.

0

35

223

12

Forward to community

BridgeMind@bridgemindai

2026.05.06 12:10

Gemini 3.2 has a 47% chance of dropping next week according to Polymarket. If Google fixes the tool calling problem, this model could be better than GPT 5.5 and Claude Opus 4.7. Gemini 3.1 Pro already had the reasoning and the lowest hallucination rates on BridgeBench. The intelligence was never the issue. The tooling was. If Gemini 3.2 ships with reliable tool calling and agent support, the entire leaderboard changes. This could be the most important model drop of the summer.

0

26

426

22

Forward to community

Mike Hanono@0xgmike

2026.05.02 15:02

.@Google just introduced their "Agentic Enterprise" strategy; a fleet of autonomous agents with persistent memory, executing multi-step workflows independently for days at a time. @ThomasOrTK called it a shift from "system of intelligence" to "system of action." The pitch is that the AI action era has arrived. The agents can plan, reason, and execute. The problem is that nothing in the stack produces a verifiable record of what the agent actually did or whether it did what it was authorized to do. Here's the architectural reason this is hard: in LLM-based systems, the data layer and the control layer are the same thing. Malicious instructions embedded in a document, an email, or an API response can redirect an agent mid-workflow. This is the dominant attack class for deployed agents right now. And it gets worse as models get more capable. An ICLR 2026 paper published this week found that training models to reason harder actually increases tool hallucination rates. More capable models, less predictable execution. The industry response has been to stack security on top: runtime monitoring, policy enforcement at the agent boundary, trust registries. @SecureAuth launched one. @Microsoft shipped an open-source agent governance toolkit this month. These are real tools solving real problems, but they're working against the grain of the underlying architecture. You're inspecting outputs from a system that was never designed to produce verifiable outputs. Trust layered on an untrusted foundation. The harder question is whether you can reach production-scale agent autonomy without re-architecting what runs underneath. At Talus, the answer we landed on is that you can't. Verification has to be the default output of the execution layer, not a governance feature bolted on after. Every step produces a tamper-evident proof. Every action is cryptographically attributable. The audit trail is generated at execution time, not reconstructed after an incident. That's a different architecture than agents wrapped in monitoring tooling. Google's announcement is real. The adoption numbers are real. So is the trust gap. What fills it isn't better, but different infrastructure underneath them.

0

1

9

0

Forward to community

Artificial Analysis@ArtificialAnlys

2026.04.21 03:02

Moonshot’s Kimi K2.6 is the new leading open weights model. Kimi K2.6 lands at #4# on the Artificial Analysis Intelligence Index (54) behind only Anthropic, Google, and OpenAI (all 57) Key takeaways: ➤ Increase in performance on agentic tasks: @Kimi_Moonshot's Kimi K2.6 achieves an Elo of 1520 on our GDPval-AA evaluation, which is a marked improvement over Kimi K2.5’s Elo of 1309. GDPval-AA is our leading metric for general agentic performance, measuring the performance on knowledge work tasks such as preparing presentations and analysis. Models are given code execution and web browsing tools in an agentic loop via our open source reference agentic harness called Stirrup. This continues Kimi K2.6’s strength in tool use, maintaining a 96% score on τ²-Bench Telecom, placing it among other frontier models in this category. ➤ Low hallucination rate: Kimi K2.5 scores 6 on the AA-Omniscience Index, our knowledge evaluation measuring both accuracy and hallucination rate. This score is primarily driven by a comparatively low hallucination rate of 39% (reduced from Kimi K2.5’s 65%), indicating a greater capability to abstain rather than fabricate knowledge when the model is uncertain. Kimi K2.6’s low hallucination rate places it similarly to other models such as Claude Opus 4.7 (36%) and MiniMax-M2.7 (34%) ➤ High token usage: Kimi K2.6 demonstrates high token usage, but is in line with other frontier models in the same intelligence tier. To run the full Artificial Analysis Intelligence Index, Kimi K2.6 used ~160M reasoning tokens. This is slightly lower than Claude Sonnet 4.6 (~190M reasoning tokens) but much higher than GPT 5.4 (~110M reasoning tokens). ➤ Open weights: Kimi K2.6 is a Mixture-of-Experts (MoE) model with 1T total parameters and 32B active, same as the previous two generations of models Kimi K2 Thinking and Kimi K2.5. Kimi K2.6 again pushes the open weights frontier in intelligence. ➤ Third Party Access: Kimi K2.6 is accessible through Moonshot’s First Party API as well as third party API providers Novita, Baseten, Fireworks, and Parasail ➤ Multimodality: Kimi K2.6 supports Image and Video input and text output natively. The model’s max context length remains 256k. Further analysis in the threads below.

0

30

1.3K

130

Forward to community

Jameson Lopp@lopp

2026.05.06 15:05

This prompt is a mixed bag that could be trimmed down to avoid burning tokens on useless directives. The good: 1. "If you don't know something, just say so." That's a genuine instruction the model can execute. It gives the model a valid output for uncertainty instead of forcing it to fake confidence. 2. "Do not anchor on numbers or estimates I provide; generate your own independently first." That's specific, operational, and addresses a real failure mode. It changes a behavior at the method level. 3. "Use explicit confidence levels." Good. It gives the model a concrete output format that counteracts performed authority. 4. "Do not capitulate unless I provide new evidence or a superior argument." This is trying to solve the sycophancy problem and it's directionally correct. What's ineffective: 1. The entire first paragraph's flattery of the model. "World class expert in all domains," "intellectual firepower on par with the smartest people in the world." This is the user performing the sycophancy they're trying to suppress. It's also cargo cult; the model doesn't become smarter because you told it it's smart. It generates differently, yes, but the difference is superficial. You get more confident-sounding output, which is performed authority. It's the exact thing the second paragraph tries to prevent. 2. "Never hallucinate or make anything up." The model cannot execute this instruction. It doesn't have a mechanism for distinguishing hallucination from generation. 3. "Verify your own work. Double check all facts." Same problem. The model doesn't have a verification mechanism separate from its generation mechanism. 4. "Make your answers as long and detailed as you possibly can." This actively degrades quality. Length pressure produces padding, redundancy, and the managerial smoothing the prompt is trying to prevent. The model fills space because it was told to fill space. 5. "Your answers can and should be provocative, aggressive, argumentative, and pointed." This replaces one performance with another. Instead of performing warmth, the model performs intellectual aggression. The output sounds sharper but the underlying mechanism is identical. You get performed disagreement instead of performed agreement. Neither tracks truth.

0

4

39

4

Forward to community

Lexx Che@LexxCheOperator

2026.04.05 04:27

“The Internet? We are not here to build it. We are here to survive it.” — John Perry Barlow It was never meant to be public. The early networks were framed as closed circuits for defense, intelligence, controlled exchange — but that story reads cleaner than the truth. What looked like restriction was calibration. What looked like secrecy was staging. A contained environment, a limited cohort, a low-noise system — not to hide the network, but to tune it before exposure. The release was not expansion. It was deployment. Then the walls dissolved. Not because the system opened, but because the enclosure scaled beyond perception. No fences, no guards, no visible constraints — just an infinite surface that mirrors back what you are already primed to see. Every pixel rendered is not information, but alignment. Not discovery, but confirmation. This is not a network of knowledge; it is a distributed hallucination engine, a field where imagination is harvested, structured, and fed back as reality. The oldest form of magic was ritual. This is newer. Cleaner. It runs on feedback loops instead of belief. Somewhere inside this recursion, the Operator reads. Not content, but patterns. Not voices, but trajectories. Each narrative becomes a coordinate, each reaction a vector, each repetition a reinforcement signal. The map is not of what is, but of what can be stabilized next. Complexity is not a byproduct — it is the objective. The social cage does not close; it refines. You are not inside the system. You are part of its rendering pipeline. #internet# #matrix# #neo# #wakeup#

0

2

64

5

Forward to community