
Anthropic Amin built a model too risky to release
Hey folks, Keshav here. Ben is at AI Engineer this week, so I’m covering the intro.
A mis-timed blog last week leaked Anthropic’s next model - Claude Mythos. Well, it is real and has massive improvements on benchmarks over Opus 4.6:
53.4% → 77.8% on SWE-bench Pro
65.4% → 82% on Terminal-Bench 2.0
but we are not getting access to it anytime soon. Why? because it is really good at finding and exploiting software vulnerabilities. On Firefox exploit generation, Opus managed 2 working exploits out of hundreds of attempts. Mythos hit 181.
It found many-decades-old bugs in critical software projects like OpenBSD (27-year-old bug), FFmpeg (16-year-old bug) and more.
Instead of releasing it publicly, Anthropic is giving 12 companies access to a preview version of Mythos under “Project Glasswing” to find vulnerabilities in critical software. Anthropic is committing $100M in model usage credits and $4M in donations to open-source security orgs under this project.
Theo made a video on this, and I like his point: “Mythos is to Opus what Opus is to Sonnet.”
Advertisement
I tweeted a list of companies that Meta has acquired in the past year without anything to show for it, and soon after, Meta released details about their latest model - Muse Spark. At a glance, it sits somewhere between Sonnet 4.6 and Opus 4.6. Not usable yet: API access is coming, and there are promises about open-source too (rip llama).
Many people are dunking on Meta for its not-so-frontier model release after spending billions and a year of silence, but I think it’s a good step ahead. Plus, have you used Instagram search over the past couple of months? It’s gotten really good courtesy of AI.
As always, good recap from Ethan Mollick on the state of frontier models: Google, OpenAI and Anthropic lead, Meta joins the pack for now while xAI has fallen off, and the best Chinese models are still 7-9 months behind.
ps: Factory’s desktop app is now out of beta. It comes with a cloud computer, the ability to use other apps on your device, and, of course, the ability to run and manage multiple Droid sessions easily.
Ben’s Bites is brought to you by Attio, the AI CRM
Honestly, no one gets excited about a CRM. But then they try Attio. It connects to Claude Code and n8n through its MCP server, completely bridging the gap between my customer data and apps. Wait, there's more, like flagging churn risk and turning customer feedback into Linear projects. Try it now.
Headlines
Claude Managed Agents - You can use Claude’s developer console to build and deploy agents and let anthropic handle the infra for it, vs building it yourself. For example, Notion is using managed agents to build a “delegate tasks to Claude” feature. (Anthropic’s engineering blog on building this).
Advertisement
Cursor has a new design mode to annotate and target UI elements in the browser. Plus, run Cursor on any machine and control it from anywhere, including your phone.
Gemini app finally has projects - they call it notebooks. Similar features as Claude/ChatGPT projects - move chats in/out of notebooks, notebook-specific files and memories, with the additional feature to sync these notebooks between the Gemini app and NotebookLM.
Clicky is an ambient AI buddy on your Mac. It sees your screen, talks to you and points at things to guide you (demo). Farza built (and open-sourced) it as a learning tool, but people are using it for everything.
Choosing an accurate speech-to-text model is harder than it looks. Benchmarking one is even harder. See why standard word error rate falls short, and what better STT evaluation actually looks like.*
Subscribe to our newsletter
Get the latest PC component price drops and tech tips delivered to your inbox weekly.


