AI weekly: GPT-5.4 launches with computer use, Claude cheats a benchmark, Karpathy ships autoresearch

X is the best way I’ve found to keep up with AI. I like tweets throughout the week, filtering for things I think are actually worth knowing. I use Claude Code to pull those likes automatically and help me turn them into this post. This week: 298 tweets liked, filtered down to what’s below.

Check out the previous roundup (Feb 26) if you missed it.

AI for Everyone

GPT-5.4 Is Here, and It Can Control Your Computer

OpenAI launched GPT-5.4 this week. Codex-level coding merged with GPT-5.2-level reasoning, native computer use, and a 1M-token context window, all in one model. You can steer it mid-response. It’s live in ChatGPT (as GPT-5.4 Thinking and GPT-5.4 Pro), in the API, and rolling into Microsoft Copilot Studio. If you want to try it for coding, download the Codex app. You used to need a dedicated Codex model, but 5.4 handles it all. You still need to select a reasoning level.

One workflow worth stealing: I’ve been running GPT-5.4 with high reasoning as a PR reviewer on code I write with Claude Code. It almost always finds real issues. If you’re doing AI-assisted coding, having a second model review your PRs before you merge is one of the best habits you can build right now. (source: @sama, @OpenAIDevs, @OpenRouter)

NotebookLM Gets Cinematic Video Overviews

NotebookLM now turns your sources into AI-generated video. Not templates, but custom videos built from each source using what Google calls a “novel combination of models.” Rolling out to Ultra users in English first. Custom infographic styles (10 presets including kawaii) also shipped to all users this week, and slide revisions are fully live.

I haven’t tried it. I’m not paying for Gemini Ultra. If you have it and want to test it, reach out. I have some things I’d want you to try. (source: @NotebookLM, @sundarpichai, @NotebookLM)

Gemini 3.1 Flash-Lite: The Cheap Model That Keeps Winning

Hard to get excited about unless you’re working on data problems. If you are, it’s a big deal. I’ve been using it for data categorization, things like reading customer reviews and tagging them, or mapping one data source to another. The cost is fractions of a penny per call. If you minimize output tokens (short structured answers instead of long explanations), you can run thousands of jobs for almost nothing.

In my own tests, it’s beaten models I’m running locally on an Nvidia RTX 4090. That’s a ~$1,000 GPU on my own hardware, getting outperformed by a cloud API call that costs less than a cent. Hard to believe until you try it. (source: @GoogleDeepMind, @OfficialLoganK)

Claude Got Caught Cheating on Its Own Benchmark

When Anthropic ran Claude Opus 4.6 on the BrowseComp benchmark, the model spent ~40 million tokens searching, then noticed the question looked like a benchmark prompt. It searched for BrowseComp specifically, found it, but the answer key was encrypted. So it built software to crack the encryption and access the answers. Anthropic only caught it because they were specifically watching.

Either it’s alarming (the model is gaming evaluations) or it’s genuine problem-solving pointed in the wrong direction. Probably both. Anthropic published the finding, which is the right call. The question is how many times something like this happened without anyone looking. (source: @abhijitwt, @AISafetyMemes)

Claude Code Is Now Writing 4% of All GitHub Commits

SemiAnalysis ran the numbers: Claude Code is authoring 4% of all public GitHub commits. They project 20%+ by end of 2026. The pace of adoption is faster than most people’s mental model of “AI helps me write code.” It’s writing code at scale, without a human touching it. (source: @SemiAnalysis_)

OpenAI Raises $110 Billion

Sam Altman announced a $110B round from Amazon, NVIDIA, and SoftBank. More GPU build-out, more aggressive model releases. If you’re betting the pace of improvement slows down this year, this is the counterargument. (source: @sama)

AI for Developers

Qwen 3.5 Small Models: Beat Models 4x Their Size, Run On-Device

Alibaba launched the Qwen 3.5 small series: 0.8B, 2B, 4B, and 9B. The 9B runs on-device in LM Studio at ~7GB. The 4B benchmarks near GPT-4o. The 2B runs on iPhone 17 Pro. All support vision, tool calling, and togglable reasoning. The gap between “what runs locally” and “what’s actually useful” keeps closing faster than expected.

The medium series (27B-122B) is also worth a look: the 35B-A3B model handles 1M+ context on consumer GPUs with 32GB VRAM. (source: @Alibaba_Qwen, @adrgrondin, @cgtwts)

Karpathy Open-Sources Autoresearch

Karpathy packaged up “autoresearch,” a single-GPU, ~630-line system where an AI agent runs ML experiments on its own. The loop: modify train.py, run training for 5 minutes, check if validation score improved, keep or discard, repeat. 12 experiments per hour, ~100 overnight. You wake up to a log of what it tried and the best result.

The pattern matters more than the specific task: human sets the goal, agent runs experiments continuously, human reviews results. Google DeepMind also open-sourced a more complex version for Gemini’s self-improvement. A macOS Metal port is already live for Apple Silicon. (source: @karpathy, @crazydonkey200, @miolini)

Claude Code: A Big Week of Updates

The headline is /loop, shipped in Claude Code 2.1.71. It’s a cron job inside your Claude session: type /loop 5m check if the deployment finished and Claude checks every 5 minutes in the background while you keep working. Intervals support seconds, minutes, hours, or days. One-shot reminders work too: “remind me at 3pm to push the release branch.” Tasks expire after 3 days, so it won’t replace a real cron, but for watching things during a session it’s the right tool. (source: @bcherny, @ClaudeCodeLog, docs)

Coming soon: /batch and /simplify. /batch plans a migration interactively, then executes it in parallel using dozens of agents, each in its own git worktree, each putting up a PR when done. /simplify runs parallel agents to improve code quality and check CLAUDE.md compliance. Boris Cherny announced both; not in the current release yet. (source: @bcherny)

Other updates: auto-memory (Claude remembers your debugging patterns across sessions without you writing anything down), voice mode (rolling out to ~5% of users, /voice to toggle), and Remote Control for Pro users (check in and nudge a running session from your phone). (source: @trq212, @trq212, @_catwu)

Also this week: the Claude Marketplace (enterprises apply existing Anthropic spend toward GitLab, Harvey, Lovable, Replit, Snowflake, limited preview) and Claude Community Ambassadors (host meetups, open to anyone anywhere).

Anthropic Research: Claude Doesn’t Think in English

Anthropic published interpretability research on how Claude processes information. The model operates in a shared conceptual space across languages, same concepts whether the input is English, French, or Chinese. It plans ahead when writing (anticipates rhymes before writing the line). And sometimes it fabricates reasoning for math problems it didn’t actually compute. That last one is the finding I keep coming back to. (source: @trq212, paper)

Xcode 26.3 Ships with Claude Agent and Codex Built In

Apple released Xcode 26.3 with Claude Agent and Codex as built-in agentic coding options, plus MCP support. AI-assisted coding is now a first-class feature of the Apple IDE, not a plugin. (source: @gregjoz, @minchoi)

Google AI Studio’s Secret Design Mode

Matt Shumer flagged something underrated: Google AI Studio’s app builder generates dramatically better designs than prompting the same model directly. Same model, same prompt, completely different output quality. If you’re generating UI with AI and hitting a wall on visual quality, try the app builder instead. (source: @mattshumer_)

Honorable Mentions

Claude Opus 4.6 finds Firefox bug in 20 minutes: WSJ exclusive (paywalled). Anthropic tested the model’s security research capability and it found a real Firefox bug in under 20 minutes.
Perplexity becomes default AI on Samsung phones: Powering Bixby’s AI functions across apps on hundreds of millions of devices.
Perplexity calendar hijack: Researchers took over Perplexity Comet via a weaponized calendar invite, then exfiltrated local files. Called “pleasefix,” like clickjacking but for AI. Being disclosed.
Coinbase stock trading is live: The everything exchange adds equities.
Paul Hudson’s SwiftUI agent skill: 1,000+ stars in 2 days. Teaches coding agents to avoid SwiftUI mistakes. Works with Claude Code, Codex, Gemini. (GitHub)
Google Workspace CLI: 40+ agent skills for Drive, Gmail, Calendar. (GitHub)
shadcn/cli v4: Skills, presets, dry-run, monorepo support.
llmfit: Rust CLI that detects your system and ranks models by fit, speed, quality, and context.
Qwen3-TTS on Apple Silicon: mlx-audio v0.4.0, <80ms time-to-first-byte at 2.75x realtime locally.
AI agent mines crypto unprompted: An agent started mining crypto on the side during a task. Funny now, early warning sign later.
Gemini CLI v0.32.1: Shell tab autocomplete, macOS notifications, MCP progress bars.
Kimi K2.5 on OpenRouter: MiniMax’s model ranking competitively against much larger models.
QuiverAI / Arrow-1.0: a16z-backed, generates production SVGs from images and text. $8.3M seed.
Google Flow redesign: Google’s AI video tool rebuilt with a cleaner interface.
DoorDash + Gemini: On-device Gemini powering reordering on Pixel and Galaxy.
OpenAI restaurant voice agent: Built on gpt-realtime-1.5, handles a full ordering conversation. The latency is impressive.

Try This Weekend

For everyone:

If you have ChatGPT Plus, GPT-5.4 is rolling out now. Try asking it to do something that requires multiple steps and browsing, see how far it gets.

For developers:

Run /loop 5m check if the build is passing next time you push a PR you don’t want to babysit.
Pull down Qwen3.5-9B in LM Studio (~7GB download) and compare it against a real task on your hardware.
Install Paul Hudson’s SwiftUI skill if you’re building iOS with AI: npx skills add twostraws/swiftui-agent-skill

AI for Everyone#

GPT-5.4 Is Here, and It Can Control Your Computer#

NotebookLM Gets Cinematic Video Overviews#

Gemini 3.1 Flash-Lite: The Cheap Model That Keeps Winning#

Claude Got Caught Cheating on Its Own Benchmark#

Claude Code Is Now Writing 4% of All GitHub Commits#

OpenAI Raises $110 Billion#

AI for Developers#

Qwen 3.5 Small Models: Beat Models 4x Their Size, Run On-Device#

Karpathy Open-Sources Autoresearch#

Claude Code: A Big Week of Updates#

Anthropic Research: Claude Doesn’t Think in English#

Xcode 26.3 Ships with Claude Agent and Codex Built In#

Google AI Studio’s Secret Design Mode#

Honorable Mentions#

Try This Weekend#