AI Resources Weekly Report

Published on
Last updated on

AI Resources Weekly Report

更新频率:每周自动更新


New Tools

Code Agents

IDE Plugins

  • cline: Autonomous coding agent right in your IDE, capable of creating/editing files, executing commands, using the browser, and more with your permission every step of the way.
  • Roo-Cline: A fork of Cline, an autonomous coding agent, with some additional experimental features. It’s been mainly writing itself recently, with a light touch of human guidance here and there.
  • Roo Code: (Formerly Roo-Cline) An autonomous coding agent and significantly evolved fork of Cline. Now features “Custom Modes” and operates as a full AI Coding OS.
  • Claude Code Workflow Studio: Design complex AI agent workflows by conversing with AI – or use intuitive drag-and-drop. Build Sub-Agent orchestrations and conditional branching with natural language, then export directly to .claude format.
  • GitKraken Desktop 12.0: Agent Mode — dedicated view for launching, monitoring, and managing multiple AI coding agent sessions from a single interface (Apr 2026)

CLI & Cloud Agents

  • vm0: Natural language Agent, 24/7 in cloud sandbox
  • agenticSeek: Fully Local Manus AI
  • ml-intern: HuggingFace’s open-source AI agent that automates the full LLM post-training workflow
  • GELab-Zero-4B-preview: 这是个专注于 Android 系统的GUI 代理模型,针对交互界面元素(点击、输入、滑动、等待等)进行了优化,可以支持跨多个应用(如餐饮、交通、购物、社交等)执行多步骤长时程任务。
  • Atlassian Rovo Dev: Enterprise AI coding agent CLI — integrates with Jira, Confluence, Bitbucket via Teamwork Graph. Supports Claude Code, Cursor, Copilot, Kiro, OpenCode via installable skills (Apr 28, 2026)
  • Warp Terminal: Open-source AI-native terminal (MIT/AGPLv3, Rust). 41K+ GitHub stars in 48h. Built-in support for Claude Code, Codex, Gemini CLI. OpenAI founding sponsor. Oz cloud agent orchestration (Apr 30, 2026)
  • cplt: Sandbox wrapper for AI coding agents — runs GitHub Copilot CLI, OpenCode, Gemini CLI, or plain shell inside a kernel-level sandbox. macOS (Seatbelt/SBPL) + Linux (Landlock LSM + seccomp-BPF). MIT license (May 12, 2026)
  • agent-sandbox: Runs AI coding agents inside isolated Docker containers with no root access, no Docker socket, and no privilege escalation. Multi-language support (Node, Go, PHP) (May 8, 2026)
  • Code Bench: Local-first desktop AI coding agent, BYO model (MIT). Desktop-native with integrated AI agent (May 10, 2026)
  • Kanbots: Open-source Kanban desktop app that runs parallel AI coding agents (Claude Code, Codex) on every card — local collaboration interface for multi-agent task management (May 2026)
  • CodeGraph: Pre-indexed code knowledge graph for Claude Code, Codex, Cursor, OpenCode, and Hermes Agent — fewer tokens, fewer tool calls, 100% local (16.4K stars, MIT)

UI & Design Dev Tools

  • Pax dev and code pax: Pax is a revolutionary new canvas for building apps & websites with AI.
  • makepad:a new way to build UIs in Rust for both native and the web.
  • superdesign: extract webpage info and generate UI designs
  • ui.sh
  • variant

Video Generation

Video Models

  • FramePack: Revolutionary next-frame prediction model using a 13B model to generate 1-minute videos (60 seconds) at 30fps (1800 frames), minimum GPU memory required is 6GB
  • Wan2.1 we present Wan2.1, a comprehensive and open suite of video foundation models that pushes the boundaries of video generation.
  • Pusa: is a new video diffusion model that matches SOTA with 200x less training cost & 2500x less data. It outperforms Wan-I2V on VBench-I2V, runs 5x faster, and supports I2V, T2V, start-end frames, video extension, and video completion.
  • HunyuanVideo: Tencent’s 13B+ parameter video generation model with systematic framework
  • CogVideoX: Tsinghua’s expert transformer-based text-to-video diffusion model
  • Luma AI Dream Machine: Text-to-video and image-to-video generator powered by Ray3 model, featuring 4K resolution output and realistic physics
  • Sora 2: OpenAI’s latest video generation model with enhanced coherence and length capabilities
  • Runway Gen-4.5: World’s top-rated video model — unprecedented visual fidelity and creative control, advanced camera presets
  • Seedance 2.0: ByteDance’s multimodal video model — #1 on Arena for T2V (Elo 1460) and I2V (Elo 1454) as of April 2026. Unified audio-video generation, accepts up to 12 mixed inputs, 4–15s clips at 2K with native audio
  • Kling 3.0: 4K@60fps video generation, ~$0.50/10s clip, best value for volume creators
  • Veo 3.1: Google DeepMind’s video model — current benchmark for cinematic realism and native audio quality
  • PixVerse V6: Advanced video generation with precise parameterized camera control; R1 real-time world model with shared worlds + personalized avatars (Apr 2026 update); Series C unicorn status, 100M+ users across 177 countries
  • Pollo AI: All-in-one AI video & image platform, innovative video-to-video transformation
  • Genra AI: Chat-to-Video AI agent with built-in music, avatars, and templates: 15B params, ranks #1 on T2V and I2V leaderboards (Elo 1333/1392) as of April 2026, accepts 12 simultaneous multimodal inputs
  • Gemini Omni: Google’s upcoming unified video generation model, surfaced ahead of Google I/O 2026. Demonstrates significantly improved text rendering in generated video (May 2026, pre-release)

Avatar & Talking Head

  • SynTalker:Enabling Synergistic Full-Body Control in Prompt-Based Co-Speech Motion Generation
  • EchoMimicV3: 1.3B Parameters for Unified Multi-Modal and Multi-Task Human Animation (AAAI 2026)
  • JoyVASA: Portrait and Animal Image Animation with Diffusion-Based Audio-Driven Facial Dynamics and Head Motion Generation
  • MyTimeMachine: Personalized Facial Age Transformation
  • MEMO: MEMO is a state-of-the-art open-weight model for audio-driven talking video generation.
  • StableAnimator:High-Quality Identity-Preserving Human Image Animation (CVPR 2025)
  • INFP: Audio-Driven Interactive Head Generation in Dyadic Conversations.Given the dual-track audio in dyadic conversations and a single portrait image of arbitrary agent, our framework can dynamically synthesize verbal, non-verbal and interactive agent videos with lifelike facial expressions and rhythmic head pose movements
  • LatentSync:Taming Stable Diffusion for Lip Sync! - State-of-the-art lip-sync technology from ByteDance
  • KDTalker: Accurate and Efficient Implicit Keypoint-based Spatiotemporal Diffusion for Audio-driven Talking Portrait (IJCV 2025)
  • FantasyTalking: Realistic Talking Portrait Generation via Coherent Motion Synthesis (ACM MM 2025)
  • HeyGen Avatar V: Most realistic AI avatar — 15-second recording creates full identity model, 175+ language translation with lip-sync, zero identity drift
  • HappyHorse 1.0

Video Tools

  • VideoCaptioner: An intelligent video subtitle processing assistant based on Large Language Models (LLM), supporting subtitle generation, optimization, translation and more
  • VideoLingo: VideoLingo is an all-in-one video translation, localization, and dubbing tool aimed at generating Netflix-quality subtitles. It eliminates stiff machine translations and multi-line subtitles while adding high-quality dubbing, enabling global knowledge sharing across language barriers.
  • SmolVLM real-time camera demo This repository is a simple demo for how to use llama.cpp server with SmolVLM 500M to get real-time object detection

Audio

TTS

  • OpenAudio (formerly Fish Audio): support both tts and asr, #1 on TTS-Arena2 with 0.008 WER
  • fish-speech-gui
  • F5-TTS:F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching - Very active with streaming support
  • OuteTTS:OuteTTS is an experimental text-to-speech model that uses a pure language modeling approach to generate speech, without architectural changes to the foundation model itself.
  • ElevenLabs Flash: generates speech in 75ms + application & network latency,build directly via the API using model id “eleven_flash_v2” and “eleven_flash_v2_5”. April 2026: v3 conversational model, TTS Normalizer v3.1, Agent platform with tool error handling and voice quality assessment.
  • Orpheus TTS: 自然人声合成 with Llama-3b backbone, <200ms latency
  • dia: Dia directly generates highly realistic dialogue from a transcript. Recently released Dia2 with 1.6B params
  • VITA-Audio Fast Interleaved Cross-Modal Token Generation for Efficient Large Speech-Language Model (53ms first token latency)
  • Muyan-TTS is a trainable TTS model designed for podcast applications within a $50,000 budget, which is pre-trained on over 100,000 hours of podcast audio data
  • Higgs Audio V2: Redefining Expressiveness in Audio Generation with 75.7% win rate over GPT-4o-mini
  • SoulX-Podcast: Very recent podcast generation with Chinese dialects support
  • sonic-3
  • MiniMax Speech 2.6
  • Spark-TTS: Recent open-source TTS model
  • Voxtral TTS: Mistral AI’s open-source text-to-speech model
  • FireRedTTS: Competitive open-source option
  • MoonCast: Multi-speaker dialogue generation
  • CosyVoice2: Advanced voice cloning
  • Gemini 3.1 Flash TTS: Google’s most expressive TTS — 200+ inline audio tags, 70+ languages, two-speaker mode, $0.50/1M chars, Elo 1211 on Artificial Analysis (highest)
  • MiMo-V2.5-TTS: Xiaomi’s open-source TTS series (VoiceDesign / VoiceClone variants), supports Chinese dialects
  • Kani-TTS-2: 400M param open-source TTS, runs in 3GB VRAM, 12 languages, voice cloning support
  • OmniVoice: 600+ language zero-shot TTS with diffusion language model, SOTA on multilingual benchmarks
  • Chatterbox Multilingual: MIT-licensed zero-shot TTS for 23 languages with emotion control and watermarking
  • TADA: Hume AI’s open-source zero-hallucination TTS, 5x faster than LLM-based TTS (1B/3B models)
  • MOSS-TTS-Nano: Open-source 0.1B param multilingual TTS from Fudan NLP Lab / MOSI.AI — CPU-only inference, 48kHz stereo, 20 languages, voice cloning, Apache 2.0 (Apr 10, 2026)
  • MOSS-TTS Family: Open-source speech & sound generation family — MOSS-TTS (8B), VoiceGenerator (1.7B), SoundEffect (8B), TTS-Realtime (1.7B, 180ms TTFB, 377ms e2e). Covers long-form, dialogue, voice design, SFX, streaming. Apache 2.0
  • Supertonic: Lightning-fast, on-device, multilingual TTS running natively via ONNX — supports Swift, Python, Rust, C++, Java, Go, Node.js, Flutter, Web (9.4K stars, MIT)
  • LongCat-AudioDiT: Meituan’s 3.5B non-autoregressive TTS — operates in waveform latent space, eliminates error accumulation, Adaptive Projection Guidance (APG) (Mar 31, 2026)
  • MAGIC-TTS: Fine-grained controllable speech synthesis with explicit local duration and pause control — flow-matching TTS with token-aligned timing (arXiv Apr 2026)
  • pocket-tts

ASR

  • Open ASR Leaderboard
  • FunASR:FunASR is a speech recognition framework developed by the Speech Lab of DAMO Academy, which integrates industrial-level models in the fields of speech endpoint detection, speech recognition, punctuation segmentation, and more. It has attracted many developers to participate in experiencing and developing
  • nvidia/canary-1b:Canary-1B supports automatic speech-to-text recognition (ASR) in 4 languages (English, German, French, Spanish) and translation from English to German/French/Spanish and from German/French/Spanish to English with or without punctuation and capitalization (PnC).
  • Parakeet TDT 1.1B (en): is an ASR model that transcribes speech in lower case English alphabet. This model is jointly developed by NVIDIA NeMo and Suno.ai teams. It is an XXL version of FastConformer [1] TDT [2] (around 1.1B parameters) model.
  • CrisperWhisper:CrisperWhisper is an advanced variant of OpenAI’s Whisper, designed for fast, precise, and verbatim speech recognition with accurate (crisp) word-level timestamps. Unlike the original Whisper, which tends to omit disfluencies and follows more of a intended transcription style, CrisperWhisper aims to transcribe every spoken word exactly as it is, including fillers, pauses, stutters and false starts. support English, German.
  • Deepgram
  • Whisper-WebUI: A Gradio-based browser interface for Whisper. You can use it as an Easy Subtitle Generator!
  • ten-vad TEN VAD is a real-time voice activity detection system designed for enterprise use, providing accurate frame-level speech activity detection. It shows superior precision compared to both WebRTC VAD and Silero VAD, which are commonly used in the industry. Additionally, TEN VAD offers lower computational complexity and reduced memory usage compared to Silero VAD. Meanwhile, the architecture’s temporal efficiency enables rapid voice activity detection, significantly reducing end-to-end response and turn detection latency in conversational AI systems.
  • Omnilingual ASR: 1600+ languages support with zero-shot learning
  • Scribe v2 Realtime Speech to Text - 150ms Latency API: very promising
  • WhisperX: Improved Whisper with diarization
  • SenseVoice: FunASR’s multilingual model
  • Qwen3 ASR: Pure Rust implementation of Qwen3-ASR automatic speech recognition
  • MiMo-V2.5-ASR: Xiaomi’s open-source ASR — handles bilingual conversations, Chinese dialects (Wu, Cantonese, Minnan, Sichuanese), song lyrics recognition
  • mega-asr

Music

  • qa-mdt(active):(OpenMusic) Awesome Open-source Text-to-music (TTM) generation: QA-MDT (IJCAI-25 accepted)
  • DiffRhythm:全球首个基于扩散模型的端到端音乐模型
  • suno: Leading commercial music AI platform — Suno v5.5 with Studio DAW, personal voice cloning; Warner Music licensing deal (2026); 50 credits/day free tier
  • Lyria: Google’s music model (powering YouTube)
  • Beatoven.ai: Royalty-free music generation
  • LoudMe: Text-to-song generator
  • Udio: Professional music generation — best vocal realism, Inpaint editing feature; legal status unsettled after RIAA lawsuit
  • SUMO: Pay-per-track AI music generation with clear commercial rights
  • huobao-drama
  • khala

Voice Tools

  • UVR5-UI: UI for UVR’s, state-of-the-art source separation models to remove vocals from audio files
  • Grok Speech API: xAI’s standalone STT/TTS APIs — STT error rate 5.0% (industry-best), 25 languages, 0.10/hrbatch/0.10/hr batch / 0.20/hr streaming
  • MMAudio: MMAudio generates synchronized audio given video and/or text inputs
  • omniaudio-2.6b and demo: World’s Fastest Audio Language Model for Edge Deployment
  • openai-webrtc-go
  • MicDrop: Transform your voice into any voice, instantly.
  • Spatial Speech Translation Translating Across Space With Binaural Hearables
  • Seed LiveInterpret 2.0: End-to-end Simultaneous Speech-to-speech Translation with Your Voice
  • Grok Speech API: xAI’s standalone speech-to-text and text-to-speech APIs for enterprise voice developers
  • Amazon Nova 2 Sonic: AWS unified speech-to-speech foundation model — bidirectional streaming, unifies ASR + reasoning + tool use + TTS into single model. Background noise robustness, tone/sentiment adaptation

Image

  • Sana: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformer - 20x smaller and 100x faster than FLUX-12B (ICLR-2025 Oral)
  • HivisionIDPhotos: 旨在开发一种实用、系统性的证件照智能制作算法,它利用一套完善的 AI 模型工作流程,实现对多种用户拍照场景的识别、抠图与证件照生成。
  • LiYing: LiYing 是一套适用于自动化完成一般照相馆后期证件照处理流程的照片自动处理的程序。
  • BRIA-RMBG: High-Accuracy, Legal, and Inclusive Background Removal
  • RMBG-2-Studio:
  • IC-Custom: IC-Custom is designed for diverse image customization scenarios, including:
  • InvSR: Arbitrary-steps Image Super-resolution via Diffusion Inversion (CVPR 2025)
  • upscayl: Upscayl lets you enlarge and enhance low-resolution images using advanced AI algorithms. Enlarge images without losing quality. It’s almost like magic! 🎩🪄
  • HiDream-I1: is a new open-source image generative foundation model with 17B parameters that achieves state-of-the-art image generation quality within seconds.
  • FLUX: Leading open-weight image generation model from Stable Diffusion creators
  • Recraft V3: #1 on Artificial Analysis rankings for 5 consecutive months
  • Midjourney V8.1: V8.1 Alpha — HD mode 3x faster & cheaper, native 2K rendering (--hd), Prompt Shortener, updated Describe, image prompts restored
  • Midjourney v7: Aesthetic intelligence benchmark — --cref character consistency, --sref style lock, cinematic texture
  • Remove.bg: Industry standard for automatic background removal
  • Z-Image-Turbo 图像生成器
  • Z-Image-Turbo: is a powerful and highly efficient image generation model with 6B parameters. Currently there are three variants
    • GPT-Image-2: OpenAI’s next-gen image model — officially released April 21, 2026. First agentic image model with O-series reasoning, 2K resolution, near-perfect multilingual text rendering (CJK/Hindi/Arabic), natural language editing, web search integration. Powered by GPT-5.4 backbone. API pricing 8/8/30 per million tokens.
    • Recraft V4: Design-optimized image model with native SVG vector output and raster variant — professional brand asset generation
    • Flux 2 Max: Open-weight photorealistic image model, excels at complex multi-subject scenes
    • Dolphin is a novel multimodal document image parsing model that follows an analyze-then-parse paradigm. It addresses the challenges of complex document understanding through a two-stage approach designed to handle intertwined elements such as text paragraphs, figures, formulas, and tables.
    • Seedream 5.0: ByteDance’s latest image generation model — multi-version (5.0/4.5/4.0), advanced multi-image editing, real-time search integration, top-rated on Pixazo for versatility

3D Generation

  • StableGen: Transform your 3D texturing workflow with the power of generative AI, directly within Blender!
  • Meshy AI: Text-to-3D model generation with high-quality 3D assets — Meshy 6 latest model for highest fidelity
  • Hunyuan3D 2.1: Tencent’s 3D generation model for detailed textured assets
  • Hunyuan3D-Omni: Unified framework for controllable generation of 3D assets
  • hyper3d: 3D 模型生成(3D打印)
  • Seed3D 2.0: ByteDance’s next-gen 3D foundation model — officially released April 23, 2026. SOTA geometry + texture quality, improved PBR materials, supports scene generation from text/image/video (API via Volcano Engine)
  • Tripo Studio / Tripo H3.1: All-in-one AI 3D workspace. H3.1 (April 2026) — high-fidelity image-to-3D for production (game dev, product viz, interactive media). 20B+ parameter algorithm, 6.5M+ creators, AI model segmentation and 3D printing support
  • Hi3D v2.1: Converts AI images (GPT-Image-2 etc.) to production-ready 3D models, 2-5 min generation, manifold meshes for 3D printing

AI Search & Research

Search Engines

  • Perplexity AI: Market leader in AI search
  • MindSearch: Mimicking Human Minds Elicits Deep AI Searcher - Deployed on Puyu platform
  • khoj: is a personal AI app to extend your capabilities. It smoothly scales up from an on-device personal AI to a cloud-scale enterprise AI.
  • Scira (formerly mplx.run): A minimalistic AI-powered search engine that helps you find information on the internet
  • Gemini Search: A Perplexity-style search engine powered by Google’s Gemini 2.0 Flash model with grounding through Google Search
  • exa: AI-native search engine for developers and research
  • You.com: AI-powered search with personalization
  • Kagi: Premium AI search engine
  • Brave Search: Privacy-focused AI search

Research & Deep Research

  • SurfSense While tools like NotebookLM and Perplexity are impressive and highly effective for conducting research on any topic/query, SurfSense elevates this capability by integrating with your personal knowledge base. It is a highly customizable AI research agent, connected to external sources such as search engines (Tavily, LinkUp), Slack, Linear, Notion, YouTube, GitHub and more to come.
  • Agentic Company Researcher 🔍: A multi-agent tool that generates comprehensive company research reports
  • MAESTRO: Your Self-Hosted AI Research Assistant
  • Open Deep Research: Deep research has broken out as one of the most popular agent applications. This is a simple, configurable, fully open source deep research agent that works across many model providers, search tools, and MCP servers. It’s performance is on par with many popular deep research agents (see Deep Research Bench leaderboard).
  • MiroThinker/code: is MiroMind’s Flagship Research Agent Model. It is an open-source search model designed to advance tool-augmented reasoning and information-seeking capabilities, enabling complex real-world research workflows across diverse challenges
  • CO-storm: Get a Wikipedia-like report on your topic with AI, STORM is a research prototype that supports interactive knowledge curation.
  • OmniThink: Expanding Knowledge Boundaries in Machine Writing through Thinking
  • Resume Matcher: Resume Matcher is an AI Based Free & Open Source Tool. To tailor your resume to a job description. Find the matching keywords, improve the readability and gain deep insights into your resume.
  • InsightExpress: InsightExpress is a Next.js application that generates AI-powered research reports based on user-provided topics and emails them to users. The application leverages Langflow for its AI capabilities and features a modern, responsive UI built using NextJS.

Document & OCR

  • pdf2md: Self-hostable API server and pipeline for converting PDF’s to markdown using thrifty large language vision models like GPT-4o-mini and gemini-flash-1.5.
  • PDFMathTranslate: PDF scientific paper translation and bilingual comparison.
  • gamma: 精美的演示文稿、文档和网站。无需设计或编码技能。
  • refly: 这是一款革新性的 AI Native 内容创作引擎!⚡️Refly 是基于「自由画布👨‍🎨👩‍🎨」理念打造的 AI Native 创作工具,为用户提供从创意萌发到成品内容的一站式解决方案🌈:
  • markitdown: MarkItDown is a utility for converting various files to Markdown (e.g., for indexing, text analysis, etc).
  • ReaderLM: converts raw HTML into beautifully formatted markdown or JSON with superior accuracy and improved longer context handling
  • AiryLark 是一个开源的文档处理工具,支持多种文件格式的输入和处理。无论是 PDF 文档、Word 文件还是纯文本,AiryLark 都能高效处理。
  • AI Presentation Generator
  • GutenOCR
  • deepseek-ocr-2
  • PaddleOCR
  • glm-ocr

AI Agents & OpenClaw

Agent Frameworks

  • ChainForge: An open-source visual programming environment for battle-testing prompts to LLMs.
  • anychat:A unified chat interface for multiple AI models powered by Gradio. This application provides access to various leading AI models through a simple tab-based interface.
  • aisuite: Simple, unified interface to multiple Generative AI providers.
  • Open Canvas: Open Canvas is an open source web application for collaborating with agents to better write documents. It is inspired by OpenAI’s “Canvas”, but with a few key differences.
  • Reppl:Using APPL to reimplement popular algorithms for Large Language Models (LLMs) and prompts; experimental AI-assisted re-implementation of prompt optimization algorithms.
  • Google ADK → Gemini Enterprise Agent Platform: Google’s Agent Development Kit — major upgrade at Cloud Next 2026. Graph-based multi-agent orchestration, 6T+ tokens/month, secure Workspaces, multimodal streaming. Now part of unified Gemini Enterprise Agent Platform replacing Vertex AI
  • Kimi Claw: Moonshot AI’s native OpenClaw agent — 5,000 community skills, 40GB cloud storage
  • Statewright: Visual state machine guardrails for AI agents — controls which tools agents can use in each phase. Works across Claude Code, Codex, Cursor, OpenCode, Pi. Local models (13GB+) go from 2/10 to 10/10 on SWE-bench subset. 122 HN points (May 12, 2026)
  • Mozaik: TypeScript framework for building reactive AI agents — models LLM context as a first-class primitive, aligned with OpenResponses spec. Provider-agnostic, typed context composition (May 11, 2026)
  • Rotunda: A browser built for AI agents from the ground up with simulated typing and anti-CAPTCHA. Playwright-compatible Python library (May 13, 2026)
  • Recursant: Mesh-based control plane for governing AI agents — service mesh for production agent deployments (May 13, 2026)
  • Lowdefy v5.3: AI agents in 30 lines of YAML — low-code agent orchestration platform (May 11, 2026)
  • OpenHuman: Your personal AI superintelligence — private, simple, extremely powerful. Rust-based (25.7K stars)
  • Academic Research Skills: Academic Research Skills for Claude Code: research → write → review → revise → finalize pipeline (19K stars)
  • Superpowers: Agentic skills framework & software development methodology that works (202.8K stars, MIT)
  • RuView: Turns commodity WiFi signals into real-time spatial intelligence, vital sign monitoring, and presence detection — no camera needed (63.9K stars, MIT)
  • AgentMemory: #1 Persistent memory for AI coding agents based on real-world benchmarks (16.4K stars)
  • CloakBrowser: Stealth Chromium that passes every bot detection test — drop-in Playwright replacement with source-level fingerprint patches. 30/30 tests passed (18.7K stars, MIT)
  • ViMax: Agentic Video Generation — Director, Screenwriter, Producer, and Video Generator All-in-One (6.7K stars, MIT)

Desktop Coworkers

  • Open Claude Cowork: An open-source desktop chat application powered by Claude Agent SDK and Composio Tool Router. Build AI agents with access to 500+ tools and persistent chat sessions.
  • openwork: Open Source AI Desktop Agent(coworker)
  • Claude-Cowork: Agent Cowork is an open-source alternative to Claude Cowork — a desktop AI assistant that helps with programming, file management, and any task you can describe(desktop)
  • eigent: The Open Source Cowork Desktop to Unlock Your Exceptional Productivity
  • AionUi: Cowork with Your AI, Gemini CLI, Claude Code, Codex, Qwen Code, Goose CLI, Auggie, and more
  • Deepseek-Cowork:

OpenClaw Ecosystem

  • KrillClaw: The world’s smallest AI agent runtime. 49KB. Written in Zig. Zero dependencies.
  • nullclaw:Fastest, smallest, and fully autonomous AI assistant infrastructure written in Zig
  • openfang: Open-source Agent OS built in Rust. 137K LOC. 14 crates. 1,767+ tests. Zero clippy warnings.
  • nanoclaw
  • ironclaw
  • nanobot
  • hermes-agent
  • NemoClaw: NVIDIA’s open-source secure OpenClaw stack — one-command install, privacy controls, Nemotron-powered agents
  • OpenHarness

Open Source LLMs

  • BAGEL Unified Model for Multimodal Understanding and Generation - Outperforms Qwen2.5-VL and InternVL-2.5
  • VITA: Towards GPT-4o Level Real-Time Vision and Speech Interaction
  • Qwen2.5-Omni: Alibaba’s multimodal model with TTS capabilities
  • Gemma 4 (31B): Google DeepMind’s open multimodal model — text + image + video input, 256K context, 140+ languages, hybrid attention
  • DeepSeek-V4: MIT-licensed MoE — V4-Pro (1.6T total / 49B active) and V4-Flash (284B / 13B active), 1M context, 32T tokens pretrain. Released April 24, 2026. Competes with top closed-source models on reasoning, coding, agentic tasks. 0.14/Minput(Flash)/0.14/M input (Flash) / 1.74/M input (Pro) — cheapest frontier-class model ever released publicly
  • GLM-5.1: Z.ai’s open-weights model — sustains autonomous work for up to 8 hours, strong coding and agentic workflows. 744B MoE with 40B active
  • Qwen3.6-27B: Dense 27B model — flagship-level coding, 77.2% SWE-bench Verified, competitive with Claude 4.5 Opus on coding agents
  • Qwen3.6-35B-A3B: MoE variant with 3B active params for efficient deployment
  • Qwen3.5-397B-A17B: Alibaba’s flagship MoE — top-tier general chat, story writing, competitive with DeepSeek-V4
  • Qwen3.5-9B: 81.7% GPQA Diamond at 0.10/Mtokensbenchmarkleaderinsub0.10/M tokens — benchmark leader in sub-0.20 tier
  • Tencent Hy3-preview: Tencent’s latest open-weights model — strong STEM reasoning, code & agent benchmarks
  • Llama 4 Behemoth: Meta Superintelligence Labs’ first frontier model — Intelligence Index 52 (vs Llama 4 Maverick’s 18)
  • Llama 4 Scout: 10M token context window — longest in any open-weight model
  • Gemma 4: Google’s strongest open-weight family — four variants from 2B to 31B, native text+image+audio, Apache 2.0
  • Kimi K2.5/K2.6: Moonshot AI’s MoE models — strong agentic performance, top OpenClaw compatibility
  • Mistral Small 4: Apache 2.0, 6.5B active params — efficient open-source model for self-hosted deployment
  • Mistral Large 3: Mistral’s flagship model — competitive with frontier closed-source models
  • NVIDIA Nemotron 3 Super: High-performance inference model from GTC 2026
  • IBM Granite 4.0: Enterprise-grade multimodal models including 3B Vision variant
  • MiniMax-M2.7: Top OpenClaw compatibility, strong coding and agentic benchmarks
  • PrismML Bonsai 8B: 1-bit quantized model — extreme compression with competitive performance
  • Devstral 2: Mistral’s coding-focused open model
  • GLM-4.7: Z.ai’s predecessor to GLM-5.1, strong general performance

Inference Acceleration

Speculative Decoding

  • DFlash: Block Diffusion for Flash Speculative Decoding — lightweight block diffusion model for efficient parallel drafting. Supports Gemma 4, Qwen3.5/3.6, Kimi K2.5, MiniMax-M2.5 and more. Works with vLLM, SGLang, Transformers, MLX. 3.1K+ stars (MIT)
  • EAGLE-3: Extrapolation Algorithm for Greater Language-model Efficiency — feature-level speculative decoding with provable output quality preservation. EAGLE-3 (NeurIPS’25) is the latest. 2.3K+ stars
  • Medusa: Simple framework for accelerating LLM generation with multiple decoding heads — adds trainable draft heads to any LLM without a separate draft model. 2.7K+ stars
  • Gemma 4 MTP Drafters: Google’s Multi-Token Prediction drafters for Gemma 4 family — up to 3x speedup via speculative decoding, zero quality degradation. Apache 2.0, supports Transformers, MLX, vLLM, SGLang, Ollama, LiteRT-LM
  • SpecForge: Flexible open-source training framework for speculative decoding draft models — supports Medusa, EAGLE, and other drafting strategies
  • Awesome Multi-Token Prediction: Curated list of MTP papers, tools, and resources for LLMs and Speech-Language Models

Attention & KV Cache Optimization

  • FlashAttention-3: Fast and memory-efficient exact attention — exploits async Tensor Cores and TMA on Hopper GPUs (H100). The de facto standard for attention acceleration
  • vLLM: High-throughput LLM serving with PagedAttention — built-in speculative decoding support (DFlash, MTP, EAGLE, Medusa), continuous batching, quantization

Infrastructure & DevTools

  • Genesis: Genesis is a physics platform designed for general purpose Robotics/Embodied AI/Physical AI applications. It is simultaneously multiple things:
  • nofx: Agentic Trading OS
  • AI AGENTS FOR TRADING
  • dexter
  • 微舆BettaFish 是一个从0实现的创新型 多智能体 舆情分析系统,帮助大家破除信息茧房,还原舆情原貌,预测未来走向,辅助决策。用户只需像聊天一样提出分析需求,智能体开始全自动分析 国内外30+主流社媒 与 数百万条大众评论。
  • TrendRadar: 🚀 最快30秒部署的热点助手 —— 告别无效刷屏,只看真正关心的新闻资讯
  • higress: AI Native API Gateway
  • LLMRouter
  • grep.app:the fastest code search engine on the planet (of over 500k+ Git repos).
  • AIMX: Self-hosted, open-source email server designed for AI agents — purpose-built SMTP/IMAP infrastructure for agent-to-agent and agent-to-human communication (May 14, 2026)
  • Hypercubic Hopper: Agentic interface for mainframes and COBOL — AI agents that interact with legacy enterprise systems. 95 HN points (May 12, 2026)
  • Voker: Analytics platform for AI agents — YC S24 startup providing observability and metrics for production agent deployments. 59 HN points (May 12, 2026)
  • Models.dev: Open-source database of AI model specs, pricing, and capabilities — community-maintained reference for model comparison (MIT)
  • Pyrefly: Meta’s fast type checker and language server for Python, written in Rust (6.4K stars, MIT)
  • ModelRift: OpenSCAD code editor & AI 3D model builder — text-to-CAD tool for creating, customizing, and sharing parametric 3D models in browser
  • MiroFish: A Simple and Universal Swarm Intelligence Engine, Predicting Anything

Tutorials & Guides

Leadboard

更新日志

2026-05-22

  • 🆕 新增:Anthropic Project Glasswing — 网络安全倡议,联合 AWS、Apple、Google、Microsoft、NVIDIA 等发现 Claude Mythos Preview 前沿模型可发现所有主要操作系统和浏览器的高危漏洞,跨行业防御安全合作(194 HN points)
  • 🆕 新增:CodeGraph — AI 编码 agent 预索引代码知识图谱,支持 Claude Code/Codex/Cursor/OpenCode/Hermes Agent,减少 token 和工具调用(16.4K stars, MIT)
  • 🆕 新增:Kanbots — 开源看板桌面应用,每张卡片运行并行 AI 编码 agent(Claude Code/Codex),多 agent 任务管理(116 HN points)
  • 🆕 新增:Supertonic — 超快本地多语言 TTS,ONNX 原生运行,支持 Swift/Python/Rust/C++/Java/Go/Node.js/Flutter/Web(9.4K stars, MIT)
  • 🆕 新增:Superpowers — Agentic skills 框架和软件开发方法论(202.8K stars, MIT)
  • 🆕 新增:OpenHuman — 个人 AI 超级智能,Rust 实现,隐私优先(25.7K stars)
  • 🆕 新增:RuView — WiFi 信号实时空间智能,生命体征监测和存在检测(63.9K stars, MIT)
  • 🆕 新增:AgentMemory — AI 编码 agent 持久化内存,基于真实基准排名第一(16.4K stars)
  • 🆕 新增:CloakBrowser — 隐身 Chromium,通过所有机器人检测测试,Playwright 替代品(18.7K stars, MIT)
  • 🆕 新增:ViMax — Agentic 视频生成,导演/编剧/制片人/生成器一体化(6.7K stars, MIT)
  • 🆕 新增:Academic Research Skills — Claude Code 学术研究技能:研究→写作→审查→修订→定稿流程(19K stars)
  • 🆕 新增:Models.dev — 开源 AI 模型规格/定价/能力数据库(MIT)
  • 🆕 新增:Pyrefly — Meta 的快速 Python 类型检查器和语言服务器,Rust 实现(6.4K stars, MIT)
  • 🆕 新增:ModelRift — OpenSCAD 代码编辑器和 AI 3D 模型构建器,Antigravity 2.0 在 OpenSCAD 架构 3D LLM 基准测试中排名第一(321 HN points)
  • 🆕 新增:Easy Vibe — 2026 Vibe coding 初学者编程课程(14K stars)
  • 📈 更新:Anthropic Glasswing 披露 Mythos Preview 模型在网络安全领域的前沿能力,已发现数千个高危漏洞

2026-05-15

  • 🆕 新增:Statewright — AI agent 可视化状态机护栏,控制每阶段可用工具,跨 Claude Code/Codex/Cursor/OpenCode/Pi 兼容,本地模型 SWE-bench 从 2/10 提升至 10/10(122 HN points)
  • 🆕 新增:cplt — AI 编码 agent 内核级沙箱包装器,支持 macOS Seatbelt + Linux Landlock/seccomp-BPF(MIT, NAV/挪威劳工福利局开源)
  • 🆕 新增:agent-sandbox — Docker 容器隔离 AI 编码 agent,零 root 权限,零 Docker socket 访问,零提权能力
  • 🆕 新增:Code Bench — 本地优先桌面 AI 编码 agent,BYO 模型(MIT)
  • 🆕 新增:Gemini Omni — Google 即将发布的统一视频生成模型,Google I/O 2026 前泄露,显著改善视频内文字渲染
  • 🆕 新增:Mozaik — TypeScript 响应式 AI agent 框架,LLM context 作为一等公民,OpenResponses 规范对齐
  • 🆕 新增:Rotunda — 专为 AI agent 构建的浏览器,模拟输入 + 反 CAPTCHA,Playwright 兼容
  • 🆕 新增:Recursant — AI agent 服务网格控制平面,生产级 agent 治理
  • 🆕 新增:Lowdefy v5.3 — 30 行 YAML 定义 AI agent,低代码 agent 编排平台
  • 🆕 新增:AIMX — 自托管开源 AI agent 专用邮件服务器
  • 🆕 新增:Hypercubic Hopper — 主机/COBOL 的 Agent 接口,AI agent 与遗留企业系统交互(95 HN points)
  • 🆕 新增:Voker — AI agent 分析平台,YC S24,agent 生产部署可观测性(59 HN points)
  • 🆕 新增:Arena AI Model ELO History — AI 模型 ELO 历史评分追踪工具
  • 📈 更新:Anthropic 将 Claude Code SDK 和 claude -p 移出订阅计划,开放独立使用(May 14)
  • 📈 更新:OpenAI 发布 GPT-5.5-Cyber — 网络安全专用模型,欧盟可信访问计划(May 8)

2026-05-07

  • 🆕 新增:Inference Acceleration 章节 — 推理加速技术汇总,包含 Speculative Decoding(DFlash, EAGLE-3, Medusa, Gemma 4 MTP Drafters, SpecForge)和 Attention/KV Cache 优化(FlashAttention-3, vLLM)

  • 🆕 新增:Warp Terminal — AI 终端开源(MIT/AGPLv3),OpenAI 赞助,41K+ GitHub stars,agent-native 开发环境(Rust)

  • 🆕 新增:Atlassian Rovo Dev CLI — 企业级 AI 编码代理,集成 Jira/Confluence/Bitbucket,Teamwork Graph + MCP,支持多 agent 技能安装

  • 🆕 新增:Amazon Nova 2 Sonic — AWS 统一语音到语音基础模型,bidirectional streaming,ASR+推理+TTS 单模型

  • 🆕 新增:MOSS-TTS-Nano — 0.1B 参数开源 TTS,CPU-only,48kHz 立体声,20 语言(Fudan/MOSI.AI,Apache 2.0)

  • 🆕 新增:MOSS-TTS Family — 开源语音家族(8B/1.7B/Realtime),覆盖长语音/对话/角色/音效,377ms 端到端延迟

  • 🆕 新增:LongCat-AudioDiT — Meituan 开源 3.5B 非自回归 TTS,波形潜空间直接生成,消除误差累积

  • 🆕 新增:MAGIC-TTS — 细粒度可控 TTS(arXiv),显式时长和停顿控制,flow-matching

  • 🆕 新增:Seedream 5.0 — ByteDance 即梦最新图像模型,多版本(5.0/4.5/4.0),多图编辑+搜索集成

  • 🆕 新增:GitKraken Desktop 12.0 — Agent Mode 多代理会话管理界面

  • 📈 更新:Microsoft Agent 365 → 正式 GA (May 1),企业治理(Observe/Govern/Secure)+ Defender + Purview + Entra 集成

  • 📈 更新:Atlassian Rovo → 信用使用月增 20%+,Rovo 客户 ARR 增速 2x 非Rovo客户,Service Collection $1B ARR

  • 📈 更新:GLM-5 → SWE-bench Verified 77.8%(开源最高),Chatbot Arena Elo 1451(开源最高)

  • 📈 更新:Kimi K2.6 → Rails 构建 87 分,开源唯一 Tier A 编码模型,BenchLM 89.3

2026-04-26

  • 🆕 新增:GPT-5.5 — OpenAI 最新旗舰模型,agentic coding、computer use、知识工作大幅提升,API 已开放
  • 🆕 新增:Claude Opus 4.7 — Anthropic 最新 Opus 模型,SWE-bench 87.6%,同步发布 Claude Design(设计协作工具)
  • 🆕 新增:Claude Design — Anthropic Labs 新产品,AI 协作创建设计/原型/演示
  • 🆕 新增:Gemini Enterprise Agent Platform — Google Cloud Next 2026 发布,取代 Vertex AI 的统一企业 AI Agent 平台
  • 🆕 新增:Grok Speech API — xAI 独立 STT/TTS API,企业级语音开发,STT 错误率 5.0%(行业最低)
  • 🆕 新增:Tripo H3.1 — 高保真 image-to-3D 模型,面向生产级游戏/产品可视化
  • 🆕 新增:Suno v5.5 — 个人语音克隆 + Studio DAW 功能升级,Warner Music 授权合作
  • 🆕 新增:Midjourney v7 — 美学智能标杆,—cref 角色一致性 + —sref 风格锁定
  • 🆕 新增:Qwen3.5-397B-A17B / Qwen3.5-9B / Qwen3.6-Plus — Alibaba 新一轮开源模型发布
  • 🆕 新增:MiniMax-M2.7 — 顶尖 OpenClaw 兼容模型
  • 🆕 新增:Deezer AI 音乐检测 — 44% 新上传为 AI 生成,Deezer 向行业开放检测技术
  • 🆕 新增:PixVerse R1 更新 — 共享世界 + 个性化化身,实时交互 1080P
  • 📈 更新:GPT-Image-2 → 正式发布(4/21),2K 分辨率,推理驱动生成,多语言文字渲染
  • 📈 更新:ElevenLabs v3 conversational model + TTS Normalizer v3.1,Agent 平台增强
  • 📈 更新:Cursor → $1.2B ARR,超越 Copilot 成为开发者首选 AI IDE
  • 📈 更新:Windsurf → Google 收购创始团队 2.4BCognition收购剩余部分2.4B,Cognition 收购剩余部分 250M
  • 📈 更新:Seed3D 2.0 → 正式发布(4/23),SOTA 几何 + PBR 材质质量
  • 📈 更新:Google ADK → 重大升级,图框架多 agent 编排,月处理 6T+ tokens
  • 📈 更新:PixVerse → 完成 C 轮融资达到独角兽估值,100M+ 用户

2026-04-26 (prior)

  • 🆕 更新:Cursor 3 — Agent-first 重构,Agents Window 并行代理、Cloud Agents、Design Mode
  • 🆕 更新:Midjourney V8 → V8.1 Alpha — HD 模式 3x 更快更便宜,Prompt Shortener,图像提示回归
  • 🆕 更新:Runway Gen-3 → Gen-4.5 — 世界最高评分视频模型,前所未有视觉保真度与创意控制
  • 🆕 更新:Kling O3 → Kling 3.0 — 4K@60fps 视频生成,~$0.50/10s clip,最佳性价比
  • 🆕 新增:HeyGen Avatar V — 最逼真 AI 数字人,15 秒录制,175+ 语言翻译
  • 🆕 新增:JetBrains Junie / Air — IDE 原生 AI 编码代理 + 多代理并行开发环境
  • 🆕 新增:Google Jules — 免费 AI 编码代理,15 tasks/day,1M 上下文
  • 🆕 新增:OpenCode — 开源无锁定编码代理,147K+ GitHub stars
  • 🆕 新增:Augment Code (Auggie CLI) — 企业级代码理解,SWE-bench Pro 51.80%
  • 🆕 新增:Kilo Code — 开源多 IDE 编码代理(VS Code/JetBrains/CLI)
  • 🆕 新增:NVIDIA NemoClaw — OpenClaw 安全代理栈,一键部署隐私可控 AI Agent
  • 🆕 新增:Voxtral TTS (Mistral) — 开源 TTS;Mistral Small 4 (Apache 2.0, 6.5B)
  • 🆕 新增:NVIDIA Nemotron 3 Super、IBM Granite 4.0、Pollo AI、Genra AI
  • 新增:DeepSeek-V4 (Pro 1.6T / Flash 284B) 开源大模型,MIT 协议,1M 上下文
  • 新增:Qwen3.6-27B 密集模型、Gemma 4 (31B 多模态)、Tencent Hy3-preview
  • 新增:Seedance 2.0 (ByteDance) 视频生成 Arena 排名第一,Kling O3、Veo 3.1、PixVerse V6
  • 新增:Gemini 3.1 Flash TTS (200+ 音频标签,70+ 语言)、MiMo-V2.5、Kani-TTS-2、OmniVoice、Chatterbox Multilingual
  • 新增:Seed3D 2.0 (ByteDance)、Tripo Studio、Meshy 6、GPT-Image-2、Recraft V4
  • 新增:Google ADK、ml-intern (HuggingFace)、Kimi Claw
  • 新增:开源 LLM 章节追踪 DeepSeek-V4、GLM-5.1、Llama 4 Behemoth 等最新模型
Back to List