AI Resources

Published on 2024-11-26

Last updated on 2026-04-07

New Tools

some new tools need to consider, will try them later

Editors
- cline: Autonomous coding agent right in your IDE, capable of creating/editing files, executing commands, using the browser, and more with your permission every step of the way.
- Roo-Cline: A fork of Cline, an autonomous coding agent, with some additional experimental features. It’s been mainly writing itself recently, with a light touch of human guidance here and there.
- Roo Code: (Formerly Roo-Cline) An autonomous coding agent and significantly evolved fork of Cline. Now features “Custom Modes” and operates as a full AI Coding OS.
UI
- Pax dev and code pax: Pax is a revolutionary new canvas for building apps & websites with AI.
- makepad:a new way to build UIs in Rust for both native and the web.
- superdesign: extract webpage info and generate UI designs
- ui.sh
- variant
Video
- SynTalker:Enabling Synergistic Full-Body Control in Prompt-Based Co-Speech Motion Generation
- EchoMimicV3: 1.3B Parameters for Unified Multi-Modal and Multi-Task Human Animation (AAAI 2026)
- JoyVASA: Portrait and Animal Image Animation with Diffusion-Based Audio-Driven Facial Dynamics and Head Motion Generation
- VideoCaptioner: An intelligent video subtitle processing assistant based on Large Language Models (LLM), supporting subtitle generation, optimization, translation and more
- VideoLingo: VideoLingo is an all-in-one video translation, localization, and dubbing tool aimed at generating Netflix-quality subtitles. It eliminates stiff machine translations and multi-line subtitles while adding high-quality dubbing, enabling global knowledge sharing across language barriers.
- MyTimeMachine: Personalized Facial Age Transformation
- MEMO: MEMO is a state-of-the-art open-weight model for audio-driven talking video generation.
- StableAnimator:High-Quality Identity-Preserving Human Image Animation (CVPR 2025)
- INFP: Audio-Driven Interactive Head Generation in Dyadic Conversations.Given the dual-track audio in dyadic conversations and a single portrait image of arbitrary agent, our framework can dynamically synthesize verbal, non-verbal and interactive agent videos with lifelike facial expressions and rhythmic head pose movements
- LatentSync:Taming Stable Diffusion for Lip Sync! - State-of-the-art lip-sync technology from ByteDance
- KDTalker: Accurate and Efficient Implicit Keypoint-based Spatiotemporal Diffusion for Audio-driven Talking Portrait (IJCV 2025)
- FantasyTalking: Realistic Talking Portrait Generation via Coherent Motion Synthesis (ACM MM 2025)
- FramePack: Revolutionary next-frame prediction model using a 13B model to generate 1-minute videos (60 seconds) at 30fps (1800 frames), minimum GPU memory required is 6GB
- Wan2.1 we present Wan2.1, a comprehensive and open suite of video foundation models that pushes the boundaries of video generation.
- SmolVLM real-time camera demo This repository is a simple demo for how to use llama.cpp server with SmolVLM 500M to get real-time object detection
- Pusa: is a new video diffusion model that matches SOTA with 200x less training cost & 2500x less data. It outperforms Wan-I2V on VBench-I2V, runs 5x faster, and supports I2V, T2V, start-end frames, video extension, and video completion.
- HunyuanVideo: Tencent’s 13B+ parameter video generation model with systematic framework
- CogVideoX: Tsinghua’s expert transformer-based text-to-video diffusion model
- Luma AI Dream Machine: Text-to-video and image-to-video generator powered by Ray2 model, featuring 4K resolution output and realistic physics
- Sora 2: OpenAI’s latest video generation model with enhanced coherence and length capabilities
- Runway Gen-3: Professional video generation with advanced motion controls and editing suite
Audio
- UVR:
  - UVR5-UI: UI for UVR’s, state-of-the-art source separation models to remove vocals from audio files
- TTS
  - OpenAudio (formerly Fish Audio): support both tts and asr, #1 on TTS-Arena2 with 0.008 WER
  - fish-speech-gui
  - F5-TTS:F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching - Very active with streaming support
  - OuteTTS:OuteTTS is an experimental text-to-speech model that uses a pure language modeling approach to generate speech, without architectural changes to the foundation model itself.
  - ElevenLabs Flash： generates speech in 75ms + application & network latency，build directly via the API using model id “eleven_flash_v2” and “eleven_flash_v2_5”
  - Orpheus TTS: 自然人声合成 with Llama-3b backbone, <200ms latency
    - 多语种支持：包括中文、法语、德语、西班牙语、意大利语、韩语和印地语。
    - 情感表达：根据文本内容调整语音的情感基调，如喜悦或悲伤。
    - 非语音元素模拟：如笑声、叹息等自然人类声音。
    - 口语化特征：模拟口语中的停顿、重复和自我修正等，使语音更贴近真人。
  - dia: Dia directly generates highly realistic dialogue from a transcript. Recently released Dia2 with 1.6B params
  - VITA-Audio Fast Interleaved Cross-Modal Token Generation for Efficient Large Speech-Language Model (53ms first token latency)
  - Muyan-TTS is a trainable TTS model designed for podcast applications within a $50,000 budget, which is pre-trained on over 100,000 hours of podcast audio data
  - Higgs Audio V2: Redefining Expressiveness in Audio Generation with 75.7% win rate over GPT-4o-mini
  - SoulX-Podcast: Very recent podcast generation with Chinese dialects support
  - sonic-3
    - 像真人一样自然，甚至能听出兴奋或悲伤
    - 实时语音回应，几乎无延迟
    - 多语言支持，支持 42 种语言
    - 10 秒即可完成声音克隆
    - 智能上下文理解
  - MiniMax Speech 2.6
    - ⚡Ultra-Fast: <250 ms latency for real-time conversational interactions
    - 💬Smart Text Normalization for URLs, emails, dates, numbers & more
    - 🎙️Full Voice Clone + Fluent LoRA: natural, expressive, effortless
    - 🌍40+ languages with inline code switching
  - Spark-TTS: Recent open-source TTS model
  - FireRedTTS: Competitive open-source option
  - MoonCast: Multi-speaker dialogue generation
  - CosyVoice2: Advanced voice cloning
- ASR
  - Open ASR Leaderboard
  - FunASR:FunASR is a speech recognition framework developed by the Speech Lab of DAMO Academy, which integrates industrial-level models in the fields of speech endpoint detection, speech recognition, punctuation segmentation, and more. It has attracted many developers to participate in experiencing and developing
  - nvidia/canary-1b:Canary-1B supports automatic speech-to-text recognition (ASR) in 4 languages (English, German, French, Spanish) and translation from English to German/French/Spanish and from German/French/Spanish to English with or without punctuation and capitalization (PnC).
    - Canary is an encoder-decoder model with FastConformer [1] encoder and Transformer Decoder
    - With audio features extracted from the encoder, task tokens such as <source language>, <target language>, <task> and <toggle PnC> are fed into the Transformer Decoder to trigger the text generation process.
    - Canary uses a concatenated tokenizer from individual SentencePiece tokenizers of each language, which makes it easy to scale up to more languages.
    - The Canay-1B model has 24 encoder layers and 24 layers of decoder layers in total.
  - Parakeet TDT 1.1B (en): is an ASR model that transcribes speech in lower case English alphabet. This model is jointly developed by NVIDIA NeMo and Suno.ai teams. It is an XXL version of FastConformer [1] TDT [2] (around 1.1B parameters) model.
    - This model uses a FastConformer-TDT architecture.
    - FastConformer [1] is an optimized version of the Conformer model with 8x depthwise-separable convolutional downsampling;
    - TDT (Token-and-Duration Transducer) [2] is a generalization of conventional Transducers by decoupling token and duration predictions, TDT model can skip majority of blank predictions by using the duration output (up to 4 frames for this parakeet-tdt-1.1b model), thus brings significant inference speed-up.
  - CrisperWhisper:CrisperWhisper is an advanced variant of OpenAI’s Whisper, designed for fast, precise, and verbatim speech recognition with accurate (crisp) word-level timestamps. Unlike the original Whisper, which tends to omit disfluencies and follows more of a intended transcription style, CrisperWhisper aims to transcribe every spoken word exactly as it is, including fillers, pauses, stutters and false starts. support English, German.
  - Deepgram
  - FishAudio ASR
  - Whisper-WebUI: A Gradio-based browser interface for Whisper. You can use it as an Easy Subtitle Generator!
  - ten-vad TEN VAD is a real-time voice activity detection system designed for enterprise use, providing accurate frame-level speech activity detection. It shows superior precision compared to both WebRTC VAD and Silero VAD, which are commonly used in the industry. Additionally, TEN VAD offers lower computational complexity and reduced memory usage compared to Silero VAD. Meanwhile, the architecture’s temporal efficiency enables rapid voice activity detection, significantly reducing end-to-end response and turn detection latency in conversational AI systems.
  - Omnilingual ASR: 1600+ languages support with zero-shot learning
  - Scribe v2 Realtime Speech to Text - 150ms Latency API: very promising
  - WhisperX: Improved Whisper with diarization
  - SenseVoice: FunASR’s multilingual model
  - VibeVoice-ASR
  - glm-asr
  - Qwen3 ASR: Pure Rust implementation of Qwen3-ASR automatic speech recognition
- Music
  - qa-mdt(active):(OpenMusic) Awesome Open-source Text-to-music (TTM) generation: QA-MDT (IJCAI-25 accepted)
  - DiffRhythm:全球首个基于扩散模型的端到端音乐模型
  - suno: Leading commercial music AI platform - Suno v5 and Studio released (Sept 2025)
  - Lyria: Google’s music model (powering YouTube)
  - Beatoven.ai: Royalty-free music generation
  - LoudMe: Text-to-song generator
  - Udio: Professional music generation platform
  - 音乐风格列表： https://rateyourmusic.com/genres/
- 短剧
  - huobao-drama
- Other
  - MMAudio: MMAudio generates synchronized audio given video and/or text inputs
  - omniaudio-2.6b and demo: World’s Fastest Audio Language Model for Edge Deployment
  - openai-webrtc-go
  - MicDrop: Transform your voice into any voice, instantly.
  - SurfSense While tools like NotebookLM and Perplexity are impressive and highly effective for conducting research on any topic/query, SurfSense elevates this capability by integrating with your personal knowledge base. It is a highly customizable AI research agent, connected to external sources such as search engines (Tavily, LinkUp), Slack, Linear, Notion, YouTube, GitHub and more to come.
  - Spatial Speech Translation Translating Across Space With Binaural Hearables
    - We first enable speech translation under multi-speaker and interference conditions.
    - Our simultaneous and expressive speech translation model can run in real-time on Apple silicon.
    - First binaural rendering of speech translation can preserve spatial cues from the input to the translated output.
  - http://listenhub.ai/
  - Seed LiveInterpret 2.0: End-to-end Simultaneous Speech-to-speech Translation with Your Voice
image
- Sana: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformer - 20x smaller and 100x faster than FLUX-12B (ICLR-2025 Oral)
- HivisionIDPhotos: 旨在开发一种实用、系统性的证件照智能制作算法，它利用一套完善的 AI 模型工作流程，实现对多种用户拍照场景的识别、抠图与证件照生成。
- LiYing: LiYing 是一套适用于自动化完成一般照相馆后期证件照处理流程的照片自动处理的程序。
- BRIA-RMBG: High-Accuracy, Legal, and Inclusive Background Removal
- RMBG-2-Studio:
  - Background Removal: Powered by BRIA-RMBG-2.0
  - Image Composition: Place and adjust processed images onto new backgrounds
- IC-Custom: IC-Custom is designed for diverse image customization scenarios, including:
  - Position-aware: Input a reference image, target background, and specify the customization location (via segmentation or drawing)
    
    Examples: Product placement, virtual try-on
  - Position-free: Input a reference image and a target description to generate a new image with the reference image’s ID
    
    Examples: IP customization and creation
- InvSR: Arbitrary-steps Image Super-resolution via Diffusion Inversion (CVPR 2025)
- upscayl: Upscayl lets you enlarge and enhance low-resolution images using advanced AI algorithms. Enlarge images without losing quality. It’s almost like magic! 🎩🪄
- HiDream-I1: is a new open-source image generative foundation model with 17B parameters that achieves state-of-the-art image generation quality within seconds.
- BAGEL Unified Model for Multimodal Understanding and Generation - Outperforms Qwen2.5-VL and InternVL-2.5
- FLUX: Leading open-weight image generation model from Stable Diffusion creators
- Recraft V3: #1 on Artificial Analysis rankings for 5 consecutive months
- Midjourney V7: Complete architectural overhaul with personalized realism
- Remove.bg: Industry standard for automatic background removal
- Z-Image-Turbo 图像生成器
  - Z-Image-Turbo: is a powerful and highly efficient image generation model with 6B parameters. Currently there are three variants
3D generation
- StableGen: Transform your 3D texturing workflow with the power of generative AI, directly within Blender!
- Meshy AI: Text-to-3D model generation with high-quality 3D assets
- Hunyuan3D 2.1: Tencent’s 3D generation model for detailed textured assets
- Hunyuan3D-Omni: Unified framework for controllable generation of 3D assets
- hyper3d: 3D 模型生成（3D打印）
AI search Engine
- Perplexity AI: Market leader in AI search
- MindSearch: Mimicking Human Minds Elicits Deep AI Searcher - Deployed on Puyu platform
- khoj: is a personal AI app to extend your capabilities. It smoothly scales up from an on-device personal AI to a cloud-scale enterprise AI.
- Scira (formerly mplx.run): A minimalistic AI-powered search engine that helps you find information on the internet
- Gemini Search: A Perplexity-style search engine powered by Google’s Gemini 2.0 Flash model with grounding through Google Search
- exa: AI-native search engine for developers and research
- Agentic Company Researcher 🔍: A multi-agent tool that generates comprehensive company research reports
- MAESTRO: Your Self-Hosted AI Research Assistant
- Open Deep Research: Deep research has broken out as one of the most popular agent applications. This is a simple, configurable, fully open source deep research agent that works across many model providers, search tools, and MCP servers. It’s performance is on par with many popular deep research agents (see Deep Research Bench leaderboard).
- You.com: AI-powered search with personalization
- Kagi: Premium AI search engine
- Brave Search: Privacy-focused AI search
- 秘塔AI搜索: Chinese AI search engine
- 纳米搜索: Chinese AI search engine with podcast/video generation
- MiroThinker/code: is MiroMind’s Flagship Research Agent Model. It is an open-source search model designed to advance tool-augmented reasoning and information-seeking capabilities, enabling complex real-world research workflows across diverse challenges
Document
- pdf2md: Self-hostable API server and pipeline for converting PDF’s to markdown using thrifty large language vision models like GPT-4o-mini and gemini-flash-1.5.
- PDFMathTranslate: PDF scientific paper translation and bilingual comparison.
- gamma: 精美的演示文稿、文档和网站。无需设计或编码技能。
- refly: 这是一款革新性的 AI Native 内容创作引擎！⚡️Refly 是基于「自由画布👨‍🎨👩‍🎨」理念打造的 AI Native 创作工具，为用户提供从创意萌发到成品内容的一站式解决方案🌈：
- markitdown: MarkItDown is a utility for converting various files to Markdown (e.g., for indexing, text analysis, etc).
- CO-storm: Get a Wikipedia-like report on your topic with AI, STORM is a research prototype that supports interactive knowledge curation.
- ReaderLM: converts raw HTML into beautifully formatted markdown or JSON with superior accuracy and improved longer context handling
- OmniThink: Expanding Knowledge Boundaries in Machine Writing through Thinking
- AiryLark 是一个开源的文档处理工具，支持多种文件格式的输入和处理。无论是 PDF 文档、Word 文件还是纯文本，AiryLark 都能高效处理。
- AI Presentation Generator
OCR:
Physics
- Genesis: Genesis is a physics platform designed for general purpose Robotics/Embodied AI/Physical AI applications. It is simultaneously multiple things:
multi-modal
- VITA: Towards GPT-4o Level Real-Time Vision and Speech Interaction
- Dolphin is a novel multimodal document image parsing model that follows an analyze-then-parse paradigm. It addresses the challenges of complex document understanding through a two-stage approach designed to handle intertwined elements such as text paragraphs, figures, formulas, and tables.
- Qwen2.5-Omni: Alibaba’s multimodal model with TTS capabilities
- GELab-Zero-4B-preview: 这是个专注于 Android 系统的GUI 代理模型，针对交互界面元素（点击、输入、滑动、等待等）进行了优化，可以支持跨多个应用（如餐饮、交通、购物、社交等）执行多步骤长时程任务。
AI Agents & Automation
- ChainForge: An open-source visual programming environment for battle-testing prompts to LLMs.
- anychat:A unified chat interface for multiple AI models powered by Gradio. This application provides access to various leading AI models through a simple tab-based interface.
- aisuite: Simple, unified interface to multiple Generative AI providers.
- Open Canvas: Open Canvas is an open source web application for collaborating with agents to better write documents. It is inspired by OpenAI’s “Canvas”, but with a few key differences.
- Reppl:Using APPL to reimplement popular algorithms for Large Language Models (LLMs) and prompts; experimental AI-assisted re-implementation of prompt optimization algorithms.
- Resume Matcher: Resume Matcher is an AI Based Free & Open Source Tool. To tailor your resume to a job description. Find the matching keywords, improve the readability and gain deep insights into your resume.
- InsightExpress: InsightExpress is a Next.js application that generates AI-powered research reports based on user-provided topics and emails them to users. The application leverages Langflow for its AI capabilities and features a modern, responsive UI built using NextJS.
- Claude Code Workflow Studio: Design complex AI agent workflows by conversing with AI – or use intuitive drag-and-drop. Build Sub-Agent orchestrations and conditional branching with natural language, then export directly to .claude format.
- vm0: Natural language Agent, 24/7 in cloud sandbox
- agenticSeek: Fully Local Manus AI
coworker
- webserver/client
  - Open Claude Cowork: An open-source desktop chat application powered by Claude Agent SDK and Composio Tool Router. Build AI agents with access to 500+ tools and persistent chat sessions.
- Desktop
  - openwork: Open Source AI Desktop Agent(coworker)
  - Claude-Cowork: Agent Cowork is an open-source alternative to Claude Cowork — a desktop AI assistant that helps with programming, file management, and any task you can describe(desktop)
  - eigent: The Open Source Cowork Desktop to Unlock Your Exceptional Productivity
  - AionUi: Cowork with Your AI, Gemini CLI, Claude Code, Codex, Qwen Code, Goose CLI, Auggie, and more
  - Deepseek-Cowork:
openclaw and derivatives
- KrillClaw: The world’s smallest AI agent runtime. 49KB. Written in Zig. Zero dependencies.
- nullclaw:Fastest, smallest, and fully autonomous AI assistant infrastructure written in Zig
- openfang: Open-source Agent OS built in Rust. 137K LOC. 14 crates. 1,767+ tests. Zero clippy warnings.
- nanoclaw
- ironclaw
- nanobot
- hermes-agent
- OpenHarness
AI Trading & Finance
- nofx: Agentic Trading OS
- AI AGENTS FOR TRADING
- dexter
Social Listening & Analytics
- 微舆BettaFish 是一个从0实现的创新型多智能体舆情分析系统，帮助大家破除信息茧房，还原舆情原貌，预测未来走向，辅助决策。用户只需像聊天一样提出分析需求，智能体开始全自动分析国内外30+主流社媒与数百万条大众评论。
- TrendRadar: 🚀 最快30秒部署的热点助手 —— 告别无效刷屏，只看真正关心的新闻资讯
AI Infrastructure
- higress: AI Native API Gateway
- LLMRouter
Code Search & Discovery
- grep.app:the fastest code search engine on the planet (of over 500k+ Git repos).
Tutorials & Guides
- Prompt Engineering Guide:Motivated by the high interest in developing with LLMs, we have created this new prompt engineering guide that contains all the latest papers, learning guides, lectures, references, and tools related to prompt engineering for LLMs.
- Build and design an AI image generator app for iOS without ANY design or coding experience
- Awesome Claude Prompts
- cursorhub: 零基础掌握 Cursor 和我一起用 AI 做网站
- Gemini for Google Workspace prompting guide 101
- Context Engineering
- ClaudeCode 最佳实践、社区技巧和工具的综合指南
- Frad’s .claude: A comprehensive development environment with specialized AI agents for code review, security analysis, and technical leadership.
- lovable Prompting 1.1
- Anthropic Prompt engineering
- prompt-optimizer
- Writing a good CLAUDE.md
- Vibe Coding 指南

Leadboard

open asr leaderboard
TTS Arena V2
Artificial Analysis: Model performance rankings

Back to List