ai-multimodal-timeline

AI Multimodal Timeline

ComfyUI

Here we will track the latest AI Multimodal Models, including Multimodal Foundation Model, LLM, Agent, Audio, Image, Video, Music and 3D content. 🔥

Table of Contents

Project List

Multimodal Model

Date Source Description Paper Model
2024-06 Cambrian-1 A Fully Open, Vision-Centric Exploration of Multimodal LLMs. arXiv Hugging Face
2024-06 MINT-1T Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens. arXiv  
2024-06 OmniTokenizer A Joint Image-Video Tokenizer for Visual Generation. arXiv Website
2024-06 ml-4m A framework for training any-to-any multimodal foundation models. arXiv Website
2024-06 VideoLLaMA 2 Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs. arXiv Hugging Face
2024-05 ManyICL Many-Shot In-Context Learning in Multimodal Foundation Models. arXiv  
2024-05 Contrastive ALignment (CAL) Seeing the Image: Prioritizing Visual Correlation by Contrastive Alignment. arXiv  
2024-05 Groma Grounded Multimodal Large Language Model with Localized Visual Tokenization. arXiv Hugging Face
2024-05 CogVLM2 GPT4V-level open-source multi-modal model based on Llama3-8B.   Hugging Face
2024-05 Chameleon Mixed-Modal Early-Fusion Foundation Models. arXiv  
2024-05 Lumina-T2X Transforming Text into Any Modality, Resolution, and Duration via Flow-based Large Diffusion Transformers. arXiv Hugging Face
2024-05 MiniCPM-Llama3-V 2.5 MiniCPM-Llama3-V 2.5 is the latest model in the MiniCPM-V series. The model is built on SigLip-400M and Llama3-8B-Instruct with a total of 8B parameters.   Hugging Face
2024-05 Gemini Build with state-of-the-art generative models and tools to make AI helpful for everyone.   API
2024-05 GPT-4o GPT-4o (“o” for “omni”) is a step towards much more natural human-computer interaction—it accepts as input any combination of text, audio, image, and video and generates any combination of text, audio, and image outputs.   API
2024-04 MyGO Discrete Modality Information as Fine-Grained Tokens for Multi-modal Knowledge Graph Completion. arXiv  
2024-04 InternLM-XComposer2 InternLM-XComposer2 is a groundbreaking vision-language large model (VLLM) excelling in free-form text-image composition and comprehension. arXiv Hugging Face
2024-01 MMVP Exploring the Visual Shortcomings of Multimodal LLMs. arXiv  
2023-12 V* Guided Visual Search as a Core Mechanism in Multimodal LLMs. arXiv  
2023-12 Tokenize Anything Tokenize Anything via Prompting. arXiv Hugging Face
2023-11 ShareGPT4V Improving Large Multi-Modal Models with Better Captions. arXiv Hugging Face
2023-11 Video-LLaVA Learning United Visual Representation by Alignment Before Projection. arXiv Hugging Face
2023-10 LanguageBind Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment. arXiv Hugging Face
2023-07 Emu Emu: Generative Multimodal Models from BAAI. arXiv Hugging Face
2023-05 ImageBind One Embedding Space To Bind Them All. arXiv Website
2022-11 EVA EVA: Visual Representation Fantasies from BAAI. arXiv Hugging Face

^ Back to Contents ^

LLM

Date Source Description Paper Model
2024-06 Claude 3.5 Sonnet Claude 3.5 Sonnet   API
2024-06 Nemotron-4 Nemotron-4-340B-Instruct is a large language model (LLM) that can be used as part of a synthetic data generation pipeline to create training data that helps researchers and developers build their own LLMs. arXiv Hugging Face
2024-04 Llama 3 Meta Llama 3 is the next generation of our state-of-the-art open source large language model.   Hugging Face
2024-03 Claude 3 Talk with Claude, an AI assistant from Anthropic.   API
2024-03 Grok-1 The weights and architecture of our 314 billion parameter Mixture-of-Experts model, Grok-1.   Hugging Face
2023-09 Baichuan 2 A series of large language models developed by Baichuan Intelligent Technology.   Hugging Face
2023-07 GPT-4 GPT-4 is OpenAI’s most advanced system, producing safer and more useful responses.   API

^ Back to Contents ^

Agent

Date Source Description Paper Model
2024-06 Mixture of Agents (MoA) Mixture-of-Agents Enhances Large Language Model Capabilities. arXiv  
2024-06 Buffer of Thoughts Thought-Augmented Reasoning with Large Language Models. arXiv  
2024-06 Translation Agent Agentic translation using reflection workflow.    
2024-06 Atomic Agents The Atomic Agents framework is designed to be modular, extensible, and easy to use.    
2024-05 Pipecat Open Source framework for voice and multimodal conversational AI.    
2024-02 V-IRL Grounding Virtual Intelligence in Real Life. arXiv  

^ Back to Contents ^

Audio

Audio/Text-to-Speech

Date Source Description Paper Model
2024-05 ChatTTS ChatTTS is a text-to-speech model designed specifically for dialogue scenario such as LLM assistant.    
2023-06 StyleTTS 2 Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models. arXiv Hugging Face

Audio/Automatic Speech Recognition

Date Source Description Paper Model
2024-05 TeleSpeech-ASR Large speech model-super multi-dialect ASR.   Hugging Face
2022-12 Whisper Whisper is a general-purpose speech recognition model. arXiv API

Audio/Audio Generation

Date Source Description Paper Model
2024-06 SEE-2-SOUND Zero-Shot Spatial Environment-to-Spatial Sound. arXiv  
2024-05 Make-An-Audio 3 Transforming Text into Audio via Flow-based Large Diffusion Transformers. arXiv Hugging Face

^ Back to Contents ^

Image

Date Source Description Paper Model
2024-06 Depth Anything V2 Depth Anything V2. arXiv Hugging Face
2024-06 AutoStudio Crafting Consistent Subjects in Multi-turn Interactive Image Generation. arXiv  
2024-06 MimicBrush Zero-shot Image Editing with Reference Imitation. arXiv Hugging Face
2024-06 LlamaGen Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation. arXiv Hugging Face
2024-05 Omost Omost is a project to convert LLM’s coding capability to image generation (or more accurately, image composing) capability.   Hugging Face
2024-05 Hunyuan-DiT A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding. arXiv Hugging Face
2023-10 DALL·E 3 DALL·E is a AI system that can create realistic images and art from a description in natural language.   API

^ Back to Contents ^

Video

Date Source Description Paper Model
2024-05 Video-MME The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis.    
2024-05 MotionLLM Understanding Human Behaviors from Human Motions and Videos. arXiv  
2024-05 Vidu Vidu: a Highly Consistent, Dynamic and Skilled Text-to-Video Generator with Diffusion Models. arXiv  
2024-02 Sora Sora is an AI model that can create realistic and imaginative scenes from text instructions. Technical Report  
2023-11 Pika Pika is the idea-to-video platform that sets your creativity in motion.    
2023-03 Runway Runway is an applied AI research company shaping the next era of art, entertainment and human creativity.    

^ Back to Contents ^

Music

Date Source Description Paper Model
2024-04 Udio Udio - AI Music Generator   Website
2023-12 Suno Suno is building a future where anyone can make great music.   Website
2023-12 Soundry AI Generative AI tools including text-to-sound and infinite sample packs.   Website
2023-12 Sonauto Sonauto is an AI music editor that turns prompts, lyrics, or melodies into full songs in any style.   Website

^ Back to Contents ^

3D

Date Source Description Paper Model
2024-06 Unique3D High-Quality and Efficient 3D Mesh Generation from a Single Image. arXiv Hugging Face
2024-06 DreamGaussian4D Generative 4D Gaussian Splatting. arXiv Hugging Face
2024-03 GaussianCube A Structured and Explicit Radiance Representation for 3D Generative Modeling. arXiv Hugging Face
2024-03 TripoSR Fast 3D Object Reconstruction from a Single Image. arXiv Hugging Face

^ Back to Contents ^