AI Multimodal Timeline
Here we will track the latest AI Multimodal Models, including Multimodal Foundation Model, LLM, Agent, Audio, Image, Video, Music and 3D content. 🔥
Table of Contents
Project List
Multimodal Model
Date |
Source |
Description |
Paper |
Model |
2024-06 |
Cambrian-1 |
A Fully Open, Vision-Centric Exploration of Multimodal LLMs. |
arXiv |
Hugging Face |
2024-06 |
MINT-1T |
Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens. |
arXiv |
|
2024-06 |
OmniTokenizer |
A Joint Image-Video Tokenizer for Visual Generation. |
arXiv |
Website |
2024-06 |
ml-4m |
A framework for training any-to-any multimodal foundation models. |
arXiv |
Website |
2024-06 |
VideoLLaMA 2 |
Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs. |
arXiv |
Hugging Face |
2024-05 |
ManyICL |
Many-Shot In-Context Learning in Multimodal Foundation Models. |
arXiv |
|
2024-05 |
Contrastive ALignment (CAL) |
Seeing the Image: Prioritizing Visual Correlation by Contrastive Alignment. |
arXiv |
|
2024-05 |
Groma |
Grounded Multimodal Large Language Model with Localized Visual Tokenization. |
arXiv |
Hugging Face |
2024-05 |
CogVLM2 |
GPT4V-level open-source multi-modal model based on Llama3-8B. |
|
Hugging Face |
2024-05 |
Chameleon |
Mixed-Modal Early-Fusion Foundation Models. |
arXiv |
|
2024-05 |
Lumina-T2X |
Transforming Text into Any Modality, Resolution, and Duration via Flow-based Large Diffusion Transformers. |
arXiv |
Hugging Face |
2024-05 |
MiniCPM-Llama3-V 2.5 |
MiniCPM-Llama3-V 2.5 is the latest model in the MiniCPM-V series. The model is built on SigLip-400M and Llama3-8B-Instruct with a total of 8B parameters. |
|
Hugging Face |
2024-05 |
Gemini |
Build with state-of-the-art generative models and tools to make AI helpful for everyone. |
|
API |
2024-05 |
GPT-4o |
GPT-4o (“o” for “omni”) is a step towards much more natural human-computer interaction—it accepts as input any combination of text, audio, image, and video and generates any combination of text, audio, and image outputs. |
|
API |
2024-04 |
MyGO |
Discrete Modality Information as Fine-Grained Tokens for Multi-modal Knowledge Graph Completion. |
arXiv |
|
2024-04 |
InternLM-XComposer2 |
InternLM-XComposer2 is a groundbreaking vision-language large model (VLLM) excelling in free-form text-image composition and comprehension. |
arXiv |
Hugging Face |
2024-01 |
MMVP |
Exploring the Visual Shortcomings of Multimodal LLMs. |
arXiv |
|
2023-12 |
V* |
Guided Visual Search as a Core Mechanism in Multimodal LLMs. |
arXiv |
|
2023-12 |
Tokenize Anything |
Tokenize Anything via Prompting. |
arXiv |
Hugging Face |
2023-11 |
ShareGPT4V |
Improving Large Multi-Modal Models with Better Captions. |
arXiv |
Hugging Face |
2023-11 |
Video-LLaVA |
Learning United Visual Representation by Alignment Before Projection. |
arXiv |
Hugging Face |
2023-10 |
LanguageBind |
Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment. |
arXiv |
Hugging Face |
2023-07 |
Emu |
Emu: Generative Multimodal Models from BAAI. |
arXiv |
Hugging Face |
2023-05 |
ImageBind |
One Embedding Space To Bind Them All. |
arXiv |
Website |
2022-11 |
EVA |
EVA: Visual Representation Fantasies from BAAI. |
arXiv |
Hugging Face |
^ Back to Contents ^
LLM
Date |
Source |
Description |
Paper |
Model |
2024-06 |
Claude 3.5 Sonnet |
Claude 3.5 Sonnet |
|
API |
2024-06 |
Nemotron-4 |
Nemotron-4-340B-Instruct is a large language model (LLM) that can be used as part of a synthetic data generation pipeline to create training data that helps researchers and developers build their own LLMs. |
arXiv |
Hugging Face |
2024-04 |
Llama 3 |
Meta Llama 3 is the next generation of our state-of-the-art open source large language model. |
|
Hugging Face |
2024-03 |
Claude 3 |
Talk with Claude, an AI assistant from Anthropic. |
|
API |
2024-03 |
Grok-1 |
The weights and architecture of our 314 billion parameter Mixture-of-Experts model, Grok-1. |
|
Hugging Face |
2023-09 |
Baichuan 2 |
A series of large language models developed by Baichuan Intelligent Technology. |
|
Hugging Face |
2023-07 |
GPT-4 |
GPT-4 is OpenAI’s most advanced system, producing safer and more useful responses. |
|
API |
^ Back to Contents ^
Agent
Date |
Source |
Description |
Paper |
Model |
2024-06 |
Mixture of Agents (MoA) |
Mixture-of-Agents Enhances Large Language Model Capabilities. |
arXiv |
|
2024-06 |
Buffer of Thoughts |
Thought-Augmented Reasoning with Large Language Models. |
arXiv |
|
2024-06 |
Translation Agent |
Agentic translation using reflection workflow. |
|
|
2024-06 |
Atomic Agents |
The Atomic Agents framework is designed to be modular, extensible, and easy to use. |
|
|
2024-05 |
Pipecat |
Open Source framework for voice and multimodal conversational AI. |
|
|
2024-02 |
V-IRL |
Grounding Virtual Intelligence in Real Life. |
arXiv |
|
^ Back to Contents ^
Audio
Audio/Text-to-Speech
Date |
Source |
Description |
Paper |
Model |
2024-05 |
ChatTTS |
ChatTTS is a text-to-speech model designed specifically for dialogue scenario such as LLM assistant. |
|
|
2023-06 |
StyleTTS 2 |
Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models. |
arXiv |
Hugging Face |
Audio/Automatic Speech Recognition
Audio/Audio Generation
^ Back to Contents ^
Image
Date |
Source |
Description |
Paper |
Model |
2024-06 |
Depth Anything V2 |
Depth Anything V2. |
arXiv |
Hugging Face |
2024-06 |
AutoStudio |
Crafting Consistent Subjects in Multi-turn Interactive Image Generation. |
arXiv |
|
2024-06 |
MimicBrush |
Zero-shot Image Editing with Reference Imitation. |
arXiv |
Hugging Face |
2024-06 |
LlamaGen |
Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation. |
arXiv |
Hugging Face |
2024-05 |
Omost |
Omost is a project to convert LLM’s coding capability to image generation (or more accurately, image composing) capability. |
|
Hugging Face |
2024-05 |
Hunyuan-DiT |
A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding. |
arXiv |
Hugging Face |
2023-10 |
DALL·E 3 |
DALL·E is a AI system that can create realistic images and art from a description in natural language. |
|
API |
^ Back to Contents ^
Video
Date |
Source |
Description |
Paper |
Model |
2024-05 |
Video-MME |
The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis. |
|
|
2024-05 |
MotionLLM |
Understanding Human Behaviors from Human Motions and Videos. |
arXiv |
|
2024-05 |
Vidu |
Vidu: a Highly Consistent, Dynamic and Skilled Text-to-Video Generator with Diffusion Models. |
arXiv |
|
2024-02 |
Sora |
Sora is an AI model that can create realistic and imaginative scenes from text instructions. |
Technical Report |
|
2023-11 |
Pika |
Pika is the idea-to-video platform that sets your creativity in motion. |
|
|
2023-03 |
Runway |
Runway is an applied AI research company shaping the next era of art, entertainment and human creativity. |
|
|
^ Back to Contents ^
Music
Date |
Source |
Description |
Paper |
Model |
2024-04 |
Udio |
Udio - AI Music Generator |
|
Website |
2023-12 |
Suno |
Suno is building a future where anyone can make great music. |
|
Website |
2023-12 |
Soundry AI |
Generative AI tools including text-to-sound and infinite sample packs. |
|
Website |
2023-12 |
Sonauto |
Sonauto is an AI music editor that turns prompts, lyrics, or melodies into full songs in any style. |
|
Website |
^ Back to Contents ^
3D
^ Back to Contents ^