ai-multimodal-timeline

AI Multimodal Timeline

Here we will track the latest AI Multimodal Models, including Multimodal Foundation Model, LLM, Agent, Audio, Image, Video, Music and 3D content. 🔥

Multimodal Model
LLM
Agent
Audio
Image
Video
Music
3D

Project List

Multimodal Model

Date	Source	Description	Paper	Model
2025-01	MILS	MILS: LLMs can see and hear without any training.	arXiv
2024-11	Oasis	Oasis is an interactive world model developed by Decart and Etched. Based on diffusion transformers, Oasis takes in user keyboard input and generates gameplay in an autoregressive manner.		Hugging Face
2024-10	Unbounded	Unbounded: A Generative Infinite Game of Character Life Simulation.	arXiv	Website
2024-10	Janus	Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation.	arXiv	Hugging Face
2024-09	LLaVA-3D	LLaVA-3D: A Simple yet Effective Pathway to Empowering LMMs with 3D-awareness.	arXiv
2024-09	Emu3	Emu3: Next-Token Prediction is All You Need.		Hugging Face
2024-09	Moshi	Moshi: a speech-text foundation model for real time dialogue.		Hugging Face
2024-09	Qwen2-VL	Qwen2-VL is the multimodal large language model series developed by Qwen team, Alibaba Cloud.		Hugging Face
2024-08	Eagle	Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders.	arXiv
2024-08	Mini-Omni	Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming.	arXiv	Hugging Face
2024-08	GameNGen	GameNGen - Diffusion Models Are Real-Time Game Engines.	arXiv
2024-08	Sapiens	Sapiens: Foundation for Human Vision Models.	arXiv
2024-08	Show-o	Show-o: One Single Transformer to Unify Multimodal Understanding and Generation.	arXiv
2024-08	LLaVA-OneVision	LLaVA-OneVision: Easy Visual Task Transfer.	arXiv	Hugging Face
2024-08	AI Scientist	The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery.	arXiv
2024-08	Mini-Monkey	Mini-Monkey: Multi-Scale Adaptive Cropping for Multimodal Large Language Models.	arXiv
2024-08	VITA	VITA: Towards Open-Source Interactive Omni Multimodal LLM.	arXiv
2024-08	Lumina-mGPT	Lumina-mGPT: Illuminate Flexible Photorealistic Text-to-Image Generation with Multimodal Generative Pretraining.	arXiv
2024-07	Any2Point	Any2Point: Empowering Any-modality Large Models for Efficient 3D Understanding.	arXiv
2024-07	SOLO	SOLO: A Single Transformer for Scalable Vision-Language Modeling.	arXiv
2024-07	Kangaroo	Kangaroo: A Powerful Video-Language Model Supporting Long-context Video Input.		Hugging Face
2024-07	SEED-Story	SEED-Story: Multimodal Long Story Generation with Large Language Model.	arXiv	Hugging Face
2024-07	VTA-LDM	Video-to-Audio Generation with Hidden Alignment.	arXiv	Hugging Face
2024-07	Qwen2-Audio	Qwen2-Audio chat & pretrained large audio language model proposed by Alibaba Cloud.	arXiv
2024-07	Moshi	Moshi is an experimental conversational AI.		Website
2024-07	Anole	Anole: An Open, Autoregressive and Native Multimodal Models for Interleaved Image-Text Generation.		Hugging Face
2024-06	Cambrian-1	A Fully Open, Vision-Centric Exploration of Multimodal LLMs.	arXiv	Hugging Face
2024-06	EVF-SAM	EVF-SAM: Early Vision-Language Fusion for Text-Prompted Segment Anything Model.	arXiv	Hugging Face
2024-06	MINT-1T	Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens.	arXiv
2024-06	OmniTokenizer	A Joint Image-Video Tokenizer for Visual Generation.	arXiv	Website
2024-06	ml-4m	A framework for training any-to-any multimodal foundation models.	arXiv	Website
2024-06	LongVA	Long Context Transfer from Language to Vision.	arXiv	Hugging Face
2024-06	VideoLLaMA 2	Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs.	arXiv	Hugging Face
2024-05	ManyICL	Many-Shot In-Context Learning in Multimodal Foundation Models.	arXiv
2024-05	Contrastive ALignment (CAL)	Seeing the Image: Prioritizing Visual Correlation by Contrastive Alignment.	arXiv
2024-05	Groma	Grounded Multimodal Large Language Model with Localized Visual Tokenization.	arXiv	Hugging Face
2024-05	CogVLM2	GPT4V-level open-source multi-modal model based on Llama3-8B.		Hugging Face
2024-05	Chameleon	Mixed-Modal Early-Fusion Foundation Models.	arXiv
2024-05	Lumina-T2X	Transforming Text into Any Modality, Resolution, and Duration via Flow-based Large Diffusion Transformers.	arXiv	Hugging Face
2024-05	MiniCPM-Llama3-V 2.5	MiniCPM-Llama3-V 2.5 is the latest model in the MiniCPM-V series. The model is built on SigLip-400M and Llama3-8B-Instruct with a total of 8B parameters.		Hugging Face
2024-05	Gemini	Build with state-of-the-art generative models and tools to make AI helpful for everyone.		API
2024-05	GPT-4o	GPT-4o (“o” for “omni”) is a step towards much more natural human-computer interaction—it accepts as input any combination of text, audio, image, and video and generates any combination of text, audio, and image outputs.		API
2024-04	MyGO	Discrete Modality Information as Fine-Grained Tokens for Multi-modal Knowledge Graph Completion.	arXiv
2024-04	InternLM-XComposer2	InternLM-XComposer2 is a groundbreaking vision-language large model (VLLM) excelling in free-form text-image composition and comprehension.	arXiv	Hugging Face
2024-02	AnyGPT	AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling.	arXiv
2024-01	MMVP	Exploring the Visual Shortcomings of Multimodal LLMs.	arXiv
2023-12	V*	Guided Visual Search as a Core Mechanism in Multimodal LLMs.	arXiv
2023-12	Tokenize Anything	Tokenize Anything via Prompting.	arXiv	Hugging Face
2023-12	VILA	VILA: On Pre-training for Visual Language Models.	arXiv	Hugging Face
2023-11	LEO	An Embodied Generalist Agent in 3D World.	arXiv	Website
2023-11	ShareGPT4V	Improving Large Multi-Modal Models with Better Captions.	arXiv	Hugging Face
2023-11	Video-LLaVA	Learning United Visual Representation by Alignment Before Projection.	arXiv	Hugging Face
2023-10	LanguageBind	Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment.	arXiv	Hugging Face
2023-07	Emu	Emu: Generative Multimodal Models from BAAI.	arXiv	Hugging Face
2023-05	ImageBind	One Embedding Space To Bind Them All.	arXiv	Website
2022-11	EVA	EVA: Visual Representation Fantasies from BAAI.	arXiv	Hugging Face

^ Back to Contents ^

LLM

Date	Source	Description	Paper	Model
2024-08	LongWriter	LongWriter: Unleashing 10,000+ Word Generation from Long Context LLMs.	arXiv	Hugging Face
2024-07	DCLM	DataComp for Language Models	arXiv	Hugging Face
2024-07	Index-1.9B	A SOTA lightweight multilingual LLM		Hugging Face
2024-06	Claude 3.5 Sonnet	Claude 3.5 Sonnet		API
2024-06	Nemotron-4	Nemotron-4-340B-Instruct is a large language model (LLM) that can be used as part of a synthetic data generation pipeline to create training data that helps researchers and developers build their own LLMs.	arXiv	Hugging Face
2024-06	Qwen2	Qwen2 is the large language model series developed by Qwen team, Alibaba Cloud.		Hugging Face
2024-04	Llama 3	Meta Llama 3 is the next generation of our state-of-the-art open source large language model.		Hugging Face
2024-03	Claude 3	Talk with Claude, an AI assistant from Anthropic.		API
2024-03	Grok-1	The weights and architecture of our 314 billion parameter Mixture-of-Experts model, Grok-1.		Hugging Face
2023-11	Mixtral	Open and portable generative AI for devs and businesses.	arXiv	Hugging Face
2023-09	Baichuan 2	A series of large language models developed by Baichuan Intelligent Technology.		Hugging Face
2023-07	GPT-4	GPT-4 is OpenAI’s most advanced system, producing safer and more useful responses.		API

^ Back to Contents ^

Agent

Date	Source	Description	Paper	Model
2024-10	TEN Agent	TEN Agent is the world’s first real-time multimodal agent integrated with the OpenAI Realtime API, RTC, and features weather checks, web search, vision, and RAG capabilities.		Website
2024-08	Twitter	Twitter Personality is a web application that analyzes your Twitter handle to create a personalized personality profile using Wordware AI Agent.		Website
2024-08	MindSearch	🔍 An LLM-based Multi-agent Framework of Web Search Engine (like Perplexity.ai Pro and SearchGPT).
2024-08	MMRole	MMRole: A Comprehensive Framework for Developing and Evaluating Multimodal Role-Playing Agents.	arXiv
2024-08	Agent K	An autoagentic AGI that is self-evolving and modular.
2024-08	LangGraph Studio	LangGraph Studio offers a new way to develop LLM applications by providing a specialized agent IDE that enables visualization, interaction, and debugging of complex agentic applications.
2024-07	LLama Agentic System	Agentic components of the Llama Stack APIs.
2024-07	TaskGen	A Task-based agentic framework building on StrictJSON outputs by LLM agents.
2024-07	IoA	An open-source framework for collaborative AI agents, enabling diverse, distributed agents to team up and tackle complex tasks through internet-like connectivity.
2024-07	OmAgent	A multimodal agent framework for solving complex tasks.	arXiv
2024-06	GraphRAG	A modular graph-based Retrieval-Augmented Generation (RAG) system.		Website
2024-06	Mixture of Agents (MoA)	Mixture-of-Agents Enhances Large Language Model Capabilities.	arXiv
2024-06	Buffer of Thoughts	Thought-Augmented Reasoning with Large Language Models.	arXiv
2024-06	Translation Agent	Agentic translation using reflection workflow.
2024-06	Atomic Agents	The Atomic Agents framework is designed to be modular, extensible, and easy to use.
2024-05	Pipecat	Open Source framework for voice and multimodal conversational AI.
2024-02	V-IRL	Grounding Virtual Intelligence in Real Life.	arXiv

^ Back to Contents ^

Audio

Audio/Text-to-Speech

Date	Source	Description	Paper	Model
2024-07	CosyVoice	Multi-lingual large voice generation model, providing inference, training and deployment full-stack ability.
2024-06	DEX-TTS	Diffusion-based EXpressive Text-to-Speech with Style Modeling on Time Variability.	arXiv	Website
2024-05	ChatTTS	ChatTTS is a text-to-speech model designed specifically for dialogue scenario such as LLM assistant.
2023-06	StyleTTS 2	Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models.	arXiv	Hugging Face

Audio/Automatic Speech Recognition

Date	Source	Description	Paper	Model
2024-07	SenseVoice	SenseVoice is a speech foundation model with multiple speech understanding capabilities, including automatic speech recognition (ASR), spoken language identification (LID), speech emotion recognition (SER), and audio event detection (AED).		Hugging Face
2024-05	TeleSpeech-ASR	Large speech model-super multi-dialect ASR.		Hugging Face
2022-12	Whisper	Whisper is a general-purpose speech recognition model.	arXiv	API

Audio/Audio Generation

Date	Source	Description	Paper	Model
2024-07	FoleyCrafter	FoleyCrafter: Bring Silent Videos to Life with Lifelike and Synchronized Sounds.	arXiv	Hugging Face
2024-06	SEE-2-SOUND	Zero-Shot Spatial Environment-to-Spatial Sound.	arXiv
2024-05	Make-An-Audio 3	Transforming Text into Audio via Flow-based Large Diffusion Transformers.	arXiv	Hugging Face

^ Back to Contents ^

Image

Date	Source	Description	Paper	Model
2024-09	StoryMaker	StoryMaker: Towards Holistic Consistent Characters in Text-to-image Generation.	arXiv	Hugging Face
2024-08	CSGO	CSGO: Content-Style Composition in Text-to-Image Generation.	arXiv
2024-08	FLUX	This repo contains minimal inference code to run text-to-image and image-to-image with our Flux latent rectified flow transformers.		Hugging Face
2024-08	Segment Anything Model 2 (SAM 2)	SAM 2: Segment Anything in Images and Videos.	arXiv	Hugging Face
2024-07	CatVTON	CatVTON: Concatenation Is All You Need for Virtual Try-On with Diffusion Models.	arXiv	Hugging Face
2024-07	UltraEdit	UltraEdit: Instruction-based Fine-Grained Image Editing at Scale.	arXiv	Hugging Face
2024-07	UltraPixel	UltraPixel: Advancing Ultra-High-Resolution Image Synthesis to New Peaks.	arXiv
2024-07	PaintsUndo	PaintsUndo: A Base Model of Drawing Behaviors in Digital Paintings.
2024-07	Kolors	Kolors: Effective Training of Diffusion Model for Photorealistic Text-to-Image Synthesis.		Hugging Face
2024-06	Depth Anything V2	Depth Anything V2.	arXiv	Hugging Face
2024-06	AutoStudio	Crafting Consistent Subjects in Multi-turn Interactive Image Generation.	arXiv
2024-06	MimicBrush	Zero-shot Image Editing with Reference Imitation.	arXiv	Hugging Face
2024-06	LlamaGen	Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation.	arXiv	Hugging Face
2024-05	Omost	Omost is a project to convert LLM’s coding capability to image generation (or more accurately, image composing) capability.		Hugging Face
2024-05	Hunyuan-DiT	A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding.	arXiv	Hugging Face
2024-02	MIGC	MIGC: Multi-Instance Generation Controller for Text-to-Image Synthesis.	arXiv
2023-10	DALL·E 3	DALL·E is a AI system that can create realistic images and art from a description in natural language.		API

^ Back to Contents ^

Video

Date	Source	Description	Paper	Model
2024-11	LTX-Video	LTX-Video is the first DiT-based video generation model that can generate high-quality videos in real-time.		Hugging Face
2024-09	MIMO	MIMO: Controllable Character Video Synthesis with Spatial Decomposed Modeling.	arXiv	Website
2024-09	DrawingSpinUp	DrawingSpinUp: 3D Animation from Single Character Drawings.	arXiv	Website
2024-09	ViewCrafter	ViewCrafter: Taming Video Diffusion Models for High-fidelity Novel View Synthesis.	arXiv	Website
2024-08	CogVideoX	CogVideoX is an open-source version of the video generation model, which is homologous to 清影.		Hugging Face
2024-07	Tora	Tora: Trajectory-oriented Diffusion Transformer for Video Generation.	arXiv	Website
2024-06	Diffutoon	High-Resolution Editable Toon Shading via Diffusion Models.	arXiv	Website
2024-05	Video-MME	The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis.
2024-05	Video-of-Thought	Video-of-Thought: Step-by-Step Video Reasoning from Perception to Cognition.		Website
2024-05	MOFA-Video	MOFA-Video: Controllable Image Animation via Generative Motion Field Adaptions in Frozen Image-to-Video Diffusion Model.	arXiv	Hugging Face
2024-05	MotionLLM	Understanding Human Behaviors from Human Motions and Videos.	arXiv
2024-05	Vidu	Vidu: a Highly Consistent, Dynamic and Skilled Text-to-Video Generator with Diffusion Models.	arXiv
2024-02	Sora	Sora is an AI model that can create realistic and imaginative scenes from text instructions.	Technical Report
2023-11	Pika	Pika is the idea-to-video platform that sets your creativity in motion.
2023-03	Runway	Runway is an applied AI research company shaping the next era of art, entertainment and human creativity.

^ Back to Contents ^

Music

Date	Source	Description	Paper	Model
2025-01	YuE	YuE: Open Full-song Generation Foundation Model, something similar to Suno.ai but open.		Hugging Face
2024-05	Diff-BGM	A Diffusion Model for Video Background Music Generation.	arXiv
2024-04	Udio	Udio - AI Music Generator		Website
2023-12	Suno	Suno is building a future where anyone can make great music.		Website
2023-12	Soundry AI	Generative AI tools including text-to-sound and infinite sample packs.		Website
2023-12	Sonauto	Sonauto is an AI music editor that turns prompts, lyrics, or melodies into full songs in any style.		Website

^ Back to Contents ^

3D

Date	Source	Description	Paper	Model
2025-01	Hunyuan3D 2.0	Hunyuan3D 2.0: Scaling Diffusion Models for High Resolution Textured 3D Assets Generation.	arXiv	Hugging Face
2024-11	Hunyuan3D-1.0	Hunyuan3D-1.0: A Unified Framework for Text-to-3D and Image-to-3D Generation.	arXiv	Hugging Face
2024-09	3DTopia-XL	3DTopia-XL: Scaling High-quality 3D Asset Generation via Primitive Diffusion.	arXiv	Website
2024-08	SF3D	SF3D: Stable Fast 3D Mesh Reconstruction with UV-unwrapping and Illumination Disentanglement.	arXiv	Hugging Face
2024-07	HoloDreamer	HoloDreamer: Holistic 3D Panoramic World Generation from Text Descriptions.	arXiv	Website
2024-07	DreamCatalyst	DreamCatalyst: Fast and High-Quality 3D Editing via Controlling Editability and Identity Preservation.	arXiv	Website
2024-07	CharacterGen	CharacterGen: Efficient 3D Character Generation from Single Images with Multi-View Pose Canonicalization.	arXiv	Website
2024-07	GALA3D	GALA3D: Towards Text-to-3D Complex Scene Generation via Layout-guided Generative Gaussian Splatting.	arXiv	Website
2024-06	Unique3D	High-Quality and Efficient 3D Mesh Generation from a Single Image.	arXiv	Hugging Face
2024-06	DreamGaussian4D	Generative 4D Gaussian Splatting.	arXiv	Hugging Face
2024-03	GaussCtrl	GaussCtrl: Multi-View Consistent Text-Driven 3D Gaussian Splatting Editing.	arXiv
2024-03	GaussianCube	A Structured and Explicit Radiance Representation for 3D Generative Modeling.	arXiv	Hugging Face
2024-03	TripoSR	Fast 3D Object Reconstruction from a Single Image.	arXiv	Hugging Face

^ Back to Contents ^

This site is open source. Improve this page.

ai-multimodal-timeline

AI Multimodal Timeline

Table of Contents

Project List

Multimodal Model

LLM

Agent

Audio

Audio/Text-to-Speech

Audio/Automatic Speech Recognition

Audio/Audio Generation

Image

Video

Music

3D