Vision-Language ModelsA vision-language model learns a joint representation of images and text.
Audio-Visual LearningAudio-visual learning studies models that jointly process sound and visual information. The goal is to learn representations that combine what is seen with what is heard.
Unified Foundation ModelsA unified foundation model is a neural network trained across many modalities, tasks, and domains using a shared architecture and shared representations.
Retrieval SystemsA retrieval system finds relevant information from an external memory source.
Long-Horizon AgentsA long-horizon agent is a model-driven system that pursues goals over many steps. It observes the environment, chooses actions, records intermediate state, uses tools, and adjusts its plan as new information arrives.