A

  • ABC-Dataset: A collection of one million Computer-Aided Design (CAD) models for research of geometric deep learning methods and applications
  • ActivityNet 200: A Large-Scale Video Benchmark for Human Activity Understanding
  • ADI17 (Arabic Dialect Identification) : 3,000 hours of Arabic dialect speech data from 17 countries on the Arabic world collected from YouTube
  • AirbusShipDetection: Dataset of satellite images (RGB) + associated bounding boxes for ships (csv), from a past Kaggle competition
  • AISHELL-4: A free Mandarin multi-channel meeting speech corpus
  • AliMeeting: A free Mandarin multi-channel meeting speech corpus, provided by Alibaba Group
  • ArabicSpeech: Phonetic texts and sounds for arabic language
  • Argoverse2: Next Generation Datasets for Self-Driving Perception and Forecasting
  • AudioSet: An audio event dataset consisting of over 2M human-annotated 10-second video clips
  • AudioCaps: A large-scale dataset of about 46K audio clips to human-written text pairs collected via crowdsourcing on the AudioSet dataset

B

  • BDD: A Diverse Driving Dataset for Heterogeneous Multitask Learning
  • BOP: Benchmark for 6D Object Pose Estimation
  • BIRB: A Generalization Benchmark for Information Retrieval in Bioacoustics
  • BridgeV2: Robotic manipulation behaviors designed to facilitate research in scalable robot learning

C

  • c4: A colossal, cleaned version of Common Crawl's web crawl corpus
  • Caltech-256: Caltech-256 Image Set, design machine vision systems with applications toscience, consumer products, entertainment, manufacturing and defense
  • CAMELS Sims IllustrisTNG LH 0-499: Subset of the Cosmology and Astrophysics with MachinE Learning Simulations dataset
  • CAMELYON: Whole-slide images (WSI) of hematoxylin and eosin (H&E) stained lymph node sections
  • CC-100: Monolingual data for 100+ languages from Web Crawl Data
  • CHARADES: Unstructured video activity recognition and common sense reasoning for daily human activities
  • CIFAR-10: corpus of 60000 32×32 colour images in 10 classes
  • ClimSim_high-res: An open large-scale dataset for training high-resolution physics emulators in hybrid multi-scale climate simulators.
  • ClimSim_low-res: An open large-scale dataset for training high-resolution physics emulators in hybrid multi-scale climate simulators.
  • Clotho_v2: Audio captioning dataset with 6974 audio samples, each with five captions
  • CN-Celeb: A large-scale Chinese speaker recognition dataset collected “in the wild”
  • CO3D: Common Objects in 3D (CO3D) is a dataset designed for learning category-specific 3D reconstruction and new-view synthesis using multi-view images of common object categories
  • Conceptual Captions: Dataset consisting of ~3.3M images representing a wide variety of styles annotated with captions
  • Condensed Movies: A large-scale video dataset, featuring clips from movies with detailed captions
  • COCO: Common Objects in Context (COCO) dataset for object detection, segmentation, and captioning
  • COCO-Stuff: Augmented version of the COCO dataset with pixel-level stuff annotations for scene understanding tasks
  • coco_minitrain_25k: COCO minitrain is a curated mini training set (25K images ≈ 20% of train2017) for COCO.
  • COIN: A Large-scale Dataset for Comprehensive Instructional Video Analysis
  • Collages: image dataset for binary classification task (based on MNIST and CIFAR)
  • Common Crawl: A corpus of web crawl data composed of billions of web pages
  • Common Voice (4, 6.1, 7.0, 8.0, 10.0, 15.0, 16.1): An open source, multi-language dataset of voices that anyone can use to train speech-enabled applications
  • criteo1tb: This dataset contains feature values and click feedback for millions of display ads.
  • CRVD (Captured Raw Video Denoising): 55 groups of noisy-clean videos with ISO values ranging from 1600 to 25600
  • CSTR-VCTK: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit (version 0.92)
  • CVSS: A massively multilingual-to-English speech-to-speech translation corpus, covering sentence-level parallel speech-to-speech translation pairs from 21 languages into English

D

  • DeepLesion: Dataset of CT images to help the scientific community improve detection accuracy of lesions
  • DiDeMo (Distinct Describable Moments): One of the largest and most diverse datasets for the temporal localization of events in videos given natural language descriptions
  • DIV2K: collection of 1000 2K high resolution images.
  • domain_net: dataset of common objects in six different categories (Clipart, Infograph, Painting, Quickdraw, Real and Sketch)
  • DROID: Distributed RObot Interaction Dataset, an in-the-wild robot manipulation dataset
  • DynamicEarthNet: Daily Multi-Spectral Satellite Dataset for Semantic Change Segmentation

E

  • EPIC-KITCHENS: Extended dataset in first-person (egocentric) vision, multi-faceted non-scripted recordings in native environments
  • EPIC_KITCHENS_2018: Dataset in first-person (egocentric) vision, multi-faceted non-scripted recordings in native environments
  • Europarl-ST: A Multilingual Speech Translation Corpus of paired audio-text samples constructed using the debates carried out in the European Parliament

F

  • fastText: Dataset of pre-trained word vectors for text representations and text classifiers
  • FineGym: A Hierarchical Video Dataset for Fine-grained Action Understanding
  • FLEURS: An n-way parallel speech dataset in 102 languages built on top of the machine translation FLoRes-101 benchmark
  • Flickr2K: large collection of 2K high resolution images.
  • Flickr8kAudio: 40,000 spoken captions of 8,000 natural images
  • Flickr30k: 30k Image Caption Corpus (31783 images).
  • FlickrFace: Flickr-Faces-HQ (FFHQ) is a high-quality image dataset of human faces

G

  • GigaSpeech: A multi-domain English speech recognition corpus
  • GLDv2: Google Landmarks Dataset v2G, 5 million annotated images with labels representing human-made and natural landmarks. The dataset can be used for landmark recognition and retrieval experiments.
  • Google Scanned Objects: Dataset of common household objects that have been 3D scanned for use in robotic simulation and synthetic perception research
  • GLUE: The General Language Understanding Evaluation (GLUE) benchmark is a collection of resources for training, evaluating, and analyzing natural language understanding systems.
  • GOT-10k: Generic Object Tracking Benchmark: A large, high-diversity, one-shot database for generic object tracking in the wild
  • GranDf_HA_images: GranD-f Grounded Conversation Generation high-quality human-annotated dataset
  • GTA5 Dataset: Synthetic images from the open-world video game Grand Theft Auto 5

H

  • HDR+: Dataset consisting of 3640 bursts (made up of 28461 images in total) captured using a variety of mobile cameras
  • HO-3D (v2): Dataset with 3D pose annotations for hand and object under severe occlusions from each other
  • howto100m: Dataset of narrated videos (instructional videos)
  • howto100m_s3d_features: Pre-trained S3D features for HowTo100M dataset
  • HuggingFace: Hundreds of Hugging Face datasets, not listed here
  • Hypersim: A Photorealistic Synthetic Dataset for Holistic Indoor Scene Understanding

I

  • IGN: Departmental orthophotograph datasets (5m and 50cm resolution) of French National Institute of Geographic and Forest Information
  • IIIT-HWS: Synthetic handwritten word images
  • imagenet: Image database for visual object recognition
  • ImageNet21K (Winter 2021 release): Extended version of ImageNet
  • ImageNet-C: 5 levels of severity for 19 different corruption types
  • IndicSUPERB - Kathbath: 1684 hours of labelled speech data across 12 Indian Languages
  • IWSLT2022: 17 hours of Tamasheq audio data aligned to French translations and unlabaled raw audio data in 5 languages spoken in Niger: French (116 hours), Fulfulde (114 hours), Hausa (105 hours), Tamasheq (234 hours) and Zarma (100 hours)

J

  • JTubeSpeech: Corpus of Japanese speech collected from YouTube

K

  • KAIST Multispectral Pedestrian Detection: 95k manually annotated color-thermal pairs taken from a vehicle
  • kinetics: DeepMind Kinetics human action video dataset
  • KITTI: Dataset captured from a station wagon for use in mobile robotics and autonomous driving research

L

  • LAION-400M:The world’s largest openly available image-text-pair dataset with 400 million samples.
  • LaSOT (Large-scale Single Object Tracking): 1,550 manually annotated sequences with more 3.87 millions frames
  • Libri-Light: A collection of spoken English audio suitable for training speech recognition systems under limited or no supervision
  • LibriMix: Large-scale corpus of English speech derived from the original materials of the LibriSpeech corpus
  • LibriSpeechAsrCorpus: Large-scale (1000 hours) corpus of read English speech
  • LibriTTS: Large-scale corpus of English speech derived from the original materials of the LibriSpeech corpus
  • LSUN: Image dataset for visual recognition
  • LVIS: Dataset for long tail instance segmentation with annotations for over 1000 object categories in 164k images
  • LibriCSS: A real recorded dataset derived from LibriSpeech by concatenating the corpus utterances to simulate a conversation and capturing the audio replays with far-field microphones.

M

  • MAESTRO: A dataset composed of about 200 hours of virtuosic piano performances captured with fine alignment (~3 ms) between note labels and audio waveforms.
  • MAGICDATA Mandarin Chinese Conversational Speech Corpus: 180 hours of rich annotated Mandarin spontaneous conversational speech data
  • Mapillary: Access street-level imagery and map data from all over the world.
  • MegaPose: 6D Pose Estimation of Novel Objects via Render & Compare
  • MegaScenes: An extensive collection of structure-from-motion reconstructions and internet images
  • MeteoNet: Meteorological dataset by METEO FRANCE, the French national meteorological service
  • MIMIC-IV-ECG: Diagnostic Electrocardiogram Matched Subset
  • MInDS-14: Multilingual dataset for intent detection from spoken data
  • MIRFLICKR: Dataset with 1M tagged images collected from the social photography site Flickr
  • ML-SUPERB: Data for the MultiLingual Speech processing Universal PERformance Benchmark
  • MMF: modular framework for vision and language multimodal research from Facebook AI Research
  • MNIST: Database of handwritten digits
  • MSDWILD: dataset designed for multi-modal speaker diarization and lip-speech synchronization in the wild.
  • MOT: Multiple Object Tracking
  • MOTSynth-MOTS-CVPR22 (Multiple Object Tracking): Large-scale synthetic dataset for pedestrian detection, segmentation, and tracking in urban scenarios created by exploiting the highly photorealistic video game Grand Theft Auto V
  • MSR-VTT (MSR Video to Text): Large-scale video benchmark for video understanding, especially the task of translating video to text
  • MultilingualLibriSpeech: A large multilingual corpus derived from LibriVox audiobooks
  • MultilingualTEDx: a multilingual corpus of TEDx talks for speech recognition and translation
  • Multi-Object Datasets: multi-object scenes with ground-truth segmentation masks, and some include generative factors detailing object features
  • MultiShapeNet: Videos of novel scenes created by LFN, PixelNerf, and SRT on the Multi-ShapeNet dataset
  • MuS2: A Benchmark for Sentinel-2 Multi-Image Super-Resolution.
  • MUSAN: Dataset of music, speech, and noise
  • MUSDB18: The musdb18 is a dataset of 150 full lengths music tracks (~10h duration) of different genres along with their isolated drums, bass, vocals and others stems
  • MuST-C: A multilingual speech translation corpus comprising several hundred hours of audio recordings from English TED Talks
  • MTG-Jamendo: The MTG-Jamendo Dataset is a new open dataset for music auto-tagging and built using music available at Jamendo

N

  • Narratives: fMRI dataset of participants listening to spoken stories.
  • NAVI: Category-Agnostic Image Collections with High-Quality 3D Shape and Pose Annotations
  • NOTSOFAR-1 Recorded Meeting Dataset: a collection of 237 meetings recorded across 30 conference rooms with 4-8 attendees, featuring a total of 35 unique speakers
  • Nuscenes LiDAR-seg: In nuScenes-lidarseg, for each point in a lidar pointcloud that belongs to a keyframe in the nuScenes dataset, we annotate it with one of 32 possible semantic labels

O

  • Objaverse]: Massive Dataset with 800K+ Annotated 3D Objects * [[https://www.objectron.dev/|Objectron: Collection of short, object-centric video clips, which are accompanied by AR session metadata that includes camera poses, sparse point-clouds and characterization of the planar surfaces in the surrounding environment
  • Objects365: Large-scale object detection dataset with 365 object categories over 600K training image
  • office_home: images from four different domains (Art, Clipart, Product and Real-World)
  • ogbg: ogbg-molpcba dataset is a molecular property prediction datasets. Each graph represents a molecule, where nodes are atoms, and edges are chemical bonds.
  • ONCE: Large-scale autonomous driving dataset with 2D&3D object annotations
  • OnePose: Dataset with multiple video scans of the same object put in different locations
  • OpenImagesV5: Open-source image dataset with annotated bounding boxes, object segmentations and visual relationships
  • OpenNEURO: A free and open platform for sharing MRI, MEG, EEG, iEEG, ECoG, ASL, and PET data
  • OSCAR: Open Super-large Crawled Aggregated coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the Ungoliant architecture.
  • OWTB: Open-world tracking benchmark

P

  • PACS: image dataset for domain generalization (Photo, Art painting, Cartoon, and Sketch domains)
  • PASCAL-Part: Additional annotations for PASCAL VOC 2010 providing segmentation masks for each body part of the object
  • Le Petit Prince: A multilingual fMRI corpus using ecological stimuli
  • Places365-Standard: Image dataset for visual understanding tasks
  • Places Audio Captions (English & Hindi): approximately 400,000 English and 100,000 Hindi spoken captions for natural images drawn from the Places 205 image dataset
  • PodcastFillers: 199 full-length podcast episodes in English with manually annotated filler words and automatically generated transcripts
  • PROBA_V: Satellite data from 74 hand-selected regions around the globe at different points in time
  • pubmed: Baseline set of MEDLINE/PubMed citation records in XML

Q

  • QVHighlights: Dataset for moment retrieval and highlight detections

R

  • RCSB: Protein Data Bank (PDB) archive of 3D structure data for large biological molecules (proteins, DNA, and RNA)
  • RealEstate10K-360P: A large dataset of camera poses corresponding to 10 million frames derived from about 80,000 video clips, gathered from about 10,000 YouTube videos. This dataset has been built on “low resolution” videos (360p).
  • Recipe1M(+): A large-scale, structured corpus of over one million cooking recipes and 13 million food images
  • RedPajama-V2: An open dataset with 30 trillion tokens for training large language models
  • RoboCasa: A large-scale simulation framework for training generally capable robots to perform everyday tasks

S

  • SA-V: 51K videos and 643K masklet annotations for general-purpose object segmentation models from open world videos
  • S2L3A_France_2019: 13 Sentinel-2 L3A mosaics covering France in 2019
  • SBCSAE (Santa Barbara Corpus of Spoken American English): 60 roughly 20-minute stereo recordings of naturally-occurring spoken interactions recorded across the United States
  • SegmentAnything-1B : 11M images and 1.1B mask annotations for semantic segmentation
  • Semantic KITTI : A Dataset for Semantic Scene Understanding of LiDAR Sequences
  • SemanticKITTI-C: An evaluation benchmark heading toward robust and reliable 3D semantic segmentation in autonomous driving
  • SEN2VENµS: SEN2VENµS is an open dataset for the super-resolution of Sentinel-2 images by leveraging simultaneous acquisitions with the VENµS satellite.
  • ShapeNet: A richly-annotated, large-scale dataset of 3D shapes
  • ShapeStacks: Simulation-based dataset composed of a variety of elementary geometric primitives richly annotated regarding semantics and structural stability
  • SHREC19: A dataset of isometric and non-isometric non-rigid shapes with texture-based ground-truth correspondences
  • SHREC20: A dataset of non-isometric non-rigid shapes with texture-based ground-truth correspondences
  • Shrutilipi: Labelled ASR corpus obtained from All India Radio news bulletins for 12 Indian languages
  • SmolLM-Corpus: Curated collection of high-quality educational and synthetic data designed for training small language models.
  • snapshot-safari-2024-expansion: 4,029,374 images from 15 camera trapping projects in the the Snapshot Safari program
  • snapshot-serengeti: 2.65M sequences of camera trap images, totaling 7.1M images, from seasons one through eleven of the Snapshot Serengeti project
  • SpokenCOCO: Approximately 600k recordings of human speakers reading the MSCOCO image captions out loud in English
  • SpokenObjectNet: 50k English spoken audio captions for the images in the ObjectNet dataset
  • SUNRGBD: Dataset captured by four different sensors containing 10k densely annotated RGB-D images
  • Surreal Dataset of Meshes: 230000 human meshes with a large variety of realistic poses and body shapes
  • svo: Subject-Verb-Object (SVO) triples from the NELL (Never-Ending Language Learning) project
  • STL-10: Image recognition dataset for developing unsupervised feature learning, deep learning, self-taught learning algorithms
  • SynthTriplets18M: Large-scale synthetic dataset created by the researchers to train and evaluate CIR models

T

  • TCGA WSI: Whole Slide Images of The Cancer Genome Atlas
  • TCIA_LIDC-IDRI_20200921: A completed reference database of lung nodules on CT scans (LIDC-IDRI)
  • TEDLIUM: Corpus dedicated to speech recognition in English
  • Thumos14: Challenge: Action Recognition with a Large Number of Classes
  • TikTok_dataset: Learning High Fidelity Depths of Dressed Humans by Watching Social Media Dance Videos
  • TikTok_Raw_videos: Learning High Fidelity Depths of Dressed Humans by Watching Social Media Dance Videos
  • tng-project: Results and data from IllustrisTNG simulations of cosmological galaxy formation
  • TOPKIDS: A collection of 3D shapes undergoing within-class deformations
  • TrackingNet: A Large-Scale Dataset and Benchmark for Object Tracking in the Wild

U

  • UCF101: Action videos dataset for action recognition

V

  • V2X-Sim-2: A Comprehensive Synthetic Multi-agent Perception Dataset
  • VATEX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research
  • VGGSound: An audio-visual correspondent dataset consisting of short clips of audio sounds, extracted from videos uploaded to YouTube
  • VideoCC: Dataset containing (video, caption) pairs for training video-text machine learning models
  • Vimeo-90K septuplet: 91,701 7-frame sequences with fixed resolution 448 x 256, extracted from 39K selected video clips from Vimeo-90K
  • Virtual KITTI 2: Virtual KITTI 2 is a more photo-realistic and better-featured version of the original virtual KITTI dataset. It exploits recent improvements of the Unity game engine and provides new data such as stereo images or scene flow.
  • Visual Genome: Dataset to connect structurated image concepts to language
  • ViTT (Video Timeline Tags): Collection of videos annotated with timelines where each video is divided into segments, and each segment is labelled with a short free-text description
  • VOST: Video Object Segmentation under Transformations
  • VoxCeleb1-2: Audio-visual dataset of human speech
  • VoxLingua107: Speech dataset for training spoken language identification models containing data extracted from YouTube videos and labeled according the language of the video title and description for 107 languages
  • VoxPopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation
  • VoxTube: A multilingual speaker recognition dataset collected from the CC BY 4.0 YouTube videos

W

  • WebVid-2M: Large-scale dataset of short videos (2M+) with textual descriptions sourced from the web
  • WebVid-10M: Large-scale dataset of short videos (10M+) with textual descriptions sourced from the web.
  • WenetSpeech: A 10000+ Hours Multi-domain Mandarin Corpus for Speech Recognition (original format + WAV 16khz mono format available)
  • WIT: Large multimodal multilingual dataset composed of a curated set of 37.6 million entity rich image-text examples with 11.5 million unique images across 108 Wikipedia languages
  • Wikimedia Dumps: A subset of the dumps from the Wikimedia wikis, currently available: frwiki-20220901 and wikidatawiki-entities-20220912
  • wmt: a machine translation dataset composed from a collection of various sources.
  • WorldStrat: Nearly 10,000 km² of free high-resolution and matched low-resolution satellite imagery

X

Y

  • YCB-Videos: A large-scale video dataset for 6D object pose estimation.
  • YFCC100M: Multimedia collection containing the metadata of around 99.2 million photos and 0.8 million videos from Flickr
  • YouTube-VOS: Dataset for the 4th Large-scale Video Object Segmentation Challenge, Workshop in conjunction with CVPR 2022
  • YT-Temporal-180M: Large and diverse dataset of 6 million videos that covers diverse topics
  • YT-Temporal-1B: Dataset of 20 million videos, spanning over a billion frames

Z

— List updated on March 18th 2025 —