A

  • ABC-Dataset: A collection of one million Computer-Aided Design (CAD) models for research of geometric deep learning methods and applications
  • ActivityNet 200: A Large-Scale Video Benchmark for Human Activity Understanding
  • ADI17 (Arabic Dialect Identification) : 3,000 hours of Arabic dialect speech data from 17 countries on the Arabic world collected from YouTube
  • AirbusShipDetection: Dataset of satellite images (RGB) + associated bounding boxes for ships (csv), from a past Kaggle competition
  • AISHELL-4: A free Mandarin multi-channel meeting speech corpus
  • AliMeeting: A free Mandarin multi-channel meeting speech corpus, provided by Alibaba Group
  • ArabicSpeech: Phonetic texts and sounds for arabic language
  • Argoverse2: Next Generation Datasets for Self-Driving Perception and Forecasting
  • AudioSet: An audio event dataset consisting of over 2M human-annotated 10-second video clips
  • AudioCaps: A large-scale dataset of about 46K audio clips to human-written text pairs collected via crowdsourcing on the AudioSet dataset

B

  • BDD: A Diverse Driving Dataset for Heterogeneous Multitask Learning
  • BOP: Benchmark for 6D Object Pose Estimation
  • BIRB: A Generalization Benchmark for Information Retrieval in Bioacoustics

C

  • c4: A colossal, cleaned version of Common Crawl's web crawl corpus
  • Caltech-256: Caltech-256 Image Set, design machine vision systems with applications toscience, consumer products, entertainment, manufacturing and defense
  • CAMELS Sims IllustrisTNG LH 0-499: Subset of the Cosmology and Astrophysics with MachinE Learning Simulations dataset
  • CAMELYON: Whole-slide images (WSI) of hematoxylin and eosin (H&E) stained lymph node sections
  • CC-100: Monolingual data for 100+ languages from Web Crawl Data
  • CHARADES: Unstructured video activity recognition and common sense reasoning for daily human activities
  • CIFAR-10: corpus of 60000 32×32 colour images in 10 classes
  • ClimSim_low-res: An open large-scale dataset for training high-resolution physics emulators in hybrid multi-scale climate simulators.
  • Clotho_v2: Audio captioning dataset with 6974 audio samples, each with five captions
  • CN-Celeb: A large-scale Chinese speaker recognition dataset collected “in the wild”
  • CO3D: Common Objects in 3D (CO3D) is a dataset designed for learning category-specific 3D reconstruction and new-view synthesis using multi-view images of common object categories
  • Conceptual Captions: Dataset consisting of ~3.3M images representing a wide variety of styles annotated with captions
  • Condensed Movies: A large-scale video dataset, featuring clips from movies with detailed captions
  • COCO: Common Objects in Context (COCO) dataset for object detection, segmentation, and captioning
  • COCO-Stuff: Augmented version of the COCO dataset with pixel-level stuff annotations for scene understanding tasks
  • coco_minitrain_25k: COCO minitrain is a curated mini training set (25K images ≈ 20% of train2017) for COCO.
  • COIN: A Large-scale Dataset for Comprehensive Instructional Video Analysis
  • Collages: image dataset for binary classification task (based on MNIST and CIFAR)
  • Common Crawl: A corpus of web crawl data composed of billions of web pages
  • Common Voice (4, 6.1, 7.0, 8.0, 10.0, 15.0, 16.1): An open source, multi-language dataset of voices that anyone can use to train speech-enabled applications
  • criteo1tb: This dataset contains feature values and click feedback for millions of display ads.
  • CRVD (Captured Raw Video Denoising): 55 groups of noisy-clean videos with ISO values ranging from 1600 to 25600
  • CVSS: A massively multilingual-to-English speech-to-speech translation corpus, covering sentence-level parallel speech-to-speech translation pairs from 21 languages into English

D

E

  • EPIC-KITCHENS: Extended dataset in first-person (egocentric) vision, multi-faceted non-scripted recordings in native environments
  • EPIC_KITCHENS_2018: Dataset in first-person (egocentric) vision, multi-faceted non-scripted recordings in native environments
  • Europarl-ST: A Multilingual Speech Translation Corpus of paired audio-text samples constructed using the debates carried out in the European Parliament

F

  • fastText: Dataset of pre-trained word vectors for text representations and text classifiers
  • FineGym: A Hierarchical Video Dataset for Fine-grained Action Understanding
  • FLEURS: An n-way parallel speech dataset in 102 languages built on top of the machine translation FLoRes-101 benchmark
  • Flickr2K: large collection of 2K high resolution images.
  • Flickr8kAudio: 40,000 spoken captions of 8,000 natural images
  • Flickr30k: 30k Image Caption Corpus (31783 images).
  • FlickrFace: Flickr-Faces-HQ (FFHQ) is a high-quality image dataset of human faces

G

  • GigaSpeech: A multi-domain English speech recognition corpus
  • GLDv2: Google Landmarks Dataset v2G, 5 million annotated images with labels representing human-made and natural landmarks. The dataset can be used for landmark recognition and retrieval experiments.
  • Google Scanned Objects: Dataset of common household objects that have been 3D scanned for use in robotic simulation and synthetic perception research
  • GLUE: The General Language Understanding Evaluation (GLUE) benchmark is a collection of resources for training, evaluating, and analyzing natural language understanding systems.
  • GOT-10k: Generic Object Tracking Benchmark: A large, high-diversity, one-shot database for generic object tracking in the wild
  • GTA5 Dataset: Synthetic images from the open-world video game Grand Theft Auto 5

H

  • HDR+: Dataset consisting of 3640 bursts (made up of 28461 images in total) captured using a variety of mobile cameras
  • HO-3D (v2): Dataset with 3D pose annotations for hand and object under severe occlusions from each other
  • howto100m: Dataset of narrated videos (instructional videos)
  • howto100m_s3d_features: Pre-trained S3D features for HowTo100M dataset
  • HuggingFace: Hundreds of Hugging Face datasets, not listed here
  • Hypersim: A Photorealistic Synthetic Dataset for Holistic Indoor Scene Understanding

I

  • IGN: Departmental orthophotograph datasets (5m and 50cm resolution) of French National Institute of Geographic and Forest Information
  • IIIT-HWS: Synthetic handwritten word images
  • imagenet: Image database for visual object recognition
  • ImageNet21K (Winter 2021 release): Extended version of ImageNet
  • ImageNet-C: 5 levels of severity for 19 different corruption types
  • IndicSUPERB - Kathbath: 1684 hours of labelled speech data across 12 Indian Languages
  • IWSLT2022: 17 hours of Tamasheq audio data aligned to French translations and unlabaled raw audio data in 5 languages spoken in Niger: French (116 hours), Fulfulde (114 hours), Hausa (105 hours), Tamasheq (234 hours) and Zarma (100 hours)

J

  • JTubeSpeech: Corpus of Japanese speech collected from YouTube

K

  • KAIST Multispectral Pedestrian Detection: 95k manually annotated color-thermal pairs taken from a vehicle
  • kinetics: DeepMind Kinetics human action video dataset
  • KITTI: Dataset captured from a station wagon for use in mobile robotics and autonomous driving research

L

  • LAION-400M:The world’s largest openly available image-text-pair dataset with 400 million samples.
  • LaSOT (Large-scale Single Object Tracking): 1,550 manually annotated sequences with more 3.87 millions frames
  • Libri-Light: A collection of spoken English audio suitable for training speech recognition systems under limited or no supervision
  • LibriMix: Open source dataset for source separation in noisy environments, derived from LibriSpeech signals (clean subset) and WHAM noise
  • LibriSpeechAsrCorpus: Large-scale (1000 hours) corpus of read English speech
  • LSUN: Image dataset for visual recognition
  • LVIS: Dataset for long tail instance segmentation with annotations for over 1000 object categories in 164k images

M

  • MAESTRO: A dataset composed of about 200 hours of virtuosic piano performances captured with fine alignment (~3 ms) between note labels and audio waveforms.
  • MAGICDATA Mandarin Chinese Conversational Speech Corpus: 180 hours of rich annotated Mandarin spontaneous conversational speech data
  • Mapillary: Access street-level imagery and map data from all over the world.
  • MeteoNet: Meteorological dataset by METEO FRANCE, the French national meteorological service
  • MegaPose: 6D Pose Estimation of Novel Objects via Render & Compare
  • MIRFLICKR: Dataset with 1M tagged images collected from the social photography site Flickr
  • MInDS-14: Multilingual dataset for intent detection from spoken data
  • ML-SUPERB: Data for the MultiLingual Speech processing Universal PERformance Benchmark
  • MMF: modular framework for vision and language multimodal research from Facebook AI Research
  • MNIST: Database of handwritten digits
  • MSDWILD: dataset designed for multi-modal speaker diarization and lip-speech synchronization in the wild.
  • MOT: Multiple Object Tracking
  • MOTSynth-MOTS-CVPR22 (Multiple Object Tracking): Large-scale synthetic dataset for pedestrian detection, segmentation, and tracking in urban scenarios created by exploiting the highly photorealistic video game Grand Theft Auto V
  • MSR-VTT (MSR Video to Text): Large-scale video benchmark for video understanding, especially the task of translating video to text
  • MultilingualLibriSpeech: A large multilingual corpus derived from LibriVox audiobooks
  • MultilingualTEDx: a multilingual corpus of TEDx talks for speech recognition and translation
  • Multi-Object Datasets: multi-object scenes with ground-truth segmentation masks, and some include generative factors detailing object features
  • MultiShapeNet: Videos of novel scenes created by LFN, PixelNerf, and SRT on the Multi-ShapeNet dataset
  • MuS2: A Benchmark for Sentinel-2 Multi-Image Super-Resolution.
  • MUSAN: Dataset of music, speech, and noise

N

  • Narratives: fMRI dataset of participants listening to spoken stories.
  • Nuscenes LiDAR-seg: In nuScenes-lidarseg, for each point in a lidar pointcloud that belongs to a keyframe in the nuScenes dataset, we annotate it with one of 32 possible semantic labels
  • MuST-C: A multilingual speech translation corpus comprising several hundred hours of audio recordings from English TED Talks
  • NAVI: Category-Agnostic Image Collections with High-Quality 3D Shape and Pose Annotations

O

  • Objaverse]: Massive Dataset with 800K+ Annotated 3D Objects * [[https://www.objectron.dev/|Objectron: Collection of short, object-centric video clips, which are accompanied by AR session metadata that includes camera poses, sparse point-clouds and characterization of the planar surfaces in the surrounding environment
  • Objects365: Large-scale object detection dataset with 365 object categories over 600K training image
  • office_home: images from four different domains (Art, Clipart, Product and Real-World)
  • ogbg: ogbg-molpcba dataset is a molecular property prediction datasets. Each graph represents a molecule, where nodes are atoms, and edges are chemical bonds.
  • ONCE: Large-scale autonomous driving dataset with 2D&3D object annotations
  • OnePose: Dataset with multiple video scans of the same object put in different locations
  • OpenImagesV5: Open-source image dataset with annotated bounding boxes, object segmentations and visual relationships
  • OpenNEURO: A free and open platform for sharing MRI, MEG, EEG, iEEG, ECoG, ASL, and PET data
  • OSCAR: Open Super-large Crawled Aggregated coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the Ungoliant architecture.
  • OWTB: Open-world tracking benchmark

P

  • Le Petit Prince: A multilingual fMRI corpus using ecological stimuli
  • PACS: image dataset for domain generalization (Photo, Art painting, Cartoon, and Sketch domains)
  • Places365-Standard: Image dataset for visual understanding tasks
  • Places Audio Captions (English & Hindi): approximately 400,000 English and 100,000 Hindi spoken captions for natural images drawn from the Places 205 image dataset
  • PodcastFillers: 199 full-length podcast episodes in English with manually annotated filler words and automatically generated transcripts
  • PROBA_V : Satellite data from 74 hand-selected regions around the globe at different points in time
  • pubmed: Baseline set of MEDLINE/PubMed citation records in XML

Q

  • QVHighlights: Dataset for moment retrieval and highlight detections

R

  • Recipe1M(+): A large-scale, structured corpus of over one million cooking recipes and 13 million food images
  • RedPajama-V2: An open dataset with 30 trillion tokens for training large language models

S

  • S2L3A_France_2019: 13 Sentinel-2 L3A mosaics covering France in 2019
  • SegmentAnything-1B : 11M images and 1.1B mask annotations for semantic segmentation
  • Semantic KITTI : A Dataset for Semantic Scene Understanding of LiDAR Sequences
  • SemanticKITTI-C: An evaluation benchmark heading toward robust and reliable 3D semantic segmentation in autonomous driving
  • SEN2VENµS: SEN2VENµS is an open dataset for the super-resolution of Sentinel-2 images by leveraging simultaneous acquisitions with the VENµS satellite.
  • ShapeNet: A richly-annotated, large-scale dataset of 3D shapes
  • ShapeStacks: Simulation-based dataset composed of a variety of elementary geometric primitives richly annotated regarding semantics and structural stability
  • SHREC19: A dataset of isometric and non-isometric non-rigid shapes with texture-based ground-truth correspondences
  • SHREC20: A dataset of non-isometric non-rigid shapes with texture-based ground-truth correspondences
  • Shrutilipi: Labelled ASR corpus obtained from All India Radio news bulletins for 12 Indian languages
  • SpokenCOCO: Approximately 600k recordings of human speakers reading the MSCOCO image captions out loud in English
  • SpokenObjectNet: 50k English spoken audio captions for the images in the ObjectNet dataset
  • SUNRGBD: Dataset captured by four different sensors containing 10k densely annotated RGB-D images
  • Surreal Dataset of Meshes: 230000 human meshes with a large variety of realistic poses and body shapes
  • svo: Subject-Verb-Object (SVO) triples from the NELL (Never-Ending Language Learning) project
  • STL-10: Image recognition dataset for developing unsupervised feature learning, deep learning, self-taught learning algorithms

T

  • TCGA WSI: Whole Slide Images of The Cancer Genome Atlas
  • TCIA_LIDC-IDRI_20200921: A completed reference database of lung nodules on CT scans (LIDC-IDRI)
  • TEDLIUM: Corpus dedicated to speech recognition in English
  • Thumos14: Challenge: Action Recognition with a Large Number of Classes
  • TikTok_dataset: Learning High Fidelity Depths of Dressed Humans by Watching Social Media Dance Videos
  • TikTok_Raw_videos: Learning High Fidelity Depths of Dressed Humans by Watching Social Media Dance Videos
  • tng-project: Results and data from IllustrisTNG simulations of cosmological galaxy formation
  • TOPKIDS: A collection of 3D shapes undergoing within-class deformations
  • TrackingNet: A Large-Scale Dataset and Benchmark for Object Tracking in the Wild

U

  • UCF101: Action videos dataset for action recognition

V

  • V2X-Sim-2: A Comprehensive Synthetic Multi-agent Perception Dataset
  • VATEX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research
  • VGGSound: An audio-visual correspondent dataset consisting of short clips of audio sounds, extracted from videos uploaded to YouTube
  • VideoCC: Dataset containing (video, caption) pairs for training video-text machine learning models
  • Vimeo-90K septuplet: 91,701 7-frame sequences with fixed resolution 448 x 256, extracted from 39K selected video clips from Vimeo-90K
  • Visual Genome: Dataset to connect structurated image concepts to language
  • ViTT (Video Timeline Tags): Collection of videos annotated with timelines where each video is divided into segments, and each segment is labelled with a short free-text description
  • VOST: Video Object Segmentation under Transformations
  • VoxCeleb1-2: Audio-visual dataset of human speech
  • VoxLingua107: Speech dataset for training spoken language identification models containing data extracted from YouTube videos and labeled according the language of the video title and description for 107 languages
  • VoxPopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation
  • VoxTube: A multilingual speaker recognition dataset collected from the CC BY 4.0 YouTube videos

W

  • WebVid-2M: Large-scale dataset of short videos (2M+) with textual descriptions sourced from the web
  • WebVid-10M: Large-scale dataset of short videos (10M+) with textual descriptions sourced from the web.
  • WenetSpeech: A 10000+ Hours Multi-domain Mandarin Corpus for Speech Recognition (original format + WAV 16khz mono format available)
  • WIT: Large multimodal multilingual dataset composed of a curated set of 37.6 million entity rich image-text examples with 11.5 million unique images across 108 Wikipedia languages
  • Wikimedia Dumps: A subset of the dumps from the Wikimedia wikis, currently available: frwiki-20220901 and wikidatawiki-entities-20220912
  • wmt: a machine translation dataset composed from a collection of various sources.
  • WorldStrat: Nearly 10,000 km² of free high-resolution and matched low-resolution satellite imagery

X

Y

  • YCB-Videos: A large-scale video dataset for 6D object pose estimation.
  • YFCC100M: Multimedia collection containing the metadata of around 99.2 million photos and 0.8 million videos from Flickr
  • YouTube-VOS: Dataset for the 4th Large-scale Video Object Segmentation Challenge, Workshop in conjunction with CVPR 2022
  • YT-Temporal-180M: Large and diverse dataset of 6 million videos that covers diverse topics
  • YT-Temporal-1B: Dataset of 20 million videos, spanning over a billion frames

Z

— List updated on February 26th 2024 —