A
- ABC-Dataset: A collection of one million Computer-Aided Design (CAD) models for research of geometric deep learning methods and applications
- ActivityNet 200: A Large-Scale Video Benchmark for Human Activity Understanding
- ADI17 (Arabic Dialect Identification) : 3,000 hours of Arabic dialect speech data from 17 countries on the Arabic world collected from YouTube
- AirbusShipDetection: Dataset of satellite images (RGB) + associated bounding boxes for ships (csv), from a past Kaggle competition
- AISHELL-4: A free Mandarin multi-channel meeting speech corpus
- AliMeeting: A free Mandarin multi-channel meeting speech corpus, provided by Alibaba Group
- ArabicSpeech: Phonetic texts and sounds for arabic language
- Argoverse2: Next Generation Datasets for Self-Driving Perception and Forecasting
- AudioSet: An audio event dataset consisting of over 2M human-annotated 10-second video clips
- AudioCaps: A large-scale dataset of about 46K audio clips to human-written text pairs collected via crowdsourcing on the AudioSet dataset
B
C
- c4: A colossal, cleaned version of Common Crawl's web crawl corpus
- Caltech-256: Caltech-256 Image Set, design machine vision systems with applications toscience, consumer products, entertainment, manufacturing and defense
- CAMELS Sims IllustrisTNG LH 0-499: Subset of the Cosmology and Astrophysics with MachinE Learning Simulations dataset
- CAMELYON: Whole-slide images (WSI) of hematoxylin and eosin (H&E) stained lymph node sections
- CC-100: Monolingual data for 100+ languages from Web Crawl Data
- CHARADES: Unstructured video activity recognition and common sense reasoning for daily human activities
- CIFAR-10: corpus of 60000 32×32 colour images in 10 classes
- ClimSim_low-res: An open large-scale dataset for training high-resolution physics emulators in hybrid multi-scale climate simulators.
- Clotho_v2: Audio captioning dataset with 6974 audio samples, each with five captions
- CN-Celeb: A large-scale Chinese speaker recognition dataset collected “in the wild”
- CO3D: Common Objects in 3D (CO3D) is a dataset designed for learning category-specific 3D reconstruction and new-view synthesis using multi-view images of common object categories
- Conceptual Captions: Dataset consisting of ~3.3M images representing a wide variety of styles annotated with captions
- Condensed Movies: A large-scale video dataset, featuring clips from movies with detailed captions
- COCO: Common Objects in Context (COCO) dataset for object detection, segmentation, and captioning
- COCO-Stuff: Augmented version of the COCO dataset with pixel-level stuff annotations for scene understanding tasks
- coco_minitrain_25k: COCO minitrain is a curated mini training set (25K images ≈ 20% of train2017) for COCO.
- COIN: A Large-scale Dataset for Comprehensive Instructional Video Analysis
- Collages: image dataset for binary classification task (based on MNIST and CIFAR)
- Common Crawl: A corpus of web crawl data composed of billions of web pages
- Common Voice (4, 6.1, 7.0, 8.0, 10.0, 15.0, 16.1): An open source, multi-language dataset of voices that anyone can use to train speech-enabled applications
- criteo1tb: This dataset contains feature values and click feedback for millions of display ads.
- CRVD (Captured Raw Video Denoising): 55 groups of noisy-clean videos with ISO values ranging from 1600 to 25600
- CVSS: A massively multilingual-to-English speech-to-speech translation corpus, covering sentence-level parallel speech-to-speech translation pairs from 21 languages into English
D
- DeepLesion: Dataset of CT images to help the scientific community improve detection accuracy of lesions
- DiDeMo (Distinct Describable Moments): One of the largest and most diverse datasets for the temporal localization of events in videos given natural language descriptions
- DIV2K: collection of 1000 2K high resolution images.
- domain_net: dataset of common objects in six different categories (Clipart, Infograph, Painting, Quickdraw, Real and Sketch)
- DynamicEarthNet: Daily Multi-Spectral Satellite Dataset for Semantic Change Segmentation
E
- EPIC-KITCHENS: Extended dataset in first-person (egocentric) vision, multi-faceted non-scripted recordings in native environments
- EPIC_KITCHENS_2018: Dataset in first-person (egocentric) vision, multi-faceted non-scripted recordings in native environments
- Europarl-ST: A Multilingual Speech Translation Corpus of paired audio-text samples constructed using the debates carried out in the European Parliament
F
- fastText: Dataset of pre-trained word vectors for text representations and text classifiers
- FineGym: A Hierarchical Video Dataset for Fine-grained Action Understanding
- FLEURS: An n-way parallel speech dataset in 102 languages built on top of the machine translation FLoRes-101 benchmark
- Flickr2K: large collection of 2K high resolution images.
- Flickr8kAudio: 40,000 spoken captions of 8,000 natural images
- Flickr30k: 30k Image Caption Corpus (31783 images).
- FlickrFace: Flickr-Faces-HQ (FFHQ) is a high-quality image dataset of human faces
G
- GigaSpeech: A multi-domain English speech recognition corpus
- GLDv2: Google Landmarks Dataset v2G, 5 million annotated images with labels representing human-made and natural landmarks. The dataset can be used for landmark recognition and retrieval experiments.
- Google Scanned Objects: Dataset of common household objects that have been 3D scanned for use in robotic simulation and synthetic perception research
- GLUE: The General Language Understanding Evaluation (GLUE) benchmark is a collection of resources for training, evaluating, and analyzing natural language understanding systems.
- GOT-10k: Generic Object Tracking Benchmark: A large, high-diversity, one-shot database for generic object tracking in the wild
- GTA5 Dataset: Synthetic images from the open-world video game Grand Theft Auto 5
H
- HDR+: Dataset consisting of 3640 bursts (made up of 28461 images in total) captured using a variety of mobile cameras
- HO-3D (v2): Dataset with 3D pose annotations for hand and object under severe occlusions from each other
- howto100m: Dataset of narrated videos (instructional videos)
- howto100m_s3d_features: Pre-trained S3D features for HowTo100M dataset
- HuggingFace: Hundreds of Hugging Face datasets, not listed here
- Hypersim: A Photorealistic Synthetic Dataset for Holistic Indoor Scene Understanding
I
- IGN: Departmental orthophotograph datasets (5m and 50cm resolution) of French National Institute of Geographic and Forest Information
- IIIT-HWS: Synthetic handwritten word images
- imagenet: Image database for visual object recognition
- ImageNet21K (Winter 2021 release): Extended version of ImageNet
- ImageNet-C: 5 levels of severity for 19 different corruption types
- IndicSUPERB - Kathbath: 1684 hours of labelled speech data across 12 Indian Languages
- IWSLT2022: 17 hours of Tamasheq audio data aligned to French translations and unlabaled raw audio data in 5 languages spoken in Niger: French (116 hours), Fulfulde (114 hours), Hausa (105 hours), Tamasheq (234 hours) and Zarma (100 hours)
J
- JTubeSpeech: Corpus of Japanese speech collected from YouTube
K
- KAIST Multispectral Pedestrian Detection: 95k manually annotated color-thermal pairs taken from a vehicle
- kinetics: DeepMind Kinetics human action video dataset
- KITTI: Dataset captured from a station wagon for use in mobile robotics and autonomous driving research
L
- LAION-400M:The world’s largest openly available image-text-pair dataset with 400 million samples.
- LaSOT (Large-scale Single Object Tracking): 1,550 manually annotated sequences with more 3.87 millions frames
- Libri-Light: A collection of spoken English audio suitable for training speech recognition systems under limited or no supervision
- LibriMix: Open source dataset for source separation in noisy environments, derived from LibriSpeech signals (clean subset) and WHAM noise
- LibriSpeechAsrCorpus: Large-scale (1000 hours) corpus of read English speech
- LSUN: Image dataset for visual recognition
- LVIS: Dataset for long tail instance segmentation with annotations for over 1000 object categories in 164k images
M
- MAESTRO: A dataset composed of about 200 hours of virtuosic piano performances captured with fine alignment (~3 ms) between note labels and audio waveforms.
- MAGICDATA Mandarin Chinese Conversational Speech Corpus: 180 hours of rich annotated Mandarin spontaneous conversational speech data
- Mapillary: Access street-level imagery and map data from all over the world.
- MeteoNet: Meteorological dataset by METEO FRANCE, the French national meteorological service
- MegaPose: 6D Pose Estimation of Novel Objects via Render & Compare
- MIRFLICKR: Dataset with 1M tagged images collected from the social photography site Flickr
- MInDS-14: Multilingual dataset for intent detection from spoken data
- ML-SUPERB: Data for the MultiLingual Speech processing Universal PERformance Benchmark
- MMF: modular framework for vision and language multimodal research from Facebook AI Research
- MNIST: Database of handwritten digits
- MSDWILD: dataset designed for multi-modal speaker diarization and lip-speech synchronization in the wild.
- MOT: Multiple Object Tracking
- MOTSynth-MOTS-CVPR22 (Multiple Object Tracking): Large-scale synthetic dataset for pedestrian detection, segmentation, and tracking in urban scenarios created by exploiting the highly photorealistic video game Grand Theft Auto V
- MSR-VTT (MSR Video to Text): Large-scale video benchmark for video understanding, especially the task of translating video to text
- MultilingualLibriSpeech: A large multilingual corpus derived from LibriVox audiobooks
- MultilingualTEDx: a multilingual corpus of TEDx talks for speech recognition and translation
- Multi-Object Datasets: multi-object scenes with ground-truth segmentation masks, and some include generative factors detailing object features
- MultiShapeNet: Videos of novel scenes created by LFN, PixelNerf, and SRT on the Multi-ShapeNet dataset
- MuS2: A Benchmark for Sentinel-2 Multi-Image Super-Resolution.
- MUSAN: Dataset of music, speech, and noise
N
- Narratives: fMRI dataset of participants listening to spoken stories.
- Nuscenes LiDAR-seg: In nuScenes-lidarseg, for each point in a lidar pointcloud that belongs to a keyframe in the nuScenes dataset, we annotate it with one of 32 possible semantic labels
- MuST-C: A multilingual speech translation corpus comprising several hundred hours of audio recordings from English TED Talks
- NAVI: Category-Agnostic Image Collections with High-Quality 3D Shape and Pose Annotations
O
- Objaverse]: Massive Dataset with 800K+ Annotated 3D Objects * [[https://www.objectron.dev/|Objectron: Collection of short, object-centric video clips, which are accompanied by AR session metadata that includes camera poses, sparse point-clouds and characterization of the planar surfaces in the surrounding environment
- Objects365: Large-scale object detection dataset with 365 object categories over 600K training image
- office_home: images from four different domains (Art, Clipart, Product and Real-World)
- ogbg: ogbg-molpcba dataset is a molecular property prediction datasets. Each graph represents a molecule, where nodes are atoms, and edges are chemical bonds.
- ONCE: Large-scale autonomous driving dataset with 2D&3D object annotations
- OnePose: Dataset with multiple video scans of the same object put in different locations
- OpenImagesV5: Open-source image dataset with annotated bounding boxes, object segmentations and visual relationships
- OpenNEURO: A free and open platform for sharing MRI, MEG, EEG, iEEG, ECoG, ASL, and PET data
- OSCAR: Open Super-large Crawled Aggregated coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the Ungoliant architecture.
- OWTB: Open-world tracking benchmark
P
- Le Petit Prince: A multilingual fMRI corpus using ecological stimuli
- PACS: image dataset for domain generalization (Photo, Art painting, Cartoon, and Sketch domains)
- Places365-Standard: Image dataset for visual understanding tasks
- Places Audio Captions (English & Hindi): approximately 400,000 English and 100,000 Hindi spoken captions for natural images drawn from the Places 205 image dataset
- PodcastFillers: 199 full-length podcast episodes in English with manually annotated filler words and automatically generated transcripts
- PROBA_V : Satellite data from 74 hand-selected regions around the globe at different points in time
- pubmed: Baseline set of MEDLINE/PubMed citation records in XML
Q
- QVHighlights: Dataset for moment retrieval and highlight detections
R
- Recipe1M(+): A large-scale, structured corpus of over one million cooking recipes and 13 million food images
- RedPajama-V2: An open dataset with 30 trillion tokens for training large language models
S
- S2L3A_France_2019: 13 Sentinel-2 L3A mosaics covering France in 2019
- SegmentAnything-1B : 11M images and 1.1B mask annotations for semantic segmentation
- Semantic KITTI : A Dataset for Semantic Scene Understanding of LiDAR Sequences
- SemanticKITTI-C: An evaluation benchmark heading toward robust and reliable 3D semantic segmentation in autonomous driving
- SEN2VENµS: SEN2VENµS is an open dataset for the super-resolution of Sentinel-2 images by leveraging simultaneous acquisitions with the VENµS satellite.
- ShapeNet: A richly-annotated, large-scale dataset of 3D shapes
- ShapeStacks: Simulation-based dataset composed of a variety of elementary geometric primitives richly annotated regarding semantics and structural stability
- SHREC19: A dataset of isometric and non-isometric non-rigid shapes with texture-based ground-truth correspondences
- SHREC20: A dataset of non-isometric non-rigid shapes with texture-based ground-truth correspondences
- Shrutilipi: Labelled ASR corpus obtained from All India Radio news bulletins for 12 Indian languages
- SpokenCOCO: Approximately 600k recordings of human speakers reading the MSCOCO image captions out loud in English
- SpokenObjectNet: 50k English spoken audio captions for the images in the ObjectNet dataset
- SUNRGBD: Dataset captured by four different sensors containing 10k densely annotated RGB-D images
- Surreal Dataset of Meshes: 230000 human meshes with a large variety of realistic poses and body shapes
- svo: Subject-Verb-Object (SVO) triples from the NELL (Never-Ending Language Learning) project
- STL-10: Image recognition dataset for developing unsupervised feature learning, deep learning, self-taught learning algorithms
T
- TCGA WSI: Whole Slide Images of The Cancer Genome Atlas
- TCIA_LIDC-IDRI_20200921: A completed reference database of lung nodules on CT scans (LIDC-IDRI)
- TEDLIUM: Corpus dedicated to speech recognition in English
- Thumos14: Challenge: Action Recognition with a Large Number of Classes
- TikTok_dataset: Learning High Fidelity Depths of Dressed Humans by Watching Social Media Dance Videos
- TikTok_Raw_videos: Learning High Fidelity Depths of Dressed Humans by Watching Social Media Dance Videos
- tng-project: Results and data from IllustrisTNG simulations of cosmological galaxy formation
- TOPKIDS: A collection of 3D shapes undergoing within-class deformations
- TrackingNet: A Large-Scale Dataset and Benchmark for Object Tracking in the Wild
U
- UCF101: Action videos dataset for action recognition
V
- V2X-Sim-2: A Comprehensive Synthetic Multi-agent Perception Dataset
- VATEX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research
- VGGSound: An audio-visual correspondent dataset consisting of short clips of audio sounds, extracted from videos uploaded to YouTube
- VideoCC: Dataset containing (video, caption) pairs for training video-text machine learning models
- Vimeo-90K septuplet: 91,701 7-frame sequences with fixed resolution 448 x 256, extracted from 39K selected video clips from Vimeo-90K
- Visual Genome: Dataset to connect structurated image concepts to language
- ViTT (Video Timeline Tags): Collection of videos annotated with timelines where each video is divided into segments, and each segment is labelled with a short free-text description
- VOST: Video Object Segmentation under Transformations
- VoxCeleb1-2: Audio-visual dataset of human speech
- VoxLingua107: Speech dataset for training spoken language identification models containing data extracted from YouTube videos and labeled according the language of the video title and description for 107 languages
- VoxPopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation
- VoxTube: A multilingual speaker recognition dataset collected from the CC BY 4.0 YouTube videos
W
- WebVid-2M: Large-scale dataset of short videos (2M+) with textual descriptions sourced from the web
- WebVid-10M: Large-scale dataset of short videos (10M+) with textual descriptions sourced from the web.
- WenetSpeech: A 10000+ Hours Multi-domain Mandarin Corpus for Speech Recognition (original format + WAV 16khz mono format available)
- WIT: Large multimodal multilingual dataset composed of a curated set of 37.6 million entity rich image-text examples with 11.5 million unique images across 108 Wikipedia languages
- Wikimedia Dumps: A subset of the dumps from the Wikimedia wikis, currently available: frwiki-20220901 and wikidatawiki-entities-20220912
- wmt: a machine translation dataset composed from a collection of various sources.
- WorldStrat: Nearly 10,000 km² of free high-resolution and matched low-resolution satellite imagery
X
Y
- YCB-Videos: A large-scale video dataset for 6D object pose estimation.
- YFCC100M: Multimedia collection containing the metadata of around 99.2 million photos and 0.8 million videos from Flickr
- YouTube-VOS: Dataset for the 4th Large-scale Video Object Segmentation Challenge, Workshop in conjunction with CVPR 2022
- YT-Temporal-180M: Large and diverse dataset of 6 million videos that covers diverse topics
- YT-Temporal-1B: Dataset of 20 million videos, spanning over a billion frames
Z
— List updated on February 26th 2024 —