IDRIS - A

A

ABC-Dataset: A collection of one million Computer-Aided Design (CAD) models for research of geometric deep learning methods and applications
ActivityNet 200: A Large-Scale Video Benchmark for Human Activity Understanding
ADI17 (Arabic Dialect Identification) : 3,000 hours of Arabic dialect speech data from 17 countries on the Arabic world collected from YouTube
AirbusShipDetection: Dataset of satellite images (RGB) + associated bounding boxes for ships (csv), from a past Kaggle competition
AISHELL-4: A free Mandarin multi-channel meeting speech corpus
AliMeeting: A free Mandarin multi-channel meeting speech corpus, provided by Alibaba Group
ArabicSpeech: Phonetic texts and sounds for arabic language
Argoverse2: Next Generation Datasets for Self-Driving Perception and Forecasting
AudioSet: An audio event dataset consisting of over 2M human-annotated 10-second video clips
AudioCaps: A large-scale dataset of about 46K audio clips to human-written text pairs collected via crowdsourcing on the AudioSet dataset
Aya Project: An Open Science Initiative to Accelerate Multilingual AI Progress
- Aya Evaluation Suite: Aya Evaluation Suite contains a total of 26,750 open-ended conversation-style prompts to evaluate multilingual open-ended generation quality.
- Aya Collection: The Aya Collection is a massive multilingual collection consisting of 513 million instances of prompts and completion
- Aya dataset: The Aya Dataset is a multilingual instruction fine-tuning dataset curated by an open-science community via Aya Annotation Platform from Cohere Labs

B

BDD: A Diverse Driving Dataset for Heterogeneous Multitask Learning
BOP: Benchmark for 6D Object Pose Estimation
BIRB: A Generalization Benchmark for Information Retrieval in Bioacoustics
BridgeV2: Robotic manipulation behaviors designed to facilitate research in scalable robot learning

C

c4: A colossal, cleaned version of Common Crawl's web crawl corpus
Caltech-256: Caltech-256 Image Set, design machine vision systems with applications toscience, consumer products, entertainment, manufacturing and defense
CAMELS Sims IllustrisTNG LH 0-499: Subset of the Cosmology and Astrophysics with MachinE Learning Simulations dataset
CAMELYON: Whole-slide images (WSI) of hematoxylin and eosin (H&E) stained lymph node sections
CC-100: Monolingual data for 100+ languages from Web Crawl Data
CHARADES: Unstructured video activity recognition and common sense reasoning for daily human activities
CIFAR-10: corpus of 60000 32×32 colour images in 10 classes
ClimSim_high-res: An open large-scale dataset for training high-resolution physics emulators in hybrid multi-scale climate simulators.
ClimSim_low-res: An open large-scale dataset for training high-resolution physics emulators in hybrid multi-scale climate simulators.
Clotho_v2: Audio captioning dataset with 6974 audio samples, each with five captions
CN-Celeb: A large-scale Chinese speaker recognition dataset collected “in the wild”
CO3D: Common Objects in 3D (CO3D) is a dataset designed for learning category-specific 3D reconstruction and new-view synthesis using multi-view images of common object categories
Conceptual Captions: Dataset consisting of ~3.3M images representing a wide variety of styles annotated with captions
Condensed Movies: A large-scale video dataset, featuring clips from movies with detailed captions
COCO: Common Objects in Context (COCO) dataset for object detection, segmentation, and captioning
COCO-Stuff: Augmented version of the COCO dataset with pixel-level stuff annotations for scene understanding tasks
coco_minitrain_25k: COCO minitrain is a curated mini training set (25K images ≈ 20% of train2017) for COCO.
COIN: A Large-scale Dataset for Comprehensive Instructional Video Analysis
Collages: image dataset for binary classification task (based on MNIST and CIFAR)
Common Crawl: A corpus of web crawl data composed of billions of web pages
Common Voice (4, 6.1, 7.0, 8.0, 10.0, 15.0, 16.1): An open source, multi-language dataset of voices that anyone can use to train speech-enabled applications
criteo1tb: This dataset contains feature values and click feedback for millions of display ads.
CRVD (Captured Raw Video Denoising): 55 groups of noisy-clean videos with ISO values ranging from 1600 to 25600
CSTR-VCTK: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit (version 0.92)
CVSS: A massively multilingual-to-English speech-to-speech translation corpus, covering sentence-level parallel speech-to-speech translation pairs from 21 languages into English

D

DeepLesion: Dataset of CT images to help the scientific community improve detection accuracy of lesions
Deep Noise Suppression (DNS) Challenge 5
DiDeMo (Distinct Describable Moments): One of the largest and most diverse datasets for the temporal localization of events in videos given natural language descriptions
DIV2K: collection of 1000 2K high resolution images.
domain_net: dataset of common objects in six different categories (Clipart, Infograph, Painting, Quickdraw, Real and Sketch)
DROID: Distributed RObot Interaction Dataset, an in-the-wild robot manipulation dataset
DynamicEarthNet: Daily Multi-Spectral Satellite Dataset for Semantic Change Segmentation

E

EPIC-KITCHENS: Extended dataset in first-person (egocentric) vision, multi-faceted non-scripted recordings in native environments
EPIC_KITCHENS_2018: Dataset in first-person (egocentric) vision, multi-faceted non-scripted recordings in native environments
Europarl-ST: A Multilingual Speech Translation Corpus of paired audio-text samples constructed using the debates carried out in the European Parliament

F

fastText: Dataset of pre-trained word vectors for text representations and text classifiers
FineGym: A Hierarchical Video Dataset for Fine-grained Action Understanding
FLEURS: An n-way parallel speech dataset in 102 languages built on top of the machine translation FLoRes-101 benchmark
Flickr2K: large collection of 2K high resolution images.
Flickr8kAudio: 40,000 spoken captions of 8,000 natural images
Flickr30k: 30k Image Caption Corpus (31783 images).
FlickrFace: Flickr-Faces-HQ (FFHQ) is a high-quality image dataset of human faces

G

GigaSpeech: A multi-domain English speech recognition corpus
GLDv2: Google Landmarks Dataset v2G, 5 million annotated images with labels representing human-made and natural landmarks. The dataset can be used for landmark recognition and retrieval experiments.
Google Scanned Objects: Dataset of common household objects that have been 3D scanned for use in robotic simulation and synthetic perception research
GLUE: The General Language Understanding Evaluation (GLUE) benchmark is a collection of resources for training, evaluating, and analyzing natural language understanding systems.
GOT-10k: Generic Object Tracking Benchmark: A large, high-diversity, one-shot database for generic object tracking in the wild
GranDf_HA_images: GranD-f Grounded Conversation Generation high-quality human-annotated dataset
GTA5 Dataset: Synthetic images from the open-world video game Grand Theft Auto 5

H

HDR+: Dataset consisting of 3640 bursts (made up of 28461 images in total) captured using a variety of mobile cameras
HO-3D (v2): Dataset with 3D pose annotations for hand and object under severe occlusions from each other
howto100m: Dataset of narrated videos (instructional videos)
howto100m_s3d_features: Pre-trained S3D features for HowTo100M dataset
HuggingFace: Hundreds of Hugging Face datasets, not listed here
Hypersim: A Photorealistic Synthetic Dataset for Holistic Indoor Scene Understanding

I

IGN: Departmental orthophotograph datasets (5m and 50cm resolution) of French National Institute of Geographic and Forest Information
IIIT-HWS: Synthetic handwritten word images
imagenet: Image database for visual object recognition
ImageNet21K (Winter 2021 release): Extended version of ImageNet
ImageNet-C: 5 levels of severity for 19 different corruption types
IndicSUPERB - Kathbath: 1684 hours of labelled speech data across 12 Indian Languages
IWSLT2022: 17 hours of Tamasheq audio data aligned to French translations and unlabaled raw audio data in 5 languages spoken in Niger: French (116 hours), Fulfulde (114 hours), Hausa (105 hours), Tamasheq (234 hours) and Zarma (100 hours)

J

JTubeSpeech: Corpus of Japanese speech collected from YouTube

K

KAIST Multispectral Pedestrian Detection: 95k manually annotated color-thermal pairs taken from a vehicle
kinetics: DeepMind Kinetics human action video dataset
KITTI: Dataset captured from a station wagon for use in mobile robotics and autonomous driving research

L

LAION-400M:The world’s largest openly available image-text-pair dataset with 400 million samples.
LaSOT (Large-scale Single Object Tracking): 1,550 manually annotated sequences with more 3.87 millions frames
Libri-Light: A collection of spoken English audio suitable for training speech recognition systems under limited or no supervision
LibriMix: Large-scale corpus of English speech derived from the original materials of the LibriSpeech corpus
LibriSpeechAsrCorpus: Large-scale (1000 hours) corpus of read English speech
LibriTTS: Large-scale corpus of English speech derived from the original materials of the LibriSpeech corpus
LSUN: Image dataset for visual recognition
LVIS: Dataset for long tail instance segmentation with annotations for over 1000 object categories in 164k images
LibriCSS: A real recorded dataset derived from LibriSpeech by concatenating the corpus utterances to simulate a conversation and capturing the audio replays with far-field microphones.

M

MAESTRO: A dataset composed of about 200 hours of virtuosic piano performances captured with fine alignment (~3 ms) between note labels and audio waveforms.
MAGICDATA Mandarin Chinese Conversational Speech Corpus: 180 hours of rich annotated Mandarin spontaneous conversational speech data
Mapillary: Access street-level imagery and map data from all over the world.
MegaPose: 6D Pose Estimation of Novel Objects via Render & Compare
MegaScenes: An extensive collection of structure-from-motion reconstructions and internet images
MeteoNet: Meteorological dataset by METEO FRANCE, the French national meteorological service
MIMIC-IV-ECG: Diagnostic Electrocardiogram Matched Subset
MInDS-14: Multilingual dataset for intent detection from spoken data
MIRFLICKR: Dataset with 1M tagged images collected from the social photography site Flickr
ML-SUPERB: Data for the MultiLingual Speech processing Universal PERformance Benchmark
MMF: modular framework for vision and language multimodal research from Facebook AI Research
MNIST: Database of handwritten digits
MSDWILD: dataset designed for multi-modal speaker diarization and lip-speech synchronization in the wild.
MOT: Multiple Object Tracking
MOTSynth-MOTS-CVPR22 (Multiple Object Tracking): Large-scale synthetic dataset for pedestrian detection, segmentation, and tracking in urban scenarios created by exploiting the highly photorealistic video game Grand Theft Auto V
MSR-VTT (MSR Video to Text): Large-scale video benchmark for video understanding, especially the task of translating video to text
MultilingualLibriSpeech: A large multilingual corpus derived from LibriVox audiobooks
MultilingualTEDx: a multilingual corpus of TEDx talks for speech recognition and translation
Multi-Object Datasets: multi-object scenes with ground-truth segmentation masks, and some include generative factors detailing object features
MultiShapeNet: Videos of novel scenes created by LFN, PixelNerf, and SRT on the Multi-ShapeNet dataset
MuS2: A Benchmark for Sentinel-2 Multi-Image Super-Resolution.
MUSAN: Dataset of music, speech, and noise
MUSDB18: The musdb18 is a dataset of 150 full lengths music tracks (~10h duration) of different genres along with their isolated drums, bass, vocals and others stems
MuST-C: A multilingual speech translation corpus comprising several hundred hours of audio recordings from English TED Talks
MTG-Jamendo: The MTG-Jamendo Dataset is a new open dataset for music auto-tagging and built using music available at Jamendo

N

Narratives: fMRI dataset of participants listening to spoken stories.
NAVI: Category-Agnostic Image Collections with High-Quality 3D Shape and Pose Annotations
NOTSOFAR-1 Recorded Meeting Dataset: a collection of 237 meetings recorded across 30 conference rooms with 4-8 attendees, featuring a total of 35 unique speakers
Nuscenes LiDAR-seg: In nuScenes-lidarseg, for each point in a lidar pointcloud that belongs to a keyframe in the nuScenes dataset, we annotate it with one of 32 possible semantic labels

O

Objaverse]: Massive Dataset with 800K+ Annotated 3D Objects * [[https://www.objectron.dev/|Objectron: Collection of short, object-centric video clips, which are accompanied by AR session metadata that includes camera poses, sparse point-clouds and characterization of the planar surfaces in the surrounding environment
Objects365: Large-scale object detection dataset with 365 object categories over 600K training image
office_home: images from four different domains (Art, Clipart, Product and Real-World)
ogbg: ogbg-molpcba dataset is a molecular property prediction datasets. Each graph represents a molecule, where nodes are atoms, and edges are chemical bonds.
ONCE: Large-scale autonomous driving dataset with 2D&3D object annotations
OnePose: Dataset with multiple video scans of the same object put in different locations
OpenImagesV5: Open-source image dataset with annotated bounding boxes, object segmentations and visual relationships
OpenNEURO: A free and open platform for sharing MRI, MEG, EEG, iEEG, ECoG, ASL, and PET data
OSCAR: Open Super-large Crawled Aggregated coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the Ungoliant architecture.
OWTB: Open-world tracking benchmark

P

PACS: image dataset for domain generalization (Photo, Art painting, Cartoon, and Sketch domains)
PASCAL-Part: Additional annotations for PASCAL VOC 2010 providing segmentation masks for each body part of the object
Le Petit Prince: A multilingual fMRI corpus using ecological stimuli
Places365-Standard: Image dataset for visual understanding tasks
Places Audio Captions (English & Hindi): approximately 400,000 English and 100,000 Hindi spoken captions for natural images drawn from the Places 205 image dataset
PodcastFillers: 199 full-length podcast episodes in English with manually annotated filler words and automatically generated transcripts
PROBA_V: Satellite data from 74 hand-selected regions around the globe at different points in time
pubmed: Baseline set of MEDLINE/PubMed citation records in XML

Q

QVHighlights: Dataset for moment retrieval and highlight detections

R

RCSB: Protein Data Bank (PDB) archive of 3D structure data for large biological molecules (proteins, DNA, and RNA)
RealEstate10K-360P: A large dataset of camera poses corresponding to 10 million frames derived from about 80,000 video clips, gathered from about 10,000 YouTube videos. This dataset has been built on “low resolution” videos (360p).
Recipe1M(+): A large-scale, structured corpus of over one million cooking recipes and 13 million food images
RedPajama-V2: An open dataset with 30 trillion tokens for training large language models
RoboCasa: A large-scale simulation framework for training generally capable robots to perform everyday tasks

S

SA-V: 51K videos and 643K masklet annotations for general-purpose object segmentation models from open world videos
S2L3A_France_2019: 13 Sentinel-2 L3A mosaics covering France in 2019
SBCSAE (Santa Barbara Corpus of Spoken American English): 60 roughly 20-minute stereo recordings of naturally-occurring spoken interactions recorded across the United States
SegmentAnything-1B : 11M images and 1.1B mask annotations for semantic segmentation
Semantic KITTI : A Dataset for Semantic Scene Understanding of LiDAR Sequences
SemanticKITTI-C: An evaluation benchmark heading toward robust and reliable 3D semantic segmentation in autonomous driving
SEN2VENµS: SEN2VENµS is an open dataset for the super-resolution of Sentinel-2 images by leveraging simultaneous acquisitions with the VENµS satellite.
ShapeNet: A richly-annotated, large-scale dataset of 3D shapes
ShapeStacks: Simulation-based dataset composed of a variety of elementary geometric primitives richly annotated regarding semantics and structural stability
SHREC19: A dataset of isometric and non-isometric non-rigid shapes with texture-based ground-truth correspondences
SHREC20: A dataset of non-isometric non-rigid shapes with texture-based ground-truth correspondences
Shrutilipi: Labelled ASR corpus obtained from All India Radio news bulletins for 12 Indian languages
SmolLM-Corpus: Curated collection of high-quality educational and synthetic data designed for training small language models.
snapshot-safari-2024-expansion: 4,029,374 images from 15 camera trapping projects in the the Snapshot Safari program
snapshot-serengeti: 2.65M sequences of camera trap images, totaling 7.1M images, from seasons one through eleven of the Snapshot Serengeti project
SpokenCOCO: Approximately 600k recordings of human speakers reading the MSCOCO image captions out loud in English
SpokenObjectNet: 50k English spoken audio captions for the images in the ObjectNet dataset
SUNRGBD: Dataset captured by four different sensors containing 10k densely annotated RGB-D images
Surreal Dataset of Meshes: 230000 human meshes with a large variety of realistic poses and body shapes
svo: Subject-Verb-Object (SVO) triples from the NELL (Never-Ending Language Learning) project
STL-10: Image recognition dataset for developing unsupervised feature learning, deep learning, self-taught learning algorithms
SynthTriplets18M: Large-scale synthetic dataset created by the researchers to train and evaluate CIR models

T

TCGA WSI: Whole Slide Images of The Cancer Genome Atlas
TCIA_LIDC-IDRI_20200921: A completed reference database of lung nodules on CT scans (LIDC-IDRI)
TEDLIUM: Corpus dedicated to speech recognition in English
Thumos14: Challenge: Action Recognition with a Large Number of Classes
TikTok_dataset: Learning High Fidelity Depths of Dressed Humans by Watching Social Media Dance Videos
TikTok_Raw_videos: Learning High Fidelity Depths of Dressed Humans by Watching Social Media Dance Videos
tng-project: Results and data from IllustrisTNG simulations of cosmological galaxy formation
TOPKIDS: A collection of 3D shapes undergoing within-class deformations
TrackingNet: A Large-Scale Dataset and Benchmark for Object Tracking in the Wild

U

UCF101: Action videos dataset for action recognition

V

V2X-Sim-2: A Comprehensive Synthetic Multi-agent Perception Dataset
VATEX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research
VGGSound: An audio-visual correspondent dataset consisting of short clips of audio sounds, extracted from videos uploaded to YouTube
VideoCC: Dataset containing (video, caption) pairs for training video-text machine learning models
Vimeo-90K septuplet: 91,701 7-frame sequences with fixed resolution 448 x 256, extracted from 39K selected video clips from Vimeo-90K
Virtual KITTI 2: Virtual KITTI 2 is a more photo-realistic and better-featured version of the original virtual KITTI dataset. It exploits recent improvements of the Unity game engine and provides new data such as stereo images or scene flow.
Visual Genome: Dataset to connect structurated image concepts to language
ViTT (Video Timeline Tags): Collection of videos annotated with timelines where each video is divided into segments, and each segment is labelled with a short free-text description
VOST: Video Object Segmentation under Transformations
VoxCeleb1-2: Audio-visual dataset of human speech
VoxLingua107: Speech dataset for training spoken language identification models containing data extracted from YouTube videos and labeled according the language of the video title and description for 107 languages
VoxPopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation
VoxTube: A multilingual speaker recognition dataset collected from the CC BY 4.0 YouTube videos

W

WebVid-2M: Large-scale dataset of short videos (2M+) with textual descriptions sourced from the web
WebVid-10M: Large-scale dataset of short videos (10M+) with textual descriptions sourced from the web.
WenetSpeech: A 10000+ Hours Multi-domain Mandarin Corpus for Speech Recognition (original format + WAV 16khz mono format available)
WIT: Large multimodal multilingual dataset composed of a curated set of 37.6 million entity rich image-text examples with 11.5 million unique images across 108 Wikipedia languages
Wikimedia Dumps: A subset of the dumps from the Wikimedia wikis, currently available: frwiki-20220901 and wikidatawiki-entities-20220912
wmt: a machine translation dataset composed from a collection of various sources.
WorldStrat: Nearly 10,000 km² of free high-resolution and matched low-resolution satellite imagery

X

Y

YCB-Videos: A large-scale video dataset for 6D object pose estimation.
YFCC100M: Multimedia collection containing the metadata of around 99.2 million photos and 0.8 million videos from Flickr
YouTube-VOS: Dataset for the 4th Large-scale Video Object Segmentation Challenge, Workshop in conjunction with CVPR 2022
YT-Temporal-180M: Large and diverse dataset of 6 million videos that covers diverse topics
YT-Temporal-1B: Dataset of 20 million videos, spanning over a billion frames

Z

— List updated on March 18th 2025 —

Institut du développement et des ressources en informatique scientifique

Navigation du site

L'IDRIS

Gestion des ressources

Espace utilisateurs

Actualités

A

B

C

D

E

F

G

H

I

J

K

L

M

N

O

P

Q

R

S

T

U

V

W

X

Y

Z