Skip to content

Datasets

Data Formats

To guarantee interoperability, we focus on a limited set of data formats. We aim to support all sorts of data, but for the moment we only fully support tabular data in the ARFF format. We are currently working on supporting a much wider range of formats.

ARFF definition. Also check that attribute definitions do not mix spaces and tabs, and do not include end-of-line comments.

Data repositories

This is a list of public dataset repositories that offer additional useful machine learning datasets. These have widely varying data formats, so they require manual selection, parsing and meta-data extraction.

A collection of sources made by different users

Machine learning dataset repositories (mostly already in OpenML)

MS Open datasets:

APIs (mostly defunct):

Time series data:

Deep learning datasets (mostly image data)

Extreme classification:

MLData (down)

AutoWEKA datasets:

Kaggle public datasets

RAMP Challenge datasets

Wolfram data repository

Data.world

Figshare (needs digging, lots of Excel files)

KDNuggets list of data sets (meta-list, lots of stuff here):

Benchmark Data Sets for Highly Imbalanced Binary Classification

Feature Selection Challenge Datasets

BigML's list of 1000+ data sources

Massive list from Data Science Central.

R packages (also see https://github.com/openml/openml-r/issues/185)

UTwente Activity recognition datasets:

Vanderbilt:

Quandl

Microarray data:

Medical data:

Nature.com Scientific data repositories list