Skip to content

Datasets

Data Formats

OpenML aims to achieve full data interoperability, meaning that you can load all datasets in a uniform way (a 'universal dataloader'). This requires that all datasets are stored in the same dataformat (or a set of interoperable formats), or at least have a version of it stored in that format. After an intensive study, which you can read on our blog, we settled on the Parquet format.

This means that all OpenML datasets can be retrieved in the Parquet format. They are also stored on our servers in this format. Oftentimes, you will not notice this, as the OpenML clients can automatically convert data into your preferred data structures, and be fed directly into machine learning workflows. For example:

1
2
3
4
5
6
7
import openml
dataset = openml.datasets.get_dataset("Fashion-MNIST")     # Returns the dataset meta-data 
X, y, _, _ = dataset.get_data(dataset_format="dataframe",  # Downloads the data and returns a Pandas dataframe
                target=dataset.default_target_attribute)

from sklearn.ensemble import GradientBoostingClassifier         # Using a sklearn model as an example
model = GradientBoostingClassifier(n_estimators=10).fit(X, y)   # Set hyperparameters and train the model 

To guarantee interoperability, we focus on a limited set of data formats.

Tabular data

OpenML has historically focussed on tabular data, and has extensive support for all kinds of tabular data. As explained above, we store all data in the Parquet format. You can upload data from many different data structures, such as Pandas dataframes and R dataframes, after which they will be converted and stored in Parquet. You can also upload datasets as CSV files or ARFF files, and we aim to allow direct Parquet uploads soon.

ARFF legacy

At the moment, some aspects of OpenML still has a dependency on the ARFF format. This will be fully phased out in favor of Parquet.

Image data

OpenML generally supports other data types by requiring a 'header table', a table (stored in Parquet) listing all data instances with additional meta-data (e.g. classes, bounding boxes,...) and references to data files, such as images (e.g. JPGs), stored in seperate folders. See our blog post for details. We will provide more detailed guidelines here as soon as possible.

Data repositories

This is a list of public dataset repositories that offer additional useful machine learning datasets. These have widely varying data formats, so they require manual selection, parsing and meta-data extraction.

A collection of sources made by different users

Machine learning dataset repositories (mostly already in OpenML)

MS Open datasets:

APIs (mostly defunct):

Time series / Geo data:

Deep learning datasets (mostly image data)

Extreme classification:

MLData (down)

AutoWEKA datasets:

Kaggle public datasets

RAMP Challenge datasets

Wolfram data repository

Data.world

Figshare (needs digging, lots of Excel files)

KDNuggets list of data sets (meta-list, lots of stuff here):

Benchmark Data Sets for Highly Imbalanced Binary Classification

Feature Selection Challenge Datasets

BigML's list of 1000+ data sources

Massive list from Data Science Central.

R packages (also see https://github.com/openml/openml-r/issues/185)

UTwente Activity recognition datasets:

Vanderbilt:

Quandl

Microarray data:

Medical data:

Nature.com Scientific data repositories list