Datasets

Data Formats¶

OpenML aims to achieve full data interoperability, meaning that you can load all datasets in a uniform way (a 'universal dataloader'). This requires that all datasets are stored in the same dataformat (or a set of interoperable formats), or at least have a version of it stored in that format. After an intensive study, which you can read on our blog, we settled on the Parquet format.

This means that all OpenML datasets can be retrieved in the Parquet format. They are also stored on our servers in this format. Oftentimes, you will not notice this, as the OpenML clients can automatically convert data into your preferred data structures, and be fed directly into machine learning workflows. For example:

import openml
dataset = openml.datasets.get_dataset("Fashion-MNIST")     # Returns the dataset meta-data 
X, y, _, _ = dataset.get_data(dataset_format="dataframe",  # Downloads the data and returns a Pandas dataframe
                target=dataset.default_target_attribute)

from sklearn.ensemble import GradientBoostingClassifier         # Using a sklearn model as an example
model = GradientBoostingClassifier(n_estimators=10).fit(X, y)   # Set hyperparameters and train the model 

To guarantee interoperability, we focus on a limited set of data formats.

Tabular data¶

OpenML has historically focussed on tabular data, and has extensive support for all kinds of tabular data. As explained above, we store all data in the Parquet format. You can upload data from many different data structures, such as Pandas dataframes and R dataframes, after which they will be converted and stored in Parquet. You can also upload datasets as CSV files or ARFF files, and we aim to allow direct Parquet uploads soon.

ARFF legacy

At the moment, some aspects of OpenML still has a dependency on the ARFF format. This will be fully phased out in favor of Parquet.

Image data¶

OpenML generally supports other data types by requiring a 'header table', a table (stored in Parquet) listing all data instances with additional meta-data (e.g. classes, bounding boxes,...) and references to data files, such as images (e.g. JPGs), stored in seperate folders. See our blog post for details. We will provide more detailed guidelines here as soon as possible.

Data repositories¶

This is a list of public dataset repositories that offer additional useful machine learning datasets. These have widely varying data formats, so they require manual selection, parsing and meta-data extraction.

A collection of sources made by different users