Skip to content

Client development

Building clients

You can access OpenML datasets, pipelines, benchmarks, and much more, through a range of client APIs. Well-developed clients exist in Python, R, Java, and several other languages. Please see their documentation (in the other tabs) for more guidance of how to contribute to them.

If you want to develop your own client (e.g. for a new language), please check out the following resources:

  • REST API: all endpoints to GET, POST, or DELETE resources
  • Metadata Standard: how we describe datasets and all other OpenML resources
  • Minimal standards (below) for uniform client configuration and caching mechanisms, to make the client behavior more uniform across languages.

Integrating tools

If you want to integrate OpenML into machine learning and data science tools, it's often easier to build on one of the existing clients, which often can be used as is or extended. For instance, see how to extend the Python API to integrate OpenML into Python tools.

Minimal standards

Configuration file

The configuration file resides in a directory .openml in the home directory of the user and is called config. It consists of key = value pairs which are seperated by newlines. The following keys are defined:

  • apikey:
    • required to access the server
  • server:
    • default: http://www.openml.org
  • verbosity:
    • 0: normal output
    • 1: info output
    • 2: debug output
  • cachedir:
    • if not given, will default to file.path(tempdir(), "cache").

Caching

Cache invalidation

All parts of the entities which affect experiments are immutable. The entities dataset and task have a flag status which tells the user whether they can be used safely.

File structure

Caching should be implemented for

  • datasets
  • tasks
  • splits
  • predictions

and further entities might follow in the future. The cache directory $cache should be specified by the user when invoking the API. The structure in the cache directory should be as following:

  • One directory for the following entities:
    • $cache/datasets
    • $cache/tasks
    • $cache/runs
  • For every dataset there is an extra directory for which the name is the dataset ID, e.g. $cache/datasets/2 for the dataset with OpenML ID 2.
    • The dataset should be called dataset.pq or dataset.arff
    • Every other file should be named by the API call which was used to obtain it. The XML returned by invoking openml.data.qualities should therefore be called qualities.xml.
  • For every task there is an extra directory for which the name is the task ID, e.g. $cache/tasks/1
    • The task file should be called task.xml.
    • The splits accompanying a task are stored in a file datasplits.arff.
  • For every run there is an extra directory for which the name is the run ID, e.g. $cache/run/1
    • The predictions should be called predictions.arff.