Client development
Building clients¶
You can access OpenML datasets, pipelines, benchmarks, and much more, through a range of client APIs. Well-developed clients exist in Python, R, Java, and several other languages. Please see their documentation (in the other tabs) for more guidance of how to contribute to them.
If you want to develop your own client (e.g. for a new language), please check out the following resources:
- REST API: all endpoints to GET, POST, or DELETE resources
- Metadata Standard: how we describe datasets and all other OpenML resources
- Minimal standards (below) for uniform client configuration and caching mechanisms, to make the client behavior more uniform across languages.
Integrating tools
If you want to integrate OpenML into machine learning and data science tools, it's often easier to build on one of the existing clients, which often can be used as is or extended. For instance, see how to extend the Python API to integrate OpenML into Python tools.
Minimal standards¶
Configuration file¶
The configuration file resides in a directory .openml
in the home directory of the user and is called config. It consists of key = value
pairs which are seperated by newlines. The following keys are defined:
- apikey:
- required to access the server
- server:
- default:
http://www.openml.org
- default:
- verbosity:
- 0: normal output
- 1: info output
- 2: debug output
- cachedir:
- if not given, will default to
file.path(tempdir(), "cache")
.
- if not given, will default to
Caching¶
Cache invalidation¶
All parts of the entities which affect experiments are immutable. The entities dataset and task have a flag status
which tells the user whether they can be used safely.
File structure¶
Caching should be implemented for
- datasets
- tasks
- splits
- predictions
and further entities might follow in the future. The cache directory $cache
should be specified by the user when invoking the API. The structure in the cache directory should be as following:
- One directory for the following entities:
$cache/datasets
$cache/tasks
$cache/runs
- For every dataset there is an extra directory for which the name is the dataset ID, e.g.
$cache/datasets/2
for the dataset with OpenML ID 2.- The dataset should be called
dataset.pq
ordataset.arff
- Every other file should be named by the API call which was used to obtain it. The XML returned by invoking
openml.data.qualities
should therefore be called qualities.xml.
- The dataset should be called
- For every task there is an extra directory for which the name is the task ID, e.g.
$cache/tasks/1
- The task file should be called
task.xml
. - The splits accompanying a task are stored in a file
datasplits.arff
.
- The task file should be called
- For every run there is an extra directory for which the name is the run ID, e.g.
$cache/run/1
- The predictions should be called
predictions.arff
.
- The predictions should be called