Concepts¶
OpenML concepts¶
OpenML operates on a number of core concepts which are important to understand:
Datasets
Datasets are pretty straight-forward. Tabular datasets are self-contained, consisting of a number of rows (instances) and columns (features), including their data types. Other
modalities (e.g. images) are included via paths to files stored within the same folder.
Datasets are uniformly formatted (S3 buckets with Parquet tables, JSON metadata, and media files), and are auto-converted and auto-loaded in your desired format by the APIs (e.g. in Python) in a single line of code.
Example: The Iris dataset or the Plankton dataset
Tasks
A task consists of a dataset, together with a machine learning task to perform, such as classification or clustering and an evaluation method. For
supervised tasks, this also specifies the target column in the data.
Example: Classifying different iris species from other attributes and evaluate using 10-fold cross-validation.
Flows
A flow identifies a particular machine learning algorithm (a pipeline or untrained model) from a particular library or framework, such as scikit-learn, pyTorch, or MLR. It contains details about the structure of the model/pipeline, dependencies (e.g. the library and its version) and a list of settable hyperparameters. In short, it is a serialized description of the algorithm that in many cases can also be deserialized to reinstantiate the exact same algorithm in a particular library.
Example: scikit-learn's RandomForest or a simple TensorFlow model
Runs
A run is an experiment - it evaluates a particular flow (pipeline/model) with particular hyperparameter settings, on a particular task. Depending on the task it will include certain results, such as model evaluations (e.g. accuracies), model predictions, and other output files (e.g. the trained model).
Example: Classifying Gamma rays with scikit-learn's RandomForest