The OpenML platform provides an integration platform for carrying out and comparing machine learning solutions across a broad collection of public datasets and software platforms.
julia>usingOpenML# or using MLJjulia>usingDataFramesjulia>OpenML.list_tags()300-elementVector{Any}:"study_41""uci""study_34""study_37""mythbusting_1""OpenML-CC18""study_99""artificial""BNG""study_16"⋮"Earth Science""Social Media""Meteorology""Geography""Language""Computational Universe""History""Culture""Sociology"
Listing all datasets with the "OpenML100" tag which also have n instances and p features, where 100 < n < 1000 and 1 < p < 10:
julia>OpenML.describe_dataset(15)Author:Dr.WilliamH.Wolberg,UniversityofWisconsinSource:UCI(https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+(original)),UniversityofWisconsin(http://pages.cs.wisc.edu/~olvi/uwmp/cancer.html)-1995Pleasecite:Seebelow,plusUCI(https://archive.ics.uci.edu/ml/citation_policy.html)BreastCancerWisconsin(Original)DataSet.Featuresarecomputedfromadigitizedimageofafineneedleaspirate(FNA)ofabreastmass.Theydescribecharacteristicsofthecellnucleipresentintheimage.Thetargetfeaturerecordstheprognosis(malignantorbenign).Originaldataavailablehere(ftp://ftp.cs.wisc.edu/math-prog/cpo-dataset/machine-learn/cancer/)CurrentdatasetwasadaptedtoARFFformatfromtheUCIversion.SamplecodeID'swereremoved.!NotethatthereisalsoarelatedBreastCancerWisconsin(Diagnosis)DataSetwithadifferentsetoffeatures,betterknownaswdbc(https://www.openml.org/d/1510).RelevantPapers–––––––––––––––W.N.Street,W.H.WolbergandO.L.Mangasarian.Nuclearfeatureextractionforbreasttumordiagnosis.IS&T/SPIE1993InternationalSymposiumonElectronicImaging:ScienceandTechnology,volume1905,pages861-870,SanJose,CA,1993.O.L.Mangasarian,W.N.StreetandW.H.Wolberg.Breastcancerdiagnosisandprognosisvialinearprogramming.OperationsResearch,43(4),pages570-577,July-August1995.Citationrequest––––––––––––––––ThisbreastcancerdatabasewasobtainedfromtheUniversityofWisconsinHospitals,MadisonfromDr.WilliamH.Wolberg.Ifyoupublishresultswhenusingthisdatabase,thenpleaseincludethisinformationinyouracknowledgments.Also,pleaseciteoneormoreof:1.O.L.MangasarianandW.H.Wolberg:"Cancer diagnosis via linear programming",SIAMNews,Volume23,Number5,September1990,pp1&18.2.WilliamH.WolbergandO.L.Mangasarian:"Multisurface method of pattern separation for medical diagnosis applied to breast cytology",ProceedingsoftheNationalAcademyofSciences,U.S.A.,Volume87,December1990,pp9193-9196.3.O.L.Mangasarian,R.Setiono,andW.H.Wolberg:"Pattern recognition via linear programming: Theory and application to medical diagnosis",in:"Large-scale numerical optimization",ThomasF.ColemanandYuyingLi,editors,SIAMPublications,Philadelphia1990,pp22-30.4.K.P.Bennett&O.L.Mangasarian:"Robust linear programming discrimination of two linearly inseparable sets",OptimizationMethodsandSoftware1,1992,23-34(Gordon&BreachSciencePublishers).julia>table=OpenML.load(15)Tables.DictColumnTablewith699rows,10columns,andschema::Clump_ThicknessFloat64:Cell_Size_UniformityFloat64:Cell_Shape_UniformityFloat64:Marginal_AdhesionFloat64:Single_Epi_Cell_SizeFloat64:Bare_NucleiUnion{Missing,Float64}:Bland_ChromatinFloat64:Normal_NucleoliFloat64:MitosesFloat64:ClassCategoricalArrays.CategoricalValue{String,UInt32}
Lists all active OpenML datasets, if tag = nothing (default). To list only datasets with a given tag, choose one of the tags in list_tags(). An alternative output_format can be chosen, e.g. DataFrame, if the DataFrames package is loaded.
A filter is a string of <data quality>/<range> or <data quality>/<value> pairs, concatenated using /, such as
usingDataFramestable=OpenML.load(61)df=DataFrame(table)# transform to a DataFrameusingScientificTypesdf2=coerce(df,autotype(df))# corce to automatically detected scientific typespeek_table=OpenML.load(61,maxbytes=1024)# load only the first 1024 bytes of the table