I had been familier with R which i used mostly for data analysis, plots. I never really thought of R as a machine learning tool. That was always Python territory for me. Scikit-learn, pandas, loading CSVs manually, waiting for downloads. It was a whole process.
Then I came across this GSoC project — mlr3hf — and I
got genuinely intersted. R for machine learning? I had never imagined R
with machine learning and this made me very curious.
So I just started. Opened RStudio, installed the packages and started typing.
Here I learnt that a package connects directly to OpenML and pulls datasets like an API. I got really eager to find out how fast and clean it actually was compared to what I was used to.
install.packages("mlr3")
install.packages("mlr3oml")
Honestly this felt very familiar. In Python you do
pip install, here you do install.packages().
Same idea, different syntax. The packages installed without issues
(there was a Rtools warning but it still worked fine).
Then loading them:
library(mlr3)
library(mlr3oml)
Again — very similar to Python’s import. I had worked a
lot in python so it clicked immediately.
I got to know that otsk() fetches tasks from OpenML. I
didn’t read too much into it, I just ran it with task ID 61 to see what
happens.
t1 <- otsk(61)
t1
<OMLTask:61>
* Type: Learning Curve
* Data: anneal (id: 1; dim: 898x39)
* Estimation: crossvalidation (id: 14; repeats: 10, folds: 10)
Okay interesting. It pulled something. I noticed it said “Learning Curve” as the type — I didn’t think much of it at first. I just moved on and tried to get the target column name.
t1$target_names
Error: Unsupported task type 'Learning Curve'
That stopped me. I expected a column name, got an error instead. So I
went back and looked at what “Learning Curve” actually means as a task
type — and it made sense. A Learning Curve task doesn’t predict one
thing, it measures how model accuracy changes as training data grows.
There is no single target column. So $target_names has
nothing to return.
This was my first real learning moment — otsk() doesn’t
only return classification tasks. So I need to check the task type
before assuming anything.
I switched to task ID 59 — iris. A classic. I knew this one would be classification.
t2 <- otsk(59)
t2
<OMLTask:59>
* Type: Supervised Classification
* Data: iris (id: 61; dim: 150x5)
* Target: class
* Estimation: crossvalidation (id: 1; repeats: 1, folds: 10)
This time — Supervised Classification. Now $target_names
and $feature_names worked perfectly.
t2$target_names # "class"
t2$feature_names # sepallength, sepalwidth, petallength, petalwidth
One thing I noticed here — otsk() is fast. It only
fetches metadata at this point, not the actual data. It is like getting
a book’s description without opening the book. In Python, when I used to
load datasets, everything would download upfront and took a lot of time
— even the inbuilt ones like iris took a moment. Here it was almost
instant which I really appreciated.
Now I wanted to see the actual rows of data. I thought
$data() would work on the OMLTask object directly.
head(t2$data())
Error in head(t2$data()) : attempt to apply non-function
Tried a few variations:
as.data.frame(t2$data()) # same error
t2$data(rows = 1:6) # same error
None of them worked. I was confused for a moment. Then I realized —
t2 at this point is still just an OMLTask
object (metadata only). To actually get the data, I needed to
convert it into a proper mlr3 task first using
as_task().
iris_mlr3 <- as_task(t2)
This is when it actually downloaded the ARFF file from OpenML. And
after this — $data() worked perfectly.
iris_mlr3$data()[1:6, ]
class sepallength sepalwidth petallength petalwidth
1: Iris-setosa 5.1 3.5 1.4 0.2
2: Iris-setosa 4.9 3.0 1.4 0.2
3: Iris-setosa 4.7 3.2 1.3 0.2
This is the moment it clicked for me. There are two separate steps:
otsk() → fetches metadata onlyas_task() → fetches actual data and creates the mlr3
taskOne more thing I noticed when I looked at the output. Below each
column name, R shows the datatype: <fctr>,
<num>, <int>. Python doesn’t show
this by default when you print a dataframe. I actually liked this — you
immediately know what type each column is without running a separate
dtype check.
I wanted to try something much larger and messier. I picked the bank-marketing dataset — task ID 14965.
t3 <- otsk(14965)
bank_mlr3 <- as_task(t3)
print(bank_mlr3)
<TaskClassif> (45211x17)
* Target: Class
* Target classes: 1 (positive class, 88%), 2 (12%)
* Features (16):
fct (9): V2, V3, V4, V5, V7, V8, V9, V11, V16
int (7): V1, V6, V10, V12, V13, V14, V15
45,211 rows. I was honestly surprised at how fast it loaded compared to what I expected. In Python, loading a 45k row dataset from a URL involves downloading, reading into pandas, checking dtypes separately. Here it was a few seconds and everything was already structured as a proper task.
The class imbalance immediately caught my eye — 88% class 1, only 12% class 2. I verified it:
table(bank_mlr3$data()$Class)
# 1 2
# 39922 5289
In Python I would do df.isnull().sum(). In R:
sum(is.na(iris_mlr3$data())) # 0
sum(is.na(bank_mlr3$data())) # 0
is.na() is exactly like isnull() in pandas.
Returns TRUE/FALSE for each cell, then sum() counts the
TRUEs. Same logic, different syntax.
The two mistakes I made taught me the most important thing about
otsk():
otsk() has a two-step design on purpose.
Step 1 — fetch metadata → lets you check task type, features, target without downloading anything heavy. Step 2 — as_task() → only then download the actual data.
This is smart design. If you are browsing 50 datasets looking for the right one, you don’t want to download all 50. You check metadata first, then download only what you need.
Coming from ML only in Python, R felt surprisingly approachable. The package system, the syntax, the way data is structured — someone who knows Python can get comfortable here quickly. And for ML workflows specifically, mlr3 felt cleaner and faster than I expected.