When I saw the test was labelled Hard, I hesitated for a second. Hard usually means complicated, long, lots of things that can go wrong. But I have noticed something that once you understand the basics of a tool, nothing stays hard for long. So I started.
The task was to download a dataset from HuggingFace and convert it to
an mlr3 Task using only R — no Python, no reticulate. My first thought
was that HuggingFace is a Python thing. I have used it in Python where
you just call load_dataset() and it works. In R there is no
such shortcut. You have to talk to the HuggingFace API directly. That is
actually what made this interesting.
library(httr)
library(jsonlite)
library(mlr3)
httr handles HTTP requests — it lets R talk to any
website or API. jsonlite reads JSON responses. Both felt
very familiar coming from Python. pip install becomes
install.packages(), import becomes
library(). Same idea, different syntax.
I tried fetching the Titanic dataset first since it is the most common classification example everyone starts with.
url <- "https://huggingface.co/api/datasets/titanic"
response <- GET(url)
status_code(response)
[1] 401
401 means not authorized. HuggingFace requires a login token for some
datasets and I did not have one set up. Rather than going through
authentication I looked for a fully public dataset instead. This small
bump taught me something important — not all HuggingFace datasets are
freely accessible without a token. When building mlr3hf,
the function will need to handle this, either by asking for a token or
clearly telling the user which datasets require one.
I switched to scikit-learn/iris which is fully public
and needs no token. Here I knocked on HuggingFace’s door and asked —
tell me about this dataset.
url <- "https://huggingface.co/api/datasets/scikit-learn/iris"
response <- GET(url)
status_code(response)
[1] 200
200 means success. HuggingFace replied with a JSON response — a text file containing all the dataset’s information. I read it like this.
content <- content(response, "text", encoding = "UTF-8")
info <- fromJSON(content)
names(info)
[1] "_id" "id" "author" "sha"
[5] "lastModified" "private" "gated" "disabled"
[9] "tags" "description" "downloads" "likes"
[13] "cardData" "siblings" "createdAt" "usedStorage"
This is the metadata — same idea as otsk() in the Easy
test. You get information about the dataset before downloading anything.
The key field here is siblings which lists all the actual
files inside the dataset repository.
info$siblings
rfilename
1 .gitattributes
2 Iris.csv
3 README.md
4 database.sqlite
There it is — Iris.csv. I asked HuggingFace what files
are inside this dataset. This is the moment I understood why building a
proper htsk() function matters. Right now I am manually
reading through siblings to find the data file. A good function would do
this automatically — detect which file is the actual data, handle
different formats like CSV or whatever and return a clean Task without
the user needing to dig through API responses every time.
Now I downloaded just that one file — the actual data.
csv_url <- "https://huggingface.co/datasets/scikit-learn/iris/resolve/main/Iris.csv"
csv_response <- GET(csv_url)
csv_text <- content(csv_response, "text", encoding = "UTF-8")
iris_hf <- read.csv(text = csv_text)
head(iris_hf)
Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species
1 1 5.1 3.5 1.4 0.2 Iris-setosa
2 2 4.9 3.0 1.4 0.2 Iris-setosa
3 3 4.7 3.2 1.3 0.2 Iris-setosa
4 4 4.6 3.1 1.5 0.2 Iris-setosa
5 5 5.0 3.6 1.4 0.2 Iris-setosa
6 6 5.4 3.9 1.7 0.4 Iris-setosa
I did the API call and read.csv(). I noticed there is an
Id column which is just a row number and not a real
feature. I removed it before creating the task because keeping it would
give the model meaningless information.
I cleaned the data and handed it to mlr3.
target = "Species" tells mlr3 which column is the thing I
want to predict — everything else automatically becomes a feature.
iris_hf$Id <- NULL
task_hf <- as_task_classif(iris_hf, target = "Species", id = "iris_huggingface")
print(task_hf)
── <TaskClassif> (150x5) ──────────────────────────────────────────
• Target: Species
• Target classes: Iris-setosa (33%), Iris-versicolor (33%), Iris-virginica (33%)
• Properties: multiclass
• Features (4):
• dbl (4): PetalLengthCm, PetalWidthCm, SepalLengthCm, SepalWidthCm
A HuggingFace dataset, downloaded in pure R, converted to an mlr3
Task. The whole process — knock on the API, find the file, download it,
clean it, hand it to mlr3 — is exactly what htsk() should
automate. Once I understood each step manually, I could clearly see what
the function needs to do.
The hard part was not the code. It was understanding the steps well enough to see what a function needs to do. Once I had that knack — it was not hard at all.