The ‘Hard’ Test — Downloading HuggingFace Data in Pure R

When I saw the test was labelled Hard, I hesitated for a second. Hard usually means complicated, long, lots of things that can go wrong. But I have noticed something that once you understand the basics of a tool, nothing stays hard for long. So I started.

The task was to download a dataset from HuggingFace and convert it to an mlr3 Task using only R — no Python, no reticulate. My first thought was that HuggingFace is a Python thing. I have used it in Python where you just call load_dataset() and it works. In R there is no such shortcut. You have to talk to the HuggingFace API directly. That is actually what made this interesting.

Step 1 — Loading the packages

library(httr)
library(jsonlite)
library(mlr3)

httr handles HTTP requests — it lets R talk to any website or API. jsonlite reads JSON responses. Both felt very familiar coming from Python. pip install becomes install.packages(), import becomes library(). Same idea, different syntax.

Step 2 — First attempt and the 401 error

I tried fetching the Titanic dataset first since it is the most common classification example everyone starts with.

url <- "https://huggingface.co/api/datasets/titanic"
response <- GET(url)
status_code(response)

[1] 401

401 means not authorized. HuggingFace requires a login token for some datasets and I did not have one set up. Rather than going through authentication I looked for a fully public dataset instead. This small bump taught me something important — not all HuggingFace datasets are freely accessible without a token. When building mlr3hf, the function will need to handle this, either by asking for a token or clearly telling the user which datasets require one.

Step 3 — Switching to a public dataset

I switched to scikit-learn/iris which is fully public and needs no token. Here I knocked on HuggingFace’s door and asked — tell me about this dataset.

url <- "https://huggingface.co/api/datasets/scikit-learn/iris"
response <- GET(url)
status_code(response)

[1] 200

200 means success. HuggingFace replied with a JSON response — a text file containing all the dataset’s information. I read it like this.

content <- content(response, "text", encoding = "UTF-8")
info <- fromJSON(content)
names(info)

 [1] "_id"          "id"           "author"       "sha"         
 [5] "lastModified" "private"      "gated"        "disabled"    
 [9] "tags"         "description"  "downloads"    "likes"       
[13] "cardData"     "siblings"     "createdAt"    "usedStorage"

This is the metadata — same idea as otsk() in the Easy test. You get information about the dataset before downloading anything. The key field here is siblings which lists all the actual files inside the dataset repository.

Step 4 — Finding the data file

info$siblings

        rfilename
1  .gitattributes
2        Iris.csv
3       README.md
4 database.sqlite

There it is — Iris.csv. I asked HuggingFace what files are inside this dataset. This is the moment I understood why building a proper htsk() function matters. Right now I am manually reading through siblings to find the data file. A good function would do this automatically — detect which file is the actual data, handle different formats like CSV or whatever and return a clean Task without the user needing to dig through API responses every time.

Step 5 — Downloading the CSV

Now I downloaded just that one file — the actual data.

csv_url <- "https://huggingface.co/datasets/scikit-learn/iris/resolve/main/Iris.csv"
csv_response <- GET(csv_url)
csv_text <- content(csv_response, "text", encoding = "UTF-8")
iris_hf <- read.csv(text = csv_text)
head(iris_hf)

  Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm     Species
1  1           5.1          3.5           1.4          0.2 Iris-setosa
2  2           4.9          3.0           1.4          0.2 Iris-setosa
3  3           4.7          3.2           1.3          0.2 Iris-setosa
4  4           4.6          3.1           1.5          0.2 Iris-setosa
5  5           5.0          3.6           1.4          0.2 Iris-setosa
6  6           5.4          3.9           1.7          0.4 Iris-setosa

I did the API call and read.csv(). I noticed there is an Id column which is just a row number and not a real feature. I removed it before creating the task because keeping it would give the model meaningless information.

Step 6 — Converting to an mlr3 Task

I cleaned the data and handed it to mlr3. target = "Species" tells mlr3 which column is the thing I want to predict — everything else automatically becomes a feature.

iris_hf$Id <- NULL

task_hf <- as_task_classif(iris_hf, target = "Species", id = "iris_huggingface")
print(task_hf)

── <TaskClassif> (150x5) ──────────────────────────────────────────
• Target: Species
• Target classes: Iris-setosa (33%), Iris-versicolor (33%), Iris-virginica (33%)
• Properties: multiclass
• Features (4):
  • dbl (4): PetalLengthCm, PetalWidthCm, SepalLengthCm, SepalWidthCm

A HuggingFace dataset, downloaded in pure R, converted to an mlr3 Task. The whole process — knock on the API, find the file, download it, clean it, hand it to mlr3 — is exactly what htsk() should automate. Once I understood each step manually, I could clearly see what the function needs to do.

The hard part was not the code. It was understanding the steps well enough to see what a function needs to do. Once I had that knack — it was not hard at all.