Lab 8

Cheese, Gromit!
Author

Visruth and Matt

Published

May 27, 2025

Modified

June 6, 2025

Goal: Scrape information from https://www.cheese.com to obtain a dataset of characteristics about different cheeses, and gain deeper insight into your coding process. 🪤

Part 1: Locate and examine the robots.txt file for this website. Summarize what you learn from it.

The robots.txt file only had two lines. It read: User-agent: * Sitemap: https://www.cheese.com/sitemap.xml. This means that is has no restrictions. Presumably this is because it is a relatively low-traffic website.

Part 2: Learn about the html_attr() function from rvest. Describe how this function works with a small example.

This function gets the attributes associated with some HTML elements.

library(tidyverse)
library(rvest)

read_html("https://www.cheese.com/alphabetical/?per_page=100") |>
  html_nodes(".product-img a img") |>
  html_attr("class")
  [1] "image-exists"  "image-missing" "image-exists"  "image-exists" 
  [5] "image-exists"  "image-exists"  "image-missing" "image-exists" 
  [9] "image-exists"  "image-exists"  "image-exists"  "image-missing"
 [13] "image-missing" "image-exists"  "image-missing" "image-missing"
 [17] "image-exists"  "image-missing" "image-exists"  "image-missing"
 [21] "image-missing" "image-missing" "image-missing" "image-exists" 
 [25] "image-exists"  "image-exists"  "image-missing" "image-missing"
 [29] "image-missing" "image-missing" "image-missing" "image-missing"
 [33] "image-exists"  "image-missing" "image-missing" "image-exists" 
 [37] "image-missing" "image-missing" "image-missing" "image-exists" 
 [41] "image-exists"  "image-exists"  "image-exists"  "image-exists" 
 [45] "image-exists"  "image-exists"  "image-exists"  "image-missing"
 [49] "image-missing" "image-missing" "image-missing" "image-missing"
 [53] "image-missing" "image-missing" "image-missing" "image-exists" 
 [57] "image-missing" "image-missing" "image-exists"  "image-missing"
 [61] "image-missing" "image-missing" "image-missing" "image-missing"
 [65] "image-exists"  "image-missing" "image-exists"  "image-exists" 
 [69] "image-missing" "image-missing" "image-missing" "image-exists" 
 [73] "image-missing" "image-missing" "image-missing" "image-exists" 
 [77] "image-exists"  "image-exists"  "image-missing" "image-missing"
 [81] "image-exists"  "image-exists"  "image-missing" "image-missing"
 [85] "image-exists"  "image-exists"  "image-missing" "image-missing"
 [89] "image-exists"  "image-missing" "image-missing" "image-exists" 
 [93] "image-exists"  "image-missing" "image-missing" "image-missing"
 [97] "image-exists"  "image-missing" "image-missing" "image-missing"

This small example demonstrates how to get whether the image exists for each cheese.

Part 3: (Do this alongside Part 4 below.) I used ChatGPT to start the process of scraping cheese information with the following prompt:

Write R code using the rvest package that allows me to scrape cheese information from cheese.com.

Fully document your process of checking this code. Record any observations you make about where ChatGPT is useful / not useful.

# Load required libraries
library(rvest)
library(dplyr)

# Define the URL
url <- "https://www.cheese.com/alphabetical"

# Read the HTML content from the webpage
webpage <- read_html(url)

# Extract the cheese names and URLs
cheese_data <- webpage |>
  html_nodes(".cheese-item") |>
  html_nodes("a") |>
  html_attr("href") |>
  paste0("https://cheese.com", .)

cheese_names <- webpage |>
  html_nodes(".cheese-item h3") |>
  html_text()

# Create a data frame to store the results
cheese_df <- data.frame(
  Name = cheese_names,
  URL = cheese_data,
  stringsAsFactors = FALSE
)

# Print the data frame
print(cheese_df)

The structure is nominally useful though the LLM appears to be guessing and can’t access the page structure.

The code is easy to read and is transparent for a reader to understand, but it isn’t very efficient. The verboseness is communicative, though ultimately unnecessary.

Part 4: Obtain the following information for all cheeses in the database:

To be kind to the website owners, please add a 1 second pause between page queries. (Note that you can view 100 cheeses at a time.)

parse_page <- function(url, delay = 1) {
  Sys.sleep(delay)

  # cheese product listings
  cheeses <- read_html(url) |>
    html_nodes(".product-item")

  # temp object that's used twice
  temp_cheese_data <- cheeses |>
    html_nodes("h3 a")

  # building tibble
  tibble(
    cheese = temp_cheese_data |> html_text(),
    url = temp_cheese_data |>
      html_attr("href") |>
      (\(slug) glue::glue("https://www.cheese.com{slug}"))(),
    has_image = cheeses |>
      html_nodes(".product-img a img") |>
      html_attr("class") |>
      str_detect("image-exists")
  )
}
# can't run this on last page--will find number of pages to parse automatically.
num_pages <- "https://www.cheese.com/alphabetical/?per_page=100" |>
  read_html() |>
  html_nodes(".pagination a") |>
  html_text() |>
  last() |>
  as.numeric()
# NOTE: this seems like a simple pattern so there might already be a function which does this

cheeses <- paste0(
  "https://www.cheese.com/alphabetical/?per_page=100&page=",
  1:num_pages
) |>
  map_df(parse_page)
library(gt)

cheeses |>
  sample_n(10) |>
  gt() |>
  tab_header(
    title = "10 Random Cheeses"
  ) |>
  cols_label(cheese = "Cheese", url = "URL", has_image = "Image Available?")
10 Random Cheeses
Cheese URL Image Available?
Yarra Valley White Savourine https://www.cheese.com/yarra-valley-white-savourine/ FALSE
Truffle Camemvert https://www.cheese.com/vegan-truffle-camemvert-cheese/ TRUE
Drovers Road https://www.cheese.com/old-drovers-road/ TRUE
Lincolnshire Poacher https://www.cheese.com/lincolnshire-poacher/ TRUE
Beenleigh Blue https://www.cheese.com/beenleigh-blue/ TRUE
Saint Nectaire https://www.cheese.com/saint-nectaire/ TRUE
Crowdie https://www.cheese.com/crowdie/ FALSE
Caciotta Al Tartufo https://www.cheese.com/caciotta-al-tartufo/ TRUE
Reserva - Caramelised Onion & Rioja Cheddar Cheese - Waxed Truckle 200g https://www.cheese.com/reserva-caramelised-onion-rioja-cheddar-cheese-waxed-truckle-200g/ TRUE
Isle of Man Cracked Peppercorn https://www.cheese.com/isle-of-man-cracked-peppercorn/ FALSE

Part 5: When you go to a particular cheese’s page (like gouda), you’ll see more detailed information about the cheese. For just 10 of the cheeses in the database, obtain the following detailed information:

(Just 10 to avoid overtaxing the website! Continue adding a 1 second pause between page queries.)

cheese_list <- c(
  "https://www.cheese.com/paneer/",
  "https://www.cheese.com/chevre-en-marinade/",
  "https://www.cheese.com/chevre-log/",
  "https://www.cheese.com/goat-gouda/",
  "https://www.cheese.com/gotcha-gouda/",
  "https://www.cheese.com/gouda/",
  "https://www.cheese.com/camembert-des-camarades/",
  "https://www.cheese.com/camembert-de-portneuf/",
  "https://www.cheese.com/camembert-de-normandie/",
  "https://www.cheese.com/camembert/"
)

# parses a cheese.com url by returning pertinent information about provided cheese
parse_cheese <- function(url, delay = 1) {
  Sys.sleep(delay)

  raw_info <- read_html(url) |>
    html_nodes(".panel-body ul li") |>
    html_text(trim = TRUE)

  # Extract remaining fields that use a colon
  raw_info[grepl(":", raw_info)] |>
    str_split_fixed(":", 2) |>
    as_tibble() |>
    rename(field = V1, value = V2) |>
    mutate(
      field = str_trim(tolower(field)),
      value = str_trim(value)
    ) |>
    filter(field %in% c("country of origin", "family", "type", "flavour")) |>
    pivot_wider(
      names_from = field,
      values_from = value
    ) |>
    mutate(
      Milk = str_remove(raw_info[grepl("^Made from", raw_info)], "^Made from "),
      .before = `country of origin`
    )
}

cheese_details <- map_dfr(cheese_list, parse_cheese)
cheese_details |>
  mutate(
    Cheese = str_to_title(
      basename(cheese_list) |> str_replace_all("-", " ")
    ),
    .before = Milk
  ) |>
  gt() |>
  cols_label(
    `country of origin` = "Country",
    family = "Family",
    type = "Type",
    flavour = "Flavour"
  ) |>
  tab_header(
    title = "Details on 10 Cheeses"
  )
Details on 10 Cheeses
Cheese Milk Country Family Type Flavour
Paneer pasteurized cow's or water buffalo's milk Bangladesh and India Cottage fresh firm milky
Chevre En Marinade pasteurized goat's milk United States NA semi-soft, artisan garlicky, herbaceous, spicy
Chevre Log pasteurized goat's milk France NA fresh soft citrusy
Goat Gouda pasteurized goat's milk Netherlands Gouda semi-hard, artisan tangy
Gotcha Gouda cow's milk United States Gouda semi-hard, artisan sweet, tangy
Gouda pasteurized or unpasteurized cow's, goat's or sheep's milk Netherlands Gouda hard full-flavored
Camembert Des Camarades pasteurized cow's milk Canada Camembert soft, soft-ripened buttery, nutty, subtle, sweet
Camembert De Portneuf pasteurized cow's milk Canada Camembert soft, soft-ripened buttery, creamy
Camembert De Normandie unpasteurized cow's milk France Camembert soft, soft-ripened creamy
Camembert cow's milk France Camembert soft, artisan sweet

Part 6: Evaluate the code that you wrote in terms of efficiency. To what extent do your function(s) adhere to the principles for writing good functions? To what extent are your functions efficient? To what extent is your iteration of these functions efficient?

The code is efficient. There are few objects saved and pipelines are used extensively to further that goal. Furthermore, leveraging Quarto’s design, we can further bolster efficiency by selectively caching the output of a few cells and make a nice space/time deal–scraping the data takes time due to the number of pages that need to be scraped, so we elected to cache those outputs. Of course, our code still has to run once, but since the targets seem to be static, this allows us to spend some memory to speed up execution time. Since we are caching, we save two objects so as to obviate scraping the site again.