Goal: Scrape information from https://www.cheese.com to obtain a dataset of characteristics about different cheeses, and gain deeper insight into your coding process. 🪤
Part 1: Locate and examine the robots.txt file for this website. Summarize what you learn from it.
The robots.txt file only had two lines. It read: User-agent: * Sitemap: https://www.cheese.com/sitemap.xml. This means that is has no restrictions. Presumably this is because it is a relatively low-traffic website.
Part 2: Learn about the html_attr() function from rvest. Describe how this function works with a small example.
This function gets the attributes associated with some HTML elements.
library(tidyverse)library(rvest)read_html("https://www.cheese.com/alphabetical/?per_page=100") |>html_nodes(".product-img a img") |>html_attr("class")
This small example demonstrates how to get whether the image exists for each cheese.
Part 3: (Do this alongside Part 4 below.) I used ChatGPT to start the process of scraping cheese information with the following prompt:
Write R code using the rvest package that allows me to scrape cheese information from cheese.com.
Fully document your process of checking this code. Record any observations you make about where ChatGPT is useful / not useful.
# Load required librarieslibrary(rvest)library(dplyr)# Define the URLurl <-"https://www.cheese.com/alphabetical"# Read the HTML content from the webpagewebpage <-read_html(url)# Extract the cheese names and URLscheese_data <- webpage |>html_nodes(".cheese-item") |>html_nodes("a") |>html_attr("href") |>paste0("https://cheese.com", .)cheese_names <- webpage |>html_nodes(".cheese-item h3") |>html_text()# Create a data frame to store the resultscheese_df <-data.frame(Name = cheese_names,URL = cheese_data,stringsAsFactors =FALSE)# Print the data frameprint(cheese_df)
The structure is nominally useful though the LLM appears to be guessing and can’t access the page structure.
The code is easy to read and is transparent for a reader to understand, but it isn’t very efficient. The verboseness is communicative, though ultimately unnecessary.
Part 4: Obtain the following information for all cheeses in the database:
# can't run this on last page--will find number of pages to parse automatically.num_pages <-"https://www.cheese.com/alphabetical/?per_page=100"|>read_html() |>html_nodes(".pagination a") |>html_text() |>last() |>as.numeric()# NOTE: this seems like a simple pattern so there might already be a function which does thischeeses <-paste0("https://www.cheese.com/alphabetical/?per_page=100&page=",1:num_pages) |>map_df(parse_page)
Part 5: When you go to a particular cheese’s page (like gouda), you’ll see more detailed information about the cheese. For just 10 of the cheeses in the database, obtain the following detailed information:
milk information
country of origin
family
type
flavour
(Just 10 to avoid overtaxing the website! Continue adding a 1 second pause between page queries.)
cheese_list <-c("https://www.cheese.com/paneer/","https://www.cheese.com/chevre-en-marinade/","https://www.cheese.com/chevre-log/","https://www.cheese.com/goat-gouda/","https://www.cheese.com/gotcha-gouda/","https://www.cheese.com/gouda/","https://www.cheese.com/camembert-des-camarades/","https://www.cheese.com/camembert-de-portneuf/","https://www.cheese.com/camembert-de-normandie/","https://www.cheese.com/camembert/")# parses a cheese.com url by returning pertinent information about provided cheeseparse_cheese <-function(url, delay =1) {Sys.sleep(delay) raw_info <-read_html(url) |>html_nodes(".panel-body ul li") |>html_text(trim =TRUE)# Extract remaining fields that use a colon raw_info[grepl(":", raw_info)] |>str_split_fixed(":", 2) |>as_tibble() |>rename(field = V1, value = V2) |>mutate(field =str_trim(tolower(field)),value =str_trim(value) ) |>filter(field %in%c("country of origin", "family", "type", "flavour")) |>pivot_wider(names_from = field,values_from = value ) |>mutate(Milk =str_remove(raw_info[grepl("^Made from", raw_info)], "^Made from "),.before =`country of origin` )}cheese_details <-map_dfr(cheese_list, parse_cheese)
pasteurized or unpasteurized cow's, goat's or sheep's milk
Netherlands
Gouda
hard
full-flavored
Camembert Des Camarades
pasteurized cow's milk
Canada
Camembert
soft, soft-ripened
buttery, nutty, subtle, sweet
Camembert De Portneuf
pasteurized cow's milk
Canada
Camembert
soft, soft-ripened
buttery, creamy
Camembert De Normandie
unpasteurized cow's milk
France
Camembert
soft, soft-ripened
creamy
Camembert
cow's milk
France
Camembert
soft, artisan
sweet
Part 6: Evaluate the code that you wrote in terms of efficiency. To what extent do your function(s) adhere to the principles for writing good functions? To what extent are your functions efficient? To what extent is your iteration of these functions efficient?
The code is efficient. There are few objects saved and pipelines are used extensively to further that goal. Furthermore, leveraging Quarto’s design, we can further bolster efficiency by selectively caching the output of a few cells and make a nice space/time deal–scraping the data takes time due to the number of pages that need to be scraped, so we elected to cache those outputs. Of course, our code still has to run once, but since the targets seem to be static, this allows us to spend some memory to speed up execution time. Since we are caching, we save two objects so as to obviate scraping the site again.