In my recent posts on R, I discussed the validation of sitemap content and website crawling operations. In this article, I will consolidate my previous posts and re-examine the topic from a different perspective.
Sitemap (site map), as I mentioned in my earlier article titled Google Search Console Site Map, can include all internal site links, as well as content types and/or images, videos, or other specific contexts.
In the example below, different site maps will be created based on the definitions found in internal site links. To avoid any confusion, I’ve provided advance clarification.
Rcrawler-Based Website Crawling
In my previous article titled R-based Website Crawling and Data Extraction, I provided a detailed explanation of the Rcrawler package and performed an example crawling operation[^2]. In this article, I will again use the Rcrawler package and crawl a similar website. Subsequently, I will generate sitemaps using the URL data I have collected. For the example process, I based my approach on the article titled “How to Create an XML Sitemap with R”, shared by doparank[^1].
In the example process, I will generally use the Rcrawler, dplyr, and stringr packages.
library(Rcrawler)
library(dplyr)
library(stringr)
CustomXPaths <- c("//link[@rel='canonical']/@href",
"//meta[@name='robots']/@content")
CustomLabels <- c("link_canonical",
"meta_robots")
CustomXPaths specifies the sections we’re crawling and assigns headings to them using CustomLabels. If your page types are being appended to the body tag or served as metadata via a CMS, including these sections in CustomXPaths will be highly beneficial. Otherwise, you’ll have to parse the URLs based on the patterns listed in URL. If the URL definitions that distinguish content types (such as post, page, listing, product, etc.) are not present in the URL itself, the process will become somewhat more complex.
https://domain.com/product/product-name
https://domain.com/post/post-name
https://domain.com/product-name
https://domain.com/post-name
As shown in the above example, parsing the first two URL patterns will be significantly easier. Now we can specify our working directory and begin the crawl process.
setwd("~/Desktop")
Rcrawler(Website = "https://domain.com/",
ExtractXpathPat = CustomXPaths,
PatternsNames = CustomLabels)
saveRDS(DATA, file="DATA.rds")
saveRDS(INDEX, file="INDEX.rds")
URL Classification and Sitemap Generation Process
The outcome of the crawl process will vary depending on the number of pages present on the website. Once this process is complete, you can save the resulting DATA and INDEX lists for future reuse.
Now we can combine these two tables, adjust their data types, and proceed to the next step.
mergedCrawl <- cbind(INDEX, data.frame(do.call(rbind, DATA)))
mergedCrawl$Id <- as.integer(mergedCrawl$Id)
Indexable_pages <- mergedCrawl %>%
mutate(Canonical_Indexability = ifelse(Url == link_canonical | is.na(mergedCrawl$link_canonical), TRUE, FALSE)) %>%
mutate(Indexation = ifelse(grepl("NOINDEX|noindex", mergedCrawl$meta_robots), FALSE, TRUE)) %>%
filter(Canonical_Indexability == TRUE & Indexation == TRUE)
Together with the above process, we created two new columns. We compared the Url and link_canonical fields to verify whether the page has a proper canonical definition and is suitable for indexing. Having both columns set to TRUE is sufficient for the URL to appear in the sitemap.
Now we can parse URLs based on their content[^4].
Sitemaps <- Indexable_pages %>%
filter(`Http Resp` == '200' & `Content Type` == 'text/html') %>%
select(Url) %>%
mutate(Content_type =
ifelse(str_detect(Indexable_pages$Url, "/caegory|tag/"), "Taxonomy",
ifelse(str_detect(Indexable_pages$Url, "/list|listings/"), "Listing",
ifelse(str_detect(Indexable_pages$Url, "/shop|contact/"), "Pages",
ifelse(str_detect(Indexable_pages$Url, "/locations"), "Locations",
ifelse(str_detect(Indexable_pages$Url, "/product"), "Products", "Posts")))))) %>%
group_by(Content_type) %>%
unique() %>%
arrange(Content_type)
Sitemaps$Content_type <- as.factor(Sitemaps$Content_type)
At the end of this process, we have created an additional column named Content_type containing the content types we have defined. We can specify the data type of this column as factor.
Our next step is to separate the pages based on the Content_type we have identified.
Sitemap_taxonomy <- Sitemaps %>% filter(Content_type == "Taxonomy")
Sitemap_listing <- Sitemaps %>% filter(Content_type == "Listing")
Sitemap_pages <- Sitemaps %>% filter(Content_type == "Pages")
Sitemap_places <- Sitemaps %>% filter(Content_type == "Places")
Sitemap_products <- Sitemaps %>% filter(Content_type == "Products")
Sitemap_posts <- Sitemaps %>% filter(Content_type == "Posts")
Let’s take a look at how many URLs belong to each content type.
Sitemaps %>%
group_by(Content_type) %>%
summarise(no_rows = n())
After this stage, various approaches can be taken. If you wish to re-query the relevant URLs and check header contents such as last-modified, you can issue requests using packages like httr[^5] or curl, and proceed accordingly. Alternatively, you can simply assign any desired historical lastmod value without performing any additional requests.
createSitemap <- function (links = list(), fileName = format(Sys.time(), "%y-%m-%d_%H-%M-%S")) {
require(whisker)
require(httr)
cat("Please wait...", "\n")
cat("Total Link: ", length(links), "\n")
template <- '<?xml version="1.0" encoding="UTF-08"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
{{#links}}
<url>
<loc>{{{loc}}}</loc>
<lastmod>{{{lastmod}}}</lastmod>
<changefreq>{{{changefreq}}}</changefreq>
<priority>{{{priority}}}</priority>
</url>
{{/links}}
</urlset>'
map_links <- function(url) {
tmp <- GET(url)
# https://www.stat.berkeley.edu/~s133/dates.html
date <- format(as.Date(strptime(tmp$headers$date, format = '%a, %d %b %Y %H:%M:%S', tz = "UTC")), "%Y-%m-%d")
sys_date <- format(Sys.time(), "%Y-%m-%d")
list(loc = url,
lastmod = ifelse(!is.na(date), date, sys_date),
changefreq = "monthly",
priority = "0.8")
}
links <- lapply(links, map_links)
cat(whisker.render(template, partials = links), file = paste(fileName, ".xml", sep = ""))
}
As can be seen, the createSitemap function utilizes various functions from the httr and whisker[^3] packages. You may also refer to my previous article titled Usage of Mustache Template System in R for more details on whisker[^3]. When managing requests via GET, the placeholder values within the XML template are populated with relevant data using whisker.render, and the output is then saved to a file via cat.
You can skip the GET step and directly print the URLs.