One of the key issues in website, application, and service integration processes is determining which pages contain or lack relevant tracking codes. Particularly in processes managed by different teams and involving multiple services, page-level controls can become complex.
Almost all services publish pages through domain names and/or subdomain names. Page layouts may also include direct page-level or globally affecting code additions that impact all pages. On the other hand, with the involvement of tag management tools such as Google Tag Manager, the monitoring of existing codes may evolve into a separate process.
Here is an example workflow:
[mermaid] flowchart TD 1(Typeform) 2[Alt Domain 1] 3[Alt Domain 2] 6[Blog] 5(ClickFunnel) 4(EventBrite) 7[Domain] 8[(Google Analytics)] 9[(Facebook Pixel)] 10[[Google Tag Manager]]
1 & 4 & 5 & 6 & 7 -.-> 8 & 9 2 & 3 —> 10 10 —> 8 & 9 [/mermaid]
Services can be managed through subdomain or custom domain names they provide. Most of these services directly support Google Analytics and Facebook Pixel. Let’s assume that Google Tag Manager is set up directly for subdomains. In this case, several options are available for monitoring an existing setup:
- Manually reviewing the source code of each individual page,
- Monitoring network traffic across all pages,
- Using Facebook Pixel and/or Google Tag Assistant for monitoring,
- Cookie control,
- Monitoring pages tracked via analytics services,
- Controlling JavaScript variables,
- Third-party services and applications (GA Checker, etc.)
JavaScript variables and cookies can be manipulated via the console. However, in this article, I will focus on how the process can be automated.
Facebook Pixel is handled on pages with the fbq function and cookie, and Facebook Pixel events (page view, lead, etc.) are processed through the fbq event. Google Analytics, on the other hand, relies on the _ga cookie and the gaData object. This allows us to access the property data currently being processed on the page.
// Facebook
Object.keys(_fbq.instance.pixelsByID)
// Google Analytics
Object.keys(gaData)
We’ll use these details shortly. First, let’s return to the services. If the page management and tests are organized under a list, you can directly use that list. Otherwise, unfortunately, you’ll have to spend time manually verifying the pages to be crawled. For the sake of example, I’ll assume that Google Analytics crawls all pages. Therefore, I will include the page data either via the API or through the provided external export options.
Since I’m retrieving the pages via Google Analytics in the code checks, I don’t need to perform an additional Object.keys(gaData) check. However, if you’re processing data through an external list or sitemap files, you can certainly include the Google Analytics check as well.
library(readxl)
PageURL <- read_xlsx(path = '20201001-20210407.xlsx', sheet = 2)
checkList <- PageURL$Page
RSelenium and JavaScript Operations
The next step is to individually inspect these pages. For this purpose, I will be using the RSelenium package1. RSelenium is one of the comprehensive test and automation packages that makes Selenium capabilities accessible through R. Selenium itself is an open-source, free testing tool developed for testing web applications, supporting various platforms and browsers2.
The following example also demonstrates the fundamental functions of Selenium.
driver <- rsDriver(port = 4567L, browser = c("firefox"), version = 'latest', verbose = TRUE, check = TRUE)
remote_driver <- driver[["client"]]
remote_driver$open()
getFBIDs <- function(url){
url <- if(str_detect(url, "https://")) url else paste0("https://", url)
remote_driver$navigate(url)
remote_driver$executeScript("
var pageURL = window.location.href;
var fbqID = (typeof fbq === 'function')
? Object.keys(_fbq.instance.pixelsByID)
: false;
return { pageURL, fbqID }
")
}
Pixels <- lapply(checkList, getFBIDs)
crawledData <- as.data.frame(do.call(rbind, Pixels))
remote_driver$closeServer()
remote_driver$close()
First, we select a browser using rsDriver and define a port. All subsequent commands will be executed via the selected browser. After executing the command, depending on the specified version details, the browser will be downloaded and configured. Once the operation is completed, the browser will become active and begin waiting for redirections3 4.
If you inspect the driver value, you can access the client and server components associated with Selenium. Since we will perform the subsequent operations via the client, we retrieve this value using remote_driver.
$server
PROCESS ‘filed1931848b908.sh’, running, pid 72794.
Using remote_driver$navigate(url) to send a URL to the browser is sufficient to load the relevant page and prepare it for subsequent commands. This operation will be executed within the same tab. Selenium also provides support for creating and monitoring multiple tabs5. However, since I am currently unable to properly implement this functionality, I am temporarily skipping it. I will add a note to this text once I have a solution.
In addition to URL operations, many other actions can be performed based on Selenium (e.g., reading values, clicking, scrolling, form filling, etc.). For now, we will focus on JavaScript operations and continue with an example using executeScript.
remote_driver$executeScript("alert('Hello World!')")
Inside executeScript, we can create basic JavaScript syntax and JavaScript commands specific to the selected browser, and apply them to the page6. However, if we need to retrieve data related to the state of the operation, we must use the return statement.
remote_driver$executeScript("
var pageURL = window.location.href;
var fbqID = (typeof fbq === 'function')
? Object.keys(_fbq.instance.pixelsByID)
: false;
return { pageURL, fbqID }
")
The above code snippet returns the Facebook Pixel IDs present on the current page and the page’s URL. We retrieve the page URL because some of the listed pages are not functioning properly and/or redirect to different pages. Of course, prior to this operation, it would be more accurate to verify the status of the relevant URLs using httr and/or rcurl.
In cross-domain operations, domain names will also be included in Google Analytics page views (unless otherwise specified).
If there is one or more Facebook Pixel codes present on the page, each Pixel ID will be sent to us as a list element associated with the relevant URL. Then, certainly, converting the list into a data frame format will make it more readable.
As a result, we end up with something similar to the following:
fbqID pageURL
1 100023405678901, 111112334456778, 123123123123123 https://www.clickfunnels.com/?aff_sub=domain_redirect&utm_campaign=domain_redirect
2 123123123123123 https://domain.com/
3 123123123123123 https://blog.domain.com/page-v-1
As can be seen, the first row indicates that checkList[1] is no longer in use. Therefore, the Pixel IDs we are receiving belong to clickfunnels and we must therefore clean up these fields for the next step.
You can find the relevant code as a single piece in the RSelenium-JavaScript-Variable-Check.R page.
Footnotes
- RSelenium: R Bindings for ‘Selenium WebDriver’ ↩
- RSelenium Basics ↩
- Anthony Aigboje Akhonokhue. (2020). Web scraping using RSelenium in R/Rstudio ↩
- Pascal Schmidt. (2019). RSelenium Tutorial: A Tutorial to Basic Web Scraping With RSelenium ↩
- Selenium multiple tabs at once ↩
- Caner Başat. (2018). Selenium JavascriptExecutor Kullanımı ↩