Python Chrome Bookmark HTML Scraping

Learn about . Discover the details and applications.

Ceyhun Enki Aksan
Ceyhun Enki Aksan Entrepreneur, Maker

Web Scraping: Extracting Text from Infinite-Scroll Pages as discussed, I will share articles and examples related to data scraping over a period of time. The topic I’ll cover in this article involves extracting and reorganizing content from Chrome Bookmarks (Favorites) that are automatically generated and require editing.

The main objective of the example process I’ll discuss is to extract and reorganize URLs from an exported file. Additional operations such as URL status checks and the creation of a new bookmark file record may also be included if needed. We can begin with the basic format used in the following code.

Google Bookmarks

Google Chrome uses the following structure for importing and exporting bookmarks (Favorites)1. Within this structure, DT > H3 directories represent the main folders, DL subdirectories, and P content elements.

<DL><p>
    <DT><H3 ADD_DATE="..." LAST_MODIFIED="..." PERSONAL_TOOLBAR_FOLDER="true">Bookmarks Bar</H3>
    <DL><p>
        <DT><H3 ADD_DATE="..." LAST_MODIFIED="...">Training</H3>
        <DL><p>
            <DT><A HREF="..." ADD_DATE="..." ICON="...">...</A>
            <DT><A HREF="..." ADD_DATE="..." ICON="...">...</A>
            <DT><A HREF="..." ADD_DATE="..." ICON="...">...</A>
        </DL><p>
    </DL><p>
</DL><p>
note

The Data Scraping repository consists of a collection of subdirectories, each containing a separate scraping project. For more information about our projects, you can visit the repository’s main page here.

The bookmarks.html file located within the repository can be examined as an example.

Data Scraping repository’s google-chrome-bookmarks exports its HTML file content, which can be restructured and saved as a CSV. During the restructuring process, the default [var]exclusion_domains[/var] parameter specifies domain names to be excluded, while [var]separate_domains[/var] indicates domain names to be collected into separate CSV files.

Below is an example usage of scrape_html.

scrape_html(
    input_path="/Users/user/Desktop/Data-Scraping/google-chome-bookmarks/bookmarks.html",
    output_dir="/Users/user/Desktop",
    filename="bookmark",
    extension="csv",
    strip_char="...",
    max_text_length=50,
    exclusion_domains=["facebook.com", "twitter.com"],
    separate_domains={
        "eksisozluk.com": "eksisozluk_links",
        "etsy.com": "etsy_links",
        "tr.pinterest.com": "pinterest_links",
        "imdb.com": "imdb_links",
    }
)

Upon execution, the code checks the output_dir and applies the specified operations through scrape_html.

.
├── bookmark.csv
├── eksisozluk_links.csv
├── etsy_links.csv
├── github_links.csv
├── imdb_links_inks.csv
├── medium_links.csv
├── pinterest_links.csv
└── youtube_links.csv

1 directory, 8 files

The resulting directory structure will resemble the above directory and file structure.

Footnotes

  1. Chrome bookmarks. Wikipedia