Giter VIP home page Giter VIP logo

rogerchang1108 / cambridge-dictionary-web-scraper Goto Github PK

View Code? Open in Web Editor NEW
0.0 1.0 0.0 2.13 MB

In this project, we employ the BeautifulSoup4 package in Python Jupyter Notebook to scrape data from the Cambridge Dictionary website. Subsequently, we refine and organize the scraped data to construct a custom dictionary.

Jupyter Notebook 100.00%
cambridge-dictionary nlp python scraper unix-command

cambridge-dictionary-web-scraper's Introduction

Preprocessing for NLP using BeautifulSoup4 and UNIX Commands

In this project, we employ the BeautifulSoup4 package in Python Jupyter Notebook to scrape data from the Cambridge Dictionary website. By analyzing the HTML structure of the website, we extract relevant URLs using UNIX commands guided by formal language principles. Subsequently, we refine and organize the scraped data to construct a custom dictionary.

Note: Because scraping every URLs may take a LOT of time, we are only working with the first 1000 URLs.

Scraped Data Structure

word_dict:

  • Contains 1000 elements with <headword> as the key and an individual dictionary as the value.

Individual dictionary:

  • Contains two elements:
    • "Headword" as the key with <headword> as the value.
    • "ENTRY" as the key with entry_array as the value.

entry_array:

  • Stores one or multiple entry_array_dict objects.

entry_array_dict:

  • Contains two elements:
    • "POS" as the key with pos_array as the value.
    • "POS-BODY" as the key with sense_body_array as the value.

pos_array:

  • Stores one or multiple <pos> elements for the current entry.

sense_body_array:

  • Stores one or multiple dictionaries, potentially containing def_block_array_dict and phrase_block_array_dict.

def_block_array_dict:

  • Contains three elements:
    • "DEFINITION-ENG" as the key with <definition> as the value.
    • "DEFINITION-CHI" as the key with <定義> (definition in Chinese) as the value.
    • "EXAMPLE-SENTS" as the key with examp_array as the value.

examp_array:

  • Stores one or multiple examp_arrray_dict objects for the current definition block.

examp_arrray_dict:

  • Contains two elements:
    • "SENT" as the key with <sentence> as the value.
    • "SENT-CHT" as the key with <例句> (example sentence in Chinese) as the value.

phrase_block_array_dict:

  • Contains two elements:
    • "PHRASE" as the key with <phrase> as the value.
    • "PHRASE-BODY" as the key with phrase_def_block_array as the value.

phrase_def_block_array:

  • Stores one or multiple phrase_def_block_array_dict objects.

phrase_def_block_array_dict:

  • Contains two elements:
    • "PHRASE-DEFINITION-ENG" as the key with <phrase definition> as the value.
    • "PHRASE-DEFINITION-CHI" as the key with <片語定義> (phrase definition in Chinese) as the value.

Take "accident" For Example

component

Scraping the Page:

  • Import Modules, Initialize URL List, Read URL

  • Create Dictionary and Loop

  • Set up HTTP Requests and Header Information

    component

  • Fetching web data

  • Confirm the existence of "di-body" to determine if the URL request was successful.

  • Based on different word directories, set different "headword" classes to fetch actual headwords/phrases.

    component

  • First, confirm the structure; some web pages may not necessarily have an "entry-body". If absent, use the previous "di-body" as the index for subsequent retrieval.

  • Within the "entry", fetch the part of speech ("pos"), and if absent, write "N/A".

  • Within the "entry", follow the webpage structure to fetch definitions and related phrase items under "sense-body".

    component

  • Within the "sense-body", follow the webpage structure to fetch English and Chinese definitions under "def-block". If absent, write "N/A".

  • Within "def-block", follow the webpage structure to fetch English and Chinese example sentences under "examp" and "dexamp". If absent, write "N/A".

    component

  • Check if "phrase-block" exists within the "sense-body".

  • If present, fetch the "phrase". Within "phrase-block", follow the webpage structure to fetch English and Chinese definitions of the phrase under "def-block". If there's no Chinese definition, write "N/A".

  • If absent, skip the "phrase" section.

    component

  • Save as JSON file: The JSON string is generated using json.dumps and then written to a file.

    component

  • Retry Mechanism: If a request fails (e.g., due to a network issue), the program will attempt a maximum of 3 retries.

  • Handling Request Failures: If data cannot be retrieved even after 3 retries, the code will add the URL as the headword and append the headword with the message "URL request fail" to the word_dict.

Resources

cambridge-dictionary-web-scraper's People

Contributors

rogerchang1108 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.