Giter VIP home page Giter VIP logo

munchlex's Introduction

MunchLex

comments from the original author: " A simple HTML Lexer/Parser to wrap my head around lexers and parsers, the foundations of a compiler/interpreter. Since I love webscraping thought I'd give it a try! "

Introduction and Project Idea

Multi-Threaded Web Scraper in C

Munchlex is a multi-threaded web scraper written in C that resolves around the idea of a lexer/parser. The origin of this project started when the need to further enhance the performance of a python based web scraper was required. originally Python based web scrapers though efficient, were only sequential in nature. This means that the page was parsed line by line which was time consuming. Hence, Munchlex and the idea of a multi-threaded web scraper was born.

Munchlex: Overview

the multi-threaded web scraper is designed to take advantage of concurrent execution by utilizing multiple threads. and hence a departure from the traditional sequential web scrapers written in languages like Python. the use of multiple threads aims to enhance the speed and efficiency of the web scraper. making it capable to handle the large scale data extraction and processing

Features

  • Multi-threaded: harnesses the power of parallel processing capabilities of multiple threads to scrape data more efficently and reduce the time take to scrape the data
  • Fast: the multi-threaded web scraper is designed to be fast and efficient.
  • performance optimization: the scraper addresses the limitations of sequential web scrapers writtent in interpreted langauges, ensurign a fastre and more efficient data extraction process
  • Scalability: thanks to multi-threading, the scraper is capable of handling large scale data extraction and processing easily and more efficiently

How to use Munchlex

This section contains steps that one can use to get there own copy of Munchlex

  • firstly, fork this repository using the fork button on the top right corner of the page to get your own copy of the repository. This is to ensure that you can actively experiment with the code of the web scraper

  • secondly, once you have made the fork, you can clone the repository to your local machine using the following command:

git clone <paste URL of the forked repo here>
  • thirdly, once you have cloned the repository, you can navigate to the directory of the repository and run the following command to compile the code:
make all

This will create the executable file by compiling and linking the project. we can then enter the test dir and run the shell script in there to test the web scraper.

  • Note: Please note that the web scraper is not yet ready and tested for windows systems natively. The use of a Linux based system is recommended.
  • Note: To run the scraper on a windows system, you can use the windows subsystem for linux (WSL2) to run the scraper.

Working of the Project

Basic idea of the working

The web scraper works by using multiple threads to scrape the data from the web pages. the function 'munchLex' processes each line fo the input file, tokenizes it and then constructs a tree of tokens which represents the document structure. the resulting tree is also printed to the log file. The function performs lexcial analysis fo the page.

In detail working

  • when the user begins the execution. the main drive program 'main.c' orchestrates the scraping process using a thread pool.
  • the main parses command-line arguments to gather information about the number of threads, files to scrape, and optional flags like "Daemon" mode (Still in development).
  • the creation of the thread pool then takes place and a thread pool with specified number of threads to parallelize the process is created.
  • for each file, a thread from the thread pool is assigend to execute the 'munchLex' function. This function is main parser and lexor of the scraper. the functions performs lexical analysis of the page and tokenizes the input file. ending with the construction of a tree structure for representation of the document.
  • the 'lexer' function then identifies tokens such as tags and text, and a tree structure is generated to represent the hierarchical structure of the document. the tree structure is also stored in a log file.
  • NOTE: it is recommend for the user to run the scraper to understand the working of the scraper in more detail. And how does the tree structure is generated and how the lexical analysis is performed.

Future Work

  • The scraper is still in development and there are many features that are still to be implemented.
  • Daemon mode is still in development and will be implemented in the future.
  • The scraper is yet to be tested on windows systems and will be tested in the future to ensure a smooth eperience for windows users.

Conclusion

The multi-threaded web scraper is designed to be fast and efficient. It is capable of handling large scale data extraction and processing easily and more efficiently.

Proudly brought to you by the Munchlex team and the open source community ❤️

munchlex's People

Contributors

goncalomark avatar jainil2004 avatar

Stargazers

 avatar

Watchers

 avatar

Forkers

jainil2004

munchlex's Issues

Migrate from semaphore to a thread pool strategy

Currently using a POSIX semaphore to control how many threads are running in parallel. Although it works it still incurs in some overhead because many threads will spawn, a thread pool executor will distribute per thread and threads are spawned in a fixed size at startup.

Create CI/CD pipeline

Using GitHub Actions, shall create a pipeline for compilation, automated tests(?) and build.

Output logging of the parsed syntax tree to a logging file

Get rid of printf's for logging, I want files so I can visually infer if the tree is actually well parsed. Tested simple examples with the printf's on the terminal, but I need to start getting some heavier files going and the stdout won't be a nice reading.

Add UTF-8 Support

Currently this Lexer/Parser only works with ASCII characters, needs to be refactored so it works with UTF-8 encoding.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.