Giter VIP home page Giter VIP logo

crawler's Introduction

Crawler

A small web crawler used to collect Kurdish text over the web

It has these commands:

  • Crawl: used to crawl web pages and extact kurdish text from them and save them to a folder on disk.
  • Normalize: used to convert the text collected in the previous command to standard unicode text.
  • WordList: used to make a wordlist from the text file that's produced from the previous command.

How to use

Crawl

./crawler.exe crawl -url <url> -output <output> [-delay <delay>] [-pages <pages>]

Parameters:

  • url: The absolute URL for the site you want to crawl.
  • output: The folder to save the crawled pages. The crawler will also save a $Stats.txt file that contains the crawling stats.
  • delay: Number of milliseconds to wait between crawling two pages. Default value is 1000
  • pages: Maximum number of pages to crawl. Default value is 250

Examples:

./crawler.exe crawl -url https://ckb.wikipedia.org -output ./Data
./crawler.exe crawl -url https://www.google.iq/ -output D:/CrawledPages/ -delay 250 -pages 1000

Normalize

./crawler.exe normalize -inputdir <inputdirectory> -outdir <outputdirectory> 

Parameters:

  • inputdirectory: Path for the folder which contains collected text from the website.
  • outputdirectory: Output Directory files are saved in this folder after normalizing, files which have size of 0 will be discarded.

Examples:

./crawler.exe normalize -inputdir ./myInputFolder -outdir ./myOutputFolder

Wordlist

./crawler.exe wordlist -inputdir <inputdirectory> -outfile <outputfile> 

Parameters:

  • inputdirectory: Path for the folder which contains collected text after normalizing the collected data.
  • outputfile: Output File which contains created wordlist form previous step.

Examples:

./crawler.exe wordlist -inputdir ./NormalizedFolderData -outfile WORDLIST.txt

crawler's People

Contributors

aramrafeq avatar mhmd-azeez avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.