Giter VIP home page Giter VIP logo

webcrawling_oa's Introduction

Sayari Data Task

Task

The Secretary of State of North Dakota provides a business search web app that allows users to search for businesses by name. Your task:

  1. Play around with the site and figure out how to query companies by name.
    • Hint: Your browser's dev tools are good for this.
  2. Download information for all active companies whose names start with the letter "X" (e.g., Xtreme Xteriors LLC) including their Commercial Registered Agent, Registered Agent, and/or Owners. Save the crawled data in the file format of your choice.
    • Hint: scrapy is a suitable web-crawling framework.
  3. Create and plot a graph of the companies, registered agents, and owners.
    • Hint: NetworkX is a suitable graph library that plays nice with matplotlib.
    • Hint: You may consider names as sufficiently unique to identify each node in the graph.
    • Hint: An example plot output is included below.

Work Done

  1. The APIs that were used by the website was found using Network tab in Browser's Dev Tools. Two APIs are being used for business search and business details respectively.

  2. The respective APIs were crawled using a spider from Scrapy Library. The crawled data is stored at businesssearch.json.

  3. Based on the crawled data obtained from 2,

    • We were able to construct a graph with Entity Linking which connects company names with their entities Commercial Registered Agent, Registered Agent, Owners respectively.
    • The Network is a Connected Components SubGraph which visualizes companies with common entities. The graph can be found here.

    Note: The Node labels have been removed from the graph for the sake of good visualization. The labels can be added by changing the with_labels parameter in networkx.draw() function.

webcrawling_oa's People

Watchers

Balaji Chidambaram avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.