Giter VIP home page Giter VIP logo

Comments (6)

zo0o0ot avatar zo0o0ot commented on September 26, 2024

Copied scraper here. It's still in need of significant changes, as the other page was wildly different looking, even if the content was the same.

from colly-draft-prospects.

zo0o0ot avatar zo0o0ot commented on September 26, 2024

Some changes have been made, but the scraper is still in a debug state. There are two tables on the web page, and we only need data from the second one, but we're getting data from both tables. I'm sure it has something to do with how c.OnHTML("tr td:nth-of-type(4)" and the similar collectors are working, but I don't have a ton of experience with the DOM, so I'm going to need to do some more troubleshooting.

from colly-draft-prospects.

zo0o0ot avatar zo0o0ot commented on September 26, 2024

#2 is a great start. It looks like OnHTML(...) uses GoQuery, so I'll need to read up on that.

Link to GoQuery:
https://github.com/PuerkitoBio/goquery

from colly-draft-prospects.

zo0o0ot avatar zo0o0ot commented on September 26, 2024

#3 adds the html element without goquery to the scraper, since it appears that goquery may or may not be necessary to complete this task.

This article shows an implementation that uses colly both with goquery and with html.
https://benjamincongdon.me/blog/2018/03/01/Scraping-the-Web-in-Golang-with-Colly-and-Goquery/

from colly-draft-prospects.

zo0o0ot avatar zo0o0ot commented on September 26, 2024

#4 tries to use the data obtained in statements like c.OnHTML("tr td:nth-of-type(4)" and puts them into slices. Rather than trying to filter out the data as it goes into the slice, we can make the assumption that the only garbage data is coming from the start of the slice, so if we look for the category header, we may be able to just write all the data that follows to the csv.

from colly-draft-prospects.

zo0o0ot avatar zo0o0ot commented on September 26, 2024

This is fixed by #5.

from colly-draft-prospects.

Related Issues (7)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.