Giter VIP home page Giter VIP logo

Comments (8)

raphCode avatar raphCode commented on June 12, 2024 1

Maybe we can add an option to output URL filtering information to stdout or a file, e.g, if the include or exclude regex matches?
I think this would lead to more transparency what suckit is doing.
I also plan to implement functionality to rewrite the local URLs that could profit from this debug feature.

from suckit.

Skallwar avatar Skallwar commented on June 12, 2024

This looks correct. The best way to know is by testing it, and I would love to see the result of such a test. If you can build this directory tree, just serve it using a webserver and try to run suckit on localhost

from suckit.

Skallwar avatar Skallwar commented on June 12, 2024

@mr-bo-jangles Did it worked ?

from suckit.

Skallwar avatar Skallwar commented on June 12, 2024

Maybe we can add an option to output URL filtering information to stdout or a file

Good idea

I also plan to implement functionality to rewrite the local URLs that could profit from this debug feature.

What do you mean?

from suckit.

raphCode avatar raphCode commented on June 12, 2024

What do you mean?

To download a phpBB forum, I added a hack to rewrite some URLs, namely remove a ?sid=<hash> parameter. Otherwise the same pages get downloaded over and over again with different sid hashes.
If you want to take a look:
https://github.com/raphCode/suckit/blob/fusornet_hack/src/scraper.rs#L191

I originally planned to flesh this out into a dedicated feature / command line option, but eventually didn't. I already achieved my goal and I could not figure out a way to do it properly.

from suckit.

Skallwar avatar Skallwar commented on June 12, 2024

The problem with removing parameters such as ?sid is that they might have changed the content of the requested page. If you remove them, 2 links identical except the parameters will have a common page downloaded by suckit while they should have 2 different pages

from suckit.

raphCode avatar raphCode commented on June 12, 2024

In general you are correct, but in the specific case of phpBB the content is always the same, no matter the ?sid parameter value.
One solution would be to just ignore all links with this parameter, like suggested here, but this may create a swath of broken links. I just removed the parameter from the URL and collapsed all links into their "canonical" form without the session id parameter.

I actually just found a different solution, namely to send session cookies, which avoids ?sid parameters getting appended to links in the first place.

from suckit.

Skallwar avatar Skallwar commented on June 12, 2024

We could imagine a solution where you whould have a list of tuple with a regex and list of arguments to remove

Vec<(regex, Vec<parameter>)>

But it might be really costly

from suckit.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.