Comments (8)
Maybe we can add an option to output URL filtering information to stdout or a file, e.g, if the include or exclude regex matches?
I think this would lead to more transparency what suckit is doing.
I also plan to implement functionality to rewrite the local URLs that could profit from this debug feature.
from suckit.
This looks correct. The best way to know is by testing it, and I would love to see the result of such a test. If you can build this directory tree, just serve it using a webserver and try to run suckit on localhost
from suckit.
@mr-bo-jangles Did it worked ?
from suckit.
Maybe we can add an option to output URL filtering information to stdout or a file
Good idea
I also plan to implement functionality to rewrite the local URLs that could profit from this debug feature.
What do you mean?
from suckit.
What do you mean?
To download a phpBB forum, I added a hack to rewrite some URLs, namely remove a ?sid=<hash>
parameter. Otherwise the same pages get downloaded over and over again with different sid
hashes.
If you want to take a look:
https://github.com/raphCode/suckit/blob/fusornet_hack/src/scraper.rs#L191
I originally planned to flesh this out into a dedicated feature / command line option, but eventually didn't. I already achieved my goal and I could not figure out a way to do it properly.
from suckit.
The problem with removing parameters such as ?sid
is that they might have changed the content of the requested page. If you remove them, 2 links identical except the parameters will have a common page downloaded by suckit while they should have 2 different pages
from suckit.
In general you are correct, but in the specific case of phpBB the content is always the same, no matter the ?sid
parameter value.
One solution would be to just ignore all links with this parameter, like suggested here, but this may create a swath of broken links. I just removed the parameter from the URL and collapsed all links into their "canonical" form without the session id parameter.
I actually just found a different solution, namely to send session cookies, which avoids ?sid
parameters getting appended to links in the first place.
from suckit.
We could imagine a solution where you whould have a list of tuple with a regex and list of arguments to remove
Vec<(regex, Vec<parameter>)>
But it might be really costly
from suckit.
Related Issues (20)
- Fonts download support
- Quoting issue on charset detection HOT 3
- Give tl a try HOT 1
- Solved: error: linkr 'cc' not found during install HOT 2
- Proxy support HOT 8
- Panic when folder path with dot serves a webpage HOT 3
- Incorrect local URLs on an index_no_slash.html HOT 2
- Failure in name resolution on books.toscrape.com HOT 1
- Fix release cross compilation CI HOT 1
- Only download certain filetypes HOT 1
- Crash with v0.2.0 HOT 3
- Build for riscv64
- Exclude already downloaded file HOT 1
- Make URL Processing optional HOT 3
- Add URLs to depth tree from CSS HOT 4
- Stuck thread on silent connection close HOT 12
- Create issue template
- Resume download for large websites HOT 2
- Moving currently downloaded files and folders HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from suckit.