Comments (8)
How much of a delay would be a good amount and what kind of difference would be enough to not trigger a block?
I'm relatively new to Rust but I'd like to give this a try so if there is anything else I should know please share it
Edit: Would it be enough to generate a random number from, say 50 to 100, and add it to the base SLEEP_DURATION here?
from suckit.
@SpyrosRoum the delay you linked to only happens when the queue is empty. If that sleep happens too much, then the application will just end since it means we ran out of links to visit. However, it's possible that the queue is empty for a little while, while some other threads are repopulating it. So we set the inactive threads to sleep for a little bit (SLEEP_DURATION) so that they're not running for nothing.
We need to add another type of delay when the queue is still populated so that we don't override the website's limits. Do you want to work on this ? I can assign you to it and you can ask all the questions you need here :)
Thanks a lot for giving it a try !
from suckit.
Ohh you are right, I didn't pay much attention at all (I just did a ctrl + f for sleep_duration xd)
So we would want to use a random delay only if we successfully got something from the queue. Which means we would sleep after we handled the url in the Ok arm of that same match. Right?
I would agree to be assigned to it but I feel like what I am thinking is too easy to not be already done by one of you guys so I may have the wrong idea of what I am getting my self into, here
If it is just adding a random delay after handling the message then sure, I can do that
from suckit.
That sounds like a good idea ! The reason we haven't done it yet is there were other more important features to implement, and we're also quite busy. It's marked as a good first issue, so it is one of the easy ones, no worries :) some require a bit more understanding of other parts of the code but this one is fairly simple. I'll assign you :)
from suckit.
I have zero idea about what a good delay would be haha. I'm sure there are some articles related to scraping that can help.
Using a lib if you can't find the required function in the standard lib is perfectly fine !
And regarding the process: You should fork, make your changes and then open a pull request against our master :) we'll review it then. Thanks again !
from suckit.
Amazing, I'll get right to it
Now back to my original question, do you have any idea of what would be a base value and what would be a good deviation between for every sleep?
This will inevitably slow down the whole thing so I guess we want to be as low as possible
Also, is using the rand lib acceptable?
Edit: Oh also since I am kinda new to github too, should I fork and create a new branch and then push to your master from my feature branch?
from suckit.
Alright, I have some good and some bad news
After looking around for a bit, some places suggested a delay of 10 - 20 seconds while some others said about 5 seconds should do it.
So I added a base of 2 seconds and an extra random number from 0 to 5. So in total the delay would be between 2 - 7 seconds
This means running suckit http://books.toscrape.com -j 8
(built for debug) for one minute downloaded 103 pages (which is very similar to the number you got from httrack)
Oh, and the good news is that it's working with a random delay now
from suckit.
Thanks for the work, you rock :)
Some website don't need such a feature to be scraped. The goal here is to add an option for website which require this timing limitation.
With this option enabled, yes SuckIT will be slow but otherwise the performance will be the same as before
from suckit.
Related Issues (20)
- Fonts download support
- Quoting issue on charset detection HOT 3
- Unicode handling of --include and --exclude HOT 8
- Give tl a try HOT 1
- Solved: error: linkr 'cc' not found during install HOT 2
- Proxy support HOT 8
- Panic when folder path with dot serves a webpage HOT 3
- Incorrect local URLs on an index_no_slash.html HOT 2
- Failure in name resolution on books.toscrape.com HOT 1
- Fix release cross compilation CI HOT 1
- Only download certain filetypes HOT 1
- Crash with v0.2.0 HOT 3
- Build for riscv64
- Exclude already downloaded file HOT 1
- Make URL Processing optional HOT 3
- Add URLs to depth tree from CSS HOT 4
- Stuck thread on silent connection close HOT 12
- Create issue template
- Resume download for large websites HOT 2
- Moving currently downloaded files and folders HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from suckit.