Comments (9)
Oh strange, it didn't look like that when I was testing this a few days ago, at the time that page was loading normally in a browser.
In any case, it does appear to be working with that new link as I'm seeing a lot of results pour in now. Sorry for the false alarm.
from bathyscaphe.
Hello there! That's a error from me.
The scheduler is removing fragments and query parameters from the URL. For the query parameters that's an error to remove them since they may affect page content. This cleanup should definitely be removed.
from bathyscaphe.
I was wrong, this cleanup has been removed months ago.
Do you mind sharing the scheduler logs so that we can investigate?
from bathyscaphe.
Is there some other logs? All I have are these ones:
api_1 | time="2020-09-03T05:18:51Z" level=debug msg="Successfully published URL: http://xmh57jrzrnw6insl.onion/4a1f6b371c/search.cgi?s=DRP&q=irc&cmd=Search%21"
scheduler_1 | time="2020-09-03T05:18:51Z" level=debug msg="Processing URL: http://xmh57jrzrnw6insl.onion/4a1f6b371c/search.cgi?s=DRP&q=irc&cmd=Search%21"
Nothing ever hits the crawler
from bathyscaphe.
This looks strange, have you tried running ./scripts/log.sh scheduler
?
from bathyscaphe.
ya, that's where this one comes from:
scheduler_1 | time="2020-09-03T05:18:51Z" level=debug msg="Processing URL: http://xmh57jrzrnw6insl.onion/4a1f6b371c/search.cgi?s=DRP&q=irc&cmd=Search%21"
from bathyscaphe.
I need to check the DOM of the page, maybe there's something in there that's causing trouble to the crawler.
from bathyscaphe.
$ curl --socks5-hostname localhost:9050 'http://xmh57jrzrnw6insl.onion/4a1f6b371c/search.cgi?s=DRP&q=irc&cmd=Search%21'
[...]
<h2>This domain has been migrated to Onion version 3.</h2>
[...]
From now on, to access <b>Torch: Tor Search Engine</b> service you must use this Onion domain name:<br><br>
<a href="http://xmh57jrknzkhv6y3ls3ubitzfqnkrwxhopf5aygthi7d6rplyvk3noyd.onion"><h3>xmh57jrknzkhv6y3ls3ubitzfqnkrwxhopf5aygthi7d6rplyvk3noyd.onion</h3></a><br><br>
[...]
Looks like these links are not valid anymore. The server is returning a HTTP 404 (no redirection) so there's no way for the crawler to find the page. Maybe try to crawl again with the updated torch URL?
from bathyscaphe.
No problem at all!
BTW I'm a bit curious, why do you exactly give Trandoshan a try? Personal project? Company? Student? Just for fun?
I'm happy to know that people are still interested in it :)
If you are up to, let's talk about this by whatever communication support you want to use.
Full list of my communication support is available here: https://creekorful.dev
from bathyscaphe.
Related Issues (20)
- Add 'blacklister' process
- Create process package to abstract / easy process creation
- archiver: do not archive if refresh-delay is not elapsed
- Use filtering instead of blacklisting
- Turn API into indexer
- Move schedule endpoint from Indexer to Scheduler HOT 1
- Improve blacklister
- Outsource authorization
- elastic: Error 400 (Bad Request): Limit of total fields [1000] has been exceeded
- error while storing resource: mkdir /archive/http/[...]: file name too long
- Indexer: create cache of published URLs HOT 1
- Tests are failing sometimes
- Blacklister: final tweaks
- Benchmark scheduler
- Indexer support bulk indexation
- Find new name
- Investigate redis memory optimization HOT 2
- Allow to use password for Redis connection
- Pushed images doesn't work
- Release stable version
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from bathyscaphe.