Comments (6)
For now, I will try to update our sitemap.xml file and see if this solves the problem once and for all.
Since this problem has a randomic nature, I think it will require us some months without this issue happening to be sure that it has been fixed.
I will let you know if this issue happens again after having updated the sitemap.xml file.
Thanks a lot for your support :)
from docsearch-scraper.
Hey @FilippoRezzonico,
Can you confirm you are using the latest version of the docsearch-scraper docker image?
The only case an index can be unavailable, is at the end of a successful crawl: when the crawler runs, it stores records in a _tmp
index and rename the index to the production name at the end of the crawl.
from docsearch-scraper.
Hi @shortcuts,
Yes, we are currently using the latest versione of the docker image.
We only run the scraper when we release a new version of our documentation (while this problems seem to happen randomly). Moreover, releasing our documentation (which will launch the docsearch-scraper image) seems to restore the correct functioning of our searchbar.
Today, after about 3 days from the last time, our search bar is not working again. We noticed the following anomalies:
- The records number decreased suddenly from 23.8K to 20.4K
- Some records of our top searches seem to have been removed
- Some of our top searches were also in the searches without result (seems like they have been filtered before being returned)
Do you think that these sudden changes can actually provide us some hints about the problem that we are facing?
from docsearch-scraper.
Today, after about 3 days from the last time, our search bar is not working again.
Could you confirm the index is deleted when the search does not work anymore? If it's the case, I suggest you to contact [email protected] (also provide the link of this issue so you don't have to re-explain it) and they will be able to give you information of why your index is deleted.
With our scraper, we always keep the production index up and don't perform delete operations
We noticed the following anomalies:
If the index is not deleted but only the search does not work, it might be related to some inconsistencies during the crawl. Could you please provide a gist with your config file so I can try it?
- Do you have client-side rendered content? Make sure to use the
js_render
option and add some delay if needed usingjs_wait
- If you don't have a sitemap yet, make sure to check our tips for a good search section
Hope this gives you hints :D
from docsearch-scraper.
Actually, our indexes are not deleted but it seems that their number of records gets reduced after some time. So I think that, as you said, it could be caused by some issues during the crawling.
Here is the config file that we are currently using:
{
"index_name": "mia-platform-docs",
"start_urls": [
"https://docs.mia-platform.eu"
],
"stop_urls": [
"/$"
],
"selectors": {
"text": "article p, article li",
"lvl1": "header h1",
"lvl2": "article h2",
"lvl3": "article h3",
"lvl4": "article h4",
"lvl5": "article h5",
"lvl6": "article h6",
"lvl0": {
"selector": ".menu__link--sublist.menu__link--active",
"global": true,
"default_value": "Documentation"
}
},
"sitemap_urls": [
"https://docs.mia-platform.eu/sitemap.xml"
],
"sitemap_alternate_links": true,
"strip_chars": " .,;:#",
"custom_settings": {
"separatorsToIndex": "_",
"attributesForFaceting": [
"language",
"version",
"type",
"docusaurus_tag"
],
"attributesToRetrieve": [
"hierarchy",
"content",
"anchor",
"url",
"url_without_anchor",
"type"
]
},
"min_indexed_level": 0,
"conversation_id": [
"1280385092"
],
"nb_hits": 12708
}
Do you see any possible problem with it?
- Both the links of
js_render
andjs_wait
that you provided me in the previous message return a 404 landing page, have they been moved elsewhere?
We wait exactly 1 minute after releasing our documentation to launch the algolia scraper, do you think that we should increase this amount of time? - We currently have a sitemap.xml file in our documentation project, but I am pretty sure that it is not updated and it is possible that it contains pages that are not present anymore. Do you think that this could cause the issue?
from docsearch-scraper.
We currently have a sitemap.xml file in our documentation project, but I am pretty sure that it is not updated and it is possible that it contains pages that are not present anymore. Do you think that this could cause the issue?
As the issue is mostly inconsistencies between crawls, it might have an impact, yes.
Both the links of js_render and js_wait that you provided me in the previous message return a 404 landing page, have they been moved elsewhere?
The new doc has been deployed since then, links are now at js_render
, js_wait
, sorry! :D
On my side, I had 33740
hits without client-side rendering, and 32574
with it.
We wait exactly 1 minute after releasing our documentation to launch the algolia scraper, do you think that we should increase this amount of time?
There's no caching on our side so it shouldn't have an impact.
from docsearch-scraper.
Related Issues (20)
- TypeError: argument of type 'NoneType' is not iterable HOT 2
- Crawler does not seem to work on websites that use shadowDOM HOT 3
- concurrency settings HOT 1
- Index updated but results not showing up HOT 4
- Docker image to support ARM64 Platform HOT 7
- Chrome not reachable HOT 2
- The error message for host unreachable in older versions was significantly better... HOT 4
- Getting ValueError: CONFIG is not a valid JSON HOT 6
- Need help with creating a Docker Compose file HOT 5
- Docker operation error. Procedure HOT 2
- Python error when running own scrapper HOT 1
- Getitng only 1 NB hit while running from docker HOT 3
- Ignore sidebar headings in algolia indexing
- Algolia search breaks after running subsequent scrapes HOT 2
- unable to run the scraper on local url
- DocSearch: 0 records
- Getting Unreachable hosts error when trying to scrape data
- The prompt record is 0 after running the container
- If this project is no longer maintained, please delete this repository HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from docsearch-scraper.