Situation docsearch-scraper docker image starts scraping my compan

Hey <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Algolia index records reduction after an undefined amount of time about docsearch-scraper HOT 6 OPEN

FilippoRezzonico commented on June 16, 2024

Algolia index records reduction after an undefined amount of time

from docsearch-scraper.

Comments (6)

FilippoRezzonico commented on June 16, 2024 1

For now, I will try to update our sitemap.xml file and see if this solves the problem once and for all.
Since this problem has a randomic nature, I think it will require us some months without this issue happening to be sure that it has been fixed.
I will let you know if this issue happens again after having updated the sitemap.xml file.
Thanks a lot for your support :)

from docsearch-scraper.

shortcuts commented on June 16, 2024

Hey @FilippoRezzonico,

Can you confirm you are using the latest version of the docsearch-scraper docker image?

The only case an index can be unavailable, is at the end of a successful crawl: when the crawler runs, it stores records in a _tmp index and rename the index to the production name at the end of the crawl.

from docsearch-scraper.

FilippoRezzonico commented on June 16, 2024

Hi @shortcuts,
Yes, we are currently using the latest versione of the docker image.
We only run the scraper when we release a new version of our documentation (while this problems seem to happen randomly). Moreover, releasing our documentation (which will launch the docsearch-scraper image) seems to restore the correct functioning of our searchbar.
Today, after about 3 days from the last time, our search bar is not working again. We noticed the following anomalies:

The records number decreased suddenly from 23.8K to 20.4K
Some records of our top searches seem to have been removed
Some of our top searches were also in the searches without result (seems like they have been filtered before being returned)

Do you think that these sudden changes can actually provide us some hints about the problem that we are facing?

from docsearch-scraper.

shortcuts commented on June 16, 2024

Today, after about 3 days from the last time, our search bar is not working again.

Could you confirm the index is deleted when the search does not work anymore? If it's the case, I suggest you to contact [email protected] (also provide the link of this issue so you don't have to re-explain it) and they will be able to give you information of why your index is deleted.

With our scraper, we always keep the production index up and don't perform delete operations

We noticed the following anomalies:

If the index is not deleted but only the search does not work, it might be related to some inconsistencies during the crawl. Could you please provide a gist with your config file so I can try it?

Do you have client-side rendered content? Make sure to use the js_render option and add some delay if needed using js_wait
If you don't have a sitemap yet, make sure to check our tips for a good search section

Hope this gives you hints :D

from docsearch-scraper.

FilippoRezzonico commented on June 16, 2024

Actually, our indexes are not deleted but it seems that their number of records gets reduced after some time. So I think that, as you said, it could be caused by some issues during the crawling.
Here is the config file that we are currently using:

{
  "index_name": "mia-platform-docs",
  "start_urls": [
    "https://docs.mia-platform.eu"
  ],
  "stop_urls": [
    "/$"
  ],
  "selectors": {
    "text": "article p, article li",
    "lvl1": "header h1",
    "lvl2": "article h2",
    "lvl3": "article h3",
    "lvl4": "article h4",
    "lvl5": "article h5",
    "lvl6": "article h6",
    "lvl0": {
      "selector": ".menu__link--sublist.menu__link--active",
      "global": true,
      "default_value": "Documentation"
    }
  },
  "sitemap_urls": [
    "https://docs.mia-platform.eu/sitemap.xml"
  ],
  "sitemap_alternate_links": true,
  "strip_chars": " .,;:#",
  "custom_settings": {
    "separatorsToIndex": "_",
    "attributesForFaceting": [
      "language",
      "version",
      "type",
      "docusaurus_tag"
    ],
    "attributesToRetrieve": [
      "hierarchy",
      "content",
      "anchor",
      "url",
      "url_without_anchor",
      "type"
    ]
  },
  "min_indexed_level": 0,
  "conversation_id": [
    "1280385092"
  ],
  "nb_hits": 12708
}

Do you see any possible problem with it?

Both the links of js_render and js_wait that you provided me in the previous message return a 404 landing page, have they been moved elsewhere?
We wait exactly 1 minute after releasing our documentation to launch the algolia scraper, do you think that we should increase this amount of time?
We currently have a sitemap.xml file in our documentation project, but I am pretty sure that it is not updated and it is possible that it contains pages that are not present anymore. Do you think that this could cause the issue?

from docsearch-scraper.

shortcuts commented on June 16, 2024

We currently have a sitemap.xml file in our documentation project, but I am pretty sure that it is not updated and it is possible that it contains pages that are not present anymore. Do you think that this could cause the issue?

As the issue is mostly inconsistencies between crawls, it might have an impact, yes.

Both the links of js_render and js_wait that you provided me in the previous message return a 404 landing page, have they been moved elsewhere?

The new doc has been deployed since then, links are now at js_render, js_wait, sorry! :D

On my side, I had 33740 hits without client-side rendering, and 32574 with it.

We wait exactly 1 minute after releasing our documentation to launch the algolia scraper, do you think that we should increase this amount of time?

There's no caching on our side so it shouldn't have an impact.

from docsearch-scraper.

Algolia index records reduction after an undefined amount of time about docsearch-scraper HOT 6 OPEN

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent