Giter VIP home page Giter VIP logo

Comments (6)

FilippoRezzonico avatar FilippoRezzonico commented on June 16, 2024 1

For now, I will try to update our sitemap.xml file and see if this solves the problem once and for all.
Since this problem has a randomic nature, I think it will require us some months without this issue happening to be sure that it has been fixed.
I will let you know if this issue happens again after having updated the sitemap.xml file.
Thanks a lot for your support :)

from docsearch-scraper.

shortcuts avatar shortcuts commented on June 16, 2024

Hey @FilippoRezzonico,

Can you confirm you are using the latest version of the docsearch-scraper docker image?

The only case an index can be unavailable, is at the end of a successful crawl: when the crawler runs, it stores records in a _tmp index and rename the index to the production name at the end of the crawl.

from docsearch-scraper.

FilippoRezzonico avatar FilippoRezzonico commented on June 16, 2024

Hi @shortcuts,
Yes, we are currently using the latest versione of the docker image.
We only run the scraper when we release a new version of our documentation (while this problems seem to happen randomly). Moreover, releasing our documentation (which will launch the docsearch-scraper image) seems to restore the correct functioning of our searchbar.
Today, after about 3 days from the last time, our search bar is not working again. We noticed the following anomalies:

  • The records number decreased suddenly from 23.8K to 20.4K
    monitoring
  • Some records of our top searches seem to have been removed
    template_record
  • Some of our top searches were also in the searches without result (seems like they have been filtered before being returned)
    advanced
    Do you think that these sudden changes can actually provide us some hints about the problem that we are facing?

from docsearch-scraper.

shortcuts avatar shortcuts commented on June 16, 2024

Today, after about 3 days from the last time, our search bar is not working again.

Could you confirm the index is deleted when the search does not work anymore? If it's the case, I suggest you to contact [email protected] (also provide the link of this issue so you don't have to re-explain it) and they will be able to give you information of why your index is deleted.

With our scraper, we always keep the production index up and don't perform delete operations

We noticed the following anomalies:

If the index is not deleted but only the search does not work, it might be related to some inconsistencies during the crawl. Could you please provide a gist with your config file so I can try it?

  • Do you have client-side rendered content? Make sure to use the js_render option and add some delay if needed using js_wait
  • If you don't have a sitemap yet, make sure to check our tips for a good search section

Hope this gives you hints :D

from docsearch-scraper.

FilippoRezzonico avatar FilippoRezzonico commented on June 16, 2024

Actually, our indexes are not deleted but it seems that their number of records gets reduced after some time. So I think that, as you said, it could be caused by some issues during the crawling.
Here is the config file that we are currently using:

{
  "index_name": "mia-platform-docs",
  "start_urls": [
    "https://docs.mia-platform.eu"
  ],
  "stop_urls": [
    "/$"
  ],
  "selectors": {
    "text": "article p, article li",
    "lvl1": "header h1",
    "lvl2": "article h2",
    "lvl3": "article h3",
    "lvl4": "article h4",
    "lvl5": "article h5",
    "lvl6": "article h6",
    "lvl0": {
      "selector": ".menu__link--sublist.menu__link--active",
      "global": true,
      "default_value": "Documentation"
    }
  },
  "sitemap_urls": [
    "https://docs.mia-platform.eu/sitemap.xml"
  ],
  "sitemap_alternate_links": true,
  "strip_chars": " .,;:#",
  "custom_settings": {
    "separatorsToIndex": "_",
    "attributesForFaceting": [
      "language",
      "version",
      "type",
      "docusaurus_tag"
    ],
    "attributesToRetrieve": [
      "hierarchy",
      "content",
      "anchor",
      "url",
      "url_without_anchor",
      "type"
    ]
  },
  "min_indexed_level": 0,
  "conversation_id": [
    "1280385092"
  ],
  "nb_hits": 12708
}

Do you see any possible problem with it?

  • Both the links of js_render and js_wait that you provided me in the previous message return a 404 landing page, have they been moved elsewhere?
    We wait exactly 1 minute after releasing our documentation to launch the algolia scraper, do you think that we should increase this amount of time?
  • We currently have a sitemap.xml file in our documentation project, but I am pretty sure that it is not updated and it is possible that it contains pages that are not present anymore. Do you think that this could cause the issue?

from docsearch-scraper.

shortcuts avatar shortcuts commented on June 16, 2024

We currently have a sitemap.xml file in our documentation project, but I am pretty sure that it is not updated and it is possible that it contains pages that are not present anymore. Do you think that this could cause the issue?

As the issue is mostly inconsistencies between crawls, it might have an impact, yes.

Both the links of js_render and js_wait that you provided me in the previous message return a 404 landing page, have they been moved elsewhere?

The new doc has been deployed since then, links are now at js_render, js_wait, sorry! :D

On my side, I had 33740 hits without client-side rendering, and 32574 with it.

We wait exactly 1 minute after releasing our documentation to launch the algolia scraper, do you think that we should increase this amount of time?

There's no caching on our side so it shouldn't have an impact.

from docsearch-scraper.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.