Hello Algolia Devs, I tried to add search function to <a href="https

Crawler does not seem to work on websites that use shadowDOM about docsearch-scraper HOT 3 OPEN

mantou132 commented on June 24, 2024

Crawler does not seem to work on websites that use shadowDOM

from docsearch-scraper.

Comments (3)

shortcuts commented on June 24, 2024

Hi,

It indeed doesn't seem like it is possible to access to dom via query selectors with the shadow-root tag open. I don't know much about shadowDOM but it might be possible to make it work.

As long as you can query selector something from the console, our scraper will be able to get it so you will be able to use DocSearch!

from docsearch-scraper.

mantou132 commented on June 24, 2024

Cannot select the content of shadowDOM through css selector or xpath.

To select shadowDOM content like a css selector, need to extend the css selector, such as using >>(outdated specification) instead of shdowDOM boundary: gem-book >> gem-book-sidebar >> gem-active-link，when using this selector, replace >> with shadowRoot, for example:

'body gem-book >> gem-book-sidebar >> gem-active-link >> a[href]'.split('>>').reduce(
  (p, c, index, arr) => {
    const isLastSelector = index === arr.length - 1;
    return p.map((e) => [...e.querySelectorAll(c)].map((ce) => (isLastSelector ? ce : ce.shadowRoot))).flat();
  },
  [document],
);

This is also an example of use in the browser, if it is selenium, there should be a similar API

from docsearch-scraper.

mantou132 commented on June 24, 2024

Hi, I viewed the source code today, i found only a little update can support ShadowDOM.

https://github.com/algolia/docsearch-scraper/blob/master/scraper/src/custom_downloader_middleware.py#L31

Can use custom downloaders to pull all DOM:

# pseudocode
driver.execute_script("return document.documentElement.getInnerHTML();")

https://web.dev/declarative-shadow-dom/

Will get result:

<head>...</head>
<body>
<gem-book>
<template shadowroot="open">
... content
</template>
</gem-book>
</body>

~~Next, we only need to delete the all <template> tag(don't delete content), may be a regular expression~~

from docsearch-scraper.

Recommend Projects

Crawler does not seem to work on websites that use shadowDOM about docsearch-scraper HOT 3 OPEN

Comments (3)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent