Giter VIP home page Giter VIP logo

Comments (3)

shortcuts avatar shortcuts commented on June 20, 2024

Hi,

It indeed doesn't seem like it is possible to access to dom via query selectors with the shadow-root tag open. I don't know much about shadowDOM but it might be possible to make it work.

As long as you can query selector something from the console, our scraper will be able to get it so you will be able to use DocSearch!

from docsearch-scraper.

mantou132 avatar mantou132 commented on June 20, 2024

Cannot select the content of shadowDOM through css selector or xpath.

To select shadowDOM content like a css selector, need to extend the css selector, such as using >>(outdated specification) instead of shdowDOM boundary: gem-book >> gem-book-sidebar >> gem-active-link๏ผŒwhen using this selector, replace >> with shadowRoot, for example:

'body gem-book >> gem-book-sidebar >> gem-active-link >> a[href]'.split('>>').reduce(
  (p, c, index, arr) => {
    const isLastSelector = index === arr.length - 1;
    return p.map((e) => [...e.querySelectorAll(c)].map((ce) => (isLastSelector ? ce : ce.shadowRoot))).flat();
  },
  [document],
);

Screen Shot 2021-05-18 at 4 58 11 PM

This is also an example of use in the browser, if it is selenium, there should be a similar API

from docsearch-scraper.

mantou132 avatar mantou132 commented on June 20, 2024

Hi, I viewed the source code today, i found only a little update can support ShadowDOM.

https://github.com/algolia/docsearch-scraper/blob/master/scraper/src/custom_downloader_middleware.py#L31

Can use custom downloaders to pull all DOM:

# pseudocode
driver.execute_script("return document.documentElement.getInnerHTML();")

https://web.dev/declarative-shadow-dom/

Will get result:

<head>...</head>
<body>
<gem-book>
<template shadowroot="open">
... content
</template>
</gem-book>
</body>

Next, we only need to delete the all <template> tag(don't delete content), may be a regular expression

from docsearch-scraper.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.