Giter VIP home page Giter VIP logo

docsearch-scraper's Introduction

DocSearch scraper

DEPRECATED

This repository is not maintained anymore in favor of our new infrastructure.

You can still use the run your own solution, but we won't provide any feature to the DocSearch Scraper.

Summary

If you're looking for a way to add DocSearch to your site, the easiest solution is to apply to DocSearch. To run the scraper yourself, you're at the right place.

Installation and Usage

Please check the dedicated documentation to see how you can install and run DocSearch yourself.

This project supports Python 3.6+

Useful links

docsearch-scraper's People

Contributors

axilleas avatar bcremer avatar benoitperrot avatar capraynor avatar coliff avatar dzello avatar ehayman avatar elpicador avatar endiliey avatar janmasarik avatar jerskouille avatar julien-duponchelle avatar leoercolanelli avatar lukyvj avatar maxiloc avatar mblandineau avatar michalsanger avatar mojavelinux avatar orarbel avatar orbeckst avatar phrawzty avatar pixelastic avatar pranav7 avatar redox avatar robertmogos avatar shortcuts avatar vvo avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

docsearch-scraper's Issues

Allow not searching in the url

We always add the url,anchor field to the attributesToIndex, and a lot of documentations have urls that follow the website hierarchy.

It means that if I have a page in /foobar/security.html, it will always be returned on a search on foobar, even if foobar is not present in the security.html page at all.

I'm wondering if searching in the url,anchor should be kept as a default or moved to an opt-in config.

List of pending documentations

I'll aggregate here the list of pending documentation requests from HelpScout, ordering them by missing feature in order to complete them. Feel free to check them out or add new ones

Not in English

Needs Javascript parsing

Other

hull.io's selectors are possibly wrong

It seems all the Help pages from the website (starting with https://www.hull.io/help) are not handled with the current selectors i.e. they do not produce any record. Possibly because of .documentation-body.

travis

only create the dev image

Allowing global selectors

The selectors we can currently specify are only scoped selectord. It means that, for any given lvlX element, in order to get its lvl{X-1} and lvl{X-2} parents, we have to check the DOM tree.

It works well for correctly crafted HTML, but fails as soon as the hierarchy is not in the HTML DOM order. This happens often when the hierarchy is in a sidebar, while the content is in the main part of the page. With our current scoped selectors, we cannot build such a hierarchy.

We need to allow global selectors, that instead of looking up the DOM tree will simply try to select an element in the page, wherever it is placed.

A possible API would be to allow passing objects to each selectors instead of strings.

{
  "selectors": {
    "lvl0": {
        "selector": "#accordion a.category-title:not(.collapsed)",
        "global": true
    },
    "lvl1": ".content h2",
    "lvl2": ".content h3",
    "lvl3": ".content h4",
    "lvl4": ".content h5",
    "text": ".content p"
  }

We keep retrocompatibility by converting simple strings to objects with a global: false and a selector key.

Checklist

  • Should still allow string based syntax
  • ...as well as an object with global: false as the default
  • if global: true, will use the result of the first match of the selector

Do not build any records with lvl0

On some websites like minutedock.com, even if the documentation is well organized into main categories (lvl0) each section/article has its own page.

screen shot 2015-12-28 at 16 22 05

As a consequence, the record reflecting the lvl0 is created on every single page:

screen shot 2015-12-28 at 16 24 00

I definitely want that lvl0, to group by and display nicely the dropdown menu; but not the lvl0 records. I only had the issue with lvl0 for now.

screen shot 2015-12-28 at 16 27 33

Reduce Docker container build time

The Docker container build time has increased too much since we've added the Splash instance in it. I'm currently splitting the original Dockerfile in two in order to have a base image which has all dependencies preinstalled.

Allowing XPath for global selectors

On some documentation (see http://doc.craft.ai/tutorials/doc/1/index.html or any GitBook doc), the hierarchy is in the sidebar and not in the main markup.

We can usually fix that with global selectors (see #32), but sometimes the selector is too complex to be handled by CSS alone, and we need to resort to XPath

image
image

Here, we need to first grab the li.active, go up to its closest li parent, then down to the first a. This cannot be achieved through CSS, but can be achieved with //li[@class="chapter active done"]/../../a in XPath.

Checklist

  • Allow passing XPath instead of CSS selectors
  • Limit it to global selectors? (throw an error if not)

Allow bypassing the lvl0

For documentations that have only 1 or 2 levels, the default display adds too many constraints. We usually do not need the lvl0, and would want to start the hierarchy at the lvl1, and only display two columns without the horizontal bars.

Using default values for lvl0, we could easily set it to a hardcoded "Documentation" string, and then in the front end hide the bar using CSS.

But this will leave with with a lvl0 of Documentation that is searchable, so if a user searches for "Documentation", all the records would match.

So we need to provide a way to bypass completly the lvl0. The proposal is to allow a value of None on the lvl0, that will trigger a specific behavior.

  • lvl0 will be hardcoded to Documentation for all records
  • lvl0 will not be added to the ranking

In a second step, we will add a removeTitle option on docsearch.js that will automatically hide the horizontal bar. In the meantime, those default options will simply group all the results under the same header, but disallowing searching in it.

Checklist

  • Allow passing None to selectors.lvl0
  • Is the same as empty selector + default_value of "Documentation"
  • Removes lvl0 from all ranking

TypeError: expected string or buffer

I'm getting this error independent on a number of different configs. (For example, received it on lodash and chef).

TypeError: expected string or buffer
https://docs.chef.io/nodes.html
2015-12-31 15:48:04 [scrapy] ERROR: Spider error processing <GET https://docs.chef.io/nodes.html> (referer: https://docs.chef.io/)
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/site-packages/scrapy/utils/defer.py", line 102, in iter_errback
    yield next(it)
  File "/usr/local/lib/python2.7/site-packages/scrapy/spidermiddlewares/offsite.py", line 28, in process_spider_output
    for x in result:
  File "/usr/local/lib/python2.7/site-packages/scrapy/spidermiddlewares/referer.py", line 22, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "/usr/local/lib/python2.7/site-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/usr/local/lib/python2.7/site-packages/scrapy/spidermiddlewares/depth.py", line 54, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/usr/local/lib/python2.7/site-packages/scrapy/spiders/crawl.py", line 67, in _parse_response
    cb_res = callback(response, **cb_kwargs) or ()
  File "/Users/dustin/Documents/code/documentation-scrapper/src/documentation_spider.py", line 51, in callback
    records = self.strategy.get_records_from_response(response)
  File "/Users/dustin/Documents/code/documentation-scrapper/src/strategies/default_strategy.py", line 26, in get_records_from_response
    records = self.get_records_from_dom()
  File "/Users/dustin/Documents/code/documentation-scrapper/src/strategies/default_strategy.py", line 51, in get_records_from_dom
    nodes_per_level[level] = self.cssselect(level_selector)
  File "/Users/dustin/Documents/code/documentation-scrapper/src/strategies/abstract_strategy.py", line 84, in cssselect
    return CSSSelector(selector)(self.dom)
  File "/usr/local/lib/python2.7/site-packages/lxml/cssselect.py", line 94, in __init__
    path = translator.css_to_xpath(css)
  File "/usr/local/lib/python2.7/site-packages/cssselect/xpath.py", line 192, in css_to_xpath
    for selector in parse(css))
  File "/usr/local/lib/python2.7/site-packages/cssselect/parser.py", line 341, in parse
    match = _el_re.match(css)
TypeError: expected string or buffer

Use a requirements.txt file

Instead of listing

pip install scrapy
pip install algoliasearch
pip install selenium
pip install tldextract
pip install piperclip

we should just rely on pip install -r requirements.txt; which is the regular way to handle dependencies with pip.

@ercolanelli-leo do you mind taking care of it?

Implement a task for email deployment

@maxiloc @ElPicador What do you expect this command to be able to do:

  • ./docsearch deploy:emails <config_name>... To prompt for those particular configurations?
  • ./docsearch deploy:emails to do it only for new/chaged configuration without the other deployment options?

The first one seems more like what you would like, but just to be sure

Cannot deploy the new configurations

I was not able to deploy the new places configuration, can you provide guidance on how to do so?

The script I used (deployer/deploy) failed with:

Traceback (most recent call last):
  File "/tmp/docsearch_deploy/scraper/deployer/src/index.py", line 6, in <module>
    import fetchers
  File "/tmp/docsearch_deploy/scraper/deployer/src/fetchers.py", line 5, in <module>
    import helpers
  File "/tmp/docsearch_deploy/scraper/deployer/src/helpers.py", line 1, in <module>
    import requests
ImportError: No module named requests

A's inside h*

<h2>
  Ajax Errors
  <a class="header-anchor" id="ajax-errors" href="#ajax-errors"></a>
</h2>

With this tags, it does not get the id of the a

Playground: Make the config editable

We currently have a playground.html page that lets us test our configs. apiKey, indexName, etc must be filled manually and one should take care not to commit the file with its changes.

Instead, we should follow the way the generator is working, and making the config options directly editable through input fields and updated in real time.

Useless id anchor added to url

On the reindex.io documentation (https://www.reindex.io/docs/), all the scrapped urls have a #react-mount anchor appended.

This is because none of the elements of the page have anchor, except the main wrapper that has an id='react-mount'.

Apart from asking the owner to add correct anchors to each parts of the hierarchy, would you see a way to prevent the crawler from adding this id?

Variables with fixed values are not filtered

If you have a config like:

{
      "url": "http://doc.akka.io/docs/akka/(?P<language>.*?)",
      "variables": {
        "language": ["scala", "java"]
        }
      }
    }

And an URL like http://doc.akka.io/docs/akka/ruby exists, ruby is added to the list of possible values of language

Do not copy configs

This might be a silly qestion, but why do we COPY configs /root/configs in the Dockerfile?

We pass the config with -e CONFIG when starting docker run anyway, so I'm not sure why the file needs to be in the image?

TypeError: list indices must be integers, not str

Got this on Chef, Lodash, and one I was using for Go.

2015-12-31 15:49:37 [scrapy] ERROR: Spider error processing <GET https://docs.chef.io/resources.html> (referer: https://docs.chef.io/)
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/site-packages/scrapy/utils/defer.py", line 102, in iter_errback
    yield next(it)
  File "/usr/local/lib/python2.7/site-packages/scrapy/spidermiddlewares/offsite.py", line 28, in process_spider_output
    for x in result:
  File "/usr/local/lib/python2.7/site-packages/scrapy/spidermiddlewares/referer.py", line 22, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "/usr/local/lib/python2.7/site-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/usr/local/lib/python2.7/site-packages/scrapy/spidermiddlewares/depth.py", line 54, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/usr/local/lib/python2.7/site-packages/scrapy/spiders/crawl.py", line 67, in _parse_response
    cb_res = callback(response, **cb_kwargs) or ()
  File "/Users/dustin/Documents/code/documentation-scrapper/src/documentation_spider.py", line 51, in callback
    records = self.strategy.get_records_from_response(response)
  File "/Users/dustin/Documents/code/documentation-scrapper/src/strategies/default_strategy.py", line 25, in get_records_from_response
    records = self.get_records_from_dom()
  File "/Users/dustin/Documents/code/documentation-scrapper/src/strategies/default_strategy.py", line 47, in get_records_from_dom
    level_selector = self.config.selectors[level]
TypeError: list indices must be integers, not str```

Allow the removing of some unwanted characters in selector results

Some documentations adds a sign in the markup of their titles. It is only displayed on mouse over, but is present in the HTML, thus it is indexed.

Example:
image
image

We need a way to prevent those chars from being indexed. The previous version of the scraper used the strip_chars option in the config to specify a blacklist of chars that should be removed from the selector results. I think we should re-enable that feature, and maybe even extend it to a blacklist of words.

Proposal

{
    [...]
    "strip_chars": ""
}

As suggested by @redox below, also allowing it on the selectors level

"selectors": {
  "lvl0": {
    "selector": "h1",
    "strip_chars": [""]
  }
}

Checklist

  • Allow for both a single string or an array of strings
  • Making sure that the values defined in the array are stripped from the selector results
  • Allowing it on the selector level, overriding the default root value

variables do not work with single page documentation

Try this:

{
  "index_name": "cardconnect",
  "start_urls": [
    {
      "url": "https://developer.cardconnect.com/(?P<project>.*?)/",
      "variables": {
        "project": ["cardconnect-api", "copilot-api", "hosted-payment-api"]
      }
    }
  ],
  "stop_urls": [
  ],
  "selectors_exclude": [
  ],
  "selectors": {
    "lvl0": "main h1",
    "lvl1": "main h2",
    "lvl2": "main h3",
    "lvl3": "main h4",
    "lvl4": "main h5",
    "text": "main p"
  }
}

The doc is crawled but the objects do not contain the project attribute

Allowing defaut_value for selectors

It will be useful to be able to specify a default value for a selector if such a selector does not match anything on the page.

We will take advantage of the "new" syntax of the selectors allowing for objects instead of plain string. We'll provide a default_value key that will be used when the specified selector does not match anything.

By also allowing for empty selectors, this will let us "hardcode" a hierarchy.

"selectors": {
    "lvl0": {
        "selector": ".menu a.active-trail",
        "default_value": "Homepage"
    }
  }

Questions

  1. Should the default value be used when the selector does not match anything, or also when it matches but has no content?

Checklist

  • Set a default value if no match is found
  • Works with empty selectors, as well as undefined selectors

Ability to omit a selector

Currently if your configuration doesn't specify a lvlX selector, it will fail with a bad selector exception.

Start url as regex

We should handle start url and not simple string. This would allow us to tag and page rank way more easily.

Error running docsearch

I am getting this error when I try to run docsearch to scrape my site:

12:27:47 (master) docsearch-scraper: ./docsearch run ../../docsearch/config.json
Traceback (most recent call last):
  File "scraper/src/index.py", line 5, in <module>
    from config_loader import ConfigLoader
  File "/Users/z001n2j/temp/git/docsearch-scraper/scraper/src/config_loader.py", line 26, in <module>
    from .strategies.abstract_strategy import AbstractStrategy
ValueError: Attempted relative import in non-package

Any idea what could be causing it?

With "js_render" it indexes the same page multiple times

With this config, it fetches urls with ../ in them, and so indexes the same page multiple times.
Exemple of output:

https://docs.barricade.io/../../using-barricade/
Pushed 199 records
https://docs.barricade.io/../changelog
Pushed 2 records
https://docs.barricade.io/../changelog/new-barricade-site-launched
Pushed 3 records
https://docs.barricade.io/hc/
Pushed 0 records
https://docs.barricade.io/../../getting-started/
Pushed 58 records
https://docs.barricade.io/../../../
Pushed 11 records

Configuration:

{
  "index_name": "barricade",
  "start_urls": [
    "https://docs.barricade.io"
  ],
  "stop_urls": [
  ],
  "selectors_exclude": [
  ],
  "selectors": {
    "lvl0": ".main h1",
    "lvl1": ".main h2",
    "lvl2": ".main h3",
    "lvl3": ".main h4",
    "lvl4": ".main h5",
    "text": ".main p"
  },
  "custom_settings": {},
  "js_render": true,
  "min_indexed_level": 1
}

Milestone v2

Here is the checklist to things to be done before releasing DocSearch v2. The v2 will officially be merged in the docsearch.js repo, thus making the project public. We will also update docsearch.js to v2.0 and keep the two projects in sync.

Improvements to the scraper

  • Allowing global selectors #32
  • Allowing XPath for global selectors #31
  • Allow the removing of some unwanted characters in selector results #34
  • Allowing defaut_value for selectors #12
  • Allow bypassing the lvl0 #38
  • Add nice test coverage badge
  • Update the customers waiting #35

Improvements to the playground

  • Make the config editable #36
  • Quick switch between configs #37

Improvements to the documentation

  • Update the readme with all options
  • Add one or two tutorial examples for common docs (GitBook, etc)
  • Generate the doc in a Jekyll website

Final merge

  • Synchronize scraper version and javascript version
  • Add scraper documentation webpage to the website
  • Add a removeTitle option on docsearch.js that automatically hide the horizontal bar (to be used with the lvl0: None scraper option)

Playground: Quick switch between configs

We could add a dropdown menu to the playground.html file that allow users to quickly switch from one config to another. This will let us easily test all the configs of customer currently live in production and check that everything is working correctly.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.