algolia / docsearch-scraper Goto Github PK

View Code? Open in Web Editor NEW

297.0 75.0 106.0 761 KB

DocSearch - Scraper

Home Page: https://docsearch.algolia.com/

License: Other

Python 95.83% Shell 0.62% HTML 3.47% Dockerfile 0.07%

algolia docsearch documentation scraper

docsearch-scraper's Issues

Some pages are not crawled

Don't know why.
Exemple:

Playground: Quick switch between configs

We could add a dropdown menu to the playground.html file that allow users to quickly switch from one config to another. This will let us easily test all the configs of customer currently live in production and check that everything is working correctly.

Do not copy configs

This might be a silly qestion, but why do we COPY configs /root/configs in the Dockerfile?

We pass the config with -e CONFIG when starting docker run anyway, so I'm not sure why the file needs to be in the image?

Allowing XPath for global selectors

On some documentation (see http://doc.craft.ai/tutorials/doc/1/index.html or any GitBook doc), the hierarchy is in the sidebar and not in the main markup.

We can usually fix that with global selectors (see #32), but sometimes the selector is too complex to be handled by CSS alone, and we need to resort to XPath

Here, we need to first grab the li.active, go up to its closest li parent, then down to the first a. This cannot be achieved through CSS, but can be achieved with //li[@class="chapter active done"]/../../a in XPath.

Checklist

Allow passing XPath instead of CSS selectors
Limit it to global selectors? (throw an error if not)

Allow not searching in the url

We always add the url,anchor field to the attributesToIndex, and a lot of documentations have urls that follow the website hierarchy.

It means that if I have a page in /foobar/security.html, it will always be returned on a search on foobar, even if foobar is not present in the security.html page at all.

I'm wondering if searching in the url,anchor should be kept as a default or moved to an opt-in config.

Allowing defaut_value for selectors

It will be useful to be able to specify a default value for a selector if such a selector does not match anything on the page.

We will take advantage of the "new" syntax of the selectors allowing for objects instead of plain string. We'll provide a default_value key that will be used when the specified selector does not match anything.

By also allowing for empty selectors, this will let us "hardcode" a hierarchy.

"selectors": {
    "lvl0": {
        "selector": ".menu a.active-trail",
        "default_value": "Homepage"
    }
  }

Questions

Should the default value be used when the selector does not match anything, or also when it matches but has no content?

Checklist

Set a default value if no match is found
Works with empty selectors, as well as undefined selectors

Change the default template

Put the new default template in the docsearch css

Generator cloudflare

Make the generator clouflarable

Rename text to content in the selector configuration

Just need to be BC

Add tests to Travis

cli tool

Useless id anchor added to url

On the reindex.io documentation (https://www.reindex.io/docs/), all the scrapped urls have a #react-mount anchor appended.

This is because none of the elements of the page have anchor, except the main wrapper that has an id='react-mount'.

Apart from asking the owner to add correct anchors to each parts of the hierarchy, would you see a way to prevent the crawler from adding this id?

Variables with fixed values are not filtered

If you have a config like:

{
      "url": "http://doc.akka.io/docs/akka/(?P<language>.*?)",
      "variables": {
        "language": ["scala", "java"]
        }
      }
    }

And an URL like http://doc.akka.io/docs/akka/ruby exists, ruby is added to the list of possible values of language

strip_chars is not used anymore

what was it for? (https://github.com/algolia/documentation-scrapper/blob/784266d6373347f2e07725bc2f8c222ca45543fc/src/documentation_spyder.py#L120-L126)

Reduce Docker container build time

The Docker container build time has increased too much since we've added the Splash instance in it. I'm currently splitting the original Dockerfile in two in order to have a base image which has all dependencies preinstalled.

Allow bypassing the lvl0

For documentations that have only 1 or 2 levels, the default display adds too many constraints. We usually do not need the lvl0, and would want to start the hierarchy at the lvl1, and only display two columns without the horizontal bars.

Using default values for lvl0, we could easily set it to a hardcoded "Documentation" string, and then in the front end hide the bar using CSS.

But this will leave with with a lvl0 of Documentation that is searchable, so if a user searches for "Documentation", all the records would match.

So we need to provide a way to bypass completly the lvl0. The proposal is to allow a value of None on the lvl0, that will trigger a specific behavior.

lvl0 will be hardcoded to Documentation for all records
lvl0 will not be added to the ranking

In a second step, we will add a removeTitle option on docsearch.js that will automatically hide the horizontal bar. In the meantime, those default options will simply group all the results under the same header, but disallowing searching in it.

Checklist

Allow passing None to selectors.lvl0
Is the same as empty selector + default_value of "Documentation"
Removes lvl0 from all ranking

Make `selectors_exclude` optional

Passing an config without the selectors_exclude key should set it as an empty array by default, and not throw an error.

Milestone v2

Here is the checklist to things to be done before releasing DocSearch v2. The v2 will officially be merged in the docsearch.js repo, thus making the project public. We will also update docsearch.js to v2.0 and keep the two projects in sync.

Improvements to the scraper

Allowing global selectors #32
Allowing XPath for global selectors #31
Allow the removing of some unwanted characters in selector results #34
Allowing defaut_value for selectors #12
Allow bypassing the lvl0 #38
Add nice test coverage badge
Update the customers waiting #35

Improvements to the playground

Make the config editable #36
Quick switch between configs #37

Improvements to the documentation

Update the readme with all options
Add one or two tutorial examples for common docs (GitBook, etc)
Generate the doc in a Jekyll website

Final merge

Synchronize scraper version and javascript version
Add scraper documentation webpage to the website
Add a removeTitle option on docsearch.js that automatically hide the horizontal bar (to be used with the lvl0: None scraper option)

Allowing global selectors

The selectors we can currently specify are only scoped selectord. It means that, for any given lvlX element, in order to get its lvl{X-1} and lvl{X-2} parents, we have to check the DOM tree.

It works well for correctly crafted HTML, but fails as soon as the hierarchy is not in the HTML DOM order. This happens often when the hierarchy is in a sidebar, while the content is in the main part of the page. With our current scoped selectors, we cannot build such a hierarchy.

We need to allow global selectors, that instead of looking up the DOM tree will simply try to select an element in the page, wherever it is placed.

A possible API would be to allow passing objects to each selectors instead of strings.

{
  "selectors": {
    "lvl0": {
        "selector": "#accordion a.category-title:not(.collapsed)",
        "global": true
    },
    "lvl1": ".content h2",
    "lvl2": ".content h3",
    "lvl3": ".content h4",
    "lvl4": ".content h5",
    "text": ".content p"
  }

We keep retrocompatibility by converting simple strings to objects with a global: false and a selector key.

Checklist

Should still allow string based syntax
...as well as an object with global: false as the default
if global: true, will use the result of the first match of the selector

Add a command to reindex from cli

Error running docsearch

I am getting this error when I try to run docsearch to scrape my site:

12:27:47 (master) docsearch-scraper: ./docsearch run ../../docsearch/config.json
Traceback (most recent call last):
  File "scraper/src/index.py", line 5, in <module>
    from config_loader import ConfigLoader
  File "/Users/z001n2j/temp/git/docsearch-scraper/scraper/src/config_loader.py", line 26, in <module>
    from .strategies.abstract_strategy import AbstractStrategy
ValueError: Attempted relative import in non-package

Any idea what could be causing it?

travis

only create the dev image

Crawl JS generated docs

Some documentation are generated client side with JS (ex: http://docs.prezly.com/, https://gns3.com/support/docs/quick-start-guide-for-windows-us).

It would be nice to be able to parse them

With "js_render" it indexes the same page multiple times

With this config, it fetches urls with ../ in them, and so indexes the same page multiple times.
Exemple of output:

https://docs.barricade.io/../../using-barricade/
Pushed 199 records
https://docs.barricade.io/../changelog
Pushed 2 records
https://docs.barricade.io/../changelog/new-barricade-site-launched
Pushed 3 records
https://docs.barricade.io/hc/
Pushed 0 records
https://docs.barricade.io/../../getting-started/
Pushed 58 records
https://docs.barricade.io/../../../
Pushed 11 records

Configuration:

{
  "index_name": "barricade",
  "start_urls": [
    "https://docs.barricade.io"
  ],
  "stop_urls": [
  ],
  "selectors_exclude": [
  ],
  "selectors": {
    "lvl0": ".main h1",
    "lvl1": ".main h2",
    "lvl2": ".main h3",
    "lvl3": ".main h4",
    "lvl4": ".main h5",
    "text": ".main p"
  },
  "custom_settings": {},
  "js_render": true,
  "min_indexed_level": 1
}

A's inside h*

<h2>
  Ajax Errors
  <a class="header-anchor" id="ajax-errors" href="#ajax-errors"></a>
</h2>

With this tags, it does not get the id of the a

Add a warning when deploying with non docsearch account

Documentation: min_indexed_level

We should add a explanation of what min_indexed_level does, as per #8

Cannot deploy the new configurations

I was not able to deploy the new places configuration, can you provide guidance on how to do so?

The script I used (deployer/deploy) failed with:

Traceback (most recent call last):
  File "/tmp/docsearch_deploy/scraper/deployer/src/index.py", line 6, in <module>
    import fetchers
  File "/tmp/docsearch_deploy/scraper/deployer/src/fetchers.py", line 5, in <module>
    import helpers
  File "/tmp/docsearch_deploy/scraper/deployer/src/helpers.py", line 1, in <module>
    import requests
ImportError: No module named requests

variables do not work with single page documentation

Try this:

{
  "index_name": "cardconnect",
  "start_urls": [
    {
      "url": "https://developer.cardconnect.com/(?P<project>.*?)/",
      "variables": {
        "project": ["cardconnect-api", "copilot-api", "hosted-payment-api"]
      }
    }
  ],
  "stop_urls": [
  ],
  "selectors_exclude": [
  ],
  "selectors": {
    "lvl0": "main h1",
    "lvl1": "main h2",
    "lvl2": "main h3",
    "lvl3": "main h4",
    "lvl4": "main h5",
    "text": "main p"
  }
}

The doc is crawled but the objects do not contain the project attribute

Allow the removing of some unwanted characters in selector results

Some documentations adds a ¶ sign in the markup of their titles. It is only displayed on mouse over, but is present in the HTML, thus it is indexed.

Example:

We need a way to prevent those chars from being indexed. The previous version of the scraper used the strip_chars option in the config to specify a blacklist of chars that should be removed from the selector results. I think we should re-enable that feature, and maybe even extend it to a blacklist of words.

Proposal

{
    [...]
    "strip_chars": "¶"
}

As suggested by @redox below, also allowing it on the selectors level

"selectors": {
  "lvl0": {
    "selector": "h1",
    "strip_chars": ["¶"]
  }
}

Checklist

~~Allow for both a single string or an array of strings~~
~~Making sure that the values defined in the array are stripped from the selector results~~
Allowing it on the selector level, overriding the default root value

TypeError: list indices must be integers, not str

Got this on Chef, Lodash, and one I was using for Go.

2015-12-31 15:49:37 [scrapy] ERROR: Spider error processing <GET https://docs.chef.io/resources.html> (referer: https://docs.chef.io/)
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/site-packages/scrapy/utils/defer.py", line 102, in iter_errback
    yield next(it)
  File "/usr/local/lib/python2.7/site-packages/scrapy/spidermiddlewares/offsite.py", line 28, in process_spider_output
    for x in result:
  File "/usr/local/lib/python2.7/site-packages/scrapy/spidermiddlewares/referer.py", line 22, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "/usr/local/lib/python2.7/site-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/usr/local/lib/python2.7/site-packages/scrapy/spidermiddlewares/depth.py", line 54, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/usr/local/lib/python2.7/site-packages/scrapy/spiders/crawl.py", line 67, in _parse_response
    cb_res = callback(response, **cb_kwargs) or ()
  File "/Users/dustin/Documents/code/documentation-scrapper/src/documentation_spider.py", line 51, in callback
    records = self.strategy.get_records_from_response(response)
  File "/Users/dustin/Documents/code/documentation-scrapper/src/strategies/default_strategy.py", line 25, in get_records_from_response
    records = self.get_records_from_dom()
  File "/Users/dustin/Documents/code/documentation-scrapper/src/strategies/default_strategy.py", line 47, in get_records_from_dom
    level_selector = self.config.selectors[level]
TypeError: list indices must be integers, not str```

Playground: Make the config editable

We currently have a playground.html page that lets us test our configs. apiKey, indexName, etc must be filled manually and one should take care not to commit the file with its changes.

Instead, we should follow the way the generator is working, and making the config options directly editable through input fields and updated in real time.

Make sure the design is responsivish

Ability to omit a selector

Currently if your configuration doesn't specify a lvlX selector, it will fail with a bad selector exception.

If in the DOM `lvln` is after `lvln+1` it generates non full records

For example, on this website: https://www.getpostman.com/docs/consuming_api_documentation

The DOM for lvl0 is after the DOM for lvl1, as seen here:

I think the issue is here and here as we iterate on the nodes in the order they are found in the DOM, and not relatively to their lvl

Implement a task for email deployment

@maxiloc @ElPicador What do you expect this command to be able to do:

./docsearch deploy:emails <config_name>... To prompt for those particular configurations?
./docsearch deploy:emails to do it only for new/chaged configuration without the other deployment options?

The first one seems more like what you would like, but just to be sure

Start url as regex

We should handle start url and not simple string. This would allow us to tag and page rank way more easily.

Do not build any records with lvl0

On some websites like minutedock.com, even if the documentation is well organized into main categories (lvl0) each section/article has its own page.

As a consequence, the record reflecting the lvl0 is created on every single page:

I definitely want that lvl0, to group by and display nicely the dropdown menu; but not the lvl0 records. I only had the issue with lvl0 for now.

Rexep in start_url and no facet created

If you forget the trailing / in start_url with regexp, you have object without the facet.

Maybe add a warning?
Add automatically a trailing /?

Use a requirements.txt file

Instead of listing

pip install scrapy
pip install algoliasearch
pip install selenium
pip install tldextract
pip install piperclip

we should just rely on pip install -r requirements.txt; which is the regular way to handle dependencies with pip.

@ercolanelli-leo do you mind taking care of it?

Disable removeWordsIfNoResults=allOptional

I really think that's too confusing, I would remove it.

connected docsearch

Same default image name

Here: https://github.com/algolia/docsearch-scraper/blob/master/cli/src/commands/build_docker_scraper.py#L11-L18

All containers are built with the same name because of the optional argument here: https://github.com/algolia/docsearch-scraper/blob/master/cli/src/commands/abstract_command.py#L21

Is it to be expected?

Handle errors of the connector api

List of pending documentations

I'll aggregate here the list of pending documentation requests from HelpScout, ordering them by missing feature in order to complete them. Feel free to check them out or add new ones

Not in English

https://www.gigaone.pl/wsparcie-techniczne (HelpScout) (In Polish)
http://support.growthbeat.com/ (HelpScout) (In Japanese)
http://www.bootcss.com/ (HelpScout) (In Chinese, several websites, one for each Bootstrap version)

Needs Javascript parsing

Other

http://hapijs.com/api (HelpScout) Anchors are badly generated. I've asked the maintainer to fix it, and once done we can deploy.

We only check if the value in a float and not a number

TypeError: expected string or buffer

I'm getting this error independent on a number of different configs. (For example, received it on lodash and chef).

TypeError: expected string or buffer
https://docs.chef.io/nodes.html
2015-12-31 15:48:04 [scrapy] ERROR: Spider error processing <GET https://docs.chef.io/nodes.html> (referer: https://docs.chef.io/)
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/site-packages/scrapy/utils/defer.py", line 102, in iter_errback
    yield next(it)
  File "/usr/local/lib/python2.7/site-packages/scrapy/spidermiddlewares/offsite.py", line 28, in process_spider_output
    for x in result:
  File "/usr/local/lib/python2.7/site-packages/scrapy/spidermiddlewares/referer.py", line 22, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "/usr/local/lib/python2.7/site-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/usr/local/lib/python2.7/site-packages/scrapy/spidermiddlewares/depth.py", line 54, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/usr/local/lib/python2.7/site-packages/scrapy/spiders/crawl.py", line 67, in _parse_response
    cb_res = callback(response, **cb_kwargs) or ()
  File "/Users/dustin/Documents/code/documentation-scrapper/src/documentation_spider.py", line 51, in callback
    records = self.strategy.get_records_from_response(response)
  File "/Users/dustin/Documents/code/documentation-scrapper/src/strategies/default_strategy.py", line 26, in get_records_from_response
    records = self.get_records_from_dom()
  File "/Users/dustin/Documents/code/documentation-scrapper/src/strategies/default_strategy.py", line 51, in get_records_from_dom
    nodes_per_level[level] = self.cssselect(level_selector)
  File "/Users/dustin/Documents/code/documentation-scrapper/src/strategies/abstract_strategy.py", line 84, in cssselect
    return CSSSelector(selector)(self.dom)
  File "/usr/local/lib/python2.7/site-packages/lxml/cssselect.py", line 94, in __init__
    path = translator.css_to_xpath(css)
  File "/usr/local/lib/python2.7/site-packages/cssselect/xpath.py", line 192, in css_to_xpath
    for selector in parse(css))
  File "/usr/local/lib/python2.7/site-packages/cssselect/parser.py", line 341, in parse
    match = _el_re.match(css)
TypeError: expected string or buffer

algolia / docsearch-scraper Goto Github PK

docsearch-scraper's Issues

Checklist

Questions

Checklist

Checklist

Improvements to the scraper

Improvements to the playground

Improvements to the documentation

Final merge

Checklist

Proposal

Checklist

Not in English

Needs Javascript parsing

Other

Recommend Projects

Recommend Topics

Recommend Org