algolia / docsearch-scraper Goto Github PK
View Code? Open in Web Editor NEWDocSearch - Scraper
Home Page: https://docsearch.algolia.com/
License: Other
DocSearch - Scraper
Home Page: https://docsearch.algolia.com/
License: Other
We could add a dropdown menu to the playground.html
file that allow users to quickly switch from one config to another. This will let us easily test all the configs of customer currently live in production and check that everything is working correctly.
This might be a silly qestion, but why do we COPY configs /root/configs
in the Dockerfile
?
We pass the config with -e CONFIG
when starting docker run
anyway, so I'm not sure why the file needs to be in the image?
On some documentation (see http://doc.craft.ai/tutorials/doc/1/index.html or any GitBook doc), the hierarchy is in the sidebar and not in the main markup.
We can usually fix that with global selectors (see #32), but sometimes the selector is too complex to be handled by CSS alone, and we need to resort to XPath
Here, we need to first grab the li.active
, go up to its closest li
parent, then down to the first a
. This cannot be achieved through CSS, but can be achieved with //li[@class="chapter active done"]/../../a
in XPath.
global
selectors? (throw an error if not)We always add the url,anchor
field to the attributesToIndex
, and a lot of documentations have urls that follow the website hierarchy.
It means that if I have a page in /foobar/security.html
, it will always be returned on a search on foobar
, even if foobar
is not present in the security.html
page at all.
I'm wondering if searching in the url,anchor
should be kept as a default or moved to an opt-in config.
It will be useful to be able to specify a default value for a selector if such a selector does not match anything on the page.
We will take advantage of the "new" syntax of the selectors allowing for objects instead of plain string. We'll provide a default_value
key that will be used when the specified selector does not match anything.
By also allowing for empty selectors, this will let us "hardcode" a hierarchy.
"selectors": {
"lvl0": {
"selector": ".menu a.active-trail",
"default_value": "Homepage"
}
}
Put the new default template in the docsearch css
Make the generator clouflarable
On the reindex.io documentation (https://www.reindex.io/docs/), all the scrapped urls have a #react-mount
anchor appended.
This is because none of the elements of the page have anchor, except the main wrapper that has an id='react-mount'
.
Apart from asking the owner to add correct anchors to each parts of the hierarchy, would you see a way to prevent the crawler from adding this id?
If you have a config like:
{
"url": "http://doc.akka.io/docs/akka/(?P<language>.*?)",
"variables": {
"language": ["scala", "java"]
}
}
}
And an URL like http://doc.akka.io/docs/akka/ruby
exists, ruby
is added to the list of possible values of language
The Docker container build time has increased too much since we've added the Splash instance in it. I'm currently splitting the original Dockerfile in two in order to have a base image which has all dependencies preinstalled.
For documentations that have only 1 or 2 levels, the default display adds too many constraints. We usually do not need the lvl0, and would want to start the hierarchy at the lvl1, and only display two columns without the horizontal bars.
Using default values for lvl0, we could easily set it to a hardcoded "Documentation" string, and then in the front end hide the bar using CSS.
But this will leave with with a lvl0
of Documentation that is searchable, so if a user searches for "Documentation", all the records would match.
So we need to provide a way to bypass completly the lvl0. The proposal is to allow a value of None
on the lvl0
, that will trigger a specific behavior.
lvl0
will be hardcoded to Documentation
for all recordslvl0
will not be added to the rankingIn a second step, we will add a removeTitle
option on docsearch.js that will automatically hide the horizontal bar. In the meantime, those default options will simply group all the results under the same header, but disallowing searching in it.
None
to selectors.lvl0
lvl0
from all rankingPassing an config without the selectors_exclude
key should set it as an empty array by default, and not throw an error.
Here is the checklist to things to be done before releasing DocSearch v2. The v2 will officially be merged in the docsearch.js repo, thus making the project public. We will also update docsearch.js to v2.0 and keep the two projects in sync.
removeTitle
option on docsearch.js that automatically hide the horizontal bar (to be used with the lvl0: None
scraper option)The selectors we can currently specify are only scoped selectord. It means that, for any given lvlX
element, in order to get its lvl{X-1}
and lvl{X-2}
parents, we have to check the DOM tree.
It works well for correctly crafted HTML, but fails as soon as the hierarchy is not in the HTML DOM order. This happens often when the hierarchy is in a sidebar, while the content is in the main part of the page. With our current scoped selectors, we cannot build such a hierarchy.
We need to allow global selectors, that instead of looking up the DOM tree will simply try to select an element in the page, wherever it is placed.
A possible API would be to allow passing objects to each selectors instead of strings.
{
"selectors": {
"lvl0": {
"selector": "#accordion a.category-title:not(.collapsed)",
"global": true
},
"lvl1": ".content h2",
"lvl2": ".content h3",
"lvl3": ".content h4",
"lvl4": ".content h5",
"text": ".content p"
}
We keep retrocompatibility by converting simple strings to objects with a global: false
and a selector
key.
global: false
as the defaultglobal: true
, will use the result of the first match of the selectorI am getting this error when I try to run docsearch to scrape my site:
12:27:47 (master) docsearch-scraper: ./docsearch run ../../docsearch/config.json
Traceback (most recent call last):
File "scraper/src/index.py", line 5, in <module>
from config_loader import ConfigLoader
File "/Users/z001n2j/temp/git/docsearch-scraper/scraper/src/config_loader.py", line 26, in <module>
from .strategies.abstract_strategy import AbstractStrategy
ValueError: Attempted relative import in non-package
Any idea what could be causing it?
only create the dev image
Some documentation are generated client side with JS (ex: http://docs.prezly.com/, https://gns3.com/support/docs/quick-start-guide-for-windows-us).
It would be nice to be able to parse them
With this config, it fetches urls with ../
in them, and so indexes the same page multiple times.
Exemple of output:
https://docs.barricade.io/../../using-barricade/
Pushed 199 records
https://docs.barricade.io/../changelog
Pushed 2 records
https://docs.barricade.io/../changelog/new-barricade-site-launched
Pushed 3 records
https://docs.barricade.io/hc/
Pushed 0 records
https://docs.barricade.io/../../getting-started/
Pushed 58 records
https://docs.barricade.io/../../../
Pushed 11 records
Configuration:
{
"index_name": "barricade",
"start_urls": [
"https://docs.barricade.io"
],
"stop_urls": [
],
"selectors_exclude": [
],
"selectors": {
"lvl0": ".main h1",
"lvl1": ".main h2",
"lvl2": ".main h3",
"lvl3": ".main h4",
"lvl4": ".main h5",
"text": ".main p"
},
"custom_settings": {},
"js_render": true,
"min_indexed_level": 1
}
<h2>
Ajax Errors
<a class="header-anchor" id="ajax-errors" href="#ajax-errors"></a>
</h2>
With this tags, it does not get the id
of the a
We should add a explanation of what min_indexed_level
does, as per #8
I was not able to deploy the new places configuration, can you provide guidance on how to do so?
The script I used (deployer/deploy) failed with:
Traceback (most recent call last):
File "/tmp/docsearch_deploy/scraper/deployer/src/index.py", line 6, in <module>
import fetchers
File "/tmp/docsearch_deploy/scraper/deployer/src/fetchers.py", line 5, in <module>
import helpers
File "/tmp/docsearch_deploy/scraper/deployer/src/helpers.py", line 1, in <module>
import requests
ImportError: No module named requests
Try this:
{
"index_name": "cardconnect",
"start_urls": [
{
"url": "https://developer.cardconnect.com/(?P<project>.*?)/",
"variables": {
"project": ["cardconnect-api", "copilot-api", "hosted-payment-api"]
}
}
],
"stop_urls": [
],
"selectors_exclude": [
],
"selectors": {
"lvl0": "main h1",
"lvl1": "main h2",
"lvl2": "main h3",
"lvl3": "main h4",
"lvl4": "main h5",
"text": "main p"
}
}
The doc is crawled but the objects do not contain the project
attribute
Some documentations adds a ¶
sign in the markup of their titles. It is only displayed on mouse over, but is present in the HTML, thus it is indexed.
We need a way to prevent those chars from being indexed. The previous version of the scraper used the strip_chars
option in the config to specify a blacklist of chars that should be removed from the selector results. I think we should re-enable that feature, and maybe even extend it to a blacklist of words.
{
[...]
"strip_chars": "¶"
}
As suggested by @redox below, also allowing it on the selectors level
"selectors": {
"lvl0": {
"selector": "h1",
"strip_chars": ["¶"]
}
}
Got this on Chef, Lodash, and one I was using for Go.
2015-12-31 15:49:37 [scrapy] ERROR: Spider error processing <GET https://docs.chef.io/resources.html> (referer: https://docs.chef.io/)
Traceback (most recent call last):
File "/usr/local/lib/python2.7/site-packages/scrapy/utils/defer.py", line 102, in iter_errback
yield next(it)
File "/usr/local/lib/python2.7/site-packages/scrapy/spidermiddlewares/offsite.py", line 28, in process_spider_output
for x in result:
File "/usr/local/lib/python2.7/site-packages/scrapy/spidermiddlewares/referer.py", line 22, in <genexpr>
return (_set_referer(r) for r in result or ())
File "/usr/local/lib/python2.7/site-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>
return (r for r in result or () if _filter(r))
File "/usr/local/lib/python2.7/site-packages/scrapy/spidermiddlewares/depth.py", line 54, in <genexpr>
return (r for r in result or () if _filter(r))
File "/usr/local/lib/python2.7/site-packages/scrapy/spiders/crawl.py", line 67, in _parse_response
cb_res = callback(response, **cb_kwargs) or ()
File "/Users/dustin/Documents/code/documentation-scrapper/src/documentation_spider.py", line 51, in callback
records = self.strategy.get_records_from_response(response)
File "/Users/dustin/Documents/code/documentation-scrapper/src/strategies/default_strategy.py", line 25, in get_records_from_response
records = self.get_records_from_dom()
File "/Users/dustin/Documents/code/documentation-scrapper/src/strategies/default_strategy.py", line 47, in get_records_from_dom
level_selector = self.config.selectors[level]
TypeError: list indices must be integers, not str```
We currently have a playground.html
page that lets us test our configs. apiKey
, indexName
, etc must be filled manually and one should take care not to commit the file with its changes.
Instead, we should follow the way the generator is working, and making the config options directly editable through input fields and updated in real time.
Currently if your configuration doesn't specify a lvlX
selector, it will fail with a bad selector exception.
For example, on this website: https://www.getpostman.com/docs/consuming_api_documentation
The DOM for lvl0
is after the DOM for lvl1
, as seen here:
I think the issue is here and here as we iterate on the nodes
in the order they are found in the DOM, and not relatively to their lvl
@maxiloc @ElPicador What do you expect this command to be able to do:
./docsearch deploy:emails <config_name>...
To prompt for those particular configurations?./docsearch deploy:emails
to do it only for new/chaged configuration without the other deployment options?The first one seems more like what you would like, but just to be sure
We should handle start url and not simple string. This would allow us to tag and page rank way more easily.
On some websites like minutedock.com, even if the documentation is well organized into main categories (lvl0
) each section/article has its own page.
As a consequence, the record reflecting the lvl0
is created on every single page:
I definitely want that lvl0
, to group by and display nicely the dropdown menu; but not the lvl0
records. I only had the issue with lvl0
for now.
If you forget the trailing /
in start_url with regexp, you have object without the facet.
Maybe add a warning?
Add automatically a trailing /
?
Instead of listing
pip install scrapy
pip install algoliasearch
pip install selenium
pip install tldextract
pip install piperclip
we should just rely on pip install -r requirements.txt
; which is the regular way to handle dependencies with pip.
@ercolanelli-leo do you mind taking care of it?
I really think that's too confusing, I would remove it.
All containers are built with the same name because of the optional argument here: https://github.com/algolia/docsearch-scraper/blob/master/cli/src/commands/abstract_command.py#L21
Is it to be expected?
I'll aggregate here the list of pending documentation requests from HelpScout, ordering them by missing feature in order to complete them. Feel free to check them out or add new ones
It seems all the Help pages from the website (starting with https://www.hull.io/help
) are not handled with the current selectors i.e. they do not produce any record. Possibly because of .documentation-body
.
If you specify "js_wait": 10
it take 0.5
as a wait time.
But it works if you do "js_wait": 10.0
.
I think the issue is here: https://github.com/algolia/documentation-scrapper/blob/master/src/config_loader.py#L132
We only check if the value in a float
and not a number
I'm getting this error independent on a number of different configs. (For example, received it on lodash and chef).
TypeError: expected string or buffer
https://docs.chef.io/nodes.html
2015-12-31 15:48:04 [scrapy] ERROR: Spider error processing <GET https://docs.chef.io/nodes.html> (referer: https://docs.chef.io/)
Traceback (most recent call last):
File "/usr/local/lib/python2.7/site-packages/scrapy/utils/defer.py", line 102, in iter_errback
yield next(it)
File "/usr/local/lib/python2.7/site-packages/scrapy/spidermiddlewares/offsite.py", line 28, in process_spider_output
for x in result:
File "/usr/local/lib/python2.7/site-packages/scrapy/spidermiddlewares/referer.py", line 22, in <genexpr>
return (_set_referer(r) for r in result or ())
File "/usr/local/lib/python2.7/site-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>
return (r for r in result or () if _filter(r))
File "/usr/local/lib/python2.7/site-packages/scrapy/spidermiddlewares/depth.py", line 54, in <genexpr>
return (r for r in result or () if _filter(r))
File "/usr/local/lib/python2.7/site-packages/scrapy/spiders/crawl.py", line 67, in _parse_response
cb_res = callback(response, **cb_kwargs) or ()
File "/Users/dustin/Documents/code/documentation-scrapper/src/documentation_spider.py", line 51, in callback
records = self.strategy.get_records_from_response(response)
File "/Users/dustin/Documents/code/documentation-scrapper/src/strategies/default_strategy.py", line 26, in get_records_from_response
records = self.get_records_from_dom()
File "/Users/dustin/Documents/code/documentation-scrapper/src/strategies/default_strategy.py", line 51, in get_records_from_dom
nodes_per_level[level] = self.cssselect(level_selector)
File "/Users/dustin/Documents/code/documentation-scrapper/src/strategies/abstract_strategy.py", line 84, in cssselect
return CSSSelector(selector)(self.dom)
File "/usr/local/lib/python2.7/site-packages/lxml/cssselect.py", line 94, in __init__
path = translator.css_to_xpath(css)
File "/usr/local/lib/python2.7/site-packages/cssselect/xpath.py", line 192, in css_to_xpath
for selector in parse(css))
File "/usr/local/lib/python2.7/site-packages/cssselect/parser.py", line 341, in parse
match = _el_re.match(css)
TypeError: expected string or buffer
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.