Comments (14)
Hello,
Thanks for this project ! It seems like a very nice way to "abstract" what readability is doing.
My initial reaction was to see how I could use cheerio with this since it is the de-factor fast parser / walker I use for readibility-like page analysis.
Have you given more thought on this issue ? I have not yet understood how fathom's structures are implemented. I understand that it needs a selector-like API on a dom-like tree but it seems to also need a selector API on types (?).
cheerio uses https://www.npmjs.com/package/css-select as its selector-API, which implements a subset of the https://github.com/jquery/sizzle selector-API.
Could you explain what is the interface between fathom and the object model that the document needs to meet ? Is fathom what we could call a generic rule-based tree annotator or does it have hard bindings to the DOM as we know it in html ?
from fathom.
Fathom is really quite generic. The only parts that interact with the DOM API are…
- The querySelectorAll call in the dom() LHS. But there's no innate need to use the dom() LHS if you have some other source of data. You'd need to write that source into a new kind of LHS, and Fathom isn't pluggable in that direction atm, but there's no brittle piece of design keeping it from happening. Ultimately, I would like to push Fathom as a more generic data structure annotator and queryer, so this would be a useful thing to make pluggable.
- And any callbacks (for scoring, etc.) that call DOM methods on the nodes passed in
I have actually been thinking about switching Fathom to cheerio for speed. Do you happen to know whether cheerio can act on an arbitrary DOM tree, regardless of its implementation? It looks like it prefers instead to parse text. I wonder why, since css-select seems content to act on DOM-compliant nodes. Is cheerio doing some sort of indexing or caching? I'd like to maintain Fathom's ability to work with native DOM trees or at least wrap them transparently, since one of its major use cases is running within Firefox.
To answer your question, I don't think it would be hard to decouple dom() from querySelectorAll(), but I'd have to do some thinking to decide how best to change the API. One nicety would be to support straight DOM access for certain rules but cheerio wrappings for others and be able to have callbacks for each sort coexist in one ruleset. I'd happily review a proposal, couched as a PR. Thanks for your interest!
from fathom.
I think that cheerio was initially developped as a server-side jquery style html walker. It is mainly a wrapper around htmlparser2 and css-select and that would explain why the main use case was to start with strings.
I understand your requirement for fathom to be able to work with a pre-parsed native DOM tree.
I must admit that I don't know where the speed from cheerio comes from. Is it from the parsing phase, or as css-select boasts, from a different way of doing selector research ?
I would certainly not hardcode cheerio inside fathom because cheerio has its own set of issues. so having a pluggable system is certainly the way to go towards a more generic tree annotator.
I looked at the internals of cheerio. The "string" based approach seems to be deeply coded inside cheerio. For example, cloned doms are serialized + re-parsed. Modifications like appendTo(target)
can append to strings that will be parsed internally. So there seem to be a lot of duck-typing between strings and the cheerio-dom impl.
the cheerio-dom structure corresponds to the htmlparser2 dom structure which is defined by https://github.com/fb55/domhandler
the find
method, which is close to querySelectorAll
, is imply implemented as css-select(selector, elems, options)
I am not sure if cheerio can wrap a native DOM tree. css-select
advertises the fact that it can be used with different adapters
In its default configuration, css-select queries the DOM structure of the domhandler module (also known as htmlparser2 DOM). It uses domutils as its default adapter over the DOM structure. See Options below for details on querying alternative DOM structures.
It seems that there is a https://github.com/nrkn/css-select-browser-adapter that is a css-select
adapter for a native DOM tree, so at least css-select can work both on native DOM and on cheerio's dom.
Nevertheless, I think I need to dig a little more into how fathom uses DOM (directly or via callbacks) to better understand what is used and check what it would mean for rulesets interoperability to have several non fully-compatible DOM-like engines.
Is there a complex real-life example that you know about and that I could look at to see what we are talking about DOM-wise ?
from fathom.
Note: after digging some more into cheerio,
-
I confirm that css-select 1.3.0-rc0 can traverse a "native DOM" tree when using the
css-select-browser-adapter
adapter (I successfully used css-select to traverse a jsdom instance) -
I can't find a way to instanciate cheerio with a jsdom instance and have it transparently hand it over to css-select. Cheerio wants to wrap the DOM context inside a cheerio object and there is currently no code path for a transparent wrapping of a jsdom node tree (that I consider equivalent to a native DOM tree). There may be a way to PR the cheerio code to accepts this for traversal but it seems a lot more complicated for all the cheerio dom tree modifiers.
from fathom.
Let me start with a correction: Fathom's clustering machinery uses the DOM API heavily, including compareDocumentPosition()
, which I've seen go unimplemented in JS DOM implementations.
css-select-browser-adapter looks short and sane. Thanks for test-driving it! Things like clustering would still need a way to walk around the tree, but I see cheerio has affordances for that. I'm not sure it has something like Node.contains(anotherNode)
, but one could certainly write one.
Is there a complex real-life example that you know about and that I could look at to see what we are talking about DOM-wise?
I think this Readability workalike is your best bet for now. Drill into the functions it calls on DOM nodes, like domSort(), clusters(), and inlineTextLength(). Hmm, I suppose we had other DOM-dependent things after all, mostly in the utils module. They're technically optional, but they sure come in handy.
from fathom.
I am now thinking hard about switching to cheerio, since jsdom is irredeemably leaky: jsdom/jsdom#1682. It crashed my attempts at auto-tuning my rule coefficients until I coded around it by preallocating all the DOMs rather than rebuilding them each time through the loop (which is fine and probably more efficient, at least until I have a zillion of them, but doesn't apply to other use cases). But I don't want to do that sort of trick for more general purposes; I'd have to keep a pool of jsdom docs around with all the attendant complexity. I tried switching to jsdoc.env(), rumored to be free of leaks, but it is intrinsically async, returning null and depending on a callback to do anything. This makes my decided non-event-loopy optimization loop hang. And in any case, jsdom.env() leaks faster than ever, even when given a callback that calls window.close().
So I am more motivated than ever to explore this. :-)
from fathom.
htmlparser2, which cheerio uses, is a nice project. It parses quickly. It claims to provide a DOM Level 1 interface, though I'm not seeing it, playing locally with its DomHandler. Next I'm going to have a look at the other contenders on the htmlparser-benchmark list. Do you know anything about any of them?
from fathom.
I don't think htmlparser2 is the silver bullet either. It seems to have its own share of memory leaks according to cheeriojs/cheerio#830.
I didn't know it had DOM Level 1 interface. I haven't tried it. According to https://github.com/fb55/domhandler you need to add a withDomLvl1
option to have DOM Level 1 properties.
Last time I looked at these parsing tools (1 year ago maybe), parse5 was the new kid on the block (in a good sense).
As I understand it, fathom
only needs read-only access to the DOM is that correct ? (selecting elements and traversing the DOM in several ways for clustering) so maybe a simpler "read-only cheerio" based on the adapter pattern that css-select
is advertising could be the way to go for fathom
. It could work with any dom-friendly interface for which and adapter could be written.
Would the API in https://github.com/nrkn/css-select-browser-adapter be sufficient for the traversal needs of fathom ?
I believe that the selectors offered by https://github.com/fb55/css-select would be sufficient compared to the native querySelectorAll that you are using but I must admit that I don't fully understand how they are differing. In my perspective, I would think that querySelectorAll would always be faster because browsers adopted the native querySelectorAll after jQuery introduced this sort of API in js land. But I don't know what is their difference feature-wise.
from fathom.
Regarding Node.contains, here is the implementation in cheerio - https://github.com/cheeriojs/cheerio/blob/765cdaaac56acdaf8779867c6205b7edde25e91f/lib/static.js#L167 . It should be easy enough to code through a css-select adapter.
from fathom.
As I understand it, fathom only needs read-only access to the DOM is that correct?
Yes.
Would the API in https://github.com/nrkn/css-select-browser-adapter be sufficient for the traversal needs of fathom?
I'd have to really dig through util.js to tell. Node.compareDocumentPosition() is one I immediately see is missing.
sufficient compared to the native querySelectorAll that you are using but I must admit that I don't fully understand how they are differing
I don't really care if some silly corner of CSS3 isn't supported by one or the other. As long as we can select on tags and attrs and attr values, it's fine.
from fathom.
you need to add a withDomLvl1 option to have DOM Level 1 properties
No luck:
var htmlparser = require("htmlparser2");
var rawHtml = "<html><body><p>Hello</p><p>Goodbye</p></body></html>";
var handler = new htmlparser.DomHandler(function (error, dom) {
if (error)
console.log('error')
else {
console.log(dom);
debugger;
}
},
{withDomLvl1: true}
);
var parser = new htmlparser.Parser(handler);
parser.write(rawHtml);
parser.done();
> dom.ELEMENT_NODE
undefined
> dom.childNodes
undefined
Even if it's a Document, Document is a subinterface of Node and should have those properties.
from fathom.
The dom level 1 compatibility indeed seems sparse - https://github.com/fb55/domhandler/blob/master/lib/node.js
from fathom.
from fathom.
Best domino port so far: #73 (comment)
from fathom.
Related Issues (20)
- Remove the need to look at Tensorboard for LR HOT 1
- Higher-rez histograms
- `fathom-init`-like tool
- Support CUDA in trainer HOT 1
- Handle whitespace in base64 strings in fathom extract
- Try using window loading events to trigger vectorization instead of `tabs.onUpdated` HOT 1
- Duplicate feature vectors for samples appearing in vector file
- Make Fathom see into iframes HOT 1
- Configuring Vectorizer...Error: Couldn't find trainee ID "TRAINEE" in your rulesets. HOT 11
- Travis CI free usage ends Dec 3; mozilla repos should switch to other CI platforms HOT 1
- Dev tools panel sometimes doesn't draw all its widgets HOT 1
- Evaluate SingleFile as alternative serializer
- Have an Outreachy intern
- Make isVisible 10% faster by avoiding unneeded getComputedStyle() calls HOT 2
- FathomFox doesn’t see elements that are in the iframe
- geckodriver code 64 HOT 2
- Can't find firefox binary
- Check ancestors overflow style instead of descendant's overflow style in IsVisible()
- fathom train fails: 'WebDriver' object has no attribute 'find_element_by_id' HOT 3
- `isVisible` returns `false` for elements scrolled off the top of the screen
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from fathom.