mozilla / fathom Goto Github PK

A framework for extracting meaning from web pages

Home Page: http://mozilla.github.io/fathom/

License: Mozilla Public License 2.0

Makefile 1.00% JavaScript 61.85% HTML 4.69% Python 32.47%

fathom's Introduction

Fathom

Fathom is a supervised-learning system for recognizing parts of web pages—pop-ups, address forms, slideshows—or for classifying a page as a whole. A DOM flows in one side, and DOM nodes flow out the other, tagged with types and probabilities that those types are correct. A Prolog-like language makes it straightforward to specify the “smells” that suggest each type, and a neural-net-based trainer determines the optimal contribution of each smell. Finally, the FathomFox web extension lets you collect and label a corpus of web pages for training.

Continue reading at https://mozilla.github.io/fathom/intro.html#why.

Documentation

fathom's People

Contributors

Stargazers

Watchers

fathom's Issues

Characterize OmniWeb's full-text indexing

Did OmniWeb attempt to identify the main content of a page? If so, how? Make an OmniWeb page and write the answer there.

Add code coverage

Still early days, but it'd be nice to see some code coverage numbers and a "shame badge" on the README (especially if we're going to start trying to use this for Activity Stream).

Update Readbility algorithm description

Update Readbility with the changes Mozilla made to Arc90 Labs' original work.

Undocumented dependency on jsdom 8.5.0

First of all, thanks for this amazing and well thought out library.

From reading the jsdom docs, it appears that there was an api update that broke compatibility with pre api 10 versions of jsdom.

Using the latest jsdom (11.2.0), I wasn't able to get the code at https://mozilla.github.io/fathom/using.html#rules-sides-and-flows to work. Moreover, even after tweaking it for the new api, it breaks unexpectedly when querying facts against a boundedruleset.

I haven't investigated it in detail but it appears to be breaking at lhs.js

const matches = ruleset.doc.querySelectorAll(this.selector);

where jsdom used to expose querySelectorAll directly. For completeness, this is my stacktrace:

Uncaught TypeError: ruleset.doc.querySelectorAll is not a function
    at DomLhs.fnodes (/Users/Ying/Code/sfpc/newspaper-headlines/node_modules/fathom-web/lhs.js:88:37)
    at InwardRule.results (/Users/Ying/Code/sfpc/newspaper-headlines/node_modules/fathom-web/ruleset.js:315:37)
    at BoundRuleset._execute (/Users/Ying/Code/sfpc/newspaper-headlines/node_modules/fathom-web/ruleset.js:215:31)
    at BoundRuleset.get (/Users/Ying/Code/sfpc/newspaper-headlines/node_modules/fathom-web/ruleset.js:129:40)
    at request (/Users/Ying/Code/sfpc/newspaper-headlines/index.js:30:21)
    at ReadStream.<anonymous> (/Users/Ying/Code/sfpc/newspaper-headlines/node_modules/cached-request/lib/cached-request.js:249:15)
    at emitNone (events.js:110:20)
    at ReadStream.emit (events.js:207:7)
    at endReadableNT (_stream_readable.js:1047:12)
    at _combinedTickCallback (internal/process/next_tick.js:102:11)
fnodes @ VM255 lhs.js:88
results @ VM249 ruleset.js:315
_execute @ VM249 ruleset.js:215
get @ VM249 ruleset.js:129
request @ VM60 index.js:30
(anonymous) @ VM63 cached-request.js:249
emitNone @ VM27 events.js:110
emit @ VM27 events.js:207
endReadableNT @ VM52 _stream_readable.js:1047
_combinedTickCallback @ VM39 next_tick.js:102
_tickCallback @ VM39 next_tick.js:161

I'm using node v8.1.3.

I tried to see what the query might look like on latest jsdom, but ran into some issues. I'd be happy to elaborate on it further.

For now, I downgraded jsdom to 8.5.0. If this is documented somewhere already, sorry for the repost. Otherwise, I can make a PR to the README to specify the dependency.

Switch to headless Chrome as test runner

We need a proper renderer in order to take advantage of data like element position.

Reinstate readability tests (probably using brfs)
Reinstate code coverage reporting (I think there's a karma pipeline plugin for that)
Make sure we can still debug (and update the make debugtest target if necessary)
Make sure the tests still run in Travis.

Decide how to support page geometry etc. in tests and tuners

Partway through implementing #18, I noticed that it takes more than simply HTML to reconstitute the geometry of a page. That is, a mere headless browser does not suffice to deliver the additional signal we wanted. Let's choose a solution.

See https://github.com/mozilla/fathom/wiki/PageGeometryCaptureSolutions for a more editable, history-tracking place to do it.

See if there are any off-the-shelf libs out there that "freeze-dry" a page, preserving node geometry without saving every resource in its entirety. [Nope, Swathi couldn't find any.]

Choose a documentation extractor

I'd like to be able to keep API docs in one place—in the source code—but still pull them out into HTML docs that slap more prose and context around them. I would like to avoid the common antipattern where the docs consist of an alphabetical list of routines, leaving the poor newcomer to wander linearly through them, trying to piece together an overview. Add contenders below.

Sphinxy things

Sphinx integration is nice because Sphinx is amazing, offering freeform docs in restructured text, indexing and glossary building, a good markup language with arbitrary nesting of constructs parsed in an unambiguous way, output of numerous formats, a decade plus of maturity, and the fact that I know it like the back of my hand.

https://github.com/HumanBrainProject/jsdoc-sphinx. Use JSDoc in the code.
https://github.com/Nuulogic/sphinx-jsapidoc. Use Sphinx-style docs in the code.

Pure JSDoc

The nice thing about this is that some JS tooling can consume it, like editors and Google's Closure Compiler (which uses type annotations to hint to the optimizer).

Different kinds of mixing than multiplication

In 1.0 and 2.0, all scores on a type are multiplied together. This ticket is a parking place for discussion around allowing other kinds of operations. There are downsides to this, like not fitting certain machine-learning models as well. It could also make us care even more that rules execute in order and not in parallel. However, one approach might be to have have rules lay down the component numbers and an aggregate function do the combination operations.

Clustering

Implement clustering on DOM nodes. A cluster of paragraphs might be the body text of an article. A cluster of links might be a nav element.

Customizable clustering coefficients

Clustering has a lot of hard-coded coefficients at the moment: the all-caps ones in fathom.utils.distance(). I picked those based on contrived test cases, not real-world examples, so it is likely that users will want to tweak them. (We may also change the default ones, in which case users would need a way to lock in values they like.) Design a way for callers of clusters() to specify their own coefficient values to customize the behavior of distance measurement.

examples with nested classes, e.g. microformats2

Just saw the lightning talk at #MozAloha - very cool!

Is it possible to write Fathom rules to detect nested class names? Perhaps even with additional CSS Selector constraints?

Would it be possible to add a simple example rule set for parsing data from nested elements with particular class names?

For example, I'd like to write rules like the nested class name rules for microformats2 parsing:

http://microformats.org/wiki/microformats2-parsing#parse_an_element_for_class_microformats

Or at least rules like:
.h-entry .p-name -> title = textContent

Happy to help get something simple working and iterate from there.

Check all TODOs

And do or delete or file them.

Implement and() optimization and changed-types refactor

Refactor the executor to plan not in terms of types added or emitted but in terms of types potentially changed from LHS to RHS. This, described in comments on erikrose's inferrer-generalization branch, is the generalized method which both provides optimal execution plans for the and() combinator (as in situations like and(type('a')) -> type('a')) and still gets the right answer for simple rules that have a single type on the LHS.

utils.js contains a reference to jsdom

utils.js includes a reference to jsdom over here:

https://github.com/mozilla/fathom/blob/master/utils.js#L430

That should be moved over into a separate test module that isn't included in the official fathom package as that ends up causing issues when bundling fathom in a webpack-ified bundle.

staticDom isn't used by any parts of fathom itself - just the testsuite.

The fathom-webextension works around this by adding an ignore directive to workaround this for now:

https://github.com/crankycoder/fathom-webextension/blob/master/src/webpack.config.js#L71

Reduce ranker definition verbosity

This is pretty verbose, certainly too long for an arrow function embedded in a ruleset:

function someRanker(node) {
    return [{scoreMultiplier: 3,
             element: node.element,  // unnecessary, since this is the default
             flavor: 'texty',
             notes: {suspicious: true}}];
}

Once we've written some more test cases, observe what patterns shake out, and add shortcuts. I could see these taking the form of simple factored-up functions or of introspection in score() (for example, detecting iterables vs. plain objects or integer scores vs. objects).

Max-score optimization

Once we can surface the min/max of a ranker declaratively, we can have the executor work out the most efficient execution path. This will make rulesets like Jared's title-finding one as efficient as a hard-coded approach. For example, we can execute the highest-possible-scoring rules first and, if they succeed and emit score S, we don't need to execute the rules whose max is lower than S and who emit the same type. Of course, this assumes we care about only the max of a set of rules, so we'll want to bring yankers into the ruleset and make them declarative so the global optimizer can know when such an assumption is valid. Consequently, we'll stop returning the entire knowledgebase and start returning instead a map of yankers to their output. For example (and don't pay attention to the syntax)…

ruleset:
    ...
    type(blah) -> {}
    ...
    yank("maxBlah", max(blah)),
    yank("cf", cluster(foo)),  // You can add more yankers to a ruleset later by just appending them. Later yankers of the same name override previous ones.
    yank("bar", all(bar))

That will emit something like…

{ maxBlah => a fathom node (fnode),
  cf => a bunch of fnodes or maybe a cluster object,
  bar => all fnodes of type "bar" so you can do your own yanking after the fact }

Return an array of nodes?

I'm trying to play w/ <meta name="keywords"/> and using open graph tags... Does Fathom support returning multiple matching elements?

For example, can I return all matching meta[property='og:video:tag'] elements?

<meta name="keywords" content="&quot;double dream hands&quot; &quot;john jacobson&quot; music dance fun &quot;sprint guy&quot;">
...
<meta property="og:video:tag" content="double dream hands">
<meta property="og:video:tag" content="john jacobson">
<meta property="og:video:tag" content="music">
<meta property="og:video:tag" content="dance">
<meta property="og:video:tag" content="fun">
<meta property="og:video:tag" content="sprint guy">

Plus http://ogp.me/#no_vertical

Ultimately, I think we want to return an array of keywords (either from meta keywords, or og:article:tag, og:book:tag, or og:video:tag).

The other use case would be potentially returning an array of og:image tags, per http://ogp.me/#array. (Currently ignoring the problem of trying to extract adjacent og:image and optional og:image:width and og:image:height elements):

<meta property="og:image" content="http://example.com/rock.jpg" />
<meta property="og:image:width" content="300" />
<meta property="og:image:height" content="300" />
<meta property="og:image" content="http://example.com/rock2.jpg" />
<meta property="og:image" content="http://example.com/rock3.jpg" />
<meta property="og:image:height" content="1000" />

Add a compilation pass

Add a compilation pass for rulesets. This will bring several improvements:

Catch errors early: for example, dom() rules that don't assign a type.
Compute rule prereqs only once.
Support and() rules with more complex operands. (Compiling down to simpler rules is much easier than coding it all into AndLhs.)

I see us inserting a CompiledRuleset class chronologically between Ruleset and BoundRuleset. Callers can compile a ruleset manually if they wish, or they can just call Ruleset.against() as now, and the compile will happen implicitly (and then get thrown away).

Design metarules

Fathom would like "metarules": the ability to turn certain rules on or off based on [yet-to-be-determined criteria, likely including URL].

(Until then, you can do it manually by just writing a little JS code to affect which rules get passed into the ruleset. Rules are unordered in Fathom, so you can just toss things in in any order the logic deems convenient. But it would be great to make such imperative ruleset-building unneeded.)

Link to readthedocs in readme is wrong

I'm pretty sure the content at https://fathom.readthedocs.io/en/latest/ is unrelated to this library.

Add toString() methods for things so error messages are nice.

Not everything has a toString() yet: rules in particular.

Decide on max/top/maxScore/score/getScore spelling

I'd like to remove the verbose scoreUpTo and getScore and say max (or at least maxScore) and score instead, but I don't want to confuse people. If max(8) on the LHS means "the top 8" and max(8) on the RHS means "score at most 8", that might be confusing.

What about max(8) on the LHS meaning "no score over 8" (which we don't have to implement until we need it) and top(8) on the LHS meaning "the top 8"? The other semantic we'll need is over(8).

under(8) on the rhs might be a nice symmetry with over(5) on the lhs.

Write from-1.x porting guide for rulesets

Implement Readability-style ruleset as a test

This will doubtless challenge and inform our design. A "should-work" sketch is already in test.js.

Finish readme

Document all the shiny new 2.0 stuff. Pull from the outline of my Hawaii talk.

and() combinator

Write an and() combinator for the LHS. This, plus negation, should make us as powerful as Prolog.

Add Travis CI, once it supports Node 6.

We need node.js 6.0 or better for ES6 support.

Potential functions

This could be really useful if you can extract names, proper nouns with a bit of NLP. Also email addresses, social media accounts. Are any of these items on the road map?

Rete trees for optimizing combinators

If we want to make arbitrary logical expressions on LHSs go fast, this is one way. Rete trees are a sort of generalized trie in which LHS conditions are the trie nodes, getting more specific as you go away from the root. Then fnodes get tagged onto the trie nodes where they match, so, when executing a rule, you don't need to look aside to any indices or anything. It trades memory for speed.

Remove messy design notes comments

…from index.js and anywhere else they persist. Move them into docs or personal design notes.

Consider a .multiplyExistingScoreBy(5) for ML tuning

Obviously, that name is ridiculously unwieldy, but the need stands. We'll want to scale the emitted score of a rule by a constant, automatically, while training (and perhaps in the final, tuned ruleset). Right now, this could be done by interposing a type, but it will probably prove pragmatic to let us do it more concisely. We could either add this complication to the runtime or let the compilation pass interpose a made-up type implicitly.

Training infra for automatic coefficient determination

Cool project!

Have you considered ways to determine weights automatically? Once you have a set of rules, it'd be awesome to be able to learn weights by providing a set of documents with extractions and labelling them as either correct or incorrect. This then becomes a fairly standard supervised learning problem to estimate the weights.

Export ZEROISH and ONEISH for rulesets

Over in fathom-trainees there's a recommendation to return not-quite-0 and not-quite 1 values from rules to accommodate the trainer. Fathom should probably be exporting these as MIN_SCORE and MAX_SCORE constants for rulesets to use so that they don't have to copy the values themselves.

Shop for JS DOM implementations

We use jsdom for our test cases (and, by accident, effectively recommend it for people who need to feed fathom a string rather than a predisgested DOM object). But jsdom is pretty slow, and the DOM API itself is not great. cheerio has been requested by one group inside Mozilla. There's also domjs and domino. Write pro/con lists and choose one. Here's a starting set of goals. We can edit it as needed.

Be fast.
Be well-maintained.
Remain able to take in DOMs produced by JS running in a browser. IOW, if we express ranker functions in terms of cheerio's API, we need a way to get from a native DOM object to "a cheerio" so those rankers can still be used.

Switch to ESLint

ESLint is objectively better (and more configurable).
Plus, it found these:

➜  fathom git:(master) ✗ npm run lint

> @ lint /Users/pdehaan/dev/github/fathom
> eslint .


/Users/pdehaan/dev/github/fathom/fathom.js
    6:7   error  'flatMap' is defined but never used       no-unused-vars
   10:7   error  'jsdom' is defined but never used         no-unused-vars
  222:44  error  'isBlock' is not defined                  no-undef
  243:10  error  'paragraphish' is defined but never used  no-unused-vars
  247:24  error  'str' is defined but never used           no-unused-vars

/Users/pdehaan/dev/github/fathom/test/test.js
  20:40  error  'node' is defined but never used  no-unused-vars
  35:30  error  'node' is defined but never used  no-unused-vars
  36:28  error  'node' is defined but never used  no-unused-vars

✖ 8 problems (8 errors, 0 warnings)

And my .eslintrc file looks like this:

env:
  es6: true
  mocha: true
  node: true

extends:
  - eslint:recommended

Write an aggregate function for clustering

Something like type('smoo').biggestCluster() on the LHS which emits the most populous cluster of smoos. We ought to be able to have it tolerate the occasional low-scoring sibling by messing with the stride-node cost coefficient.

And/or we could have one the emits the max-scoring cluster, by some metric.

This isn't gospel; please apply design thought.

Change vocabulary from "scribbles" to "notes".

It's shorter and more accurate.

ANDs in left-hand sides of rules

Perhaps this should wait until we have a concrete use case, but being able to AND types together on the left-hand side of rules would make the language much more powerful. ORs could happen, too, but they're already expressible by simply adding a second rule with the same right-hand side.

Port Jared's rules to Fathom

Jared did a sketch of some extractors for Universal Search but based them on a much simpler, non-scoring, short-circuiting design with mixed rank and yank phases: https://github.com/JaredKerim-Mozilla/fathom/blob/master/fathom.js#L66. Port them to Fathom's current design, and bend our design to make them possible, if necessary. Short-circuiting isn't a blocker and would probably uglify our design. It's likely we can instead make the ranker functions more declarative and make the rule engine smart enough not to execute rules that couldn't possible win.

Let score() take a function

…if this proves useful.

Need to be able to timeout ruleset execution

It is possible to define rulesets which run very slowly on some webpages.

This currently causes performance problems in Firefox (I'm currently testing on 58).

I think we should be able to mitigate this problem if Fathom supported a way to timeout ruleset execution after X ms.

In my head, I'd like something like a callback system where I can pass in a function that accepts the current runtime in ms of the ruleset. Returning a non-zero value should cause the fathom ruleset to abort early.

Use for ebooks

Is it possible to use this for EPUB files? The EPUB format is a subset of XHTML, zipped and with a bit of metadata thrown in. I am trying to strip out unnecessary content e.g. copyright pages, content pages etc. for conversion of ebooks to audiobooks (where such pages are redundant), is this a suitable use case?

Rename "node" to something better.

"Node" already means "DOM node". Use another word for the node proxies Fathom scribbles its bookkeeping information on.

Add not() and or() combinators

Update Fathom docs to acknowledge FathomFox

There's really no need to write code against the Corpus Framework anymore. Training in-browser gives better performance and stronger sources of signal with less setup.

Move labeling UI into Inspector sidebar pane

Once https://bugzilla.mozilla.org/show_bug.cgi?id=1398734 is done, we'll be able to move our labeling UI from its current home in a separate Fathom devtools panel to a subpane of the Inspector panel. This will greatly speed the labeling experience, relieving people of pogo-sticking between the two panes.

Rename "scoreMultiplier" to "score" for brevity

Characterize Chrome's distraction-free-browsing mode

How does Chrome identify what's a distraction and what's not? Summarize on a new ChromeDistractionFreeBrowsing page.

Let out('smoo') be spelled as just 'smoo'

…for symmetry with facts.get('smoo').

when()

Takes a predicate that can further specify which nodes to select

Example:
type(‘priceish’).when(fnode => fnode.element.tagName.len > 5)

Maintain an array of predicates in the higher level Lhs class. when function in Lhs adds predicate to the array. Other Lhs subclasses check that all predicates in the array are satisfied before returning the element.