glennjones / microformat-node Goto Github PK
View Code? Open in Web Editor NEWMicroformats parser for node.js
Home Page: http://glennjones.net/tools/microformats/
License: MIT License
Microformats parser for node.js
Home Page: http://glennjones.net/tools/microformats/
License: MIT License
microformats.parseUrl('http://glennjones.net/about', options, function(err, data){
if (err) throw (err);
console.log(data);
});
results in:
TypeError: Cannot read property '0' of null
at getName (/Users/bret/Documents/Git-Clones/iwc-log-feed/node_modules/microformat-node/node_modules/cheerio/node_modules/CSSselect/node_modules/CSSwhat/index.js:78:36)
at getLCName (/Users/bret/Documents/Git-Clones/iwc-log-feed/node_modules/microformat-node/node_modules/cheerio/node_modules/CSSselect/node_modules/CSSwhat/index.js:84:14)
at parse (/Users/bret/Documents/Git-Clones/iwc-log-feed/node_modules/microformat-node/node_modules/cheerio/node_modules/CSSselect/node_modules/CSSwhat/index.js:160:12)
at compileUnsafe (/Users/bret/Documents/Git-Clones/iwc-log-feed/node_modules/microformat-node/node_modules/cheerio/node_modules/CSSselect/lib/compile.js:26:9)
at select (/Users/bret/Documents/Git-Clones/iwc-log-feed/node_modules/microformat-node/node_modules/cheerio/node_modules/CSSselect/index.js:17:43)
at CSSselect (/Users/bret/Documents/Git-Clones/iwc-log-feed/node_modules/microformat-node/node_modules/cheerio/node_modules/CSSselect/index.js:40:9)
at exports.find (/Users/bret/Documents/Git-Clones/iwc-log-feed/node_modules/microformat-node/node_modules/cheerio/lib/api/traversing.js:7:21)
at new module.exports (/Users/bret/Documents/Git-Clones/iwc-log-feed/node_modules/microformat-node/node_modules/cheerio/lib/cheerio.js:83:18)
at initialize (/Users/bret/Documents/Git-Clones/iwc-log-feed/node_modules/microformat-node/node_modules/cheerio/lib/static.js:19:12)
at Object.Parser.apppendInclude (/Users/bret/Documents/Git-Clones/iwc-log-feed/node_modules/microformat-node/lib/parser.js:1446:13)
Digging around a bit to see what the cause is. Parsing other URLs seems to work though.
Parsing the URL "https://www.zalando.nl/nike-performance-trainingsbroek-blackcool-grey-n1241e0em-q11.html" causes the library to hang. It does not crash, but rather hangs for a very long time at 100% cpu-core consumption. Memory does not increase significantly.
Code to reproduce the problem:
const fetch = require('node-fetch');
const mf = require('microformat-node');
const fetchUrl = url => fetch(url)
.then(res => res.text())
.then(html => mf.getAsync({html}));
fetchUrl('https://www.zalando.nl/nike-performance-trainingsbroek-blackcool-grey-n1241e0em-q11.html')
microformat-node version: 2.0.1
node version: v8.9.4 (also tested with v6.11.5 with same problem)
Could you please give me some pointers where to start looking in the code for a possible cause of this problem. Thanks!
Current parseDom operates on jQuery-like objects.
following code crashes with a TypeError: Cannot read property 'replace' of null
:
var Microformats = require('microformat-node');
Microformats.get({
html: '<div class="include item"></div>'
}, function (err, data) {
console.log(err, data);
});
also fails on input:
'<div class="include"></div><div class="item"></div>'
narrowed down sample from http://www.advodata.be/
also happening on www.netflow.ro
In the following example the parser uses the textContent for u-url and u-uid. Based on the u-* parsing rules from http://microformats.org/wiki/microformats2-parsing i was expecting to get the href attribute. As rule 1 if a.u-x[href] or area.u-x[href] or link.u-x[href], then get the href attribute
.
var chai = require('chai'),
assert = chai.assert,
helper = require('../test/helper.js');
describe('h-entry', function() {
var htmlFragment = "<li class=\"h-entry hentry\">\r\n <p class=\"p-name entry-title e-content entry-content\">\r\n Google the company *already* effectively rebranded, into Alphabet_Inc.\r\n <\/p>\r\n <span class=\"footer\"><a href=\"2019\/156\/t2\/\" class=\"dt-published published dt-updated updated u-url u-uid\"><time class=\"value\" datetime=\"11:56-0700\">11:56<\/time> on <time class=\"value\">2019-06-05<\/time><\/a><\/span>\r\n<\/li>";
var found = helper.parseHTML(htmlFragment,'http://example.com/');
var expected = "/2019/156/t2/";
it('u-url', function(){
var url = found.items[0].properties.url[0];
assert.equal(url, expected);
});
it('u-uid', function(){
var uid = found.items[0].properties.uid[0];
assert.equal(uid, expected);
});
});
Not sending a PR for this one because neither do I exactly know what's causing it or do I think it's that big of a deal that it's worth spending a lot of time on debugging, but making a note here anyhow so that it's documented.
In the checkmention project by @kbsriram there's an XSS test that parses differently now compared to version 0.3.x
: https://github.com/kbsriram/checkmention/blob/master/src/WEB-INF/checks/xss
More specifically – the 2.0.0
version of this library is now ignoring the final </script>
in the <<SCRIPT>alert("XSS4");//<</SCRIPT>
code in there and thus makes all of the remaining e-content be treated as content of the script tag and thus drops all of that content from the text. This didn't happen before.
This means that the text output of parsing that XSS file is now:
Clicking this\nshould not cause an alert.\nThis div\nshould not alert.\nTry clicking this link\n<script>alert(\"encoded-xss\")</script>\nand this too.\nMouse over this\nshould not cause an alert. This broken\n should not throw an alert.\n<
When it before was:
Clicking this\nshould not cause an alert.\nThis div\nshould not alert.\nTry clicking this link\n<script>alert(\"encoded-xss\")</script>\nand this too.\nMouse over this\nshould not cause an alert. This broken\n should not throw an alert.\n<alert(\"XSS4\");//\n\nNeither should .\nPlease look at the Owasp XSS prevention cheat sheet for more information.\n\n\nThis note was created on\n\n%%nice_time
When looking at the same content directly in the browser I would say that the former handling was more correct than the current, but not sure what has changed. Cheerio is still the same so must be something outside of Cheerios control?
But – as I said in the beginning – this feels like an edge case that's not really worth spending a lot of time on fixing if it isn't an indicator of something bigger which I don't think it is.
Using truncated representations of dates for birth date is often good practice as noted in the vcard spec http://microformats.org/wiki/h-card#dt-bday
"--12-28"
Apart from citing parecki's birthday from the public h-card (send him much gifts) I'll look into it now for a fix [problem is in the intermediate step we would have to use the year 9999 and replace it afterwards, one reason why I am using a format like in my nlp module ].
Feel free to look into the following PR ...
i.e. class="h-entry h-note" is returned as type: ["h-entry"] but should return type: ["h-entry","h-note']
This project hasn't been maintained in a number of years.
There is an up-to-date library available here: https://github.com/microformats/microformats-parser, that works both with node.js and in the browser. Also supports TypeScript.
Hi Glenn, awesome module!
The checkLimits function in lib/cache.js is automatically called when requiring microformat-node, even when the Parser doesn't use caching. Personally, this has prevented an app from closing (due to the setTimeout still running) and also has caused tests (from a module that I'm working on) from exiting.
If possible, it would be great if the cache didn't automatically start until its first use. I'm just knocking up a quick feature branch now that does this and will push it up shortly so that it can be discussed further.
This could either be an issue with the parser, or a unforeseen issue in the spec
The example below is not parsing correctly. I would expect the entry "name" to be the empty string. Adding any non-whitespace text to the e-content causes it to revert to expected behavior.
<!DOCTYPE html>
<html lang="en">
<head>
</head>
<body>
<div class="h-entry">
<a href="http://this.site/photo" class="u-url"></a>
<div class="e-content p-name"><img src="photo.jpg" class="u-photo"/></div>
Some extraneous text
<div class="h-cite">
<a href="http://someother.site/like" class="u-url"></a>
<a href="http://this.site/photo" class="u-like-of"></a>
<div class="e-content p-name">liked this</div>
</div>
</div>
</body>
</html>
{ items:
[ { type: [ 'h-entry' ],
properties:
{ url: [ 'http://this.site/photo' ],
content: [ { value: '', html: '<img src="photo.jpg" class="u-photo" />' } ],
photo: [ 'photo.jpg' ],
name: [ 'Some extraneous text\r\n\r\n \r\n \r\n \r\n liked this' ] },
children:
[ { value: 'liked this',
type: [ 'h-cite' ],
properties:
{ url: [ 'http://someother.site/like' ],
'like-of': [ 'http://this.site/photo' ],
content: [ { value: 'liked this', html: 'liked this' } ],
name: [ 'liked this' ] } } ] } ],
rels: {},
'rel-urls': {} }
Given the following HTML stored as a variable body
in JS (taken from the h-entry example):
<article class="h-entry">
<h1 class="p-name">Microformats are amazing</h1>
<p>Published by <a class="p-author h-card" href="http://example.com">W. Developer</a>
on <time class="dt-published" datetime="2013-06-13 12:00:00">13<sup>th</sup> June 2013</time>
<p class="p-summary">In which I extoll the virtues of using microformats.</p>
<div class="e-content">
<p>Blah blah blah</p>
</div>
</article>
and the following parsing call (or substituting the parseUrl that returns body
as the previous HTML):
microformats.parseHtml(body, {'filters': ['h-entry']});
The result in items
in the resultant data
object will be an empty h-entry
object like the following:
{
"items": [{
"type": ["h-entry"],
"properties": {}
}],
"rels":{}
}
Converting the article
element into a div
element results in a successful, fully-parsed h-entry
object like so:
{
"items": [
{
"type": [
"h-entry"
],
"properties": {
"author": [
{
"type": [
"h-card"
],
"properties": {
"name": [
"W. Developer"
],
"url": [
"http:\/\/example.com"
]
},
"value": "W. Developer"
}
],
"name": [
"Microformats are amazing"
],
"summary": [
"In which I extoll the virtues of using microformats."
],
"published": [
"2013-06-13 12:00:00"
],
"content": [
{
"html": "
\n <p>Blah blah blah<\/p>
\n ",
"value": "Blah blah blah"
}
]
}
}
],
"rels": {}
}
I haven't had time to investigate the problem fully yet, but I believe that the source of the problem might be in the use of cheerio
and its support for HTML5 elements. If it isn't, then it's something directly in microformat-node
.
Running the following:
var uf = require('microformat-node');
uf.parseUrl('http://www.limitedtoendodontics.com/', {}, console.log);
I get following error:
SyntaxError: empty sub-selector
at parse (/var/www/piccolo/node_modules/cheerio/node_modules/CSSselect/node_modules/CSSwhat/index.js:178:9)
at compileUnsafe (/var/www/piccolo/node_modules/cheerio/node_modules/CSSselect/lib/compile.js:26:9)
at select (/var/www/piccolo/node_modules/cheerio/node_modules/CSSselect/index.js:17:43)
at CSSselect (/var/www/piccolo/node_modules/cheerio/node_modules/CSSselect/index.js:40:9)
at exports.find (/var/www/piccolo/node_modules/cheerio/lib/api/traversing.js:13:21)
at Object.Parser.appendInclude (/var/www/piccolo/node_modules/microformat-node/lib/parser.js:1462:28)
at Object.Parser.addAttributeIncludes (/var/www/piccolo/node_modules/microformat-node/lib/parser.js:1413:11)
at Object.Parser.addIncludes (/var/www/piccolo/node_modules/microformat-node/lib/parser.js:1388:8)
at Object.Parser.get (/var/www/piccolo/node_modules/microformat-node/lib/parser.js:278:10)
at Object.parseDom (/var/www/piccolo/node_modules/microformat-node/lib/parser.js:155:17
I am getting build errors with Travis-CI due to an intermittent 403 response from this api.github URL during npm install
, even though it seems to install fine locally:
"microformat-shiv": "https://api.github.com/repos/glennjones/microformat-shiv/tarball/"
Would you be opposed to using the owner/repo convention outlined here?
"microformat-shiv": "glennjones/microformat-shiv"
Just wanted to try it out and did run the parseUrl-promise example. However it did always error with "TypeError: Cannot call method 'then' of undefined"
After checking the source code I saw that the "parseUrl" does not return a promise. The return is actually simply commented out (5546247). Is there a reason for that or did that happen by accident? If not the example should probably get removed.
Testing this library:
var uf = require("microformat-node");
uf.parseUrl('http://microformats.org/2014/06', {}, function (err, result) {
console.log(JSON.stringify(result.items, null, 2));
});
This results in the following:
[
...,
{
"type": [
"h-card"
],
"properties": {
"url": [
"http://tantek.com/"
],
"name": [
"Tantek"
]
}
}
]
while viewing the source shows
<address class="vcard"><a class="url fn" href="http://tantek.com/">Tantek</a></address>
It seems the wrong type is detected?
This library use abandoned package "ent" that use deprecated Node punycode module.
Temporary solution: https://www.npmjs.com/package/ent-replace
Library crashes on following markup with a TypeError: Cannot set property '0' of undefined
:
var Microformats = require('microformat-node');
Microformats.get({
html: '<div class="hentry"><div class="dt-">0AM<div class="dt-">x</div></div></div>'
}, function (err, data) {
console.log(err, data);
});
narrowed down sample from: http://lbpm.com/
also happening on www.nesbitts.com, www.browsbyfay.com, decijisajam.rs, phytocarestetica.com
I get an error when running the following:
var uf = require('microformat-node');
uf.parseUrl('http://www.markgordondentistry.com/', {}, console.log);
TypeError: Cannot read property '0' of null
at getName (/var/www/piccolo/node_modules/cheerio/node_modules/CSSselect/node_modules/CSSwhat/index.js:76:36)
at parse (/var/www/piccolo/node_modules/cheerio/node_modules/CSSselect/node_modules/CSSwhat/index.js:129:13)
at compileUnsafe (/var/www/piccolo/node_modules/cheerio/node_modules/CSSselect/lib/compile.js:26:9)
at select (/var/www/piccolo/node_modules/cheerio/node_modules/CSSselect/index.js:17:43)
at CSSselect (/var/www/piccolo/node_modules/cheerio/node_modules/CSSselect/index.js:40:9)
at exports.find (/var/www/piccolo/node_modules/cheerio/lib/api/traversing.js:13:21)
at Object.Parser.appendInclude (/var/www/piccolo/node_modules/microformat-node/lib/parser.js:1462:28)
at Object.Parser.addAttributeIncludes (/var/www/piccolo/node_modules/microformat-node/lib/parser.js:1413:11)
at Object.Parser.addIncludes (/var/www/piccolo/node_modules/microformat-node/lib/parser.js:1388:8)
at Object.Parser.get (/var/www/piccolo/node_modules/microformat-node/lib/parser.js:278:10)
It seems this page has a itemref='Mark J. Gordon, DDS'
in it somewhere which makes the engine choke.
The likely cause is that test/testWriter.js
was removed in: c44e1da
It means that bin/microformat-node
refuses to start now though.
Not sure what function it played though so can't provide an easy PR to fix it unfortunately.
I was playing around a bit trying to see if I could get this library working with React Native and got Cheerio and HTMLParser2 working, but for some reason didn't get this module working and that's probably due to the Request module.
When I use microformat-node
I often do it in a context where I already have URL fetching available. In eg. https://github.com/voxpelli/webpage-webmentions I use request
myself, but in some newer projects I've started experimenting with other libraries like node-fetch.
I also often do the fetching through something like fetch-politely to ensure proper rate-limiting and robots.txt compliance.
All in all – I rarely use the fetching that this library provides, I just use it for parsing the HTML – either directly or by sending it a cheerio
object. So this module including the entire request
library is for my use-cases often a bit redundant – and since one can't exclude a dependency it makes it hard to eg. do a slim build for something like React Native where size and complexity might matter more than it does on server side node.js.
So – my proposal would be to either just remove the URL-fetching altogether or to split it up so there's a module that's focused on just the parsing and another that also provides helpers like parseUrl()
.
What's your thoughts?
HTML entities are indistinguishable from actual tags in the "html" part of e- properties. Additionally, value and p- properties may be corrupted when using textFormat: whitespacetrimmed.
Test case:
<div class="h-entry">
<div class="p-name e-content">x<y AT&T <b>NotBold</b> <b>Bold</b></div>
</div>
Output (textformat: 'whitespacetrimmed'):
{
"items": [{
"type": ["h-entry"],
"properties": {
"name": ["xNotBold Bold"],
"content": [{
"value": "xNotBold Bold",
"html": "x<y AT&T <b>NotBold</b> <b>Bold</b>"
}]
}
}],
"rels": {},
"rel-urls": {}
}
Output (textFormat: 'normalised'):
{
"items": [{
"type": ["h-entry"],
"properties": {
"name": ["x<y AT&T <b>NotBold</b> Bold"],
"content": [{
"value": "x<y AT&T <b>NotBold</b> Bold",
"html": "x<y AT&T <b>NotBold</b> <b>Bold</b>"
}]
}
}],
"rels": {},
"rel-urls": {}
}
Expected for both cases:
{
"items": [{
"type": ["h-entry"],
"properties": {
"name": ["x<y AT&T <b>NotBold</b> Bold"],
"content": [{
"value": "x<y AT&T <b>NotBold</b> Bold",
"html": "x<y AT&T <b>NotBold</b> <b>Bold</b>"
}]
}
}],
"rels": {},
"rel-urls": {}
}
I recently added mf1 markup to my review posts in order to appear as Google Rich Snippets. This has had the unfortunate side effect of confusing the crap out of mf2 parsers. Right now, the python parser is the only one that gets it right.
Note there are 5 empty objects before the real h-review. Additionally the p-item
property ended up confused. It should be an h-product
, but instead is a weird mix of h-item
(where did that come from) and the h-product
appears as the url
property.
trying to run following code:
var uf = require('microformat-node');
uf.parseUrl('http://www.boemiadigital.com/', {}, console.log);
results in an error:
TypeError: Cannot call method 'toString' of undefined
at Object.Parser.impliedRules (/Users/janpotoms/Woorank/code/piccolo/node_modules/microformat-node/lib/parser.js:636:70)
at null.<anonymous> (/Users/janpotoms/Woorank/code/piccolo/node_modules/microformat-node/lib/parser.js:754:14)
at exports.each (/Users/janpotoms/Woorank/code/piccolo/node_modules/microformat-node/node_modules/cheerio/lib/api/traversing.js:125:24)
at Object.Parser.walkChildren (/Users/janpotoms/Woorank/code/piccolo/node_modules/microformat-node/lib/parser.js:655:24)
at null.<anonymous> (/Users/janpotoms/Woorank/code/piccolo/node_modules/microformat-node/lib/parser.js:722:13)
at exports.each (/Users/janpotoms/Woorank/code/piccolo/node_modules/microformat-node/node_modules/cheerio/lib/api/traversing.js:125:24)
at Object.Parser.walkChildren (/Users/janpotoms/Woorank/code/piccolo/node_modules/microformat-node/lib/parser.js:655:24)
at null.<anonymous> (/Users/janpotoms/Woorank/code/piccolo/node_modules/mJans-MacBook-Pro:piccolo janpotoms$
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.