glennjones / microformat-node Goto Github PK

Microformats parser for node.js

Home Page: http://glennjones.net/tools/microformats/

License: MIT License

JavaScript 96.38% CSS 1.76% Makefile 0.01% HTML 1.85%

microformat-node's Issues

parseUrl example results in an error

microformats.parseUrl('http://glennjones.net/about', options, function(err, data){
  if (err) throw (err);
  console.log(data);
});

results in:

TypeError: Cannot read property '0' of null
    at getName (/Users/bret/Documents/Git-Clones/iwc-log-feed/node_modules/microformat-node/node_modules/cheerio/node_modules/CSSselect/node_modules/CSSwhat/index.js:78:36)
    at getLCName (/Users/bret/Documents/Git-Clones/iwc-log-feed/node_modules/microformat-node/node_modules/cheerio/node_modules/CSSselect/node_modules/CSSwhat/index.js:84:14)
    at parse (/Users/bret/Documents/Git-Clones/iwc-log-feed/node_modules/microformat-node/node_modules/cheerio/node_modules/CSSselect/node_modules/CSSwhat/index.js:160:12)
    at compileUnsafe (/Users/bret/Documents/Git-Clones/iwc-log-feed/node_modules/microformat-node/node_modules/cheerio/node_modules/CSSselect/lib/compile.js:26:9)
    at select (/Users/bret/Documents/Git-Clones/iwc-log-feed/node_modules/microformat-node/node_modules/cheerio/node_modules/CSSselect/index.js:17:43)
    at CSSselect (/Users/bret/Documents/Git-Clones/iwc-log-feed/node_modules/microformat-node/node_modules/cheerio/node_modules/CSSselect/index.js:40:9)
    at exports.find (/Users/bret/Documents/Git-Clones/iwc-log-feed/node_modules/microformat-node/node_modules/cheerio/lib/api/traversing.js:7:21)
    at new module.exports (/Users/bret/Documents/Git-Clones/iwc-log-feed/node_modules/microformat-node/node_modules/cheerio/lib/cheerio.js:83:18)
    at initialize (/Users/bret/Documents/Git-Clones/iwc-log-feed/node_modules/microformat-node/node_modules/cheerio/lib/static.js:19:12)
    at Object.Parser.apppendInclude (/Users/bret/Documents/Git-Clones/iwc-log-feed/node_modules/microformat-node/lib/parser.js:1446:13)

Digging around a bit to see what the cause is. Parsing other URLs seems to work though.

Some URL's causes library to "hang"

Parsing the URL "https://www.zalando.nl/nike-performance-trainingsbroek-blackcool-grey-n1241e0em-q11.html" causes the library to hang. It does not crash, but rather hangs for a very long time at 100% cpu-core consumption. Memory does not increase significantly.

Code to reproduce the problem:

const fetch = require('node-fetch');
const mf = require('microformat-node');

const fetchUrl = url => fetch(url)
    .then(res  => res.text())
    .then(html => mf.getAsync({html}));

fetchUrl('https://www.zalando.nl/nike-performance-trainingsbroek-blackcool-grey-n1241e0em-q11.html')

microformat-node version: 2.0.1
node version: v8.9.4 (also tested with v6.11.5 with same problem)

Could you please give me some pointers where to start looking in the code for a possible cause of this problem. Thanks!

parseDom (feature request)

Current parseDom operates on jQuery-like objects.

Are you planning implementing parser for real DOM API (getElementsByClassName and so on - i.e. jsdom implements this)?
If not - can I volunteer? (no sense to start if you revoke pull request)
It yes - what should be function names? (either we should rename old one, or somehow mangle new one)

crash on certain input

following code crashes with a TypeError: Cannot read property 'replace' of null:

var Microformats = require('microformat-node');
Microformats.get({
  html: '<div class="include item"></div>'
}, function (err, data) {
   console.log(err, data);
});

also fails on input:

'<div class="include"></div><div class="item"></div>'

narrowed down sample from http://www.advodata.be/

also happening on www.netflow.ro

href ignored for u-* properties

In the following example the parser uses the textContent for u-url and u-uid. Based on the u-* parsing rules from http://microformats.org/wiki/microformats2-parsing i was expecting to get the href attribute. As rule 1 if a.u-x[href] or area.u-x[href] or link.u-x[href], then get the href attribute.

var chai = require('chai'),
   assert = chai.assert,
   helper = require('../test/helper.js');


describe('h-entry', function() {
   var htmlFragment = "<li class=\"h-entry hentry\">\r\n  <p class=\"p-name entry-title e-content entry-content\">\r\n    Google the company *already* effectively rebranded, into Alphabet_Inc.\r\n  <\/p>\r\n  <span class=\"footer\"><a href=\"2019\/156\/t2\/\" class=\"dt-published published dt-updated updated u-url u-uid\"><time class=\"value\" datetime=\"11:56-0700\">11:56<\/time> on <time class=\"value\">2019-06-05<\/time><\/a><\/span>\r\n<\/li>";
   var found = helper.parseHTML(htmlFragment,'http://example.com/');
   var expected = "/2019/156/t2/";

   it('u-url', function(){
      var url = found.items[0].properties.url[0];
      assert.equal(url, expected);
   });

   it('u-uid', function(){
      var uid = found.items[0].properties.uid[0];
      assert.equal(uid, expected);
   });
});

Regression in text parsing of weird XSS:ish content?

Not sending a PR for this one because neither do I exactly know what's causing it or do I think it's that big of a deal that it's worth spending a lot of time on debugging, but making a note here anyhow so that it's documented.

In the checkmention project by @kbsriram there's an XSS test that parses differently now compared to version 0.3.x: https://github.com/kbsriram/checkmention/blob/master/src/WEB-INF/checks/xss

More specifically – the 2.0.0 version of this library is now ignoring the final </script> in the <<SCRIPT>alert("XSS4");//<</SCRIPT> code in there and thus makes all of the remaining e-content be treated as content of the script tag and thus drops all of that content from the text. This didn't happen before.

This means that the text output of parsing that XSS file is now:

Clicking this\nshould not cause an alert.\nThis div\nshould not alert.\nTry clicking this link\n<script>alert(\"encoded-xss\")</script>\nand this too.\nMouse over this\nshould not cause an alert. This broken\n should not throw an alert.\n<

When it before was:

Clicking this\nshould not cause an alert.\nThis div\nshould not alert.\nTry clicking this link\n<script>alert(\"encoded-xss\")</script>\nand this too.\nMouse over this\nshould not cause an alert. This broken\n should not throw an alert.\n<alert(\"XSS4\");//\n\nNeither should .\nPlease look at the Owasp XSS prevention cheat sheet for more information.\n\n\nThis note was created on\n\n%%nice_time

When looking at the same content directly in the browser I would say that the former handling was more correct than the current, but not sure what has changed. Cheerio is still the same so must be something outside of Cheerios control?

But – as I said in the beginning – this feels like an edge case that's not really worth spending a lot of time on fixing if it isn't an indicator of something bigger which I don't think it is.

Truncated Dates (bday / without years) are not parsed

Using truncated representations of dates for birth date is often good practice as noted in the vcard spec http://microformats.org/wiki/h-card#dt-bday

"--12-28"
Apart from citing parecki's birthday from the public h-card (send him much gifts) I'll look into it now for a fix [problem is in the intermediate step we would have to use the year 9999 and replace it afterwards, one reason why I am using a format like in my nlp module ].

Feel free to look into the following PR ...

The top most level of h-* do not record more than one type

i.e. class="h-entry h-note" is returned as type: ["h-entry"] but should return type: ["h-entry","h-note']

Alternative node/JS microformats parser

This project hasn't been maintained in a number of years.

There is an up-to-date library available here: https://github.com/microformats/microformats-parser, that works both with node.js and in the browser. Also supports TypeScript.

Cache starts a setTimeout even when unused

Hi Glenn, awesome module!

The checkLimits function in lib/cache.js is automatically called when requiring microformat-node, even when the Parser doesn't use caching. Personally, this has prevented an app from closing (due to the setTimeout still running) and also has caused tests (from a module that I'm working on) from exiting.

If possible, it would be great if the cache didn't automatically start until its first use. I'm just knocking up a quick feature branch now that does this and will push it up shortly so that it can be discussed further.

When using two h-* you get duplicate properties

This could either be an issue with the parser, or a unforeseen issue in the spec

CSS selector no longer works

p-name breaks on empty text

The example below is not parsing correctly. I would expect the entry "name" to be the empty string. Adding any non-whitespace text to the e-content causes it to revert to expected behavior.

<!DOCTYPE html>
<html lang="en">
<head>
</head>
<body>
    <div class="h-entry">
        <a href="http://this.site/photo" class="u-url"></a>
        <div class="e-content p-name"><img src="photo.jpg" class="u-photo"/></div>

        Some extraneous text

        <div class="h-cite">
            <a href="http://someother.site/like" class="u-url"></a>
            <a href="http://this.site/photo" class="u-like-of"></a>
            <div class="e-content p-name">liked this</div>
        </div>
    </div>
</body>
</html>

{ items: 
   [ { type: [ 'h-entry' ],
       properties: 
        { url: [ 'http://this.site/photo' ],
          content: [ { value: '', html: '<img src="photo.jpg" class="u-photo" />' } ],
          photo: [ 'photo.jpg' ],
          name: [ 'Some extraneous text\r\n\r\n        \r\n            \r\n            \r\n            liked this' ] },
       children: 
        [ { value: 'liked this',
            type: [ 'h-cite' ],
            properties: 
             { url: [ 'http://someother.site/like' ],
               'like-of': [ 'http://this.site/photo' ],
               content: [ { value: 'liked this', html: 'liked this' } ],
               name: [ 'liked this' ] } } ] } ],
  rels: {},
  'rel-urls': {} }

Parsing an h-entry with a root of <article> results in an empty h-entry

Given the following HTML stored as a variable body in JS (taken from the h-entry example):

<article class="h-entry">
  <h1 class="p-name">Microformats are amazing</h1>
  <p>Published by <a class="p-author h-card" href="http://example.com">W. Developer</a>
     on <time class="dt-published" datetime="2013-06-13 12:00:00">13<sup>th</sup> June 2013</time>
  <p class="p-summary">In which I extoll the virtues of using microformats.</p>
  <div class="e-content">
    <p>Blah blah blah</p>
  </div>
</article>

and the following parsing call (or substituting the parseUrl that returns body as the previous HTML):

microformats.parseHtml(body, {'filters': ['h-entry']});

The result in items in the resultant data object will be an empty h-entry object like the following:

{
  "items": [{
    "type": ["h-entry"],
    "properties": {}
  }],
  "rels":{}
}

Converting the article element into a div element results in a successful, fully-parsed h-entry object like so:

{
    "items": [
        {
            "type": [
                "h-entry"
            ],
            "properties": {
                "author": [
                    {
                        "type": [
                            "h-card"
                        ],
                        "properties": {
                            "name": [
                                "W. Developer"
                            ],
                            "url": [
                                "http:\/\/example.com"
                            ]
                        },
                        "value": "W. Developer"
                    }
                ],
                "name": [
                    "Microformats are amazing"
                ],
                "summary": [
                    "In which I extoll the virtues of using microformats."
                ],
                "published": [
                    "2013-06-13 12:00:00"
                ],
                "content": [
                    {
                        "html": "&#xD;\n    <p>Blah blah blah<\/p>&#xD;\n  ",
                        "value": "Blah blah blah"
                    }
                ]
            }
        }
    ],
    "rels": {}
}

I haven't had time to investigate the problem fully yet, but I believe that the source of the problem might be in the use of cheerio and its support for HTML5 elements. If it isn't, then it's something directly in microformat-node.

Error when parsing 'http://www.limitedtoendodontics.com'

Running the following:

var uf = require('microformat-node');
uf.parseUrl('http://www.limitedtoendodontics.com/', {}, console.log);

I get following error:

SyntaxError: empty sub-selector
  at parse (/var/www/piccolo/node_modules/cheerio/node_modules/CSSselect/node_modules/CSSwhat/index.js:178:9)
  at compileUnsafe (/var/www/piccolo/node_modules/cheerio/node_modules/CSSselect/lib/compile.js:26:9)
  at select (/var/www/piccolo/node_modules/cheerio/node_modules/CSSselect/index.js:17:43)
  at CSSselect (/var/www/piccolo/node_modules/cheerio/node_modules/CSSselect/index.js:40:9)
  at exports.find (/var/www/piccolo/node_modules/cheerio/lib/api/traversing.js:13:21)
  at Object.Parser.appendInclude (/var/www/piccolo/node_modules/microformat-node/lib/parser.js:1462:28)
  at Object.Parser.addAttributeIncludes (/var/www/piccolo/node_modules/microformat-node/lib/parser.js:1413:11)
  at Object.Parser.addIncludes (/var/www/piccolo/node_modules/microformat-node/lib/parser.js:1388:8)
  at Object.Parser.get (/var/www/piccolo/node_modules/microformat-node/lib/parser.js:278:10)
  at Object.parseDom (/var/www/piccolo/node_modules/microformat-node/lib/parser.js:155:17

Update microformat-shiv dependency reference

I am getting build errors with Travis-CI due to an intermittent 403 response from this api.github URL during npm install, even though it seems to install fine locally:

"microformat-shiv": "https://api.github.com/repos/glennjones/microformat-shiv/tarball/"

Would you be opposed to using the owner/repo convention outlined here?

"microformat-shiv": "glennjones/microformat-shiv"

parseUrl promise not working

Just wanted to try it out and did run the parseUrl-promise example. However it did always error with "TypeError: Cannot call method 'then' of undefined"

After checking the source code I saw that the "parseUrl" does not return a promise. The return is actually simply commented out (5546247). Is there a reason for that or did that happen by accident? If not the example should probably get removed.

Wrong type detected?

Testing this library:

var uf  = require("microformat-node");
uf.parseUrl('http://microformats.org/2014/06', {}, function (err, result) {
  console.log(JSON.stringify(result.items, null, 2));
});

This results in the following:

[
  ...,
  {
    "type": [
      "h-card"
    ],
    "properties": {
      "url": [
        "http://tantek.com/"
      ],
      "name": [
        "Tantek"
      ]
    }
  }
]

while viewing the source shows

<address class="vcard"><a class="url fn" href="http://tantek.com/">Tantek</a></address>

It seems the wrong type is detected?

Dependency "ent" uses deprecated Node punycode module

This library use abandoned package "ent" that use deprecated Node punycode module.

Temporary solution: https://www.npmjs.com/package/ent-replace

crash on certain markup

Library crashes on following markup with a TypeError: Cannot set property '0' of undefined:

var Microformats = require('microformat-node');
Microformats.get({
  html: '<div class="hentry"><div class="dt-">0AM<div class="dt-">x</div></div></div>'
}, function (err, data) {
   console.log(err, data);
});

narrowed down sample from: http://lbpm.com/

also happening on www.nesbitts.com, www.browsbyfay.com, decijisajam.rs, phytocarestetica.com

Error when parsing www.markgordondentistry.com

I get an error when running the following:

var uf = require('microformat-node');
uf.parseUrl('http://www.markgordondentistry.com/', {}, console.log);

TypeError: Cannot read property '0' of null
  at getName (/var/www/piccolo/node_modules/cheerio/node_modules/CSSselect/node_modules/CSSwhat/index.js:76:36)
  at parse (/var/www/piccolo/node_modules/cheerio/node_modules/CSSselect/node_modules/CSSwhat/index.js:129:13)
  at compileUnsafe (/var/www/piccolo/node_modules/cheerio/node_modules/CSSselect/lib/compile.js:26:9)
  at select (/var/www/piccolo/node_modules/cheerio/node_modules/CSSselect/index.js:17:43)
  at CSSselect (/var/www/piccolo/node_modules/cheerio/node_modules/CSSselect/index.js:40:9)
  at exports.find (/var/www/piccolo/node_modules/cheerio/lib/api/traversing.js:13:21)
  at Object.Parser.appendInclude (/var/www/piccolo/node_modules/microformat-node/lib/parser.js:1462:28)
  at Object.Parser.addAttributeIncludes (/var/www/piccolo/node_modules/microformat-node/lib/parser.js:1413:11)
  at Object.Parser.addIncludes (/var/www/piccolo/node_modules/microformat-node/lib/parser.js:1388:8)
  at Object.Parser.get (/var/www/piccolo/node_modules/microformat-node/lib/parser.js:278:10)

It seems this page has a itemref='Mark J. Gordon, DDS' in it somewhere which makes the engine choke.

Can't find '../test/testWriter.js' from bin/microformat-node

The likely cause is that test/testWriter.js was removed in: c44e1da

It means that bin/microformat-node refuses to start now though.

Not sure what function it played though so can't provide an easy PR to fix it unfortunately.

Remove or separate out parseUrl() to slim down library?

I was playing around a bit trying to see if I could get this library working with React Native and got Cheerio and HTMLParser2 working, but for some reason didn't get this module working and that's probably due to the Request module.

When I use microformat-node I often do it in a context where I already have URL fetching available. In eg. https://github.com/voxpelli/webpage-webmentions I use request myself, but in some newer projects I've started experimenting with other libraries like node-fetch.

I also often do the fetching through something like fetch-politely to ensure proper rate-limiting and robots.txt compliance.

All in all – I rarely use the fetching that this library provides, I just use it for parsing the HTML – either directly or by sending it a cheerio object. So this module including the entire request library is for my use-cases often a bit redundant – and since one can't exclude a dependency it makes it hard to eg. do a slim build for something like React Native where size and complexity might matter more than it does on server side node.js.

So – my proposal would be to either just remove the URL-fetching altogether or to split it up so there's a module that's focused on just the parsing and another that also provides helpers like parseUrl().

What's your thoughts?

HTML entity handling

HTML entities are indistinguishable from actual tags in the "html" part of e- properties. Additionally, value and p- properties may be corrupted when using textFormat: whitespacetrimmed.

Test case:

<div class="h-entry">
    <div class="p-name e-content">x&lt;y AT&amp;T &lt;b&gt;NotBold&lt;/b&gt; <b>Bold</b></div>
</div>

Output (textformat: 'whitespacetrimmed'):

{
    "items": [{
        "type": ["h-entry"],
        "properties": {
            "name": ["xNotBold Bold"],
            "content": [{
                "value": "xNotBold Bold",
                "html": "x<y AT&T <b>NotBold</b> <b>Bold</b>"
            }]
        }
    }],
    "rels": {},
    "rel-urls": {}
}

Output (textFormat: 'normalised'):

{
    "items": [{
        "type": ["h-entry"],
        "properties": {
            "name": ["x<y AT&T <b>NotBold</b> Bold"],
            "content": [{
                "value": "x<y AT&T <b>NotBold</b> Bold",
                "html": "x<y AT&T <b>NotBold</b> <b>Bold</b>"
            }]
        }
    }],
    "rels": {},
    "rel-urls": {}
}

Expected for both cases:

{
    "items": [{
        "type": ["h-entry"],
        "properties": {
            "name": ["x<y AT&T <b>NotBold</b> Bold"],
            "content": [{
                "value": "x<y AT&T <b>NotBold</b> Bold",
                "html": "x&lt;y AT&amp;T &lt;b&gt;NotBold&lt;/b&gt; <b>Bold</b>"
            }]
        }
    }],
    "rels": {},
    "rel-urls": {}
}

Backcompat parsing conflicts

I recently added mf1 markup to my review posts in order to appear as Google Rich Snippets. This has had the unfortunate side effect of confusing the crap out of mf2 parsers. Right now, the python parser is the only one that gets it right.

Original post: https://aaronparecki.com/2016/12/15/16/dropvox
node parsed
python parsed

Note there are 5 empty objects before the real h-review. Additionally the p-item property ended up confused. It should be an h-product, but instead is a weird mix of h-item (where did that come from) and the h-product appears as the url property.

Error when parsing http://www.boemiadigital.com/

trying to run following code:

var uf = require('microformat-node');
uf.parseUrl('http://www.boemiadigital.com/', {}, console.log);

results in an error:

TypeError: Cannot call method 'toString' of undefined
    at Object.Parser.impliedRules (/Users/janpotoms/Woorank/code/piccolo/node_modules/microformat-node/lib/parser.js:636:70)
    at null.<anonymous> (/Users/janpotoms/Woorank/code/piccolo/node_modules/microformat-node/lib/parser.js:754:14)
    at exports.each (/Users/janpotoms/Woorank/code/piccolo/node_modules/microformat-node/node_modules/cheerio/lib/api/traversing.js:125:24)
    at Object.Parser.walkChildren (/Users/janpotoms/Woorank/code/piccolo/node_modules/microformat-node/lib/parser.js:655:24)
    at null.<anonymous> (/Users/janpotoms/Woorank/code/piccolo/node_modules/microformat-node/lib/parser.js:722:13)
    at exports.each (/Users/janpotoms/Woorank/code/piccolo/node_modules/microformat-node/node_modules/cheerio/lib/api/traversing.js:125:24)
    at Object.Parser.walkChildren (/Users/janpotoms/Woorank/code/piccolo/node_modules/microformat-node/lib/parser.js:655:24)
    at null.<anonymous> (/Users/janpotoms/Woorank/code/piccolo/node_modules/mJans-MacBook-Pro:piccolo janpotoms$

glennjones / microformat-node Goto Github PK

microformat-node's Issues

Recommend Projects

Recommend Topics

Recommend Org