Giter VIP home page Giter VIP logo

hickory's Introduction

Clojars Project cljdoc badge CircleCI

Hickory

Hickory parses HTML into Clojure data structures, so you can analyze, transform, and output back to HTML. HTML can be parsed into hiccup vectors, or into a map-based DOM-like format very similar to that used by clojure.xml. It can be used from both Clojure and Clojurescript.

Usage

Parsing

To start, you will want to process your HTML into a parsed representation. Once the HTML is in this form, it can be converted to either Hiccup or Hickory format for further processing. There are two parsing functions, parse and parse-fragment. Both take a string containing HTML and return the parser objects representing the document. (It happens that these parser objects are Jsoup Documents and Nodes, but I do not consider this to be an aspect worth preserving if a change in parser should become necessary).

The first function, parse expects an entire HTML document, and parses it using an HTML5 parser (Jsoup on Clojure and the browser's DOM parser in Clojurescript), which will fix up the HTML as much as it can into a well-formed document. The second function, parse-fragment, expects some smaller fragment of HTML that does not make up a full document, and thus returns a list of parsed fragments, each of which must be processed individually into Hiccup or Hickory format. For example, if parse-fragment is given "<p><br>" as input, it has no common parent for them, so it must simply give you the list of nodes that it parsed.

These parsed objects can be turned into either Hiccup vector trees or Hickory DOM maps using the functions as-hiccup or as-hickory.

Here's a usage example.

user=> (use 'hickory.core)
nil
user=> (def parsed-doc (parse "<a href=\"foo\">foo</a>"))
#'user/parsed-doc
user=> (as-hiccup parsed-doc)
([:html {} [:head {}] [:body {} [:a {:href "foo"} "foo"]]])
user=> (as-hickory parsed-doc)
{:type :document, :content [{:type :element, :attrs nil, :tag :html, :content [{:type :element, :attrs nil, :tag :head, :content nil} {:type :element, :attrs nil, :tag :body, :content [{:type :element, :attrs {:href "foo"}, :tag :a, :content ["foo"]}]}]}]}
user=> (def parsed-frag (parse-fragment "<a href=\"foo\">foo</a> <a href=\"bar\">bar</a>"))
#'user/parsed-frag
user=> (as-hiccup parsed-frag)
IllegalArgumentException No implementation of method: :as-hiccup of protocol: #'hickory.core/HiccupRepresentable found for class: clojure.lang.PersistentVector  clojure.core/-cache-protocol-fn (core_deftype.clj:495)

user=> (map as-hiccup parsed-frag)
([:a {:href "foo"} "foo"] " " [:a {:href "bar"} "bar"])
user=> (map as-hickory parsed-frag)
({:type :element, :attrs {:href "foo"}, :tag :a, :content ["foo"]} " " {:type :element, :attrs {:href "bar"}, :tag :a, :content ["bar"]})

In the example above, you can see an HTML document that is parsed once and then converted to both Hiccup and Hickory formats. Similarly, a fragment is parsed, but it cannot be directly used with as-hiccup (or as-hickory), it must have those functions called on each element in the list instead.

The namespace hickory.zip provides zippers for both Hiccup and Hickory formatted data, with the functions hiccup-zip and hickory-zip. Using zippers, you can easily traverse the trees in any order you desire, make edits, and get the resulting tree back. Here is an example of that.

user=> (use 'hickory.zip)
nil
user=> (require '[clojure.zip :as zip])
nil
user=> (require '[hickory.render :refer [hickory-to-html]])
nil
user=> (-> (hiccup-zip (as-hiccup (parse "<a href=foo>bar<br></a>"))) zip/node)
([:html {} [:head {}] [:body {} [:a {:href "foo"} "bar" [:br {}]]]])
user=> (-> (hiccup-zip (as-hiccup (parse "<a href=foo>bar<br></a>"))) zip/next zip/node)
[:html {} [:head {}] [:body {} [:a {:href "foo"} "bar" [:br {}]]]]
user=> (-> (hiccup-zip (as-hiccup (parse "<a href=foo>bar<br></a>"))) zip/next zip/next zip/node)
[:head {}]
user=> (-> (hiccup-zip (as-hiccup (parse "<a href=foo>bar<br></a>")))
           zip/next zip/next
           (zip/replace [:head {:id "a"}])
           zip/node)
[:head {:id "a"}]
user=> (-> (hiccup-zip (as-hiccup (parse "<a href=foo>bar<br></a>")))
           zip/next zip/next
           (zip/replace [:head {:id "a"}])
           zip/root)
([:html {} [:head {:id "a"}] [:body {} [:a {:href "foo"} "bar" [:br {}]]]])
user=> (-> (hickory-zip (as-hickory (parse "<a href=foo>bar<br></a>")))
           zip/next zip/next
           (zip/replace {:type :element :tag :head :attrs {:id "a"} :content nil})
           zip/root)
{:type :document, :content [{:type :element, :attrs nil, :tag :html, :content [{:content nil, :type :element, :attrs {:id "a"}, :tag :head} {:type :element, :attrs nil, :tag :body, :content [{:type :element, :attrs {:href "foo"}, :tag :a, :content ["bar" {:type :element, :attrs nil, :tag :br, :content nil}]}]}]}]}
user=> (hickory-to-html *1)
"<html><head id=\"a\"></head><body><a href=\"foo\">bar<br></a></body></html>"

In this example, we can see a basic document being parsed into Hiccup form. Then, using zippers, the HEAD element is navigated to, and then replaced with one that has an id of "a". The final tree, including the modification, is also shown using zip/root. Then the same modification is made using Hickory forms and zippers. Finally, the modified Hickory version is printed back to HTML using the hickory-to-html function.

Selectors

Hickory also comes with a set of CSS-style selectors that operate on hickory-format data in the hickory.select namespace. These selectors do not exactly mirror the selectors in CSS, and are often more powerful. There is no version of these selectors for hiccup-format data, at this point.

A selector is simply a function that takes a zipper loc from a hickory html tree data structure as its only argument. The selector will return its argument if the selector applies to it, and nil otherwise. Writing useful selectors can often be involved, so most of the hickory.select package is actually made up of selector combinators; functions that return useful selector functions by specializing them to the data given as arguments, or by combining together multiple selectors. For example, if we wanted to figure out the dates of the next Formula 1 race weekend, we could do something like this:

user=> (use 'hickory.core)
nil
user=> (require '[hickory.select :as s])
nil
user=> (require '[clj-http.client :as client])
nil
user=> (require '[clojure.string :as string])
nil
user=> (def site-htree (-> (client/get "http://formula1.com/default.html") :body parse as-hickory))
#'user/site-htree
user=> (-> (s/select (s/child (s/class "subCalender") ; sic
                              (s/tag :div)
                              (s/id :raceDates)
                              s/first-child
                              (s/tag :b))
                     site-htree)
           first :content first string/trim)
"10, 11, 12 May 2013"

In this example, we get the contents of the homepage and use select to give us any nodes that satisfy the criteria laid out by the selectors. The selector in this example is overly precise in order to illustrate more selectors than we need; we could have gotten by just selecting the contents of the P and then B tags inside the element with id "raceDates".

Using the selectors allows you to search large HTML documents for nodes of interest with a relatively small amount of code. There are many selectors available in the hickory.select namespace, including:

  • node-type: Give this function a keyword or string that names the contents of the :type field in a hickory node, and it gives you a selector that will select nodes of that type. Example: (node-type :comment)
  • tag: Give this function a keyword or string that names the contents of the :tag field in a hickory node, and it gives you a selector that will select nodes with that tag. Example: (tag :div)
  • attr: Give this function a keyword or string that names an attribute in the :attrs map of a hickory node, and it gives you a selector that will select nodes whose :attrs map contains that key. Give a single-argument function as an additional argument, and the resulting selector function will additionally require the value of that key to be such that the function given as the last argument returns true. Example: (attr :id #(.startsWith % "foo"))
  • id: Give this function a keyword or string that names the :id attribute in the :attrs map and it will return a selector function that selects nodes that have that id (this comparison is case-insensitive). Example: (id :raceDates)
  • class: Give this function a keyword or string that names a class that the node should have in the :class attribute in the :attrs map, and it will return a function that selects nodes that have the given class somewhere in their class string. Example: (class :foo)
  • any: This selector takes no arguments, do not invoke it; returns any node that is an element, similarly to CSS's '*' selector.
  • element: This selector is equivalent to the any selector; this alternate name can make it clearer when the intention is to exclude non-element nodes from consideration.
  • root: This selector takes no arguments and should not be invoked; simply returns the root node (the HTML element).
  • n-moves-until: This selector returns a selector function that selects its argument if that argument is some distance from a boundary. The first two arguments, n and c define the counting: it only selects nodes whose distance can be written in the form nk+c for some natural number k. The distance and boundary are defined by the number of times the zipper-movement function in the third argument is applied before the boundary function in the last argument is true. See doc string for details.
  • nth-of-type: This selector returns a selector function that selects its argument if that argument is the (nk+c)'th child of the given tag type of some parent node for some natural k. Optionally, instead of the n and c arguments, the keywords :odd and :even can be given.
  • nth-last-of-type: Just like nth-of-type but counts backwards from the last sibling.
  • nth-child: This selector returns a selector function that selects its argument if that argument is the (nk+c)'th child of its parent node for some natural k. Instead of the n and c arguments, the keywords :odd and :even can be given.
  • nth-last-child: Just like nth-last-child but counts backwards from the last sibling.
  • first-child: Takes no arguments, do not invoke it; equivalent to (nth-child 1).
  • last-child: Takes no arguments, do not invoke it; equivalent to (nth-last-child 1).

There are also selector combinators, which take as argument some number of other selectors, and return a new selector that combines them into one larger selector. An example of this is the child selector in the example above. Here's a list of some selector combinators in the package (see the API Documentation for the full list):

  • and: Takes any number of selectors, and returns a selector that only selects nodes for which all of the argument selectors are true.
  • or: Takes any number of selectors, and retrurns a selector that only selects nodes for which at least one of the argument selectors are true.
  • not: Takes a single selector as argument and returns a selector that only selects nodes that its argument selector does not.
  • el-not: Takes a single selector as argument and returns a selector that only selects element nodes that its argument selector does not.
  • child: Takes any number of selectors as arguments and returns a selector that returns true when the zipper location given as the argument is at the end of a chain of direct child relationships specified by the selectors given as arguments.
  • descendant: Takes any number of selectors as arguments and returns a selector that returns true when the zipper location given as the argument is at the end of a chain of descendant relationships specified by the selectors given as arguments.

We can illustrate the selector combinators by continuing the Formula 1 example above. We suspect, to our dismay, that Sebastian Vettel is leading the championship for the fourth year in a row.

user=> (-> (s/select (s/descendant (s/class "subModule")
                                   (s/class "standings")
                                   (s/and (s/tag :tr)
                                          s/first-child)
                                   (s/and (s/tag :td)
                                          (s/nth-child 2))
                                   (s/tag :a))
                     site-htree)
           first :content first string/trim)
"Sebastian Vettel"

Our fears are confirmed, Sebastian Vettel is well on his way to a fourth consecutive championship. If you were to inspect the page by hand (as of around May 2013, at least), you would see that unlike the child selector we used in the example above, the descendant selector allows the argument selectors to skip stages in the tree; we've left out some elements in this descendant relationship. The first table row in the driver standings table is selected with the and, tag and first-child selectors, and then the second td element is chosen, which is the element that has the driver's name (the first table element has the driver's standing) inside an A element. All of this is dependent on the exact layout of the HTML in the site we are examining, of course, but it should give an idea of how you can combine selectors to reach into a specific node of an HTML document very easily.

Finally, it's worth noting that the select function itself returns the hickory zipper nodes it finds. This is most useful for analyzing the contents of nodes. However, sometimes you may wish to examine the area around a node once you've found it. For this, you can use the select-locs function, which returns a sequence of hickory zipper locs, instead of the nodes themselves. This will allow you to navigate around the document tree using the zipper functions in clojure.zip. If you wish to go further and actually modify the document tree using zipper functions, you should not use select-locs. The problem is that it returns a bunch of zipper locs, but once you modify one, the others are out of date and do not see the changes (just as with any other persistent data structure in Clojure). Thus, their presence was useless and possibly confusing. Instead, you should use the select-next-loc function to walk through the document tree manually, moving through the locs that satisfy the selector function one by one, which will allow you to make modifications as you go. As with modifying any data structure as you traverse it, you must still be careful that your code does not add the thing it is selecting for, or it could get caught in an infinite loop. Finally, for more specialized selection needs, it should be possible to write custom selection functions that use the selectors and zipper functions without too much work. The functions discussed in this paragraph are very short and simple, you can use them as a guide.

The doc strings for the functions in the hickory.select namespace provide more details on most of these functions.

For more details, see the API Documentation.

Hickory format

Why two formats? It's very easy to see in the example above, Hiccup is very convenient to use for writing HTML. It has a compact syntax, with CSS-like shortcuts for specifying classes and ids. It also allows parts of the vector to be skipped if they are not important.

It's a little bit harder to process data in Hiccup format. First of all, each form has to be checked for the presence of the attribute map, and the traversal adjusted accordingly. Raw Hiccup vectors might also have information about class and id in one of two different places. Finally, not every piece of an HTML document can be expressed in Hiccup without resorting to writing HTML in strings. For example, if you want to put a doctype or comment on your document, it has to be done as a string in your Hiccup form containing "<!DOCTYPE html>" or "<!--stuff-->".

The Hickory format is another data format intended to allow a roundtrip from HTML as text, into a data structure that is easy to process and modify, and back into equivalent (but not identical, in general) HTML. Because it can express all parts of an HTML document in a parsed form, it is easier to search and modify the structure of the document.

A Hickory node is either a map or a string. If it is a map, it will have some subset of the following four keys, depending on the :type:

  • :type - This will be one of :comment, :document, :document-type, :element
  • :tag - A node's tag (for example, :img). This will only be present for nodes of type :element.
  • :attrs - A node's attributes, as a map of keywords to values (for example, {:href "/a"}). This will only be present for nodes of type :element.
  • :content - A node's child nodes, in a vector. Only :comment, :document, and :element nodes have children.

Text and CDATA nodes are represented as strings.

This is almost the exact same structure used by clojure.xml, the only difference being the addition of the :type field. Having this field allows us to process nodes that clojure.xml leaves out of the parsed data, like doctype and comments.

Obtaining

To get hickory, add

[org.clj-commons/hickory "0.7.3"]

to your project.clj, or an equivalent entry for your Maven-compatible build tool.

ClojureScript support

Hickory works for all web browsers IE9+ (you can find a workaround for IE9 here).

Nodejs support

To parse markup on Nodejs, Hickory requires a Node DOM implementation. Several are available from npm. Install the npm package or use lein-npm. Here are some alternatives:

  • jsdom - Caution: this will not work if you're using figwheel

    (set! js/document (.jsdom (cljs.nodejs/require "jsdom")))
  • xmldom

    (set! js/DOMParser (.-DOMParser (cljs.nodejs/require "xmldom")))

Changes

  • Version 0.7.1. Thanks to Matt Grimm for adding the up-pred zipper function.

  • Version 0.7.0. Thanks to Ricardo J. Méndez for the following updates.

    • Removed dependency on cljx, since it was deprecated in June 2015.
    • Converted all files and conditionals to cljc.
    • Moved tests to cljs.test with doo, since cemerick.test was deprecated over a year ago.
    • Updated Clojure and ClojureScript dependencies to avoid conflicts.
    • Updated JSoup to 1.9.2, which should bring improved parsing performance.
  • Released version 0.6.0.

    • Updated JSoup to version 1.8.3. This version of JSoup contains bug fixes, but slightly changes the way it handles HTML: some parses and output might have different case than before. HTML is still case-insensitive, of course, but Hickory minor version has been increased just in case. API and semantics are otherwise unchanged.
  • Released version 0.5.4.

    • Fixed project dependencies so ClojureScript is moved to a dev-dependency.
  • Released version 0.5.3.

    • Minor bug fix to accommodate ClojureScript's new type hinting support.
  • Released version 0.5.2.

    • Updates the Clojurescript version to use the latest version of Clojurescript (0.0-1934).
  • Released version 0.5.1.

    • Added has-child and has-descendant selectors. Be careful with has-descendant, as it must do a full subtree search on each node, which is not fast.
  • Released version 0.5.0.

    • Now works in Clojurescript as well, huge thanks to Julien Eluard for doing the heavy lifting on this.
    • Reorganized parts of the API into more granular namespaces for better organization.
    • Added functions to convert between Hiccup and Hickory format; note that this conversion is not always exact or roundtripable, and can cause a full HTML reparse.
    • Added new selector, element-child, which selects element nodes that are the child of another element node.
    • Numerous bug fixes and improvements.
  • Released version 0.4.1, which adds a number of new selectors and selector combinators, including find-in-text, precede-adjacent, follow-adjacent, precede and follow.

  • Released version 0.4.0. Adds the hickory.select namespace with many helpful functions for searching through hickory-format HTML documents for specific nodes.

  • Released version 0.3.0. Provides a more helpful error message when hickory-to-html has an error. Now requires Clojure 1.4.

  • Released version 0.2.3. Fixes a bug where hickory-to-html was not html-escaping the values of tag attributes.

  • Released version 0.2.2. Fixes a bug where hickory-to-html was improperly html-escaping the contents of script/style tags.

  • Released version 0.2.1. This version fixes bugs:

    • hickory-to-html now properly escapes text nodes
    • text nodes will now preserve whitespace correctly
  • Released version 0.2.0. This version adds a second parsed data format, explained above. To support this, the API for parse and parse-fragment has been changed to allow their return values to be passed to functions as-hiccup or as-hickory to determine the final format. Also added are zippers for both Hiccup and Hickory formats.

License

Copyright © 2012 David Santiago

Distributed under the Eclipse Public License, the same as Clojure.

hickory's People

Contributors

blx avatar borkdude avatar danielcompton avatar davidsantiago avatar ieure avatar jeluard avatar joseph-alley avatar masztal avatar njordhov avatar port19x avatar raynes avatar reefersleep avatar ricardojmendez avatar seancorfield avatar slipset avatar sw1nn avatar tkocmathla avatar viebel avatar viesti avatar vitobasso avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

hickory's Issues

Question: Regarding Selector expectations

This may be just another learning curve, but:

I have a hickory format parsed "www.yahoo.com" page (_r0). I am attempting to retrieve the "description" meta information in the head section of the page.

The following works

(-> (s/select (s/child (s/tag :head) (s/attr :name #(= % "description"))) _r0))

=> [{:type :element, :attrs {:name "description", :content "A new welcome to Yahoo. The new Yahoo experience makes it easier to discover the news and information that you care about most. It's the web ordered for you."}, :tag :meta, :content nil}]

However; this does not even though it seems intuitive. Disclaimer: I may just be using the wrong selector types but:

(-> (s/select (s/child (s/tag :head) (s/tag :meta) (s/attr :name #(= % "description"))) _r0))
=> []

Thanks
Frank

Unary tag parsing

A unary span tag is being interpreted as extending to the end of the surrounding p.

(defn round-trip [text]
  (->
    text
    hickory.core/parse-fragment
    first
    hickory.core/as-hickory
    hickory.render/hickory-to-html))
(round-trip "<p>this<span/> is a test.</p>")
"<p>this<span> is a test.</span></p>"

It will also extend outside of a surrounding span element.

(round-trip "<p><span>this<span /> is a </span>test.</p>")
"<p><span>this<span> is a </span>test.</span></p>"

Is there a way to avoid this behavior? I need to clean up html which may include empty elements.

Parsing HTML unescapes all escaped entities

me.raynes.laser> (hickory/as-hickory (first (hickory/parse-fragment "<p>&amp;</p>")))
{:type :element, :attrs nil, :tag :p, :content ["&"]}

This shouldn't happen, right?

`find-in-text` selector does not seem to work when it's the last argument of a `child` selector

I don't know if it's a bug or if I missed something, but find-in-text selector does not seem to work when it's the last argument of a child selector:

(let [tree (-> "<div><div><span>a</span></div></div>" h/parse h/as-hickory)]
      (hs/select (hs/child 
                  (hs/tag :div)
                  (hs/find-in-text #"a"))
               tree)) ; => []

But it works when adding a dummy and:

(let [tree (-> "<div><div><span class=\"go\">a</span></div></div>" h/parse h/as-hickory)]
      (hs/select (hs/child 
                  (hs/tag :div)
                  (hs/and hs/any (hs/find-in-text #"a"))
               tree)) ; => works

Invalid output by as-hiccup

(let [data "<root><link>abc</link><link>def</link></root>"]
	(h/as-hiccup (h/parse data)))

gives me

([:html {} [:head {}] [:body {} [:root {} [:link {}] "abc" [:link {}] "def"]]])

when i should expect

([:html {} [:head {}] [:body {} [:root {} [:link {} "abc"] [:link {} "def"]]]])

Utility function to remove whitespace from hiccup forms

Hi!

This seems like a fairly reasonable utility feature that I've needed in more than one place now.

(def ^:private whitespace?
  "Is this a string, and does it consist of only whitespace?"
  (every-pred string? (partial re-matches #"\s*")))

(defn ^:private remove-whitespace
  "Walk a given Hiccup form and remove all pure whitespace."
  [row]
  (walk/prewalk
   (fn [form]
     (if (vector? form)
       (into [] (remove whitespace? form))
       form))
   row))

(walk is [clojure.walk :as walk].)

If you'd like it in Hiccup, I'm more than happy to add it (with unit tests, of course.)

Parsing throws on long markdown files

I'll be updating this issue as I tease it out, but I think something in the following chain of functions fails when the source markdown is too long:

  1. md-to-html-string
  2. parse
  3. as-hiccup

Trial code

(ns foo.data
  (:require [clojure.java.io :as io]
            [hickory.core :refer [as-hiccup parse]]
            [markdown.core :refer [md-to-html-string]]))

(defmacro raw-foo-html []
  (md-to-html-string (slurp (io/resource "foo.md"))))

(defmacro foo-data []
  (vec (as-hiccup (parse (raw-foo-html)))))

(defmacro foo-body []
  (rest (rest (nth (first (foo-data)) 3))))

Log

lein do clean, figwheel
Figwheel: Cutting some fruit, just a sec ...
Figwheel: Validating the configuration found in project.clj
Figwheel: Configuration Valid :)
Figwheel: Starting server at http://0.0.0.0:3449
Figwheel: Watching build - app
Figwheel: Cleaning build - app
Compiling "target/cljsbuild/public/js/app.js" from ["src/cljs" "src/cljc" "env/dev/cljs"]...
Failed to compile "target/cljsbuild/public/js/app.js" in 15.751 seconds.
----  Could not Analyze  src/cljs/foo/core.cljs  ----

  java.lang.ClassFormatError: Unknown constant tag 99 in class file foo/data$foo_data, compiling:(foo/data.clj:10:1)

----  Analysis Error : Please see src/cljs/foo/core.cljs  ----
---- Initial Figwheel ClojureScript Compilation Failed ----
We need a successful initial build for Figwheel to connect correctly.

I noticed it when I tried to double the length of an existing markdown file that was parsing successfully. If I take out a random section from the first half of the file, the parsing works again.

I can't show the exact document I found this in, but I'll try to duplicate it with lorem ipsum.

Parent

Is there a way to do like a jquery .parent() ?

Node.js support revisited

Pull request #33 allegedly solved the compatibility issues with Node.js. The Readme, however, still states that Hickory “won't work out of the box on node”. I've been reading the heck out of anything related to the issue here, but haven't found any reproducible steps to make hickory load and parse on Node.js.

In case there is currently no sensible way to use hickory on Node.js, maybe there is a perspective to tap into posthtml-parser, a Node.js library that outputs something that is almost hickory format. I've elaborated on this solution to HTML-parsing on Node.js here (SO)

I'd be willing to attempt such an integration myself, but would want to make sure that there is indeed no proper solution as of this point (Also, this would be my very first pull request, so I'm a bit daunted and quite unsure if I could actually get it done).

Hickory does not correctly parse <noscript> tags in <head>

Hickory doesn't parse noscript tags correctly when they are in the head, but does when it is in body. noscript should be supported in both.

(require '[hickory.core :as hick]
         '[hickory.render :as hr])

(-> "
<html>
 <body>
  <noscript>Ceçi n'est pas de JavaScript</noscript>
 </body>
</html>"
    hick/parse
    hick/as-hickory
    hr/hickory-to-html)

;;=>
'"<html><head></head><body>\n  <noscript>Ceçi n'est pas de JavaScript</noscript>\n \n</body></html>"


(-> "
<html>
 <head>
  <noscript>Ceçi n'est pas de JavaScript</noscript>
 </head>
</html>"
    hick/parse
    hick/as-hickory
    hr/hickory-to-html)

;;=>
'"<html><head>\n  <noscript></noscript></head><body>Ceçi n'est pas de JavaScript\n \n</body></html>"

Unexpected NullPointerException ?

I encounter some NPE when executing some select:

user => (def hick (h/as-hickory (first (h/parse-fragment "<a><img href=\"\"/></a>"))))
#'user/hick

user => (s/select (s/child (s/follow-adjacent (s/tag :a) (s/tag :img))) hick)

NullPointerException   clojure.zip/node (zip.clj:67)

user => (s/select (s/child (s/follow-adjacent (s/tag :nonexistent) (s/tag :img))) hick)

NullPointerException   clojure.zip/node (zip.clj:67)

user => (s/select (s/child (s/follow-adjacent (s/tag :a) (s/tag :nonexistent))) hick)
[]

user => (s/select (s/child s/first-child) hick)

NullPointerException   clojure.zip/node (zip.clj:67)

I am not sure all those selections make sense but they probably should not trigger exceptions?

DOM 4 deprecating node type for attributes

The ATTRIBUTE_NODE constant has been deprecated in DOM4. It should no longer be assumed that an attribute has a nodeType.

The changes to the Attribute interface is initially particularly relevant for the Clojurescript implementation of Hickory. It can be accommodated by not recursively mapping over attributes but instead convert attributes in place when building elements.

Make HTML escaping configurable

I'm currently using as-hiccup to transform parsed HTML into a vector I can save and later render with Reagent. This works almost flawlessly except for the fact that as-hiccup escapes any TextNodes it comes across, meaning things like &amp; show up as literal strings when I pipe it down to the browser.

Maybe a keyword argument is the way to go? Not sure how that works with protocols.

Any interest in CSS-style selectors?

Great library, thanks for writing and publishing it!

I wrote up a macro for my own usage that allows me to use CSS-style selectors for simple scalar selectors (i.e. div.foo, span#bar, or even div.foo:nth-child(3) but not compound things like div tr td) that lets me do this:

(css-selector :td.comment)

instead of this:

(s/and (s/tag "td") (s/class "comment"))

Is there any interest in merging something like that into the project? I can put together a PR if so.

Documentation and expectations for HTML entities

I've noticed the following behavior:

lein try hickory
(use 'hickory.core)
(parse "<body>&nbsp;</body>")
; #<Document <html>
; <head></head>
; <body>
;  &nbsp;
; </body>
; </html>>

This is what I would expect; namely, the HTML entity &nbsp; is preserved.

But when I run as-hickory, the HTML entity is converted. This seems incorrect to me:

(pprint (as-hickory (parse "<body>&nbsp;</body>")))
{:type :document,
 :content
 [{:type :element,
   :attrs nil,
   :tag :html,
   :content
   [{:type :element, :attrs nil, :tag :head, :content nil}
    {:type :element, :attrs nil, :tag :body, :content [" "]}]}]}
nil

Perhaps I am misunderstanding something here. My goal is to allow HTML entities to make their way all the way to my Web server output. Currently, I lose HTML entities that I want to preserve in my result.

Case sensitive tags/attributes

Hi,

I'm working on project where we are using hickory as part of our client interface towards REST backend application. Backend is providing XML that contains case sensitive tags that refer to JAVA classes.

Hickory has hardcoded cast of attribute names and tags to lower-case-keyword. Is it possible to make it more transparent and leave that job to end user. To use zipper/walk to change keys?

Tnx,
Robert

StackOverflowError when parsing many unclosed tags in series

Hickory throws a StackOverflowError when parsing a document with a long series of unclosed HTML tags.

(-> (clojure.string/join "\n" (repeat 2048 "<font>abc"))
    hickory.core/parse       ; this works
    hickory.core/as-hickory) ; this also fails with as-hiccup

The stack trace repeats this section over and over:

              clojure.core/map/fn       core.clj: 2726
                hickory.core/fn/G       core.clj:   21
                  hickory.core/fn       core.clj:  106
                clojure.core/into       core.clj: 6771
              clojure.core/reduce       core.clj: 6704
      clojure.core.protocols/fn/G  protocols.clj:   13
        clojure.core.protocols/fn  protocols.clj:   75
clojure.core.protocols/seq-reduce  protocols.clj:   31
      clojure.core.protocols/fn/G  protocols.clj:   19
        clojure.core.protocols/fn  protocols.clj:  168
                clojure.core/next       core.clj:   64

I'm using clojure 1.9.0-alpha14 and hickory 0.7.1.

Question: Regarding Evaluation

I am a veritable 'sophomore' in Clojure however:

I've defined a 'mini-grammar' that I want to translate into hickory selector directives. Here is a simple first statement:

(def _yf2 (map #(edn/read-string {:readers {'tag (fn [rec](s/tag %28identity rec%29))}} %) (list "(#tag :head)" "(#tag :title)")))

I want to take the results of this and feed it into something like this:
(-> (s/select (s/children _yf2) data))

But am hard-pressed on getting it to work. Any help would be appreciated. I would have posted this somewhere like a support group but I see no evidence of one.

Unexpected conversion result

I did the following:

(map as-hiccup (parse-fragment "<p><div>test</div></p>"))

And got:

([:p {}] [:div {} "test"] [:p {}])

Note that the p tags are independent empty tags, but this isn't a correct conversion I don't think. Am I missing something? Or is this a bug?

Incidentally, if I do:

(map as-hiccup (parse-fragment "<div><div>test</div></div>"))

The result seem as it should be:

([:div {} [:div {} "test"]])

"Obtaining" section of README.md has 0.5.1, should be 0.5.2

If you actually put version 0.5.1 in your project.clj deps, your examples (such as "(as-hiccup parsed-doc)") don't work.

Easiest fix ever, though, right? :)

[Edit: I'm no longer sure the examples didn't work... I may have been testing with parsed-frag when I meant to use parsed-doc. In any case, the version in the "Obtaining" section should be updated.]

not selector behavior

When using the not selector I get some results I am not sure are expected.

  user=> (def h (as-hickory (first (parse-fragment "<div>test</div>"))))
  user=> (s/select (s/tag :div) h)
  [{:type :element, :attrs nil, :tag :div, :content ["test"]}]
  user=> (s/select (s/not (s/tag :div)) h)
  ["test"]
  user=> (s/select (s/tag :p) h)
  []
  user=> (s/select (s/not (s/tag :p)) h)
  [{:type :element, :attrs nil, :tag :div, :content ["test"]} "test"]

The content element seems to be preserved although I am not expecting this.
Is this the expected behavior?

Close pull requests that won't be merged

As a suggestion, having pull requests open for years that have neither been merged nor closed gives the impression that the project might be abandoned, and could discourage future pull requests (fwiw, it almost stopped me).

I'd recommend reviewing the older ones and deciding if you want to keep them open, particularly if they haven't had any activity in a while.

Counterparts of child and descendant for first element in the chain

I have asked for this in #23 but I am afraid has-child doesn't cut it, so I am re-iterating it as a separate issue.

I would like to be able to get the first element in a chain of descendants. So if elements are td > tr > input the following would give me the td element:

(hs/parent 
   (hs/tag :td)
   (hs/tag :tr)
   (hs/tag :input))

Similarly hs/ancestor should be the counterpart of hs/descendant.

In this regard the api for same level siblings is complete as given by follow <> precede and folow-adjacent <> precede-adjacent.

I know about has-child, but the above simple snippet becomes:

(hs/and
  (hs/tag :td)
  (hs/and
    (hs/tag :tr)
    (hs/has-child 
      (hs/tag :input))))

which even in this simple case looks pretty daunting.

Explicit namespace support for as-hickory for XML

Right now, when you parse some XML (using e.g. the open PR, or just DOMParser), and then as-hiccup it, you turn something like this:

<?xml version="1.0"?>
<soap:Envelope
    xmlns:soap="http://www.w3.org/2001/12/soap-envelope"
    soap:encodingStyle="http://www.w3.org/2001/12/soap-encoding">

    <soap:Body
        xmlns:m="http://www.example.org/stock">
        <m:GetStockPrice>
            <m:StockName>RAX</m:StockName>
        </m:GetStockPrice>
    </soap:Body>

</soap:Envelope>

... and turns it into:

bubbles.xml-test> (h/as-hiccup (bx/parse-xml soap-request))
([:soap:envelope
  {:xmlns:soap "http://www.w3.org/2001/12/soap-envelope",
   :soap:encodingstyle "http://www.w3.org/2001/12/soap-encoding"}
  "\n\n    "
  [:soap:body
   {:xmlns:m "http://www.example.org/stock"}
   "\n        "
   [:m:getstockprice {} "\n            " [:m:stockname {} "RAX"] "\n        "]
   "\n    "]
  "\n\n"])

... or with as-hickory:

bubbles.xml-test> (h/as-hickory (bx/parse-xml soap-request))
{:type :document,
 :content
 [{:type :element,
   :attrs
   {:xmlns:soap "http://www.w3.org/2001/12/soap-envelope",
    :soap:encodingstyle "http://www.w3.org/2001/12/soap-encoding"},
   :tag :soap:envelope,
   :content
   ["\n\n    "
    {:type :element,
     :attrs {:xmlns:m "http://www.example.org/stock"},
     :tag :soap:body,
     :content
     ["\n        "
      {:type :element,
       :attrs nil,
       :tag :m:getstockprice,
       :content
       ["\n            "
        {:type :element, :attrs nil, :tag :m:stockname, :content ["RAX"]}
        "\n        "]}
      "\n    "]}
    "\n\n"]}]}

You'll notice that in both hiccup and hickory mode, the tags are something like :m:getstockprice. That's unfortunate, because if the XML namespaces change name, an identical document will parse quite differently. I'm not sure how to "fix" this though.

svg tag viewBox becomes viewbox

In HiccupRepresentable and HickoryRepresentable if the node type is Attribute then lower-case-keyword is applied.
However, from

https://developer.mozilla.org/en-US/docs/Web/SVG/Attribute

some svg tags have uppercase characters like viewBox (there are others). This causes a warning in both Chrome and Firefox under re-frame (originating from react) but is not in itself an error in either browser. ie, an svg image with viewbox does not cause a warning.

Example warning:
react-dom.inc.js:82 Warning: Invalid DOM property viewbox. Did you mean viewBox?

re-frame 1.2.0 cljs 1.10.516

Select all between two things?

I struggle to figure out how to select all elements between two elements such as:

<ul class="pagination">
   <li class="prevnext"><a href="#" onclick="return false;" class="disablelink">&lt;</a></li>
   <li class="current"><a href="/pc-game-pass/games">1</a></li>
   <li><a href="/pc-game-pass/games?page=2" rel="nofollow">2</a></li>
   <li class="h-ellip"><span>...</span></li>
   <li><a href="/pc-game-pass/games?page=3" rel="nofollow">3</a></li>
   <li class="l"><a href="/pc-game-pass/games?page=4" rel="nofollow">4</a></li>
   <li class="prevnext"><a href="/pc-game-pass/games?page=2" rel="nofollow">&gt;</a></li>
</ul>

And I want to select all li elements between the one of class current and the one of class l using hickory selectors, so that I get back:

   <li><a href="/pc-game-pass/games?page=2" rel="nofollow">2</a></li>
   <li class="h-ellip"><span>...</span></li>
   <li><a href="/pc-game-pass/games?page=3" rel="nofollow">3</a></li>

How do you do that?

Thank You

React-compatible :style attribute

React wants a map instead of string.
Invariant Violation: Thestyleprop expects a mapping from style properties to values, not a string. For example, style={{marginRight: spacing + 'em'}} when using JSX.

It would be nice to have support for this

Graal doesn't compile for 0.7.2+

I assumed hickory / jsoup were categorically incompatible with graalvm until very recently, prompting me to open an issue and pr to document it.
clj-easy/graalvm-clojure#61
clj-easy/graalvm-clojure#62

By luck I defined 0.7.1 as the dependency for my hickory hello world application, which made it compile.
This also held true for the significantly more complex youtube client bbyt, which initially prompted the research into graalvm clojure.
Neither of the two compile to graal with 0.7.2 or 0.7.3

For now I'm just opening the issue to bring it to your attention.
I might do a git bisect to check where the breakage occurred in 0.7.1...Release-0.7.2
Not quite sure how I'd do it. Probably via local deps or sth.

Reusing Selectors

David,

Been a user for a while. Moving into something else that produces hiccup vector tree format (instaparse) and am reviewing your selector library thinking I could re-use most of it.

My question is: It seems that the non-usable aspect would be limited to the 'tag' and 'attr' selectors, but am I blinded by desire and missing some subtle aspect which would render the whole idea moot?

Thanks in advance,
Frank

event handlers

It seems I cannot attach event handlers

(-> {:type :element, :attrs {:onclick (fn [] (prn "click"))}, :tag :h1, :content [" hi "]}
hickory-to-html )

onclick it seems must be a string. what if i want it to be a function?

Duplicated text content due to misuse of the wholeText property

Both hickory.core/as-hiccup and hickory.core/as-hickory use DOM's wholeText property to extract the text value of a dom node. However, instead of just returning the text content of a node, this property concatenates all Text nodes logically adjacent to the node.

This may lead to unexpected results, particularly when a parsed document is modified before converting it into hiccup or hickory. Transpiling a mozilla example of using wholeText:

(def doc (hickory.core/parse "<p>Thru-hiking is great! <strong>No insipid election coverage</strong> However, <a href=\"http://en.wikipedia.org/wiki/Absentee_ballot\">casting a ballot</a> is tricky.</p>"))
(def para (.item (.getElementsByTagName doc "p") 0))
(.removeChild para (.item (.-childNodes para) 1))

After the removal of the strong element from the paragraph (as-hiccup para) now returns:

[:p {} "Thru-hiking is great!  However, " "Thru-hiking is great!  However, " 
  [:a {:href "http://en.wikipedia.org/wiki/Absentee_ballot"} "casting a ballot"] " is tricky."]

Notice the duplicate text caused by wholeText concatenating adjacent text nodes for the two text nodes remaining after removing the originally interjecting strong element.

A fix is to call goog.dom/getRawTextContent in place of the wholeText property accessors in hickory.core.

:viewBox property in svg not correctly parsed to hiccup.

Totally enjoying the simplicity of using this library. I am trying to convert a SVG html string to hiccup but I noticed that the :viewBox property in svg's is not correctly parsed. The output is :viewbox instead of :viewBox which is not correctly translated by the browser hence the resulting svg is not displayed correctly. I am using the parse-fragment function although I noticed the issue also occurs in parse. The rest of the string is parsed correctly as far as I can tell. This is not a big issue since clojure has clojure.walk/postwalk which can easily fix this on the resulting hiccup but I guess it would be better to be implemented in the library. If this issue is not worked on soon, can I look into it later when I get time and make a PR?

Inlined javascript and such is corrupted

I noticed that as a consequence of escaping all text nodes, script tags that contain Javascript are corrupted by the escaping. This didn't occur to me until my google analytics snippet got escaped. I think this means the jsoup issue is more important now.

(as-hiccup (parse-fragment ...)) doesn't work in clojurescript with shadow-cljs

Example:

(as-hiccup (parse-fragment "hello world"))

Results in error:

Error: No matching clause: 
    at eval (core.cljs:71)
    at Object.hickory$core$as_hiccup [as as_hiccup] (core.cljs:13)

Parse works OK, e.g.

(as-hiccup (parse "hello world"))
=> ([:html {} [:head {}] [:body {} "hello world"]])

And parse-fragment itself doesn't throw an error, it's only thrown when I call as-hiccup or as-hickory on the result of the parse.

(parse-fragment "hello world")
=> #object[NodeList [object NodeList]]

Here are some more results in case they help debug:

(parse "hello world")
=> #object[HTMLDocument [object HTMLDocument]]
(aget (parse "hello world") "body")
=> #object[HTMLBodyElement [object HTMLBodyElement]]
(aget (parse "hello world") "body" "childNodes")
=> #object[NodeList [object NodeList]]
(aget (parse "hello world") "body" "childNodes" "nodeType")
=> nil

After bundling hickory in an uberjar, it throws and exception when running that .jar file

I created a Clojure webapp and used hickory for scraping web pages. I used io.github.clojure/tools.build {:git/tag "v0.9.1" :git/sha "27ff8a4"} for creating an uberjar of the app. I build it using clj -T:build uber, my deps.edn looking like this:

{:paths ["src/clj" "src/dev"]
 :deps {ring/ring {:mvn/version "1.4.0"}
        http-kit/http-kit {:mvn/version "2.5.3"}
        com.taoensso/timbre {:mvn/version "5.2.1"}
        metosin/reitit {:mvn/version "0.5.17"}
        metosin/ring-http-response {:mvn/version "0.9.3"}
        org.clj-commons/hickory {:mvn/version "0.7.3"}
        hiccup/hiccup {:mvn/version "1.0.5"}
        clojure.java-time/clojure.java-time {:mvn/version "1.2.0"}
        org.clojure/core.async {:mvn/version "1.6.673"}
        com.draines/postal {:mvn/version "2.0.5"}}
 :aliases {:build {:extra-paths ["src/build"]
                   :extra-deps {io.github.clojure/tools.build {:git/tag "v0.9.1" :git/sha "27ff8a4"}
                                org.clj-commons/hickory {:mvn/version "0.7.3"}}
                   :ns-default uberjar}
           :dev {:main-opts ["-m" "gajbe.server"]}}}

Then when I tried running the app using the java -jar target/gajbe.jar command, I encountered this exception:

Exception in thread "async-dispatch-1" java.lang.NoClassDefFoundError: hickory/core/HickoryRepresentable
        at gajbe.rasclanjivaci.ProcesorBeogradskiOglasi.izvuci_oglase(rasclanjivaci.clj:97)
        at gajbe.rasclanjivaci$fn__24183$G__24153__24185.invoke(rasclanjivaci.clj:10)
        at gajbe.rasclanjivaci$fn__24183$G__24152__24188.invoke(rasclanjivaci.clj:10)
        at clojure.core$map$fn__5935.invoke(core.clj:2770)
        at clojure.lang.LazySeq.sval(LazySeq.java:42)
        at clojure.lang.LazySeq.seq(LazySeq.java:51)
        at clojure.lang.RT.seq(RT.java:535)
        at clojure.core$seq__5467.invokeStatic(core.clj:139)
        at clojure.core$apply.invokeStatic(core.clj:662)
        at clojure.core$mapcat.invokeStatic(core.clj:2800)
        at clojure.core$mapcat.doInvoke(core.clj:2800)
        at clojure.lang.RestFn.invoke(RestFn.java:423)
        at gajbe.rasclanjivaci$dohvati_oglase.invokeStatic(rasclanjivaci.clj:117)
        at gajbe.rasclanjivaci$dohvati_oglase.invoke(rasclanjivaci.clj:115)
        at gajbe.poslovi$pokreni_obavestavaca$fn__24511$state_machine__21095__auto____24512$fn__24514.invoke(poslovi.clj:12)
        at gajbe.poslovi$pokreni_obavestavaca$fn__24511$state_machine__21095__auto____24512.invoke(poslovi.clj:12)
        at clojure.core.async.impl.runtime$run_state_machine.invokeStatic(runtime.clj:62)
        at clojure.core.async.impl.runtime$run_state_machine.invoke(runtime.clj:61)
        at clojure.core.async.impl.runtime$run_state_machine_wrapped.invokeStatic(runtime.clj:66)
        at clojure.core.async.impl.runtime$run_state_machine_wrapped.invoke(runtime.clj:64)
        at gajbe.poslovi$pokreni_obavestavaca$fn__24511.invoke(poslovi.clj:12)
        at clojure.lang.AFn.run(AFn.java:22)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at clojure.core.async.impl.concurrent$counted_thread_factory$reify__15124$fn__15125.invoke(concurrent.clj:29)
        at clojure.lang.AFn.run(AFn.java:22)
        at java.base/java.lang.Thread.run(Thread.java:829)

Caused by: java.lang.ClassNotFoundException: hickory.core.HickoryRepresentable
        at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:581)
        at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:178)
        at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:522)
        ... 27 more

It has nothing to do with the thread being async, because it happens with sync as well. I attach the uberjar and the only file in which I use the library.

uberjar: gajbe.jar.zip
Clojure ns using hickory:

(ns gajbe.rasclanjivaci
  (:require [clojure.string :as str]
            [hickory.core :as h]
            [hickory.select :as hs]
            [gajbe.urlovi :as url]
            [gajbe.util :refer [-m]]
            [java-time.api :as jt])
  (:import (java.time ZoneId)))

(defprotocol IzvuciOglase
  (dohvati-stranicu [this] "prihvata spisak svih urlova i vraća html stranicu")
  (izvuci-oglase [this] "uzima stranicu i vraca kolekciju mapa, od kojih je svaka jedan oglas")
  (obradi-oglas [this oglas] "uzima hickory podatke jednog oglasa i vraca njegove elemente"))

(defn datum-string->instant [^String datum ^String format]
  (let [formater (jt/formatter format)
        local-date-time (.atStartOfDay (jt/local-date formater datum))]
    (jt/instant (jt/zoned-date-time local-date-time (ZoneId/systemDefault)))))

(defn- relativni-u-apsolutni
  [^String datum]
  (condp re-matches datum
    #"[Dd]anas" (jt/instant)
    #"[Jj]u[a-zA-Z\u00C0-\u024F]e" (jt/minus (jt/instant) (jt/days 1))
    #"pre nedelju dana" (jt/minus (jt/instant) (jt/weeks 1))
    #"pre ([0-9]+) nedelj[a-z]" :>> (fn [[_ broj-nedelja]]
                                      (jt/minus (jt/instant) (jt/weeks (read-string broj-nedelja))))
    #"pre ([0-9]+) dana" :>> (fn [[_ broj-dana]]
                               (jt/minus (jt/instant) (jt/days (read-string broj-dana))))
    #"([0-9]+) dan[a-z]?[\s]+pre" :>> (fn [[_ broj-dana]]
                                        (jt/minus (jt/instant) (jt/days (read-string broj-dana))))
    #"([0-9]+) sat[a-z]?[\s]+pre" :>> (fn [[_ broj-sati]]
                                        (jt/minus (jt/instant) (jt/hours (read-string broj-sati))))
    #"([0-9]+) minut[a-z]?[\s]+pre" :>> (fn [[_ broj-minuta]]
                                          (jt/minus (jt/instant) (jt/minutes (read-string broj-minuta))))
    #"([a-zA-Z]+) ([0-9]+), ([0-9]+)" :>> (fn [[_ mesec dan godina]] ;; e.g. Mar 21, 2023
                                            (datum-string->instant (str/join "/" [godina mesec (inc (read-string dan))])
                                              "yyyy/MMM/d"))))

(comment
  (relativni-u-apsolutni "8 sati pre"))

(deftype ProcesorKP [urlovi imena-domena]
  IzvuciOglase
  (dohvati-stranicu [_this]
    (slurp (first (:KP urlovi))))
  (izvuci-oglase [this]
    (let [oglasi (hs/select (hs/tag :article) (h/as-hickory (h/parse (dohvati-stranicu this))))]
      (map (partial obradi-oglas this) oglasi)))
  (obradi-oglas [_this oglas]
    (let [[{{link-oglasa :href} :attrs}] (hs/select (hs/class :Link_link__J4Qd8) oglas)
          link-oglasa (str (:KP imena-domena) link-oglasa)
          [{[naslov] :content}] (hs/select (hs/class :AdItem_name__RhGAZ) oglas)
          [{[opis] :content}] (hs/select (hs/child (hs/class :AdItem_adTextHolder__Fmra9) (hs/tag :p)) oglas)
          [{[cena] :content}] (hs/select (hs/class :AdItem_price__jUgxi) oglas)
          [{{link-fotografije :src} :attrs}] (hs/select (hs/child (hs/class :AdItem_imageHolder__LZaKa) (hs/tag :img))
                                               oglas)
          [{[mesto] :content}] (hs/select (hs/child (hs/class :AdItem_originAndPromoLocation__HgtYj) (hs/tag :p)) oglas)
          datum (relativni-u-apsolutni
                  (first (:content (last
                                     (hs/select (hs/child (hs/class :AdItem_postedStatus__swUhG)
                                                  (hs/tag :p)) oglas)))))
          kp-obnovljen? (some? (first (:content (first (hs/select (hs/child (hs/class :AdItem_postedStatus__swUhG)
                                                                    (hs/tag :a)) oglas)))))
          izvor :KP]
      (-m link-oglasa naslov opis cena link-fotografije mesto datum kp-obnovljen? izvor))))

(deftype ProcesorHaloOglasi [urlovi imena-domena]
  IzvuciOglase
  (dohvati-stranicu [_this]
    (slurp (first (:halo-oglasi urlovi))))
  (izvuci-oglase [this]
    (let [oglasi (hs/select (hs/and (hs/class :product-item) (hs/el-not (hs/class :banner-list)))
                   (h/as-hickory (h/parse (dohvati-stranicu this))))]
      (map (partial obradi-oglas this) oglasi)))
  (obradi-oglas [_this oglas]
    (let [[{[naslov] :content {link-oglasa :href} :attrs}] (hs/select (hs/child (hs/class :product-title) (hs/tag :a))
                                                             oglas)
          link-oglasa (str (:halo-oglasi imena-domena) link-oglasa)
          [{[{[cena] :content}] :content}] (hs/select (hs/attr :data-value) oglas)
          [{{link-fotografije :src} :attrs}] (hs/select (hs/descendant (hs/tag :figure) (hs/tag :img)) oglas)
          mesto (str/join "/" (map (comp first :content)
                                (:content (first (hs/select (hs/class :subtitle-places) oglas)))))
          [tip kvadratura broj-soba] (map (comp first :content)
                                       (hs/select (hs/descendant (hs/class :product-features) (hs/class :value-wrapper))
                                         oglas))
          [{[datum] :content}] (hs/select (hs/class :publish-date) oglas)
          datum (datum-string->instant datum "dd.MM.yyyy.")
          izvor :halo-oglasi]
      (-m link-oglasa naslov cena link-fotografije mesto tip kvadratura broj-soba datum izvor))))

(deftype ProcesorBeogradskiOglasi [urlovi imena-domena]
  IzvuciOglase
  (dohvati-stranicu [_this]
    (slurp (first (:beogradski-oglasi urlovi))))
  (izvuci-oglase [this]
    (let [oglasi (hs/select (hs/class :classified) (h/as-hickory (h/parse (dohvati-stranicu this))))]
      (map (partial obradi-oglas this) oglasi)))
  (obradi-oglas [_this oglas]
    (let [[{[naslov] :content}] (hs/select (hs/child (hs/class :title) (hs/tag :h3)) oglas)
          [{{link-oglasa :href} :attrs}] (hs/select (hs/child (hs/class :classified) (hs/tag :a)) oglas)
          link-oglasa (str (:beogradski-oglasi imena-domena) link-oglasa)
          [{[opis] :content}] (hs/select (hs/child (hs/class :fbac) (hs/tag :p)) oglas)
          [{[cena] :content}] (hs/select (hs/class :sl-price) oglas)
          cena (str/trim cena)
          [{{link-fotografije :src} :attrs}] (hs/select (hs/class :cpic) oglas)
          [{[mesto] :content}] (hs/select (hs/class :sl-loc) oglas)
          [{[datum] :content}] (hs/select (hs/child (hs/class :fbac) (hs/class :small-light)) oglas)
          datum (relativni-u-apsolutni (str/trim (second (str/split datum #" "))))
          kp-obnovljen? (some? (first (:content (first (hs/select (hs/child (hs/class :AdItem_postedStatus__swUhG)
                                                                    (hs/tag :a)) oglas)))))
          izvor :beogradski-oglasi]
      (-m link-oglasa naslov opis cena link-fotografije mesto datum kp-obnovljen? izvor))))

(defn dohvati-oglase []
  (sort-by :datum jt/after?
    (mapcat izvuci-oglase
      [(->ProcesorBeogradskiOglasi url/urlovi-oglasa url/imena-domena)
       (->ProcesorHaloOglasi url/urlovi-oglasa url/imena-domena)
       (->ProcesorKP url/urlovi-oglasa url/imena-domena)])))

(comment
  (mapcat izvuci-oglase [(->ProcesorBeogradskiOglasi url/urlovi-oglasa) (->ProcesorHaloOglasi url/urlovi-oglasa)
                         (->ProcesorKP url/urlovi-oglasa url/imena-domena)]))

Let me know if you need any other info.

Node.js Support

Not sure if this is a good idea...

I'm trying to get a nodejs test-runner working for a node-webkit project, however hickory relies on DOM APIs not present in node.js (works fine in the real app).

Alternatively, I can try to find a way to use node-webkit as a test runner.

Actual Error:

  return Node[[cljs.core.str(a), cljs.core.str("_NODE")].join("")];
         ^
ReferenceError: Node is not defined
    at hickory.core.node_type (/home/gary/my-project/target/testing.js:31175:10)
    at Object.<anonymous> (/home/gary/my-project/target/testing.js:31177:49)
    at Module._compile (module.js:456:26)
    at Object.Module._extensions..js (module.js:474:10)
    at Module.load (module.js:356:32)
    at Function.Module._load (module.js:312:12)
    at Function.Module.runMain (module.js:497:10)
    at startup (node.js:119:16)
    at node.js:902:3

Use of aget breaks under advanced compilation

Hickory uses aget for example in

(defn node-type
  [type]
  (aget goog.dom.NodeType type))

which, which current ClojurScript, breaks under advanced compilation, since property names are munged.

Under advanced compilation:

(println "hickory goog.dom.NodeType" goog.dom.NodeType)
=>
hickory goog.dom.NodeType #js {:Fi 1, :zi 2, :Ki 3, :Ai 4, :Hi 5, :Gi 6, :Ji 7, :Bi 8, :Ci 9, :Ei 10, :Di 11, :Ii 12}

when not using advanced compilation:

(println "hickory goog.dom.NodeType" goog.dom.NodeType)
=>
hickory goog.dom.NodeType #js {:ELEMENT 1, :ATTRIBUTE 2, :TEXT 3, :CDATA_SECTION 4, :ENTITY_REFERENCE 5, :ENTITY 6, :PROCESSING_INSTRUCTION 7, :COMMENT 8, :DOCUMENT 9, :DOCUMENT_TYPE 10, :DOCUMENT_FRAGMENT 11, :NOTATION 12}

Should probably use just .- property access instead of aget. aget probably has worked with older ClojureScript, but doesn't work any longer.

very odd interactive issue

When I use hickory in the repl (in clojure, not cljs) I get this:

=> (require '[hickory.core :as hickory])
=> (doc hickory/parse)
-------------------------
hickory.core/parse
([s])
  Parse an entire HTML document into hiccup vectors.

And it does return hiccup. I expected it to return a JSoup document and need as-hiccup to convert it.

And indeed, when I run it outside of the repl that's what happens because code I write depending on the above fails.

I can't find "into hiccup vectors" in the hickory code base even.

It's super weird or I'm more tired than I thought I was.

Help!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.