Giter VIP home page Giter VIP logo

jsonframe-cheerio's Introduction

This repository is deprecated. Use it at your own risk!


NPM

jsonframe

simple multi-level scraper json input/output

npm jsonframe-cheerio a Cheerio Plugin

2.0.5x features

๐Ÿ˜ JSON Syntax: input json, output the same structured json including with scraped data

๐ŸŒˆ Simple patterns: simple inline selectors, extractors, filters and parser.

๐Ÿ’ช Reliable & fast: used in production within crawlers

See the full changelog

Example

let cheerio = require('cheerio')
let $ = cheerio.load(`
	<body>
		<h1>I love jsonframe!</h1>
		<span itemprop="email"> Email: [email protected]  </span>
	<body>`)

let jsonframe = require('jsonframe-cheerio')
jsonframe($) // initializing the plugin

let frame = {
	"title": "h1", // this is an inline selector
	"email": "span[itemprop=email] < email" // output an extracted email
}

console.log( $('body').scrape(frame, { string: true } ))
/*=>
{
	"title": "I love jsonframe!",
	"email": "[email protected]"
}
/*

Use

Install the plugin to your Node.js app through NPM

npm i jsonframe-cheerio --save

API

Loading

Start by loading Cheerio.

let cheerio = require('cheerio')
let $ = cheerio.load("HTML DOM to load") // See Cheerio API

Then load the jsonframe plugin.

let jsonframe = require('jsonframe-cheerio') // require from npm package
jsonframe($) // apply the plugin to the current Cheerio instance

Scraper

Once the plugin is loaded, you've first got to set the frame of your data.

Let's take the following HTML example:

<html>
<head></head>
<body>
    <h2>Pricing</h2>
		<img class="picture" src="somepath/to/image.png">
		<a class="mainLink" href="some/url/to/somewhere">A Link</a>
		<span class="date"> We are the 04/02/2017</span>
		<div class="popup"><span>Some inner content</span></div>
    <ul id="pricing" class="menu">
        <li class="item">
            <span class="planName">Hacker</span>
            <span class="planPrice" price="0">Free</span>
            <a href="/hacker"> <img src="./img/hacker.png"> </a>
        </div>
        <li class="item">
            <span class="planName">Pro</span>
            <span class="planPrice" price="39.00">$39</span>
            <a href="/pro"> <img src="./img/pro.png"> </a>
        </div>
    </ul>
	<div id="contact">
		<span itemprop="usaphone">Phone USA: (912) 148-456</div>
		<span itemprop="frphone">Phone FR: +332 38 30 37 90</div>
		<span itemprop="email">Email: [email protected]</div>
	</div>
</body>
</html>

$( selector ).scrape( frame , {options})

selector is defined in Cheerio's documentation

frame is a JSON or Javascript Object

{options} are detailed later in its own section

let frame = {
	"title": "h2" // CSS selector
}

We then pass the frame to the function:

let result = $('body').scrape(frame, { string: true })
console.log( result )
//=> {"title": "Pricing"}

Frame

Inline Selector

Most common selector, inline line by specifying nothing more than the data name property and the selector as its value.

...
let frame = { "title": "h2" }

let result = $('body').scrape(frame, { string: true })
console.log( result )

/* output =>
	{ "title": "Pricing" }
*/
...

New : Inline attribute / extractor / parser

You can now declare everything in line. You should just be careful to always use them in the following order when combining them : @ (attribute), | (extractor), || (parse).

See examples for each of them above.

Attribute

_a: "attributeName" allows you to retrieve any attribute data
@ inside the selector _s allows you to do it inline

...
let frame = {
	"proPrice": ".planName:contains('Pro') + span@price"
}

let result = $('body').scrape(frame, { string: true })
console.log( result )

/* output =>
	{ "proPrice": "39.00" }
*/
...

Extractor

< inside the selector _s allows you to do it inline

It currently supports email (also mail), telephone (also phone), date, fullName (or firstName, lastName, initials, suffix, salutation) and html (to get the inner html) and by default (no declaration), we get the inner text.

...
let frame = {
	"email": "[itemprop=email] < phone",
	"frphone": "[itemprop=frphone] < phone"
}

let result = $('body').scrape(frame, { string: true })
console.log( result )

/* output =>
	{
		"email": "[email protected]",
		"frphone": "33238303790"
	}
*/
...

Filter

| inside the selector _s allows you to do it inline

It currently supports trim (remove spaces at beginning and end), lowercase or lcase, uppercase or ucase, capitalize or cap, words or w, noescapchar or nec, compact or cmp and number or nb.

...
let frame = {
	"email1": "[itemprop=email] < phone | uppercase",
	"email2": "[itemprop=email] < phone | capitalize"
}

let result = $('body').scrape(frame, { string: true })
console.log( result )

/* output =>
	{
		"email1": "[email protected]",
		"email2": "EXAMPLE GOOGLE NET"
	}
*/
...

Parse / Regex

|| inside the selector _s allows you to use regexes in line _p: /regex/ allows you to extract data based on regular expressions

...
let frame = {
	"data": ".date || \\d{1,2}/\\d{1,2}/\\d{2,4}"
}

// or use the longer version for proper regex entry

let frame = {
	"data": {
		_s: ".date",
		_p: /\d{1,2}\/\d{1,2}\/\d{2,4}/ // n[n]/n[n]/nn[nn] format here
	}
}

let result = $('body').scrape(frame, { string: true })
console.log( result )

/* output =>
	{
		"date": "04/02/2017"
	}
*/
...

List / Array

_d: [{ }] allows you to get an array / list of data
_d: ["selector"] will retrieves a list based on the selector inbetween quotes.
_d: ["firstSelector", "secondSelector"] works too and merge the results into one array

You could even shorten it more by listing right from the selector as follows: "selectorName": [".selector"] which returns an array of strings

...
let frame = {
	"pricing": {
		_s: "#pricing .item",
		_d: [{
			"name": ".planName",
			"price": ".planPrice"
		}]
	}
}

let result = $('body').scrape(frame, { string: true })
console.log( result )

/* output =>
	{
		"pricing": [
			{
				"name": "Hacker",
				"price": "Free"
			},
			{
				"name": "Pro",
				"price": "$39"
			}
		]
	}
*/

// Or a shorter way which works for simple string arrays

let frame = {
	"pricingNames": ["#pricing .item .planName"]
}

let result = $('body').scrape(frame, { string: true })
console.log( result )

/* output =>
	{
		"pricingNames": ["Hacker", "Pro"]
	}
*/
...

Grouped

"_g": { _s: "", _d: {} } allows you to group some data selectors by a parent selector without naming the parent. You can also extends the group property to add some meaning or simply have several groups at the same level.
Group property name must be _g or _group followed by _ and whatever string you want.
ex: _g_head : {} or _g_body : {}

...
let frame = {
	_g: {
		_s: "#pricing .item",
		_d: {
			"name": ".planName",
			"price": ".planPrice"
		}
	},
	_g_second: {
		_s: "#pricing .item",
		_d: {
			"secondName": ".planName",
			"secondPrice": ".planPrice"
		}
	}
}

let result = $('body').scrape(frame, { string: true })
console.log( result )

/* output =>
	{
		"name": "Hacker",
		"price": "Free",
		"secondName": "Hacker",
		"secondPrice": "Free"
	}
*/
...

Nested

"parent": { _s: "parentSelector", _d: {} } allows you to segment your data by setting a parent section from which the child data will be scraped.

You can also use "parent": { } when you only want to nest data into objects without setting a parent selector.

...
let frame = {
	"pricing": {
		_s: "#pricing .item",
		_d: {
			"name": ".planName",
			"price": ".planPrice"
		}
	}
}

let result = $('body').scrape(frame, { string: true })
console.log( result )

/* output =>
	{
		"pricing":{
			"name": "Hacker",
			"price": "Free"
		}
	}
*/
...

Note here that we get the first returned result (#pricing .item).

Example

See how you can properly structure your data, ready for the output!

...
let frame = {
	"pricing": {
		_s: "#pricing .item",
		_d: [{
			"name": ".planName",
			"price": ".planPrice @ price",
			"image": {
				"url": "img @ src",
				"link": "a @ href"
			}
		}]
	}
}

let result = $('body').scrape(frame, { string: true })
console.log( result )

/* output =>
	{
		"pricing":[
			{
				"name": "Hacker",
				"price": "0",
				"image": {
					"url": "./img/hacker.png",
					"link": "/hacker"
				}
			},
			{
				"name": "Pro",
				"price": "39.00",
				"image": {
					"url": "./img/pro.png",
					"link": "/pro"
				}
			}
		]
	}
*/
...

Note here that we get the first returned result (#pricing .item).

Options

...
let frame = {
	"proPrice": {
		_s: ".planName:contains('Pro') + span",
		_a: "price"
	}
}

let result = $('body')
	.scrape(frame, {
			timestats: true, // default: false
			string: true // default: false
		})
console.log(result)

/* output =>
	{
		"proPrice": {
			"value":"39.00",
			"_timestats": "1" // ms
		}

	}
*/
...

Tests

One shot tests

npm run test

Watching test on updates

npm run test-watch

Changelog

โš  Careful if you've been using jsonframe from the version 1.x.x, some things changed to make it more flexible, faster to use (inline parameters) and more meaningful in the syntax.

2.0.52 (28/02/2017)

  • Update the email regex
  • Update the website regex
  • Fix array into array results
  • Improving script efficiency getting data from node(s)
  • Fix date extractor when no date to extract

2.0.51 (27/02/2017)

  • Fix a fatal error (argh) which was just a typo about the new chained extractors

2.0.50 (27/02/2017)

  • Extractors chaining is now possible. For ex: .selector < html email would work

2.0.49 (27/02/2017)

  • Fixing issue when attribute doesn't exists (@ attributeNmae)
  • Improving array of object management (need to find a way to avoid empty objects still)

2.0.48 (27/02/2017)

  • Add Filter Split(char) to split string based on character (default to whitespace)
  • Add Extractor numbers or nb (return potentially an array)
  • Update Filter numbers or nb (simply filter the string to output only numbers)
  • Add Filter between(string1&&string2) to filter data by starting and finishing string
  • Add Filter before(string) to get data before a string
  • Add Filter after(string) to get data after a string
  • Add array support to Filter left(nb) and right(nb) (slice the array elements)
  • Add Filter fromto(startNb,endNb) to either slice an array or a string from index to index
  • Add Filter get(nb) to extract either an array item or a character from a string

2.0.46 (26/02/2017)

  • Rebuild of the Unstructured scraper with breaks (_b) - Works like a charm now!

2.0.45 (25/02/2017)

  • Fix weird fullName parsing in some cases
  • Update Handle of Regex - Is now able to capture a group with a regex

2.0.44 (24/02/2017)

  • Inline array for extractors like "mails": [".parentSelector < email"]
  • Adds french words: prenom and nom to humanname extractor
  • Add filters: right(number), left(number)
  • Set a stricter regex for email extractor /([a-zA-Z0-9._-]{0,30}@[a-zA-Z0-9._-]{0,15}\.[a-zA-Z0-9._-]{0,15})/gmi

2.0.3 (23/02/2017)

  • Possibility to scrape unstructured data with breaks (_b). More about this soooon in the readme.
  • New filters: words or w, noescapchar or nec and compact or cmp
  • Multi-filters is available now. Ex: .selector | words compact. Simply separated by spaces.
  • Disabling google libphonenumber for now

2.0.2 (15/02/2017)

  • String option to get a stringified output right away
  • Multi-groups possibility at same level (several _g wouldn't work as same property name) in frame like _g_head and _g_body for example
  • Joined arrays/lists with ["firstlist.selector", "secondlist.selector", "..."] when inline
  • Better handling of img node - automatic src attribute is output (if nothing else set)

2.0.1 (14/02/2017)

  • Fixed the non-passing tests and added all the new ones for 2.x.x updates
  • Refactoring the way data is processed for future multiple occurences

2.0.0 (12/02/2017)

  • โš  Changing Type for Extractor with shortcode < instead of |
  • โš  filters with the shortcode |
  • Inline parameters support for "attribute", "extractor" and "parse"
  • Simple string arrays from inline selector
  • Group property to group data selectors whitout naming the group (childs take the place of the group property "_g" or "_group" )

1.1.1 (05/02/2017)

  • Short & functionnal parameters ( _s, _t, _a) instead of "selector", "extractor", "attr". Idea behind being to easily differentiate retrieved data name to functionnal data.
  • Automatic handler for img selected element (automatically retrieve the img src link)
  • _parent_ selector to target the parent content
  • A regex parser with the functionnal parameter parse: _p (_parse works too)
  • Extractor _t: "html" feature to get back inner html of a selector
  • Added timestats to measure time spent on each node via .scrape(frame, {timestats: true})
  • Refactorization of the whole code to make it evolutive (DRY)
  • Update of the tests cases accordingly

1.0.0 (27/01/2017)

  • Stable version release with basic features

Contributing ๐Ÿค

Feel free to follow the procedure to make it even more awesome!

  1. Create an issue so we get the discussion started
  2. Fork it!
  3. Create your feature branch: git checkout -b my-new-feature
  4. Commit your changes: git commit -am 'Add some feature'
  5. Push to the branch: git push origin my-new-feature
  6. Submit a pull request :D

License

Gabin Desserprit - datascraper.pro
Released under MIT License

jsonframe-cheerio's People

Contributors

gahabeen avatar moeahmed avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

jsonframe-cheerio's Issues

How to get an attribute of a parent item and fetch its children content data too.

Hi,
Jsonframe is really a good idea, thank you for sharing it.
I'm actually trying to scrape some content but i'm facing a problem: i can't manage to take an attribute of an item with its inner data too.
Sample html:

<div class="parents-container">
    <div class="parent" data-foo="a">
        <span class="child">...</span>
        <span class="child">...</span>
        ...
    </div>
    <div class="parent" data-foo="b">
        <span class="child">...</span>
        <span class="child">...</span>
         ...
    </div>
    <div class="parent" data-foo="c">
         <span class="child">...</span>
         <span class="child">...</span>
          ...
   </div>
</div>

This is the model i can use for this (i did not test it):

var data = {
     parents: {
        _s: ".parent",
        _d: [{
               foo: ?????? //how to get data-foo attribute value of the current ".parent" element?
               children: {
                   _s: ".child",
                   _d: [{ ... }] //child data
               }
              }]
     }
}

This should return an object with a property named "parents" that corresponds to an array of objects. Each object in the array, should represent a parent item.
Like this:

{ parents: [{ foo: "...", children: [{ child data }, {child data} ... ]}, {foo: "...", children: [{...}] }, {foo: "", children: [{}], ... }

How should i write my model in order to include the data-foo attribute ?

Thanks

Code typo in docs

In this section in the docs, the example for "link": "a @href" should have a space like the other examples: "link": "a @ href". It seems small, but it screwed me up for a few minutes.

Best,
Ryan

Pls Enrich docs

Please enrich the documentation, there are many filters, extractors and other functionalities that are in the changelog but not in the sessions. So we need to look all the changelog to find amazing filters like between(string&&sting) and others...
I could help with the docs if you want it.

Thanks for this amazing repo

Issues retrieving list

I am having issues with retrieving a list from a set of a tags, I get an empty array.

  • Here is the result in a log:
    episodes: { episodes: [ {}, {} ] }

  • Here is the code to get the list:

          let $ = cheerio.load(
            '<div class="row"> <div class="col s12 m12 l8 content-left"> <div class="content-list z-depth-1"> <h5> CATEGORY: King of Mask Singer <div class="sort input-field" data-link="http://kshowonline.com/category/141/king-of-mask-singer/1" data-sort="1"> <div class="select-wrapper"><span class="caret">โ–ผ</span><input type="text" class="select-dropdown" readonly="true" data-activates="select-options-1b022f21-4a87-3fa8-f2be-3a078ea7f4a1" value="Date added (newest)" data-cip-id="cIPJQ342845639"><ul id="select-options-1b022f21-4a87-3fa8-f2be-3a078ea7f4a1" class="dropdown-content select-dropdown "><li class=""><span>Date added (newest)</span></li><li class=""><span>Date added (oldest)</span></li><li class=""><span>Name (A-Z)</span></li><li class=""><span>Name (Z-A)</span></li></ul><select class="initialized"> <option value="1" selected="selected">Date added (newest)</option> <option value="2">Date added (oldest)</option> <option value="3">Name (A-Z)</option> <option value="4">Name (Z-A)</option> </select></div></div></h5> <a href="http://kshowonline.com/kshow/8052-[engsub]-king-of-mask-singer-ep.139" title="King of Mask Singer Ep.139"> <div class="thumbnail"> <div class="video-container center-align"> <div class="img-cover"> <img src="https://c1.staticflickr.com/1/320/18368860329_b2b17d3fb4_n.jpg" alt="King of Mask Singer Ep.139"> </div></div><div class="caption"> King of Mask Singer Ep.139 </div></div></a> <a href="http://kshowonline.com/kshow/8021-[engsub]-king-of-mask-singer-ep.138" title="King of Mask Singer Ep.138"> <div class="thumbnail"> <div class="video-container center-align"> <div class="img-cover"> <img src="https://c1.staticflickr.com/1/320/18368860329_b2b17d3fb4_n.jpg" alt="King of Mask Singer Ep.138"> </div></div><div class="caption"> King of Mask Singer Ep.138 </div></div></a> </div></div></div>'
          );
          let frame = {
            episodes: {
              _s: ".content-list.z-depth-1 a",
              _d: [
                {
                  url: "a @ href",
                  title: "a @ title"
                }
              ]
            }
          };
          jsonframe($);

          let result = $(".col.s12.m12.l8.content-left").scrape(frame, {
            string: true
          });
          console.log("episode results: ", result);
  • Here is the formatted sample html:
<div class="row">
	<div class="col s12 m12 l8 content-left">
    	<div class="content-list z-depth-1">
        	<h5> CATEGORY: King of Mask Singer 
            	<div class="sort input-field" data-link="http://kshowonline.com/category/141/king-of-mask-singer/1" data-sort="1">
                	<div class="select-wrapper">
                    	<span class="caret">โ–ผ</span>
                        <input type="text" class="select-dropdown" readonly="true" data-activates="select-options-1b022f21-4a87-3fa8-f2be-3a078ea7f4a1" value="Date added (newest)" data-cip-id="cIPJQ342845639">
                        <ul id="select-options-1b022f21-4a87-3fa8-f2be-3a078ea7f4a1" class="dropdown-content select-dropdown ">
                          <li class=""><span>Date added (newest)</span></li>
                          <li class=""><span>Date added (oldest)</span></li>
                          <li class=""><span>Name (A-Z)</span></li>
                          <li class=""><span>Name (Z-A)</span></li>
                        </ul>
                        <select class="initialized">
                          <option value="1" selected="selected">Date added (newest)</option>
                          <option value="2">Date added (oldest)</option>
                          <option value="3">Name (A-Z)</option>
                          <option value="4">Name (Z-A)</option>
                        </select>
                    </div>
                </div>
            </h5>
           	<a href="http://kshowonline.com/kshow/8052-[engsub]-king-of-mask-singer-ep.139" title="King of Mask Singer Ep.139"> 
            	<div class="thumbnail">
                	<div class="video-container center-align">
                    	<div class="img-cover">
                        	<img src="https://c1.staticflickr.com/1/320/18368860329_b2b17d3fb4_n.jpg" alt="King of Mask Singer Ep.139">
                        </div>
                    </div>
                 	<div class="caption"> King of Mask Singer Ep.139 </div>
                </div>
            </a>
            <a href="http://kshowonline.com/kshow/8021-[engsub]-king-of-mask-singer-ep.138" title="King of Mask Singer Ep.138">
            	<div class="thumbnail">
                	<div class="video-container center-align">
                    	<div class="img-cover">
                        	<img src="https://c1.staticflickr.com/1/320/18368860329_b2b17d3fb4_n.jpg" alt="King of Mask Singer Ep.138">
                        </div>
                    </div>
                    <div class="caption"> King of Mask Singer Ep.138 </div>
                </div>
            </a>
        </div>
    </div>
</div>

Compatibility with browser and jQuery

An error is thrown when jsonframe-cheerio is run in the browser with the latest version of jQuery (v3.3.1): Uncaught TypeError: Cannot read property 'toLowerCase' of undefined.

The solution I found was to wrap this line in a try/catch block:

  if (!res.extractor && !res.attribute && $(node).find(res.selector)['0'] && $(node).find(res.selector)['0'].name.toLowerCase() === 'img') {
    res.attribute = 'src';
  }

Specifically, $(node).find(res.selector)['0'] was defined, but the property name in $(node).find(res.selector)['0'].name was not defined, which caused the error.

Also, I made sure to go through all the samples and make sure they work. The fix in #2 works for that case. In addition, these changes also needed to be made:

https://github.com/gahabeen/jsonframe-cheerio#extractor
- "email": "[itemprop=email] < phone",
+ "email": "[itemprop=email] < mail",

https://github.com/gahabeen/jsonframe-cheerio#filter
-	"email1": "[itemprop=email] < phone | uppercase",
-	"email2": "[itemprop=email] < phone | capitalize"
+	"email1": "[itemprop=email] < mail | uppercase",
+	"email2": "[itemprop=email] < mail | capitalize"

Lastly, the timestats option was inoperable. The timestats variable was not passed in the third object argument in getDataFromNodes(โ€ฆ), and was defaulted to false.
gTime, while defined at the beginning, was undefined when it was used:

    if (result['_value']) {
      result['_timestat'] = timeSpent(gTime); // gTime = undefined
    }

I did not find a fix for timestats.

How to get text without nested children's texts

How can I get just "This is some text"? and not "This is some textFirst span textSecond span text"?

<li id="listItem">
    This is some text
    <span id="firstSpan">First span text</span>
    <span id="secondSpan">Second span text</span>
</li>

Example:

let cheerio = require('cheerio');
let $ = cheerio.load(`
<li id="listItem">
    This is some text
    <span id="firstSpan">First span text</span>
    <span id="secondSpan">Second span text</span>
</li>`)

let jsonframe = require('jsonframe-cheerio')
jsonframe($)

let frame = {"text": "li#listItem"}
console.log( $('body').scrape(frame, { string: true } ))
// {
//   "text": "This is some text First span text Second span text"
// }

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.