matthewmueller / x-ray Goto Github PK

The next web scraper. See through the <html> noise.

License: MIT License

JavaScript 100.00%

x-ray's Introduction

var Xray = require('x-ray')
var x = Xray()

x('https://blog.ycombinator.com/', '.post', [
  {
    title: 'h1 a',
    link: '.article-title@href'
  }
])
  .paginate('.nav-previous a@href')
  .limit(3)
  .write('results.json')

Installation

npm install x-ray

Features

Flexible schema: Supports strings, arrays, arrays of objects, and nested object structures. The schema is not tied to the structure of the page you're scraping, allowing you to pull the data in the structure of your choosing.
Composable: The API is entirely composable, giving you great flexibility in how you scrape each page.
Pagination support: Paginate through websites, scraping each page. X-ray also supports a request delay and a pagination limit. Scraped pages can be streamed to a file, so if there's an error on one page, you won't lose what you've already scraped.
Crawler support: Start on one page and move to the next easily. The flow is predictable, following a breadth-first crawl through each of the pages.
Responsible: X-ray has support for concurrency, throttles, delays, timeouts and limits to help you scrape any page responsibly.
Pluggable drivers: Swap in different scrapers depending on your needs. Currently supports HTTP and PhantomJS driver drivers. In the future, I'd like to see a Tor driver for requesting pages through the Tor network.

Selector API

xray(url, selector)(fn)

Scrape the url for the following selector, returning an object in the callback fn. The selector takes an enhanced jQuery-like string that is also able to select on attributes. The syntax for selecting on attributes is selector@attribute. If you do not supply an attribute, the default is selecting the innerText.

Here are a few examples:

Scrape a single tag

xray('http://google.com', 'title')(function(err, title) {
  console.log(title) // Google
})

Scrape a single class

xray('http://reddit.com', '.content')(fn)

Scrape an attribute

xray('http://techcrunch.com', 'img.logo@src')(fn)

Scrape innerHTML

xray('http://news.ycombinator.com', 'body@html')(fn)

xray(url, scope, selector)

You can also supply a scope to each selector. In jQuery, this would look something like this: $(scope).find(selector).

xray(html, scope, selector)

Instead of a url, you can also supply raw HTML and all the same semantics apply.

var html = '<body><h2>Pear</h2></body>'
x(html, 'body', 'h2')(function(err, header) {
  header // => Pear
})

API

xray.driver(driver)

Specify a driver to make requests through. Available drivers include:

request - A simple driver built around request. Use this to set headers, cookies or http methods.
phantom - A high-level browser automation library. Use this to render pages or when elements need to be interacted with, or when elements are created dynamically using javascript (e.g.: Ajax-calls).

xray.stream()

Returns Readable Stream of the data. This makes it easy to build APIs around x-ray. Here's an example with Express:

var app = require('express')()
var x = require('x-ray')()

app.get('/', function(req, res) {
  var stream = x('http://google.com', 'title').stream()
  stream.pipe(res)
})

xray.write([path])

Stream the results to a path.

If no path is provided, then the behavior is the same as .stream().

xray.then(cb)

Constructs a Promise object and invoke its then function with a callback cb. Be sure to invoke then() at the last step of xray method chaining, since the other methods are not promisified.

x('https://dribbble.com', 'li.group', [
  {
    title: '.dribbble-img strong',
    image: '.dribbble-img [data-src]@data-src'
  }
])
  .paginate('.next_page@href')
  .limit(3)
  .then(function(res) {
    console.log(res[0]) // prints first result
  })
  .catch(function(err) {
    console.log(err) // handle error in promise
  })

xray.paginate(selector)

Select a url from a selector and visit that page.

xray.limit(n)

Limit the amount of pagination to n requests.

xray.abort(validator)

Abort pagination if validator function returns true. The validator function receives two arguments:

result: The scrape result object for the current page.
nextUrl: The URL of the next page to scrape.

xray.delay(from, [to])

Delay the next request between from and to milliseconds. If only from is specified, delay exactly from milliseconds.

var x = Xray().delay('1s', '10s')

xray.concurrency(n)

Set the request concurrency to n. Defaults to Infinity.

var x = Xray().concurrency(2)

xray.throttle(n, ms)

Throttle the requests to n requests per ms milliseconds.

var x = Xray().throttle(2, '1s')

xray.timeout (ms)

Specify a timeout of ms milliseconds for each request.

var x = Xray().timeout(30)

Collections

X-ray also has support for selecting collections of tags. While x('ul', 'li') will only select the first list item in an unordered list, x('ul', ['li']) will select all of them.

Additionally, X-ray supports "collections of collections" allowing you to smartly select all list items in all lists with a command like this: x(['ul'], ['li']).

Composition

X-ray becomes more powerful when you start composing instances together. Here are a few possibilities:

Crawling to another site

var Xray = require('x-ray')
var x = Xray()

x('http://google.com', {
  main: 'title',
  image: x('#gbar a@href', 'title') // follow link to google images
})(function(err, obj) {
  /*
  {
    main: 'Google',
    image: 'Google Images'
  }
*/
})

Scoping a selection

var Xray = require('x-ray')
var x = Xray()

x('http://mat.io', {
  title: 'title',
  items: x('.item', [
    {
      title: '.item-content h2',
      description: '.item-content section'
    }
  ])
})(function(err, obj) {
  /*
  {
    title: 'mat.io',
    items: [
      {
        title: 'The 100 Best Children\'s Books of All Time',
        description: 'Relive your childhood with TIME\'s list...'
      }
    ]
  }
*/
})

Filters

Filters can specified when creating a new Xray instance. To apply filters to a value, append them to the selector using |.

var Xray = require('x-ray')
var x = Xray({
  filters: {
    trim: function(value) {
      return typeof value === 'string' ? value.trim() : value
    },
    reverse: function(value) {
      return typeof value === 'string'
        ? value
            .split('')
            .reverse()
            .join('')
        : value
    },
    slice: function(value, start, end) {
      return typeof value === 'string' ? value.slice(start, end) : value
    }
  }
})

x('http://mat.io', {
  title: 'title | trim | reverse | slice:2,3'
})(function(err, obj) {
  /*
  {
    title: 'oi'
  }
*/
})

Examples

selector: simple string selector
collections: selects an object
arrays: selects an array
collections of collections: selects an array of objects
array of arrays: selects an array of arrays

In the Wild

Levered Returns: Uses x-ray to pull together financial data from various unstructured sources around the web.

Resources

Video: https://egghead.io/lessons/node-js-intro-to-web-scraping-with-node-and-x-ray

Backers

Support us with a monthly donation and help us continue our activities. [Become a backer]

License

MIT

x-ray's People

Contributors

Stargazers

Watchers

Forkers

ww3 agasani marufsiddiqui bawerd davgit markslaunwhite nvdnkpr jsnuts teserak outmost javascript-forks oligoform kublaj rclai ebrelsford vsakaria niftrr cbou imclab rakesh-mohanta oceanswave eheikes paulsmash olivierlesnicki abdelhas jarvisaoieong adrian-tiberius r1q slimbn skahack mkavanagh lcamhoa meishijie jacoviza crossproduct areida soyerno ssteinerx tzmartin ashbeats dennismkerr cosminonea dfdeagle47 rubyshow woodwardoge mgreschke jaredmansaakintola jaggedsoft damianof emmoila doesdev khrustovskiy khahantk ribeiro firdausramlan naohta g00fy- sumado vskyny betoo kvnneff explore-javascript tk120404 ambujsharma23 aecca kengoldfarb mithundas79 anjuncc nanaotakashi lephuhai pdelacruz86 kodeandgame raygesualdo kashifmsidd shashank971 gettydata lukasz-kaniowski adrianhsu fsakbas sunfishtech kwhen android-leak newcoder rasata shipow threefoldo egoid dkusano nodejs-toolbox yerffejytnac simudream federicomaffei carsonwu andresvidal hasantayyar dipsec codevlabs smtx toanalien duataud

x-ray's Issues

first example with a[href] weird values

Hi,

I just try your first example on github.com/stars/matthewmueller

It seems getting a[href] value return weird values starting with null//null/

xray('github.com/stars/matthewmueller')
  .select([{
    $root: '.repo-list-item',
    link: '.repo-list-name a[href]'
  }])
  .run(function(err, data) {
    console.log(data)
  })

output:

[ { link: 'null//null/lapwinglabs/static' },
  { link: 'null//null/EvanMarkHopkins/duo-jsx' },
  { link: 'null//null/google/bazel' },
  ...

Example using regular expressions

Hi, I am wondering if you could help me figure out if regex is supported and if so how to use them. So far I've tried with this URL http://www.finanzas.df.gob.mx/sma/detallePlaca.php?placa=183YTP:

var xray = require('..');
var url = 'http://www.finanzas.df.gob.mx/sma/detallePlaca.php?placa=183YTP';
xray(url)
  .select([{
      $root: '#tablaDatos',
      folio: new RegExp(/./)
  }])
  .run(function(err, infraccion) {
    console.log(infraccion);
  });

also

var xray = require('..');
var url = 'http://www.finanzas.df.gob.mx/sma/detallePlaca.php?placa=183YTP';
xray(url)
  .select([{
      $root: '#tablaDatos',
      folio: /./
  }])
  .run(function(err, infraccion) {
    console.log(infraccion);
  });

But still getting nothing :(

Cheerio seems more useful

This sure sounded cool, but when it comes to practical use, I find Cheerio to be much more effective (thanks for Cheerio!). Unless I'm missing something, I don't see a (non-verbose) way to do something at simple as clean a returned string, for example:

        x(options.raw, {
            main: '.page-title h1'
        })(function(err, obj) {
            // obj.main = "\n     Scraped String      "
        });

With Cheerio, you can simply call .text().trim() on the returned cheerio object. Is there a way to accomplish the same thing using the collection format above...preferably at the property level rather than in the callback? Thx

write function and stream

First of all thanks for your x-ray library it rocks !!

For a project i'm using the xray.write() function that return a WritableStream cause i need to parse a lot of web pages and return a very big json array (max 1 millions entries) and using the xray.run() flushed the memory quickly.
For this i want to use Koa.js to retrieve all the data, when i hit a specific url, something like this :

var xray = require('x-ray');
var match = {
    paginates: 'ul#paging li:nth-last-child(2) a[href]',
    links: [ '.list-lbc a:not(.alertsLink)[href]' ]
};

// this function return a very big json array...to the out.json file
var getLinks = function *getLinks(url) {
   return xray(url)
      .select(match.links)
      .paginate(match.paginates)
      .write('out.json');
};

Do you know how to pipe or redirect the writeable stream to this.bodyof koa.js ?

// something like this, i need to hit the '/links' url of a dedicated server to retrieve 
// this very big json array.

app.get('/links', function *() {
   this.body = yield getLinks(url);
});

Thanks in advance.

Absolute URL doesn't work with every case

When calling an url for example http://example.com/contact.html with links with href value being legal.html. The returned link is http://example.com/contact.html/legal.html.

I think it would be better to use a library like http://medialize.github.io/URI.js instead of doing such a complex stuff.

Select innerHTML

As I see it is impossible to get innerHTML from selected node. Any thought about how to implement this feature?

Issue handling URL from XML

I've been digging on this for a while, but I'm pretty stumped right now.

<item>
    <title>Greece's Varoufakis says QE to fuel unsustainable equity rally</title>
    <link>http://feeds.reuters.com/~r/reuters/businessNews/~3/EaR-N8x3VzU/story01.htm</link>
    <description>CERNOBBIO, Italy (Reuters) - The European Central Bank's bond purchases will create an unsustainable stock market rally and are unlikely to boost euro zone investments...</description>
</item>

With .select(['item title']) I'm able to get all of the titles. With .select(['item description']) I'm able to get all of the descriptions. But with .select(['item link']) I only get an array of empty strings back. The number of empty strings equals the number of items in the page.

I'm going to keep digging in, but I think I already have keyboard marks on my forehead. =|

I've tried this using a $root with a link: 'link' attribute already, but same difference.

This is the specific URL I'm scraping: http://feeds.reuters.com/reuters/businessNews?format=xml

HTML Driver

I don't know if there's an easier way to use an html string instead of a URL for the source of scraping, but I was able to get it working by creating the driver/plugin below.

/node_modules/x-ray-html/index.js

/**
* Module Dependencies
*/

// Any module dependencies go here

/**
* Export the default `driver`
*/

module.exports = driver;

/**
* Initialize the html
* `driver` to support
* html strings instead of
* just a url
* @param {Object} opts
* @return {Function} plugin
*/

function driver(opts) {

    return function plugin(xray) {

        xray.request = function request(htmlString, fn) {
            return fn(null, htmlString);
        };

        return xray;
    }
}

Usage

var xrayHtml = require('x-ray-html');

xray('<p class="title"><strong>It Works!</strong></p>')
    .use(xrayHtml())
    .select('.title')
    .run(function(err, extraction) {

        // Return on error
        if (err) return console.log(err);

        // Log extraction results to console
        console.log(extraction);
    });

no result when a prepare function returns a number with value 0

When a prepare function returns a number with a value of 0 the field will be entirely absent.

Consider this example:

<div class="servers">
  <div class="server"> <h1>server 1</h1> <span class="users">10</span> </div>
  <div class="server"> <h1>server 2</h1> <span class="users">20</span> </div>
  <div class="server"> <h1>server 3</h1> <span class="users">0</span> </div> 
  <div class="server"> <h1>server 4</h1> <span class="users">15</span> </div>
</div>

xray('http://localhost:3000')
  .prepare('parseInt', parseInt)
  .select([{
    $root: '.server',
    name: 'h1',
    users: '.users | parseInt'
  }]);

After running this the result is:

[ { "name": "server 1", "users": 10 },
  { "name": "server 2", "users": 20 },
  { "name": "server 3" },
  { "name": "server 4", "users": 15 } ]

I would expect the result for server 3 to include "users": 0.

Any idea what might be going on?

formatters should have an `end` function

Formatter just takes a single function that you can convert JSON objects into RSS, XML, HTML, etc.

It should also take an end function that's responsible for gluing the array together.

xray('http://google.com')
  .format(rss())
  .run(function(err, rss) {
    // rss is a feed, not an array of items
  })

Support Proxy

how scrap behind a proxy (with login and password) ?
Thank
AMi44

Saving data to mongodb finished early

I am trying to scrape some urls and then save them to the mongodb, but it finished early, if I just write out.json it won't end until it finishes. As you can see below after run() I save the data to mongodb in a for loop. Why it finished early? The pagination is correct because when I check the out.json, it contains all the data I need.

app.get('/crawlAll', function(req, res){
    xray('http://jandan.net/pic')
      .select([{
        $root: 'div.row',
        img: 'div.text p img[src]'
      }])
      .paginate('.cp-pagenavi a:last-child[href]')
      .limit(5)
      // .write('out.json');
        .run(function(err, imgs){
            res.send(200, imgs);
            for (var i = 0; i < imgs.length; i++) {
                var img = imgs[i].img;
                var newImg = Img({
                    url: img
                });
                newImg.save(function(err, success){
                    if (success) {
                        res.send(200);
                    } else {
                        res.send(400, err);
                    }
                });
            };
        });
});

Callback already called error.

Edit: this is probably related to this PR: #53

This might not be a problem with x-ray per se but could not figure out what is going on.
async throws the following error

new Error("Callback was already called.")

It looks like xray is calling the end function twice as the innerCallback is getting invoked twice.

var async = require('async');
var Xray = require('x-ray');

var scrapingModel = [{
        url: "https://dribbble.com",
        selector: "li.group",
        paginator: ".next_page@href",
        model: {
            title: '.dribbble-img strong',
            image: '.dribbble-img [data-src]@data-src'
        }
    },
    {
        url: "https://github.com/lapwinglabs/x-ray/watchers",
        selector: ".follow-list-item",
        paginator: ".next_page@href",
        model: {
            fullName: '.vcard-username'
        }
    }];

async.each([1,2,3], function(item, outerCallback){
    async.each(scrapingModel, function(item, innerCallback) {
        var x = Xray();

        x(item.url, item.selector, [item.model])
            .paginate(item.paginator)
            .limit(3)
        (function(err, items) {
            innerCallback();
        });
    })
}, function() {
    outerCallback();
});

is .throttle() implemented?

btw this tool is amazing :)

Fix scraping without a $root

Given the following page HTML...

<a href="page1.html">Page 1</a>
<a href="page2.html">Page 2</a>

... how would I extract the link text and hrefs?

xray('lots-of-links.com')
  .select([{
    $root: 'a',
    text: ??,
    href: ??
  }])
  .run(function (error, links) {
    // I would like an array on links
    // [
    //   { 
    //     text: 'Page 1',
    //     href: 'page1.html'
    //   },
    //   {
    //     ...
    //   }
    // ]
  });

I don't know how to reference the $root node. :(

Named attributes

Hoping I'm just missing the syntax on this one. I'm trying to scrape a poorly structured site and have to rely on the "align" attribute of some table "td" tags to figure out what is what. In jQuery, I could select
$('td[align="left"]') or $('td[align="right"]')
to get the correct element. Does anything like that exist in x-ray? I see if I do something like
tdAlign: ['td[align]']
I get an array of all of those "align" attributes. Is there a way to feed to another function to continue scraping children of elements when the attribute is "left" or "right"? Thanks.

get innerHTML of selected element

Hi
I need to get innerHTML of selected element in page. Is x-ray can doing thins?? how? I didn't find anything in documentation.

Thanks for your attention.

CSV

I LOVE this project more than all others, second only to cheerio, which I love the best.

Is there a simple way to export to CSV? Json can get too big for huge datasets.
Thanks!

An href referencing the root of a site produces null//null/rest/of/the/url

This link in the HTML:

<div class="promo_img">
    <a href="/product/studley-moms_night_out_bouquet/display">
        <img id="productImageImg_3" 
        src="http://blah/blah/not/important" alt="Studley's Mom's Night Out Bouquet">
    </a>
</div>

Selected with this:

  link: '.promo_img a[href]'

Gives this:

    "link": "null//null/product/studley-moms_night_out_bouquet/display"

I'm not sure whether this is an x-ray or x-ray-select issue.

1.0.0

This issue will be used to track the changes that are coming to X-ray.

High-level goals

More spider, less worm

Right now x-ray can paginate, but it's very limited. It's meant for things like paging through Google results, Dribbble results, etc, where the templates are the same but with different data.

We want to allow x-ray to crawl multiple different pages simultaneously and join all the data together seamlessly.

Do not limit the power of powerful drivers

Right now the current x-ray severely limits the capabilities of drivers like Segment's phantom driver, nightmare. You can run nightmare up front, but then after you start paginating, you cannot run anymore nightmare scripts. Nightmare scripts should just work in X-ray or at least be easy to add to x-ray

Thoughts

Xray instance per page

Each page should initialize a new x-ray instance. X-ray instances are like nodes in a graph. There will need to be some high-level manager to deal with things like concurrency, rate-limiting, and throttling.

Interested in hearing your thoughts. Now's the time to chime in with you're use cases, wishes and ideas.

Paginate appears to cause multiple calls to composed function

Here's a partial example of the code I'm using (it's a fairly standard shopping page we are scraping for their benefit)

var x = new xray();
        x(url, 'ul.product-grid', ['a@href'] )
            .paginate('.next-page a@href')
            (function(err, array) {
                console.log('--- BEGIN ARRAY LIST ---');
                console.log(array);
                console.log('--- END ARRAY LIST ---');
                                ....launch other scraper that we use for page manipulation(I'd love to use the phantom-xray driver but I've not had any luck with it and the lack of .select (for select dropdowns) is a dealbreaker right now

What I end up seeing is my log message of begin, the first list of array hrefs, and end...then it processes the first couple of items (which happen to be skipped) and when it hits the first item using the other scraper I see the Begin Array List repeated but empty (there are items on the 2nd page).

I'm not using the x-ray-phantom driver so I presume it's using the standard HTML driver, the other scraper uses Phantom. I can work with what I'm sure are bulk issues with the other package but I'm trying to determine both how paginate calls the composed function at the end (does it make 1 call per page?) and if this is the case, what is the proper way to call a function once the entire pagination process is complete (and my array of items is done)? I had the version 1 approach down pretty well, and given the newness of version 2 it's hard to separate "working as intended" from "possibly a bug". Any information you have would be very welcome.

(BTW side note: if you ever were able to put together a page interaction example involving the x-ray-phantom driver and clicking/selecting [assuming it's even a use you intend] it would be a great thing. Appreciate your work on this. Thank you)

EDIT: Caught a bad selector in paginate, however it didn't make a difference.

allowing to select n-th occurence

I need the following horrible selector to parse some late nineties html: .nestedTable[1] tr[0] td[1]. This doesn't seem to work currently. Is that correct?

If there's a need (beyond my own) I could try to get this in. It shouldn't be too hard, since I'm pretty sure cheerio already support this, which we can piggy-back on?

Any pointers appreciated.

JSON support

should also work with JSON endpoints maybe using JSONmask or something for selection.

Support non UTF-8

It seams that x-ray does not support non UTF-8 yet.

It would be useful to be able to send options to superagent.

Allow queries to be set as key

Ex:

{
  $root:'el',
  '.slelector label':'.selector span'
}

comments selector ()

Is there a way to extract comments from a page source? I mean a content placed between .

Support asynchronous format functions

.filter and .map methods ?

Hi,

What about .filter and .map methods to manipulate resulted datas?

Your current example

xray('http://mat.io')
  .select(['.Header-list-item a'])
  .run(function(err, array) {
    console.log(array)
    // array is [ 'Github', 'Twitter', 'Lapwing', 'Email' ]
  });

With filter

xray('http://mat.io')
  .select(['.Header-list-item a'])
  .filter(function(e) {
    return e.length < 7
  })
  .run(function(err, array) {
    console.log(array)
    // array is [ 'Github', 'Email' ]
  });

With map

xray('http://mat.io')
  .select(['.Header-list-item a'])
  .map(function(e) {
    return e + '!'
  })
  .run(function(err, array) {
    console.log(array)
    // array is [ 'Github!', 'Twitter!', 'Lapwing!', 'Email!' ]
  });

Useful to manipulate data just before .write('out.json')

It can probably be done with .use(plugin) but I think this kind of method can be helpful in the builtin version

What do you think?

Send user-agent?

Is it possible to send a user-agent header with the xray request? The site I'm looking at responds with an empty document unless it gets one.

Add promise support

import Xray from 'x-ray';

let scraper = Xray();

scraper(url, {
    items: scraper('.scope', [
        {
            link: '.link@href'
        }
    ])
})
.then(
     result => {
        // All done
     },
     error => {
        // There're errors
     }
});

Get attr value with named attr

I'm unsuccessfully trying to extract meta tags from a page. For instance the following returns no description. Any idea of what is wrong with my selector?

var url = "http://techcrunch.com";
var selection = {
  $root: 'head',
  title: 'title',
  description: 'meta[name="description"][content]'
};

xray(url)
.select(selection)
.run(function(err, res) {
  console.log(err, res);
});

Write test with static file

It would be better and easier to write test against html fixtures instead mat.io and github.com.

x-ray with [email protected]?

Any roadmap for this update?

Process Multiple URLs

I had the need to scrape multiple URLs and simply output the combined json result with express when complete.
I originally asked Matt about this and he requested that I open an issue in case anyone else needed an example. After some more testing, I came up with an example using the async npm mpdule. I'm sure there are many other ways, but this seems to work fine. Hope it helps someone.

app.post('/process-urls', function (req, res) {
    var urls = [
      {
        url: 'http://google.com',
        selectors: {
          title: 'title'
        }
      },
      {
        url: 'http://github.com',
        selectors: {
          title: 'title'
        }
      }
    ];

    var sendResponse = function (result) {
      res.json(result);
    };

    process(urls, sendResponse);
  });

  var process = function (urls, sendResponseCallback) {
    var asyncCallback = function (err, obj) {
      return obj;
    };

    async.parallel([
      x(urls[0].url, urls[0].selectors)(asyncCallback),
      x(urls[1].url, urls[1].selectors)(asyncCallback)
    ], function (err, results) {
      if (!err) {
        sendResponseCallback(results);
      }
    });
  };

Way to escape the $root-scope?

I'm using something like:

.select([{
  $root: '.oJobTile',
  id: ".oSaveLink [data-jobid]",
  title: '.oRowTitle a',
  link: '.oRowTitle a[href]',
}])

And would like to include a property that's outside of the $root-scope.
E.g.:

.select([{
  $root: '.oJobTile',
  id: ".oSaveLink [data-jobid]",
  title: '.oRowTitle a',
  link: '.oRowTitle a[href]',
  curpage: 'body .curpage; //where body .curpage is outside of the $root scope. 
}])

Is something like this possible?

Browserify support?

I'm getting an error after browserifying a script using x-ray, and it seems superagent module is at fault here.
Is there browserify support plans for x-ray?

Error:

var superagent = Superagent.agent(opts);
                              ^
TypeError: Object function request(method, url) {
  // callback
  if ('function' == typeof url) {
    return new Request('GET', method).end(url);
  }
  // url first
  if (1 == arguments.length) {
    return new Request('GET', method);
  }

  return new Request(method, url);
} has no method 'agent'
    at driver (/home/ec2-user/yara/snowBro.js:3912:31)

A way to get $root attributes

I'd like to extract the [href] of the root element (item) in a property called link:

$root: '.item',
link: '[href]'

What is the correct syntax?

empty URL in some instance composition on a collection killing rest of the capture

I am using the composition of instances to go to other pages and pick data from there for a collection. something like this.

var Xray = require('x-ray');
var x = Xray();

x('http://example.com', '.abc',[ {
  main: 'title',
  image: x('.pqr a@href', {
    key1: '.val1',
    key2: '.val2',
    key3: '.val3'
  }), // follow link to google images
}])(function(err, obj) {
  console.log(err, obj)
})

i am picking the URLs from a certain element with given selector .pqr a@href , but some of those URL values are empty and when the x() function is called with an empty URL it gives the error:
[Error: is not a URL]

Because of this, i am not able to get the captured values for the rest of the urls for which .pqr a@href is not empty but a valid URL. i am not able to find a way to avoid the x() instance calls on empty URLs.

One possible solution could be, if the call on x() with an empty URL should just quietly die, instead of throwing an error, which tips off the rest of the instance calls.

Would highly appreciate if someone can help me in this. Thanks

Example not working for me

The example provided github-stars.js is not working for me. I did:

git clone https://github.com/lapwinglabs/x-ray.git
npm update
node examples/github-stars.js

What I get is an error:

module.js:340
    throw err;
          ^
Error: Cannot find module 'array'
    at Function.Module._resolveFilename (module.js:338:15)
    at Function.Module._load (module.js:280:25)
    at Module.require (module.js:364:17)
    at require (module.js:380:17)
    at Object.<anonymous> (/home/xyz/Desktop/x-ray/examples/github-stars.js:6:13)
    at Module._compile (module.js:456:26)
    at Object.Module._extensions..js (module.js:474:10)
    at Module.load (module.js:356:32)
    at Function.Module._load (module.js:312:12)
    at Function.Module.runMain (module.js:497:10)

Maybe it's my config, node version or something else related to my environment's setup?

Should work with german umlauts

In my resulting console or json file output all umlauts like öäüß are destroyed.
The site is encoded with charset=iso-8859-1

passing in an external Cheerio instance to x-ray

Is it possible to pass-in an external cheerio instance? e.g.: $.
Use-case: I'n authenticating using Request + Cheerio (following this tutorial), since authentication using x-ray is not that easy (1).

Noew I've done var $ = cheerio.load(body);, leaving me with the correct cheerio context after logging-in. It would be great if I could easily pass this cheerio context to x-ray so I could leverage x-ray.select etc.

Is that possible?

1): I've seen this issue with the rec to use phantomJS and try to authenticate through Nightmare. I didn't have much luck going in that direction. Besides it would be great not having to resort to this for performance reasons, having to build phantomJS on a system just for the need if authentication, etc.

Delay is not a function

xray.delay(from, [to])

is in the docs but is it supported?

Unable to store values as Integer

Hello,
By default X-ray output data only in string format. Certain information like location info where coordinates which must be which quotes gets saved with quotes in outputfile Json file. Tried Parsing on console it shows 19.00 & 72.00, However in the files gets saved as string
"loc": {
"type": "Point",
"coordinates": [
"19.00",
"72.00"
]
},

Select element by ID

I've tried selecting elements by class using .class.

How can i select elements without class but only has id ?

Support pagination based on function

what if there is no link (therefore no css selector) to the next page, but i know how to get to it?

e.g. ?page=1, ?page=2

could i use something like .paginate(function(num) { return '?page=' + num; }))

can't run tests on the current master branch

I am trying to run the tests on the master branch and I get the following.

x-ray git:(master) ✗ make test
./x-ray/test/x-ray.js:103
  it('should be yieldable', function *() {
                                     ^
SyntaxError: Unexpected token *
    at exports.runInThisContext (vm.js:73:16)
    at Module._compile (module.js:443:25)
    at Object.Module._extensions..js (module.js:478:10)
    at Module.load (module.js:355:32)
    at Function.Module._load (module.js:310:12)
    at Module.require (module.js:365:17)
    at require (module.js:384:17)
    at ./x-ray/node_modules/mocha/lib/mocha.js:192:27
    at Array.forEach (native)
    at Mocha.loadFiles (./x-ray/node_modules/mocha/lib/mocha.js:189:14)
    at Mocha.run (./x-ray/node_modules/mocha/lib/mocha.js:422:31)
    at Object.<anonymous> (./x-ray/node_modules/mocha/bin/_mocha:398:16)
    at Module._compile (module.js:460:26)
    at Object.Module._extensions..js (module.js:478:10)
    at Module.load (module.js:355:32)
    at Function.Module._load (module.js:310:12)
    at Function.Module.runMain (module.js:501:10)
    at startup (node.js:129:16)
    at node.js:814:3
make: *** [test] Error 1

Authentication?

How about login to get cookies before scraping? Maybe at least sending cookies?

Please bring back .select() or make a way of retaining scraped page!

I used to use:
var page = xray('http://url.com')

to get the page, then:

page.select([{
thing: '.classname',
stuff: '.other_classname'
}])...and so on;

This allowed me to get the page once, then scrape it in various ways.

Did I miss something, or did this wonderful capability go away?

If so, please bring it back!

Multiple website scrapping

How do it get x-ray to scrap multiple website and output them in one son?

Support request headers (for login/cookies)

Is there a way to send a session_id as a cookie, or use this within a callback from a Request.post?

I'm trying to scrape a site after POSTing form data using Request, x-ray would make crawling the resulting links/pages a lot easier!

matthewmueller / x-ray Goto Github PK

x-ray's Introduction

Installation

Features

Selector API

xray(url, selector)(fn)

xray(url, scope, selector)

xray(html, scope, selector)

API

xray.driver(driver)

xray.stream()

xray.write([path])

xray.then(cb)

xray.paginate(selector)

xray.limit(n)

xray.abort(validator)

xray.delay(from, [to])

xray.concurrency(n)

xray.throttle(n, ms)

xray.timeout (ms)

Collections

Composition

Crawling to another site

Scoping a selection

Filters

Examples

In the Wild

Resources

Backers

Sponsors

License

x-ray's People

Contributors

Stargazers

Watchers

Forkers

x-ray's Issues

/node_modules/x-ray-html/index.js

Usage

High-level goals

More spider, less worm

Do not limit the power of powerful drivers

Thoughts

Xray instance per page

Recommend Projects

Recommend Topics

Recommend Org