arboleya / snapshooter Goto Github PK
View Code? Open in Web Editor NEWSimple crawler for Single Page Applications
Simple crawler for Single Page Applications
It'll be nice if apps doesn't need to set up nothing in order to be crawled. This would be a must kick-ass feature.
When testing snapshooter with a local address i got an error after a time out occured.
โข ERROR http://hems.local:11235/store-locator took too long to render, skipping
/usr/local/lib/node_modules/snapshooter/src/core/crawler.coffee:111
do @ph.exit
^
TypeError: Cannot call method 'exit' of undefined
at Crawler.module.exports.Crawler.error (/usr/local/lib/node_modules/snapshooter/src/core/crawler.coffee:111:7)
at module.exports.Crawler.keep_on_checking (/usr/local/lib/node_modules/snapshooter/src/core/crawler.coffee:94:17)
at Proto.apply (/usr/local/lib/node_modules/snapshooter/node_modules/phantom/node_modules/dnode/node_modules/dnode-protocol/index.js:123:13)
at Proto.handle (/usr/local/lib/node_modules/snapshooter/node_modules/phantom/node_modules/dnode/node_modules/dnode-protocol/index.js:99:19)
at D.dnode.handle (/usr/local/lib/node_modules/snapshooter/node_modules/phantom/node_modules/dnode/lib/dnode.js:140:21)
at D.dnode.write (/usr/local/lib/node_modules/snapshooter/node_modules/phantom/node_modules/dnode/lib/dnode.js:128:22)
at SockJSConnection.ondata (stream.js:38:26)
at SockJSConnection.EventEmitter.emit (events.js:88:17)
at Session.didMessage (/usr/local/lib/node_modules/snapshooter/node_modules/phantom/node_modules/shoe/node_modules/sockjs/lib/transport.js:207:25)
at WebSocketReceiver.didMessage (/usr/local/lib/node_modules/snapshooter/node_modules/phantom/node_modules/shoe/node_modules/sockjs/lib/trans-websocket.js:109:40)
Besides the existente signature:
snapshooter [http://your_url] [output_folder]
It'd be nice to be able to crawl local websites without HTTP protocol:
snapshooter [local_html_file] [output_folder]
Is this working?
It'd be great to remove phantomjs-node to run Snapshooter right from PhantomJS to reduce runtime errors and problems, and also looking forward for some performance boost.
So when trying to make a server to render a local theoricus app:
snapshooter -i http://hems.local:11235 -s -P 3000 -o snapshooting
i get the following error:
/usr/local/lib/node_modules/snapshooter/src/core/shoot.coffee:90
first_url = first_url.replace //index.\w+$/m, ''
^
TypeError: Cannot call method 'replace' of undefined
at new Shoot (/usr/local/lib/node_modules/snapshooter/src/core/shoot.coffee:90:16)
at Snapshooter.module.exports.Snapshooter.shoot (/usr/local/lib/node_modules/snapshooter/src/snapshooter.coffee:91:8)
at module.exports.Snapshooter.init (/usr/local/lib/node_modules/snapshooter/src/snapshooter.coffee:78:15)
at ReadStream.module.exports.Snapshooter.prompt (/usr/local/lib/node_modules/snapshooter/src/snapshooter.coffee:130:8)
at ReadStream.g (events.js:185:14)
at ReadStream.EventEmitter.emit (events.js:88:17)
at TTY.onread (net.js:396:14)
Maybe I'm doing something wrong, but when I execute the following code:
snapshooter http://bedhead.dev/ www/
I get the following error:
- initializing...
> http://bedhead.dev/
- scanning links - http://bedhead.dev/
/Users/LMotta/Desktop/snapshooter/src/shoot.coffee:121
filename = (reg.exec(url))[1];
^
TypeError: Cannot read property '1' of null
at Shoot.module.exports.Shoot.save_page (/Users/LMotta/Desktop/snapshooter/src/shoot.coffee:121:33)
at Shoot.module.exports.Shoot.after_render (/Users/LMotta/Desktop/snapshooter/src/shoot.coffee:66:14)
at module.exports.Shoot.get (/Users/LMotta/Desktop/snapshooter/src/shoot.coffee:56:15)
at Object.module.exports.Crawler.keep_on_checking [as cb] (/Users/LMotta/Desktop/snapshooter/src/crawler.coffee:48:18)
at Socket.module.exports.create.io.sockets.on.socket.on.id (/Users/LMotta/Desktop/snapshooter/node_modules/node-phantom/node-phantom.js:156:19)
at Socket.EventEmitter.emit [as $emit] (events.js:88:17)
at SocketNamespace.handlePacket (/Users/LMotta/Desktop/snapshooter/node_modules/node-phantom/node_modules/socket.io/lib/namespace.js:335:22)
at Manager.onClientMessage (/Users/LMotta/Desktop/snapshooter/node_modules/node-phantom/node_modules/socket.io/lib/manager.js:488:38)
at WebSocket.Transport.onMessage (/Users/LMotta/Desktop/snapshooter/node_modules/node-phantom/node_modules/socket.io/lib/transport.js:387:20)
at Parser.<anonymous> (/Users/LMotta/Desktop/snapshooter/node_modules/node-phantom/node_modules/socket.io/lib/transports/websocket/default.js:36:10)
Any ideas of what that could be?
PhantomJS doesn't support audio/video tags and inevitably adds some complexity to the the code if you're looking forward for full indexing, because your page should not render any audio/video tags in indexing
mode for phantom to able to properly index it.
One alternative to this drawback is to use Selenium automation instead of PhantomJS. There are successful cases about using Selenium headlessly with Firefox and Xfvb on *nix systems.
Sounds like a good try for providing full indexing without any incompatibility on a real browser.
Add ability to use snapshooter
as a library, from another library. Useful for integrating with another libraries under the hoods.
snapshooter www.domain.com output_folder
snapshooter http://www.domain.com output_folder
Phantom's instance can be reused to increase performance.
Basic tests have increased the speed in 60%, this should do some good while crawling large websites recursively.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.