The node-crawlers's intro from rupertqin

node-crawlers's Introduction

爬虫实例

luowang
爬取落网音乐,下载到本地

baidu_img
根据关键词从百度下载图片

one
爬取 One 网站上的每日一图以及 One 问答,并且存储在 LeanCloud 云后台

sujin
爬取素锦网站上的好文章,并且存储在 LeanCloud 云后台

douban_book 爬取豆瓣图书 Top250

lagou
从拉勾网爬取较大量的职位信息以及存储至 NoSql 类型数据库中

zhihu
从知乎网爬取特定ID的精华回答，并且存储在 LeanCloud 云后台。

manong
从 manong.io 网爬取文章列表，用 readability 模块解析文章，得到 title、content 信息，存到 LeanCloud 云后台 ———— 由于 manong.io 里的文章是从别的网站文章的收集，只是一个 url 列表，所以用到了 readability 这个模块，可以提取出相当干净的有用文字，但也不是100％成功，不过放心，正确率高达 99.5% 以上，不过对于比如 “知乎专栏” 这样的 Ajax 请求页面是没有作用的。

manong_psql_html_only
从 manong.io 网爬取文章列表，不用 readability 模块，存在本地 pg 数据库中，留作下一步处理

Required

node version >= 7.0

关于

看了这个项目wuchangfeng/Crawler，是用 python 写的，自己熟悉 node，就想用 node 写看看

python 和 node 写爬虫最大的不同是一个天生是同步另一个是异步，这次用 node 模拟了同步代码的写法，因为有些网站有防爬策略，异步就很快。

代码多是这样的两轮循环，都不算复杂,

async.eachOfSeries(arr, function(item, idx, callback) {
    async.eachOfSeries(arr2, function(item2, idx2, callback2) {

    })
})

manong_psql_html_only 去掉了 async.eachOfSeries，全部用 async/await

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.

Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

TensorFlow

An Open Source Machine Learning Framework for Everyone

Django

The Web framework for perfectionists with deadlines.

Laravel

A PHP framework for web artisans

D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

web

Some thing interesting about web. New door for the world.

server

A server is a program made to process requests and deliver data to clients.

Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

Visualization

Some thing interesting about visualization, use data art

Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.

Microsoft

Open source projects and samples from Microsoft.

Google

Google ❤️ Open Source for everyone.

Alibaba

Alibaba Open Source for everyone

D3

Data-Driven Documents codes.

Tencent

China tencent open source team.

rupertqin / node-crawlers Goto Github PK