Giter VIP home page Giter VIP logo

app-basket's Introduction


app-basket

Author: HouJP_NSD

目录


基于scrapy框架完成的APP爬虫,用于收集安卓系统下各类应用的基本信息。

数据说明

已爬取的数据信息如下:

数据来源 数据条数 目录位置
豌豆荚 504,518 data/wandoujia_\d+.tsv
百度手机助手 12,308 data/baidu.tsv
360手机助手 273,766 data/wandoujia_\d+.tsv

字段说明

目前爬取的信息包含以下字段:

字段名 类型 描述 样例
channel String 爬取渠道 豌豆荚/百度/360
crawl_time Long 爬取时间 11231230
crawl_url String 爬取链接 http://www.wandoujia.com/apps/com.sdu.didi.psnger
name String 应用名字 滴滴出行
size Long 应用大小(B) 56
update_time Long 更细时间 10231231
category String 所属类别 交通导航-打车
tag String 标签 休闲-模拟-像素-驾驶-生活应用-上瘾-日常出行-男性
version String 版本信息 4.4.4
system String 手机系统要求 Android 4.0.3 以上
source String 软件来源 北京小桔科技有限公司
install_count Int 安装人数/下载人数 27480000
like_count Int 喜欢人数 4421
comment_count Int 评论人数 3424
comment_best_count Int 好评数 342
comment_good_count Int 中评数 314
comment_bad_count Int 差评数 312
editor_comment String 小编点评 用滴滴叫出租车,都市畅行无阻。滴滴一下,美好出行!
desc_info String 描述 (省略)
score Int 评分(100分制) 80
feature String 特性 官方版-安全-优质-MTC认证

进入工程目录后,执行:

# 爬取【豌豆荚】APP信息
scrapy crawl wandoujia

# 爬取【百度手机助手】APP信息
scrapy crawl baidu

# 爬取【360手机助手】APP信息
scrapy crawl sanliuling

爬取完毕后,数据保存的目录为:

app-basket/data/app-data.tsv

注意事项:

  • 不同的爬虫存储的数据文件名相同,因此当一个爬虫爬取完毕后需要对文件重命名,否则下一次爬取时会被重写

  • 2016-09-11
    • 完成【百度手机助手】站点HTML解析
    • 完成【360手机助手】站点HTML解析
    • 遗留问题
      • scrapy爬虫需要增加js动态加载功能,例如【360手机助手】中评论数是js动态加载的
  • 2016-09-05
    • 完成【豌豆荚】站点HTML解析

app-basket's People

Contributors

houjp avatar tanyokwok avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.