xianyunyh / spider_job Goto Github PK

View Code? Open in Web Editor NEW

393.0 12.0 123.0 2.31 MB

License: MIT License

HTML 14.66% JavaScript 5.60% Shell 7.70% TypeScript 38.67% Dockerfile 0.32% CSS 0.27% EJS 32.79%

boss spider koa2 nodejs puppeteer

spider_job's Introduction

爬虫项目

免责声明

本软件仅用于学术研究，但因在**大陆频频出现爬虫开发者涉诉与违规相关的新闻。

使用者需遵守其所在地的相关法律法规。因违法违规使用造成的一切后果，使用者自行承担

这个项目是主要自己研究招聘网站上的职位以及对应的需求准备的一个爬虫项目。爬虫项目基于`nodejs` `puppeteer`框架进行爬虫，使用`mysql` 存储爬取数据。服务端界面使用`nodejs` `koajs` 实现了一个`web` `ui`展示


├─web 后端服务
├─spider python爬虫
│  ├─src/spider        爬虫实现
│  │  ├─zhipin.ts      直聘爬虫
├─word.json 生成的英文技术词json
├─word.py 生成英文分词
├─stop.txt 停用词列表

后端服务

后端服务是使用koajs编写的一个接口和展示数据的服务。

打开web/server/config/index.ts 修改自己的数据库的信息

cd web
npm install --registry https://registry.npmmirror.com/
#启动服务
npm run dev

运行爬虫

请安装Nodejs

需要本地安装 chrome、或者edge浏览器

打开spider/src/index.ts

修改 executablePath 成本地的浏览器路径

const options: PuppeteerLaunchOptions  = {
  // 启动无头浏览器
  headless: 'new',
  // 浏览器路径
  executablePath: 'C:\\Program Files (x86)\\Microsoft\\Edge\\Application\\msedge.exe'
}

cd spider
npm install --registry https://registry.npmmirror.com/  --ignore-scripts #跳过下载chromium
#运行服务
npm run dev
#编译
npm run build

spider_job's People

Contributors

Stargazers

Watchers

Forkers

greatxj iyueshang guoyu07 sushengbuhuo yunshu2009 mrlantin liuenguang berwinsky duantianhen2014 lywzx guijianshi jingqianwei color4 alexmaxmiao saoinformaticsteam cacppuccino buster2004 onetreegrow sunjoker landeqi shaolei-zuo hhy5277 huan0808 bobwufall littlebaibaibai wusir2001 longtan01 xuediaobest bamshk queryfish jasonsimhone mrfiona xiaonian0430 llzhi001 bugupdater okpiaoxuefeng98 githubdcheng tapate nobertotang annihilater jdzz112 zbird1988 yavana yin000shi qingdou-33 ludvikwoo ywwgithub deguangchow lin-zone sxhylkl runll zi-cheng chenhaox shiminghang-sudo czybuaa ggqshr yinwuli mrying 18635657419 dengbinhero wuqundong520 a11266897 lxngoddess5321 pig3three shizhibin willylee007 lovessures winson1107 dwave oldliumark zhongjiao43 l0ngc ron-joan zhangzhicheng117 quincy-chang longliveping wangjake2019 mercis wangdepin jucyjun wtn-gavin yufeng518 mypfjj striver619 eviliclufas liuzhaobo1999 bykiss555 tangyin88 wooodhead daddybooks a4471174 ccsourcecode benjeming petitchouchinois leegang huboyan188 queenofbugs yiyepiaolingren nealdc itutopia

spider_job's Issues

bug

File "E:\Anaconda3\envs\pachong\lib\site-packages\scrapy\downloadermiddlewares\retry.py", line 24, in
from scrapy.core.downloader.handlers.http11 import TunnelError
File "E:\Anaconda3\envs\pachong\lib\site-packages\scrapy\core\downloader\handlers\http11.py", line 26, in
from scrapy.core.downloader.webclient import _parse
File "E:\Anaconda3\envs\pachong\lib\site-packages\scrapy\core\downloader\webclient.py", line 4, in
from twisted.web.client import HTTPClientFactory
ImportError: cannot import name 'HTTPClientFactory' from 'twisted.web.client' (unknown location)

Go to '...'
Click on '....'
Scroll down to '....'
See error

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

OS: [e.g. iOS]
Browser [e.g. chrome, safari]
Version [e.g. 22]

Smartphone (please complete the following information):

Device: [e.g. iPhone6]
OS: [e.g. iOS8.1]
Browser [e.g. stock browser, safari]
Version [e.g. 22]

Additional context
Add any other context about the problem here.

加个城市筛选器。

安装命令错误

文档中安装命令如下：
pip install -f requirements.txt

但是上面命令不能执行。
是不是应该改成: pip install -r requirements.txt