Giter VIP home page Giter VIP logo

spider_job's Introduction

爬虫项目

免责声明

本软件仅用于学术研究,但因在**大陆频频出现爬虫开发者涉诉与违规相关的新闻

使用者需遵守其所在地的相关法律法规。因违法违规使用造成的一切后果,使用者自行承担

这个项目是主要自己研究招聘网站上的职位以及对应的需求准备的一个爬虫项目。 爬虫项目基于`nodejs` `puppeteer`框架进行爬虫,使用`mysql` 存储爬取数据。 服务端界面使用`nodejs` `koajs` 实现了一个`web` `ui`展示
  • 项目目录结构图

├─web 后端服务
├─spider python爬虫
│  ├─src/spider        爬虫实现
│  │  ├─zhipin.ts      直聘爬虫
├─word.json 生成的英文技术词json
├─word.py 生成英文分词
├─stop.txt 停用词列表

后端服务

后端服务是使用koajs编写的一个接口和展示数据的服务。

打开web/server/config/index.ts 修改自己的数据库的信息

cd web
npm install --registry https://registry.npmmirror.com/
#启动服务
npm run dev

运行爬虫

  • 请安装Nodejs

  • 需要本地安装 chrome、或者edge浏览器

    打开spider/src/index.ts

    修改 executablePath 成 本地的浏览器路径

    const options: PuppeteerLaunchOptions  = {
      // 启动无头浏览器
      headless: 'new',
      // 浏览器路径
      executablePath: 'C:\\Program Files (x86)\\Microsoft\\Edge\\Application\\msedge.exe'
    }
    cd spider
    npm install --registry https://registry.npmmirror.com/  --ignore-scripts #跳过下载chromium
    #运行服务
    npm run dev
    #编译
    npm run build

spider_job's People

Contributors

dependabot[bot] avatar xianyunyh avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

spider_job's Issues

bug

File "E:\Anaconda3\envs\pachong\lib\site-packages\scrapy\downloadermiddlewares\retry.py", line 24, in
from scrapy.core.downloader.handlers.http11 import TunnelError
File "E:\Anaconda3\envs\pachong\lib\site-packages\scrapy\core\downloader\handlers\http11.py", line 26, in
from scrapy.core.downloader.webclient import _parse
File "E:\Anaconda3\envs\pachong\lib\site-packages\scrapy\core\downloader\webclient.py", line 4, in
from twisted.web.client import HTTPClientFactory
ImportError: cannot import name 'HTTPClientFactory' from 'twisted.web.client' (unknown location)

HELLP 执行scrapy crawl boss后无应答,半分钟结束也不报错

您好,我按仓库要求装好了所有的库和redis、Mongodb,执行scrapy crawl boss毫无反应。注:打印的Action是我在类里加的调试信息,我在def parse方法中也加了,但不打印,说明没有执行def parse方法。我不知道为什么,Mongodb中也没数据。
image

另外,我也注意到您在Settings里写了:HTTP_PROXY = 'http://127.0.0.1:8123/' 这个代理。这个用的是什么ip代理需要我这边怎么配置呢?求教,非常感谢!

请问现在还能用吗

Describe the bug
A clear and concise description of what the bug is.

To Reproduce
Steps to reproduce the behavior:

  1. Go to '...'
  2. Click on '....'
  3. Scroll down to '....'
  4. See error

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

  • OS: [e.g. iOS]
  • Browser [e.g. chrome, safari]
  • Version [e.g. 22]

Smartphone (please complete the following information):

  • Device: [e.g. iPhone6]
  • OS: [e.g. iOS8.1]
  • Browser [e.g. stock browser, safari]
  • Version [e.g. 22]

Additional context
Add any other context about the problem here.

安装命令错误

文档中安装命令如下:
pip install -f requirements.txt

但是上面命令不能执行。
是不是应该改成: pip install -r requirements.txt

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.