Giter VIP home page Giter VIP logo

Comments (1)

howie6879 avatar howie6879 commented on July 28, 2024

对于搜狗获取,情况如下:

  • 进度:
    • 2021-12-22: 基于playwright的数据获取脚本基本完成
  • 问题:
    • 必须利用playwright调用浏览器的形式抓取,下面两种方式都可能会触发搜狗拦截,增加 Issue #31
      • playwright调用无头浏览器增加ua,测试可行
      • [默认使用此方案] 尝试用爬虫看看验证码触发限制如何,可行
    • [备份方式解决] 获取的公众号文章链接有时长限制,但是只要在微信体系内打开,哪怕链接过期也会自动跳转到正确的链接,所以只要分发器用的是微信公众号也没事,哪怕是其他分发器,这个链接有效期也是比较长的,影响应该还好
      • 将目标内容备份,具体见Issue #20
  • 方案:2021-12-22: 基于playwright调用无头浏览器增加ua的形式进行微信最新文章抓取

数据格式:

{
    "doc_author": "howie6879",
    "doc_content": "",
    "doc_ts": 1639702080,
    "doc_date": "2021-12-17 08:48",
    "doc_des": "本周推荐游戏程序员的读书笔记,致敬。",
    "doc_id": "bd998b9c43ba2d91fd6be9f833ecb634",
    "doc_image": "http://mmbiz.qpic.cn/mmbiz_jpg/YRBRJvZXcIVBtU4gtNsZrRQtDLDS725uEGsCGXHbq7GzfDK2KumHOSKkA6TiaWLia1co96EzPqHRoiac7w7wtqlkg/0?wx_fmt=jpeg",
    "doc_keywords": [],
    "doc_link": "https://mp.weixin.qq.com/s?src=11&timestamp=1640227638&ver=3513&signature=KSf-sAynN5L4LZlsLccoZvT7BT2C6BOcinT77piilqyZnDkcBAy8xpN5o1E8XIKNlBei5CiWNuWJ7e8OzqzyvsY6Fr-aF60Sc6mXJLExQrCNDgGf1V-F8LmOuyCxPVZv&new=1",
    "doc_name": "我的周刊(第018期)",
    "doc_source": "2c_wechat",
    "doc_source_account_intro": "编程、兴趣、生活",
    "doc_source_account_nick": "howie_locker",
    "doc_source_meta_list": [
        "howie_locker",
        "编程、兴趣、生活"
    ],
    "doc_source_name": "老胡的储物柜",
    "doc_type": "article"
}

from liuli.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.