Light

duty-machine / duty-machine-action Goto Github PK

View Code? Open in Web Editor NEW

68.0 68.0 56.0 253 KB

抓取网络文章到github issue保存的github action

Dockerfile 0.63% JavaScript 99.37%

duty-machine-action's Introduction

关于此仓库暂时停用

本账号因为提交内容触犯到GitHub关于色情内容的规定，被禁用直到近日解封。在尚未有下一步计划之前，本仓库会停止自动提交功能，请有需要的用户使用duty-machine-action自行搭建。

这是什么？

这是一个借助Github搭建的备份网络文章的服务。由于Github对墙内软件业非常重要，不会轻易被墙。所以对于容易被和谐的墙内平台文章，以及已经被屏蔽的墙外平台文章，都可以借助Github让他们在墙内能访问到。

通过本项目，你可以在线匿名地为一篇网络文章创建一个抗和谐的版本，如英国驻华大使馆：关于香港问题的更正说明。

目前支持什么网站？

当前支持的网站有：微信公众平台，微博和微博文章，知乎回答和专栏，豆瓣日记和小组话题，石墨文档，bilibili专栏，acfun文章，ao3，matters，telegraph，chinadigitaltimes，rfa。

对于外媒文章，一般可以直接在 https://github.com/duty-machine/news 项目里找到，暂时没有考虑支持外媒新闻网站的抓取。

如何提交要抓取的页面？

我们提供了一个匿名的在线提交入口：https://archives.duty-machine.now.sh/ 。将网址粘贴到表单里点击提交，等待一分钟左右，机器人会把文章内容跟贴到当前网址里。

同时你可以查看所有已提交的文章抓取，还有已抓取成功的文章列表。

使用本服务有哪些风险？

我们的提交入口是开源的，并且正在运行的源码可以接受监督，详见duty-machine-form项目，我们没有保存和泄露你的ip和提交信息。

但是你的身份仍有可能泄露，理论上中国有能力监视墙内对提交入口域名的访问，如果他们知道你访问了这个域名，又在同时看到了新建的抓取请求，你的身份有可能就会被和抓取内容联系起来。尽管这样的风险不大，我们建议尽量开启vpn等代理访问提交入口和提交，以保证最大的匿名性。

对端点星的声援

这个项目是受端点星计划的启发而作，为的是可以在不登录github的情况下也可以备份网络文章，满足普通用户的备份需求。

而恰好在本项目刚刚诞生之际，得知了端点星计划志愿者陈玫、蔡伟及其女友小唐被当局关押的消息，使我觉得我有义务为他们呼唤关注。陈玫、蔡伟被秘密关押54天后被以寻衅滋事罪逮捕。

我们依照《中华人民共和国宪法》第三十五条：“中华人民共和国公民有言论、出版、集会、结社、游行、示威的自由。”，敦促当局立即停止违反宪法，侵犯公民自由的行为。

其他

本项目使用duty-machine-action搭建，你可以使用它搭建自己的版本。

在转载github.com的链接时，可以使用 https://git.io/ 短链接服务，以增加审查的难度。

友情链接

公益社会类项目汇总： https://github.com/NodeBE4/impact/

武汉疫情报道备份项目：https://github.com/lestweforget/wuhan2019

以时间线的形式展示新冠肺炎疫情、香港反送中等社会议题：https://github.com/chinatimeline/chinatimeline.github.io

免翻墙阅读外媒新闻：https://github.com/duty-machine/news

我的联系方式

请电邮 [email protected] 。

duty-machine-action's People

Contributors

Stargazers

Watchers

duty-machine-action's Issues

账号是注销了吗？

抓取掘金文章失败

加了juejin.js，在本地debug一直失败，不清楚document.querySelector('h1.article-title')为null的原因

npm run test-website juejin

> [email protected] test-website
> node test.js test-website "juejin"

null
/Users/fakeyanss/project/duty-machine-action/websites/juejin.js:21
    let title = document.querySelector('h1.article-title').textContent
                                                          ^

TypeError: Cannot read properties of null (reading 'textContent')
    at Object.process (/Users/fakeyanss/project/duty-machine-action/websites/juejin.js:21:59)
    at processTicksAndRejections (node:internal/process/task_queues:96:5)
    at async fetchArticle (/Users/fakeyanss/project/duty-machine-action/src/fetchArticle.js:22:19)
    at async /Users/fakeyanss/project/duty-machine-action/test.js:25:21

Node.js v17.4.0

以下是juejin.js

let { URL } = require('url')
let fetch = require('node-fetch')
let { JSDOM } = require('jsdom')

module.exports = {
  test(url) {
    let parsed = new URL(url)
    return parsed.hostname == 'juejin.cn'
  },

  async process(url) {
    let res = await fetch(url, {
      headers: {
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:74.0) Gecko/20100101 Firefox/74.0'
      }
    })
    let html = await res.text()
    let document = new JSDOM(html).window.document

    console.log(document.querySelector('h1.article-title'))
    let title = document.querySelector('h1.article-title').textContent
    let author = document.querySelector('.name').textContent
    let content = document.querySelector('.markdown-body')

    return {
      title,
      author,
      dom: content
    }

  },

  samples: [
    'https://juejin.cn/post/6844903975678902279'
  ]
}

gif图片无法显示

duty-machine/duty-machine#547

图片的转存是github自动实现的，他会把外站的图片抓到githubusercontent.com下，现在这个看起来是github拒绝接受过大的图片文件。

解决方法可能有把图片放到仓库里。

请求添加bilibili专栏、AcFun专栏和ao3(Archive of Our Own)的文章抓取功能

抓取archive.today

使用node-fetch抓取archive today的页面总是被要求输入验证码，不知道是什么原因。即使我使用和chrome同样的headers，以下的header是直接从chrome里复制出来的：

let res = await fetch(url, {
      "headers": {
        "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
        "accept-language": "en",
        "cache-control": "no-cache",
        "pragma": "no-cache",
        "sec-fetch-dest": "document",
        "sec-fetch-mode": "navigate",
        "sec-fetch-site": "none",
        "sec-fetch-user": "?1",
        "upgrade-insecure-requests": "1",
        "user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36"
      },
      "referrerPolicy": "strict-origin-when-cross-origin",
      "body": null,
      "method": "GET",
      "mode": "cors"
})

初步认为是node-fetch夹带了其他能暴露自己的header，但是不太清楚怎么查

优化performTasks

场景：同时存在多个未抓取issue时，最终完成抓取结果中的每个issue会产生N个comment，例如：#3009、#3010、#3011、#3012每个issue都会产生4个comment。

问题原因：perform.js#L22-L26一次性列出所有已提交的新issue，每个执行都遍历执行一次issue

duty-machine-action/perform.js

Lines 22 to 26 in 8875ceb

 let { data } = await octokit.issues.listForRepo({ 

 owner: OWNER, 

 repo: REPO, 

 state: 'open' 

 })

解决方案：增加label判断抓取状态，在执行fetchArticle前跳过正在抓取的任务，给未抓取的任务增加正在抓取标签，仅供参考。

duty-machine-action/perform.js

Line 33 in 8875ceb

let articleData = await fetchArticle(issue.body)

考虑存储到仓库

现有存档方式是将网页存档为issue，当repo被删除之后，issue也随之消失。

因此，建议将网页存档到仓库中，此举能够方便fork和clone等。

关于标签的场景

一直想有这样一个备份工具，可以备份微信上存在或者会消失的文章，收藏的文章多就知道微信的标签多烂了，存档的聊天记录当对方删除好友之后，收藏的东西也随之不见了，真是烂透了的搜索和收藏，但平台上依旧有大量不错的文章，所以就有了对微信文章备份的需求，当备份的文章多了，就需要有个快速识别的标签，虽然issue搜索能搜到，有个标签更好，既然作者有支持markdown导出的计划，就更棒了，完全变成了随身网络分类知识库，加上issue自带的搜索、评论和标签支持，堪称完美

是否需要添加抓取为jpg/mht的配置？

类似 weixin-archive-action

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.