zulip / zulip-archive Goto Github PK
View Code? Open in Web Editor NEWGenerate a static HTML archive of messages in any combination of streams in a Zulip organization.
License: MIT License
Generate a static HTML archive of messages in any combination of streams in a Zulip organization.
License: MIT License
Ideally this tool could run as a GitHub Action without the need to manage a personal access token (since scopes on personal tokens are coarse grained, and best practices keep them short lived).
Relevant bits:
Since Git(Hub) changed the default branch name to main
zulip-archive should use that when commiting new changes, I believe here and in subsequent lines.
Even better would be to allow the user to specify a preferred branch name.
The current escaping for URLs isn't strong enough. #34 fixed escaping ?
which was causing broken links. But there are other bad symbols coming through.
Here's a URL which works but is malformed: https://leanprover-community.github.io/archive/stream/218272-FoMM-/-Lean-Together-2020/index.html (notice it's in a subdirectory of /stream/21872-FoMM-
). The Zulip URL for this stream is better, https://leanprover.zulipchat.com/#narrow/stream/218272-FoMM-.2F.20Lean.20Together.202020
I haven't noticed other malformed links yet but I expect to find some. Our users like to make annoying stream/topic names. The Zulip escaping seems to be 100% reliable for us, even with the annoying names, so maybe that should be duplicated here?
One note, URL changes are disruptive, so this probably shouldn't be done incrementally. If anything is changing it should all change at once.
Currently it is just printing the result of the command instead of the command that being run. e.g https://github.com/refeed/zarchive-refeed/runs/7280779778?check_suite_focus=true
I think it would be better to print the commands that are being run for easier debugging.
My current idea is to add set -x
in the beginning of entrypoint.sh.
xml-sitemap-writer
was added as a dependency last August by 980675c , and it is not listed as a dependency in the current instructions.md.
Note: Along with @Zimmi48 we maintain a Gitlab CI script to produce Zulip archives, similar to the Github actions, and this is how I noticed the issue. https://gitlab.com/gasche/zulip-archive-gitlab-ci.
I am trying to fix the issue in a more robust way by simply calling pip
on the requirements.txt
file. Maybe instructions.md could document this more robust workflow?
Currently, the stream list is sorted by the number of topics. This is the use case of Lean, but other project may prefer different sorting method, e.g. Rust prefers sorting alphabetically.
This can be solved dynamically by adding js code to the index.html output.
See #2 (comment) for context of the discussion.
The topic pages get stored at /<stream-id>-<stream-name>/topic/<topic-name>.html
. The topic index pages get stored at /stream/<stream-id>-<stream-name>/index.md
. (Notice the stream/
at the beginning.) The topic pages are permalinked with a stream/
at the beginning, so things work after Jekyll builds the full site, but this seems needlessly complicated.
Example: https://github.com/leanprover-community/archive (Ignore the top level directories without -
after the IDs. Those are there to redirect from the old URL scheme.)
Line 36 in 7c772a6
Prod API at https://docs.github.com/en/rest/reference/pages#create-a-github-pages-site looks compatible with the old preview
Hi,
when using ./archive.py -i
, old topics, that have been resolved will be shown as a new topic with only the bots resolve message.
It would be nice if in this case, it would update the old topic instead.
E.g.
permalink: archive/213222general
with a stream id of 213222
and stream name of general
may cause collision with a stream id of 21322
and stream name of 2general
.
See https://chat.zulip.org/#narrow/stream/127-integrations/topic/zulip-archive/near/804118 for further discussion.
The name "zulip2.png" is a bit confusing, it would be better to rename it to zulip.png.
Similar to zulip/zulip-mobile#4610
There should be a link to open code block in playground if code playground is configured for the block's language.
https://github.com/tisonkun/zulip-archive/runs/4997354269?check_suite_focus=true
2022-01-30T15:05:17.2351750Z ##[group]Run zulip/zulip-archive@master
2022-01-30T15:05:17.2352844Z with:
2022-01-30T15:05:17.2353274Z zulip_organization_url: ***
2022-01-30T15:05:17.2353593Z zulip_bot_email: ***
2022-01-30T15:05:17.2353890Z zulip_bot_key: ***
2022-01-30T15:05:17.2354295Z github_personal_access_token: ***
2022-01-30T15:05:17.2354529Z delete_history: true
2022-01-30T15:05:17.2354755Z archive_branch: gh-pages
2022-01-30T15:05:17.2354991Z ##[endgroup]
2022-01-30T15:05:17.2600562Z ##[command]/usr/bin/docker run --name eaf078a84fb3749efb334b15627788c66_eca44b --label 84217e --workdir /github/workspace --rm -e INPUT_ZULIP_ORGANIZATION_URL -e INPUT_ZULIP_BOT_EMAIL -e INPUT_ZULIP_BOT_KEY -e INPUT_GITHUB_PERSONAL_ACCESS_TOKEN -e INPUT_DELETE_HISTORY -e INPUT_ARCHIVE_BRANCH -e HOME -e GITHUB_JOB -e GITHUB_REF -e GITHUB_SHA -e GITHUB_REPOSITORY -e GITHUB_REPOSITORY_OWNER -e GITHUB_RUN_ID -e GITHUB_RUN_NUMBER -e GITHUB_RETENTION_DAYS -e GITHUB_RUN_ATTEMPT -e GITHUB_ACTOR -e GITHUB_WORKFLOW -e GITHUB_HEAD_REF -e GITHUB_BASE_REF -e GITHUB_EVENT_NAME -e GITHUB_SERVER_URL -e GITHUB_API_URL -e GITHUB_GRAPHQL_URL -e GITHUB_REF_NAME -e GITHUB_REF_PROTECTED -e GITHUB_REF_TYPE -e GITHUB_WORKSPACE -e GITHUB_ACTION -e GITHUB_EVENT_PATH -e GITHUB_ACTION_REPOSITORY -e GITHUB_ACTION_REF -e GITHUB_PATH -e GITHUB_ENV -e RUNNER_OS -e RUNNER_ARCH -e RUNNER_NAME -e RUNNER_TOOL_CACHE -e RUNNER_TEMP -e RUNNER_WORKSPACE -e ACTIONS_RUNTIME_URL -e ACTIONS_RUNTIME_TOKEN -e ACTIONS_CACHE_URL -e GITHUB_ACTIONS=true -e CI=true -v "/var/run/docker.sock":"/var/run/docker.sock" -v "/home/runner/work/_temp/_github_home":"/github/home" -v "/home/runner/work/_temp/_github_workflow":"/github/workflow" -v "/home/runner/work/_temp/_runner_file_commands":"/github/file_commands" -v "/home/runner/work/zulip-archive/zulip-archive":"/github/workspace" 84217e:af078a84fb3749efb334b15627788c66 "***" "***" "***" "***" "true" "gh-pages"
2022-01-30T15:05:17.5575442Z % Total % Received % Xferd Average Speed Time Time Time Current
2022-01-30T15:05:17.5576823Z Dload Upload Total Spent Left Speed
2022-01-30T15:05:17.5582860Z
2022-01-30T15:05:17.7092965Z 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
2022-01-30T15:05:17.7103161Z 100 2548k 100 2548k 0 0 16.3M 0 --:--:-- --:--:-- --:--:-- 16.3M
2022-01-30T15:05:17.7541038Z ERROR: This script does not work on Python 3.6 The minimum supported Python version is 3.7. Please use https://bootstrap.pypa.io/pip/3.6/get-pip.py instead.
Since we mention Jekyll and GitHub Pages as a way to host this, it's probably worth writing out the precise steps for doing so in the documentation, since that could reduce significantly the barriers to setting this up.
@robertylewis FYI in case you can provide notes on how you did this.
We've been using the GH Action version of the script for a couple months now, updating once an hour. The repo it pushes to is now 3.5gb, and the checkout step of the action takes 7 minutes. Storing the full history of the html/md files is usually not necessary. It would be great if the action had an option to force push, so that the repo size doesn't explode.
Since last night, the archive build in JuliaCommunity/zulip-archive has been failing due to the following error:
Run zulip/zulip-archive@7c772a65b1dd1540d2bb64ee6f0b2860ccaf32c0
/usr/bin/docker run --name bcf09c6e0c2241d25423eb0bae2c81957d3c6_e63042 --label 2bcf09 --workdir /github/workspace --rm -e INPUT_ZULIP_ORGANIZATION_URL -e INPUT_ZULIP_BOT_EMAIL -e INPUT_ZULIP_BOT_KEY -e INPUT_GITHUB_PERSONAL_ACCESS_TOKEN -e INPUT_DELETE_HISTORY -e INPUT_ARCHIVE_BRANCH -e HOME -e GITHUB_JOB -e GITHUB_REF -e GITHUB_SHA -e GITHUB_REPOSITORY -e GITHUB_REPOSITORY_OWNER -e GITHUB_RUN_ID -e GITHUB_RUN_NUMBER -e GITHUB_RETENTION_DAYS -e GITHUB_RUN_ATTEMPT -e GITHUB_ACTOR -e GITHUB_WORKFLOW -e GITHUB_HEAD_REF -e GITHUB_BASE_REF -e GITHUB_EVENT_NAME -e GITHUB_SERVER_URL -e GITHUB_API_URL -e GITHUB_GRAPHQL_URL -e GITHUB_REF_NAME -e GITHUB_REF_PROTECTED -e GITHUB_REF_TYPE -e GITHUB_WORKSPACE -e GITHUB_ACTION -e GITHUB_EVENT_PATH -e GITHUB_ACTION_REPOSITORY -e GITHUB_ACTION_REF -e GITHUB_PATH -e GITHUB_ENV -e GITHUB_STEP_SUMMARY -e RUNNER_OS -e RUNNER_ARCH -e RUNNER_NAME -e RUNNER_TOOL_CACHE -e RUNNER_TEMP -e RUNNER_WORKSPACE -e ACTIONS_RUNTIME_URL -e ACTIONS_RUNTIME_TOKEN -e ACTIONS_CACHE_URL -e GITHUB_ACTIONS=true -e CI=true -v "/var/run/docker.sock":"/var/run/docker.sock" -v "/home/runner/work/_temp/_github_home":"/github/home" -v "/home/runner/work/_temp/_github_workflow":"/github/workflow" -v "/home/runner/work/_temp/_runner_file_commands":"/github/file_commands" -v "/home/runner/work/zulip-archive/zulip-archive":"/github/workspace" 2bcf09:c6e0c2241d25423eb0bae2c81957d3c6 "***" "***" "***" "***" "true" "master"
fatal: unsafe repository ('/github/workspace' is owned by someone else)
To add an exception for this directory, call:
git config --global --add safe.directory /github/workspace
I'm not sure where this error occurs or where that git command would have to go.
Current runner version: '2.289.2'
Operating System
Virtual Environment
Environment: ubuntu-20.04
Version: 20220405.4
Included Software: https://github.com/actions/virtual-environments/blob/ubuntu20/20220405.4/images/linux/Ubuntu2004-Readme.md
Image Release: https://github.com/actions/virtual-environments/releases/tag/ubuntu20%2F20220405.4
Virtual Environment Provisioner
1.0.0.0-main-20220325-1
GITHUB_TOKEN Permissions
Actions: write
Checks: write
Contents: write
Deployments: write
Discussions: write
Issues: write
Metadata: read
Packages: write
Pages: write
PullRequests: write
RepositoryProjects: write
SecurityEvents: write
Statuses: write
Secret source: Actions
Prepare workflow directory
Prepare all required actions
Getting action download info
Download action repository 'actions/checkout@5a4ac9002d0be2fb38bd78e4b4dbde5606d7042f' (SHA:5a4ac9002d0be2fb38bd78e4b4dbde5606d7042f)
Download action repository 'zulip/zulip-archive@7c772a65b1dd1540d2bb64ee6f0b2860ccaf32c0' (SHA:7c772a65b1dd1540d2bb64ee6f0b2860ccaf32c0)
I think it would be a good idea to run it in GitHub actions workflow.
Hello! I've tried to run the code but I get the error in the title which prevent the program to end (at least I don't get any archive following the link). I'm no specialist but I really followed the instructions precisely and look at various places how to fix it myself but without success. Any help would be appreciated! Thanks :)
At the moment, images from threads are displayed by using the original Zulip URL which results in a 401 if the server is not public (e.g. here).
Does someone know a workaround for this? As last resort I might consider downloading them during text scraping.
From #2 (comment) :
% include
is emitted into the markdown files which isn't Markdown -- I think we should avoid doing so in the official tool, since it presumes you're using something like Jekyll or a similar markdown renderer which is a superset of markdown.
I've started seeing consistent GitHub Actions failures recently:
error: RPC failed; HTTP 408 curl 18 HTTP/2 stream 7 was reset
send-pack: unexpected disconnect while reading sideband packet
fatal: the remote end hung up unexpectedly
Full log is available here.
It is an archive of a living Zulip organization, so it is constantly growing in size. I suspected the size simply got too big and overwriting the whole repo started to hit some kind of a limit. Therefore I tried to change action configuration to delete_history: false
and the next run was a success.
I am filing this issue just to confirm, that there is a size limit for an archive, that can be published with GitHub Actions. Maybe there is a fix, maybe it simply needs to be documented.
From
Lines 244 to 271 in 7c772a6
... it looks like on topic pages the custom header HTML isn't used, but custom footer HTML is used. This means
div
and the custom footer closes them -- since the tag will be closed without having been opened on topic pagesWould be good to support some basic string interpolation in the custom header to allow topics to appear (or if this is out of scope: only apply the hard-coded HTML if no custom HTML header has been passed in).
When populate.py encounters an empty messages list around line 149 this leads to index out of bounds error in line 151 last_message = messages[-1]
A simple fix is:
index 9e93e839d23..9951af3cae2 100644
--- b/lib/populate.py
+++ a/lib/populate.py
@@ -147,6 +147,8 @@ def populate_all(
}
messages = request_all(client, request)
+ if not messages:
+ continue
topic_count = len(messages)
last_message = messages[-1]
Emoji reactions may be as important as messages as they often represent a short reply. Would be nice to include them into archive.
On https://zulip-archive.rust-lang.org/219381tlibs/index.html, the ’
character in the Panic message from `catch_unwind`’s `dyn Any`
topic renders in Firefox instead as ’
. That page’s HTML does not have a <meta charset>
tag, so the encoding defaults to windows-1252.
On https://zulip-archive.rust-lang.org/219381tlibs/07261PanicmessagefromcatchunwindsdynAny.html which does have <meta charset="utf-8">
, the ’
character renders correctly.
I noticed https://zulip-archive.rust-lang.org/ doesn't have any CSS; it seems like it'd be worth providing a basic stylesheet (maybe just borrowed from Lean) to provide a nicer feel.
Hello! I was wondering if there was a way to not share the user's GitHub Personal Access Token with other people when using zulip-archive inside an organization. That is, to make a zulip-archive instance for an organization, one needs to create a PAT for a member with relevant permissions and store it in a secret accessible from within the organization i.e. an organization repository secret or an organization-wide secret. In both cases, other members of the organization can access the generated PAT, which (with the repo
and workflow
permissions) could be used to make changes to all of the owner's repositories. More details on the scopes are here.
I remembered that I used actions-gh-pages for some other GH page with static files (generated with Zola) and it only needed the generic secrets.GITHUB_TOKEN
as described here.
on:
push:
branches:
- source # Default branch
jobs:
build:
runs-on: ubuntu-latest
steps:
- name: Setup
uses: actions/checkout@v2
- name: Build
run: |
VERSION="v0.11.0"
curl -L https://github.com/getzola/zola/releases/download/$VERSION/zola-$VERSION-x86_64-unknown-linux-gnu.tar.gz > zola.tar.gz
tar -xzf zola.tar.gz
./zola build
- name: Deploy
uses: peaceiris/actions-gh-pages@v3
with:
github_token: ${{ secrets.GITHUB_TOKEN }}
publish_dir: ./public
publish_branch: master # Deploying branch (not default)
The entrypoint.sh
in zulip-archive seems to do all the work that could be delegated to actions-gh-pages
and specifically uses the PAT as the token. Could this be substituted with a token either specific to the organization (i.e. the OAuth App) or the secrets.GITHUB_TOKEN
as described above? One potential issue with using the latter is some additional steps on first run, described here.
Is there a reason for which the PAT was preferred? Apologies if I missed something obvious here and thanks for reading!
https://www.jcchouinard.com/create-xml-sitemap-with-python/ has an example of doing it via jinja2 for templating the xml content.
The archive copy of streams that are web-public should have a prominent link to the web-public view, as that will lead to the best experience for viewing such streams.
Related issue: #80
#73 reported an issue with pip not installing because of the Python version being outdated.
https://github.com/zulip/zulip-archive/pull/74/files should be an instant fix for this.
The Zulip docker ships with Python 3.6.9. The long term fix here would be to upgrade the docker image to one that comes with a newer Python version.
I think choosing one of the images from https://hub.docker.com/_/python should be most likely good enough.
And once you verify that the action works sucesfully update the Dockerfile of zulip-action to point to the correct image.
https://github.com/zulip/zulip-archive/blob/master/Dockerfile#L1
From what I remeber from 2 years ago you mostly likely don't need to do any custom modifications to the image. Choosing one of the Python images should be good enough. But if a custom modification is required, get the docker image pushed to the Zulip Docker account. And then update Dockerfile.
in populate.py, safe_request, rsp['result']
isn't defined (at least for me). So instead of having an error about an invalid key (I was using the wrong email) I got errors about rsp
.
It seems like it'd be worth having at least "last message" datetimes available in the top-level pages.
When attempting to add a Secret github_personal_access_token
the error Failed to add secret. Name is invalid.
is returned. Has GitHub perhaps changed something recently that makes it impossible to use a secret with this kind of name now?
page_head_html
and page_footer_html
will be introduced in #62. But they have yet to be exposed to the GHA user. See the discussion in #62 (comment).
I'm finally moving the leanprover archive to your new setup, since we're refactoring the website. But I'm having some issues with the GitHub Action setup.
You can see a failing run here:
https://github.com/leanprover-community/archive2/runs/615909057?check_suite_focus=true
It's odd, because I had a successful run with an identical configuration on another repo. One difference is that GH Pages was already enabled at the start of the successful build. This shows up in the log of the successful build:
Successfully installed pyyaml-5.2
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
{
100 156 100 143 100 13 1833 166 --:--:-- --:--:-- --:--:-- 2000
"message": "GitHub Pages is already enabled.",
"documentation_url": "https://developer.github.com/v3/repos/pages/#enable-a-pages-site"
}
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
100 251 100 251 0 0 3217 0 --:--:-- --:--:-- --:--:-- 3217
where the failing one errors.
I'm not familiar enough with Actions infrastructure to diagnose what's going wrong. If it's expected that Pages is enabled at the start, maybe this could go in the instructions?
My archive's homepage has a list of streams. When I click a stream I see the stream name (as a link to itself) and a list of topics. When I click a topic, the topic name is added to the top of the page as a link to itself. But at no point is a link back to the 'homepage' (list of streams) displayed anywhere...
Can you please add a link to the 'homepage'/list of streams to the top of each page?
This is my first time using GH Pages, so I was interested to see the theme chooser in my repo's settings. Especially after discovering that the archive pages were very bland indeed...
Unfortunately it doesn't seem to make any difference choosing a theme from the settings. I cannot make my archive look nice. Does support for the theme chooser need to be added to this project, or am I just missing something?
See my comment here: 01c1ac5#commitcomment-38731915
There are relative CSS links on every topic page that (1) are redirected off-site by the base href change, and (2) wouldn't work anyway unless a copy of the CSS file appears in every subdirectory.
The CSS file that this links to only has a single line. Maybe this should all just be deleted?
@robertylewis I just merge a series of commits, inspired by some work the rust-lang folks did in integrating this, that moves a bunch of the configuration settings to live in the zuliprc
file. Docs are present in the README.md. Can you test it out and make sure it didn't break anything for Lean?
Also you may want to look at the Rust folks changes: https://github.com/zulip/zulip_archive/commits/rust-remaining to see if any of those you'd like for Lean as well. Here are my notes to them on next steps:
On question I had for you:
ex. https://juliacommunity.github.io/zulip-archive/stream/274208-helpdesk-(published)/topic/.60eachline(.3A.3AIOBuffer).60.20for.20other.20delimiters.html (source at https://github.com/JuliaCommunity/zulip-archive/blob/master/274208-helpdesk-(published)/topic/.60eachline(.3A.3AIOBuffer).60.20for.20other.20delimiters.html)
I speculate this has something to do with the leading .
(generated in this case by the post-URLencoding % -> . replacement). Does Jekyll have some special logic for this that is tripping things up?
README.md, Step 3, correctly states that user should store a single secret zuliprc.
However, in step 6 the old way of using three secrets for url, bot email and and API key is still lingering, causing new users like me all sorts of confusion
As per the docs, the encoding=
arg has been removed in 3.9.
https://github.com/psf/black.
And then
I can archive it locally or in a private repository, but those streams are also valuable to be archived.
Now that the Zulip server supports web-public streams, it seems like many organizations using this feature will want specifically the web-public streams in their organization to be included in their archive.
Note that because the native web public streams feature doesn't support search engine indexing, and adding that will be a big project, this will become the main use case for zulip-archive in open organizations.
This project will also continue to be useful for non-public organizations that are shutting down to create an HTML archive for posterity -- so while I think this would be a good default, we want it to remain easy to specify directly in the tool which streams to archive.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.