Giter VIP home page Giter VIP logo

zulip-archive's Issues

Allow use of GITHUB_TOKEN when running in GH Actions

Ideally this tool could run as a GitHub Action without the need to manage a personal access token (since scopes on personal tokens are coarse grained, and best practices keep them short lived).

Relevant bits:

Switch 'master' to 'main'

Since Git(Hub) changed the default branch name to main zulip-archive should use that when commiting new changes, I believe here and in subsequent lines.

Even better would be to allow the user to specify a preferred branch name.

Strengthen URL escaping

The current escaping for URLs isn't strong enough. #34 fixed escaping ? which was causing broken links. But there are other bad symbols coming through.

Here's a URL which works but is malformed: https://leanprover-community.github.io/archive/stream/218272-FoMM-/-Lean-Together-2020/index.html (notice it's in a subdirectory of /stream/21872-FoMM-). The Zulip URL for this stream is better, https://leanprover.zulipchat.com/#narrow/stream/218272-FoMM-.2F.20Lean.20Together.202020

I haven't noticed other malformed links yet but I expect to find some. Our users like to make annoying stream/topic names. The Zulip escaping seems to be 100% reliable for us, even with the annoying names, so maybe that should be duplicated here?

One note, URL changes are disruptive, so this probably shouldn't be done incrementally. If anything is changing it should all change at once.

instructions.md: dependency list is now incorrect

xml-sitemap-writer was added as a dependency last August by 980675c , and it is not listed as a dependency in the current instructions.md.

Note: Along with @Zimmi48 we maintain a Gitlab CI script to produce Zulip archives, similar to the Github actions, and this is how I noticed the issue. https://gitlab.com/gasche/zulip-archive-gitlab-ci.

I am trying to fix the issue in a more robust way by simply calling pip on the requirements.txt file. Maybe instructions.md could document this more robust workflow?

Allow multiple ways to sort the stream list

Currently, the stream list is sorted by the number of topics. This is the use case of Lean, but other project may prefer different sorting method, e.g. Rust prefers sorting alphabetically.

This can be solved dynamically by adding js code to the index.html output.

See #2 (comment) for context of the discussion.

Directory structure is inconsistent

The topic pages get stored at /<stream-id>-<stream-name>/topic/<topic-name>.html. The topic index pages get stored at /stream/<stream-id>-<stream-name>/index.md. (Notice the stream/ at the beginning.) The topic pages are permalinked with a stream/ at the beginning, so things work after Jekyll builds the full site, but this seems needlessly complicated.

Example: https://github.com/leanprover-community/archive (Ignore the top level directories without - after the IDs. Those are there to redirect from the old URL scheme.)

Fail to archive due to Python version too low

https://github.com/tisonkun/zulip-archive/runs/4997354269?check_suite_focus=true

2022-01-30T15:05:17.2351750Z ##[group]Run zulip/zulip-archive@master
2022-01-30T15:05:17.2352844Z with:
2022-01-30T15:05:17.2353274Z   zulip_organization_url: ***
2022-01-30T15:05:17.2353593Z   zulip_bot_email: ***
2022-01-30T15:05:17.2353890Z   zulip_bot_key: ***
2022-01-30T15:05:17.2354295Z   github_personal_access_token: ***
2022-01-30T15:05:17.2354529Z   delete_history: true
2022-01-30T15:05:17.2354755Z   archive_branch: gh-pages
2022-01-30T15:05:17.2354991Z ##[endgroup]
2022-01-30T15:05:17.2600562Z ##[command]/usr/bin/docker run --name eaf078a84fb3749efb334b15627788c66_eca44b --label 84217e --workdir /github/workspace --rm -e INPUT_ZULIP_ORGANIZATION_URL -e INPUT_ZULIP_BOT_EMAIL -e INPUT_ZULIP_BOT_KEY -e INPUT_GITHUB_PERSONAL_ACCESS_TOKEN -e INPUT_DELETE_HISTORY -e INPUT_ARCHIVE_BRANCH -e HOME -e GITHUB_JOB -e GITHUB_REF -e GITHUB_SHA -e GITHUB_REPOSITORY -e GITHUB_REPOSITORY_OWNER -e GITHUB_RUN_ID -e GITHUB_RUN_NUMBER -e GITHUB_RETENTION_DAYS -e GITHUB_RUN_ATTEMPT -e GITHUB_ACTOR -e GITHUB_WORKFLOW -e GITHUB_HEAD_REF -e GITHUB_BASE_REF -e GITHUB_EVENT_NAME -e GITHUB_SERVER_URL -e GITHUB_API_URL -e GITHUB_GRAPHQL_URL -e GITHUB_REF_NAME -e GITHUB_REF_PROTECTED -e GITHUB_REF_TYPE -e GITHUB_WORKSPACE -e GITHUB_ACTION -e GITHUB_EVENT_PATH -e GITHUB_ACTION_REPOSITORY -e GITHUB_ACTION_REF -e GITHUB_PATH -e GITHUB_ENV -e RUNNER_OS -e RUNNER_ARCH -e RUNNER_NAME -e RUNNER_TOOL_CACHE -e RUNNER_TEMP -e RUNNER_WORKSPACE -e ACTIONS_RUNTIME_URL -e ACTIONS_RUNTIME_TOKEN -e ACTIONS_CACHE_URL -e GITHUB_ACTIONS=true -e CI=true -v "/var/run/docker.sock":"/var/run/docker.sock" -v "/home/runner/work/_temp/_github_home":"/github/home" -v "/home/runner/work/_temp/_github_workflow":"/github/workflow" -v "/home/runner/work/_temp/_runner_file_commands":"/github/file_commands" -v "/home/runner/work/zulip-archive/zulip-archive":"/github/workspace" 84217e:af078a84fb3749efb334b15627788c66  "***" "***" "***" "***" "true" "gh-pages"
2022-01-30T15:05:17.5575442Z   % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
2022-01-30T15:05:17.5576823Z                                  Dload  Upload   Total   Spent    Left  Speed
2022-01-30T15:05:17.5582860Z 
2022-01-30T15:05:17.7092965Z   0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
2022-01-30T15:05:17.7103161Z 100 2548k  100 2548k    0     0  16.3M      0 --:--:-- --:--:-- --:--:-- 16.3M
2022-01-30T15:05:17.7541038Z ERROR: This script does not work on Python 3.6 The minimum supported Python version is 3.7. Please use https://bootstrap.pypa.io/pip/3.6/get-pip.py instead.

Option to push updates without preserving history

We've been using the GH Action version of the script for a couple months now, updating once an hour. The repo it pushes to is now 3.5gb, and the checkout step of the action takes 7 minutes. Storing the full history of the html/md files is usually not necessary. It would be great if the action had an option to force push, so that the repo size doesn't explode.

Action fails due to fatal "unsafe repository" error

Since last night, the archive build in JuliaCommunity/zulip-archive has been failing due to the following error:

Run zulip/zulip-archive@7c772a65b1dd1540d2bb64ee6f0b2860ccaf32c0
/usr/bin/docker run --name bcf09c6e0c2241d25423eb0bae2c81957d3c6_e63042 --label 2bcf09 --workdir /github/workspace --rm -e INPUT_ZULIP_ORGANIZATION_URL -e INPUT_ZULIP_BOT_EMAIL -e INPUT_ZULIP_BOT_KEY -e INPUT_GITHUB_PERSONAL_ACCESS_TOKEN -e INPUT_DELETE_HISTORY -e INPUT_ARCHIVE_BRANCH -e HOME -e GITHUB_JOB -e GITHUB_REF -e GITHUB_SHA -e GITHUB_REPOSITORY -e GITHUB_REPOSITORY_OWNER -e GITHUB_RUN_ID -e GITHUB_RUN_NUMBER -e GITHUB_RETENTION_DAYS -e GITHUB_RUN_ATTEMPT -e GITHUB_ACTOR -e GITHUB_WORKFLOW -e GITHUB_HEAD_REF -e GITHUB_BASE_REF -e GITHUB_EVENT_NAME -e GITHUB_SERVER_URL -e GITHUB_API_URL -e GITHUB_GRAPHQL_URL -e GITHUB_REF_NAME -e GITHUB_REF_PROTECTED -e GITHUB_REF_TYPE -e GITHUB_WORKSPACE -e GITHUB_ACTION -e GITHUB_EVENT_PATH -e GITHUB_ACTION_REPOSITORY -e GITHUB_ACTION_REF -e GITHUB_PATH -e GITHUB_ENV -e GITHUB_STEP_SUMMARY -e RUNNER_OS -e RUNNER_ARCH -e RUNNER_NAME -e RUNNER_TOOL_CACHE -e RUNNER_TEMP -e RUNNER_WORKSPACE -e ACTIONS_RUNTIME_URL -e ACTIONS_RUNTIME_TOKEN -e ACTIONS_CACHE_URL -e GITHUB_ACTIONS=true -e CI=true -v "/var/run/docker.sock":"/var/run/docker.sock" -v "/home/runner/work/_temp/_github_home":"/github/home" -v "/home/runner/work/_temp/_github_workflow":"/github/workflow" -v "/home/runner/work/_temp/_runner_file_commands":"/github/file_commands" -v "/home/runner/work/zulip-archive/zulip-archive":"/github/workspace" 2bcf09:c6e0c2241d25423eb0bae2c81957d3c6  "***" "***" "***" "***" "true" "master"
fatal: unsafe repository ('/github/workspace' is owned by someone else)
To add an exception for this directory, call:

	git config --global --add safe.directory /github/workspace

(Example build)

I'm not sure where this error occurs or where that git command would have to go.

Runner info
Current runner version: '2.289.2'
Operating System
Virtual Environment
  Environment: ubuntu-20.04
  Version: 20220405.4
  Included Software: https://github.com/actions/virtual-environments/blob/ubuntu20/20220405.4/images/linux/Ubuntu2004-Readme.md
  Image Release: https://github.com/actions/virtual-environments/releases/tag/ubuntu20%2F20220405.4
Virtual Environment Provisioner
  1.0.0.0-main-20220325-1
GITHUB_TOKEN Permissions
  Actions: write
  Checks: write
  Contents: write
  Deployments: write
  Discussions: write
  Issues: write
  Metadata: read
  Packages: write
  Pages: write
  PullRequests: write
  RepositoryProjects: write
  SecurityEvents: write
  Statuses: write
Secret source: Actions
Prepare workflow directory
Prepare all required actions
Getting action download info
Download action repository 'actions/checkout@5a4ac9002d0be2fb38bd78e4b4dbde5606d7042f' (SHA:5a4ac9002d0be2fb38bd78e4b4dbde5606d7042f)
Download action repository 'zulip/zulip-archive@7c772a65b1dd1540d2bb64ee6f0b2860ccaf32c0' (SHA:7c772a65b1dd1540d2bb64ee6f0b2860ccaf32c0)

Showing images

At the moment, images from threads are displayed by using the original Zulip URL which results in a 401 if the server is not public (e.g. here).

Does someone know a workaround for this? As last resort I might consider downloading them during text scraping.

GitHub Actions started to fail, probably due to `delete_history: true` and growing archive size

I've started seeing consistent GitHub Actions failures recently:

error: RPC failed; HTTP 408 curl 18 HTTP/2 stream 7 was reset
send-pack: unexpected disconnect while reading sideband packet
fatal: the remote end hung up unexpectedly

Full log is available here.

It is an archive of a living Zulip organization, so it is constantly growing in size. I suspected the size simply got too big and overwriting the whole repo started to hit some kind of a limit. Therefore I tried to change action configuration to delete_history: false and the next run was a success.

I am filing this issue just to confirm, that there is a size limit for an archive, that can be published with GitHub Actions. Maybe there is a fix, maybe it simply needs to be documented.

Custom header/footer fails on topic pages?

From

# We use a topic-specific title instead of `page_head_html` to improve
# search engine indexing.
outfile.write(
to_topic_page_head_html(
html.escape(topic_name) + " · " + html.escape(stream_name) + " · " + title
)
)
outfile.write(topic_links)
outfile.write(
f'\n<head><link href="{html.escape(site_url)}/style.css" rel="stylesheet"></head>\n'
)
for msg in messages:
msg_html = format_message_html(
site_url,
html_root,
zulip_url,
zulip_icon_url,
stream_name,
stream_id,
topic_name,
msg,
)
outfile.write(msg_html)
outfile.write("\n\n")
outfile.write(date_footer_html)
outfile.write(page_footer_html)

... it looks like on topic pages the custom header HTML isn't used, but custom footer HTML is used. This means

  1. Broken tags if the customer header opens things like a div and the custom footer closes them -- since the tag will be closed without having been opened on topic pages
  2. Custom styling is lost on topic pages

Would be good to support some basic string interpolation in the custom header to allow topics to appear (or if this is out of scope: only apply the hard-coded HTML if no custom HTML header has been passed in).

populate.py: crash on empty message list

When populate.py encounters an empty messages list around line 149 this leads to index out of bounds error in line 151 last_message = messages[-1]

A simple fix is:

index 9e93e839d23..9951af3cae2 100644
--- b/lib/populate.py
+++ a/lib/populate.py
@@ -147,6 +147,8 @@ def populate_all(
             }
 
             messages = request_all(client, request)
+            if not messages:
+                continue
 
             topic_count = len(messages)
             last_message = messages[-1]

Non-ASCII topics render with mojibake in a stream’s topic list

On https://zulip-archive.rust-lang.org/219381tlibs/index.html, the character in the Panic message from `catch_unwind`’s `dyn Any` topic renders in Firefox instead as ’. That page’s HTML does not have a <meta charset> tag, so the encoding defaults to windows-1252.

On https://zulip-archive.rust-lang.org/219381tlibs/07261PanicmessagefromcatchunwindsdynAny.html which does have <meta charset="utf-8">, the character renders correctly.

Is the Personal Access Token needed?

Hello! I was wondering if there was a way to not share the user's GitHub Personal Access Token with other people when using zulip-archive inside an organization. That is, to make a zulip-archive instance for an organization, one needs to create a PAT for a member with relevant permissions and store it in a secret accessible from within the organization i.e. an organization repository secret or an organization-wide secret. In both cases, other members of the organization can access the generated PAT, which (with the repo and workflow permissions) could be used to make changes to all of the owner's repositories. More details on the scopes are here.

I remembered that I used actions-gh-pages for some other GH page with static files (generated with Zola) and it only needed the generic secrets.GITHUB_TOKEN as described here.

on:
  push:
    branches:
      - source # Default branch
jobs:
  build:
    runs-on: ubuntu-latest
    steps:
    - name: Setup
      uses: actions/checkout@v2
    - name: Build
      run: |
            VERSION="v0.11.0"
            curl -L https://github.com/getzola/zola/releases/download/$VERSION/zola-$VERSION-x86_64-unknown-linux-gnu.tar.gz > zola.tar.gz
            tar -xzf zola.tar.gz
            ./zola build
    - name: Deploy
      uses: peaceiris/actions-gh-pages@v3
      with:
        github_token: ${{ secrets.GITHUB_TOKEN }}
        publish_dir: ./public
        publish_branch: master # Deploying branch (not default)

The entrypoint.sh in zulip-archive seems to do all the work that could be delegated to actions-gh-pages and specifically uses the PAT as the token. Could this be substituted with a token either specific to the organization (i.e. the OAuth App) or the secrets.GITHUB_TOKEN as described above? One potential issue with using the latter is some additional steps on first run, described here.

Is there a reason for which the PAT was preferred? Apologies if I missed something obvious here and thanks for reading!

Link to web-public streams view

The archive copy of streams that are web-public should have a prominent link to the web-public view, as that will lead to the best experience for viewing such streams.

Related issue: #80

Update docker image of Zulip action to an image that comes with a newer Python version

#73 reported an issue with pip not installing because of the Python version being outdated.

https://github.com/zulip/zulip-archive/pull/74/files should be an instant fix for this.

The Zulip docker ships with Python 3.6.9. The long term fix here would be to upgrade the docker image to one that comes with a newer Python version.

I think choosing one of the images from https://hub.docker.com/_/python should be most likely good enough.

And once you verify that the action works sucesfully update the Dockerfile of zulip-action to point to the correct image.
https://github.com/zulip/zulip-archive/blob/master/Dockerfile#L1

From what I remeber from 2 years ago you mostly likely don't need to do any custom modifications to the image. Choosing one of the Python images should be good enough. But if a custom modification is required, get the docker image pushed to the Zulip Docker account. And then update Dockerfile.

error in error handling

in populate.py, safe_request, rsp['result'] isn't defined (at least for me). So instead of having an error about an invalid key (I was using the wrong email) I got errors about rsp.

Can't create github_personal_access_token

When attempting to add a Secret github_personal_access_token the error Failed to add secret. Name is invalid. is returned. Has GitHub perhaps changed something recently that makes it impossible to use a secret with this kind of name now?

Failure with GH Actions

I'm finally moving the leanprover archive to your new setup, since we're refactoring the website. But I'm having some issues with the GitHub Action setup.

You can see a failing run here:

https://github.com/leanprover-community/archive2/runs/615909057?check_suite_focus=true

It's odd, because I had a successful run with an identical configuration on another repo. One difference is that GH Pages was already enabled at the start of the successful build. This shows up in the log of the successful build:

Successfully installed pyyaml-5.2
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
{
100   156  100   143  100    13   1833    166 --:--:-- --:--:-- --:--:--  2000
  "message": "GitHub Pages is already enabled.",
  "documentation_url": "https://developer.github.com/v3/repos/pages/#enable-a-pages-site"
}
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100   251  100   251    0     0   3217      0 --:--:-- --:--:-- --:--:--  3217

where the failing one errors.

I'm not familiar enough with Actions infrastructure to diagnose what's going wrong. If it's expected that Pages is enabled at the start, maybe this could go in the instructions?

Add link back to 'homepage'

My archive's homepage has a list of streams. When I click a stream I see the stream name (as a link to itself) and a list of topics. When I click a topic, the topic name is added to the top of the page as a link to itself. But at no point is a link back to the 'homepage' (list of streams) displayed anywhere...

Can you please add a link to the 'homepage'/list of streams to the top of each page?

Support GH Pages theme chooser

This is my first time using GH Pages, so I was interested to see the theme chooser in my repo's settings. Especially after discovering that the archive pages were very bland indeed...

Unfortunately it doesn't seem to make any difference choosing a theme from the settings. I cannot make my archive look nice. Does support for the theme chooser need to be added to this project, or am I just missing something?

Broken css link

See my comment here: 01c1ac5#commitcomment-38731915

There are relative CSS links on every topic page that (1) are redirected off-site by the base href change, and (2) wouldn't work anyway unless a copy of the CSS file appears in every subdirectory.

The CSS file that this links to only has a single line. Maybe this should all just be deleted?

retest this project with the Lean data set

@robertylewis I just merge a series of commits, inspired by some work the rust-lang folks did in integrating this, that moves a bunch of the configuration settings to live in the zuliprc file. Docs are present in the README.md. Can you test it out and make sure it didn't break anything for Lean?

Also you may want to look at the Rust folks changes: https://github.com/zulip/zulip_archive/commits/rust-remaining to see if any of those you'd like for Lean as well. Here are my notes to them on next steps:

  • The "move CSS to separate file" commit we should just rebase to the bottom and merge.
  • For sort order, was the previous sort order "by last update"? I think both are potentially useful, so I wouldn't want to unconditionally apply that change. Ideally we'd just include something simple like sortable.js that allows the user to pick, and maybe have a config option for which is the default. So I'd probably tweak the commit to be an option for the default view and merge it, and then we can open an issue for adding JS sorting.
  • For the "Reformat" commit, we may need to split it. I suspect some of the styling changes may just be better for everyone. I don't fully understand the "last_updated" change; I think the intent of the previous model was to avoid needing to rewrite all the HTML/markdown when run in incremental mode, and I'm not sure if your restructuring loses that. Maybe it makes sense to open issues/small PRs for each independent change so we can see what the lean folks' original author thinks?

On question I had for you:

  • Would it make sense to move the code for pushing this to a GitHub Pages site to a separate script? The Rust folks removed that functionality in their branch I think mostly to declutter things, and I think I agree it'd be cleaner if that was a separate script in this repo that just pushed the archive.

404 on certain pages even when html view exists

ex. https://juliacommunity.github.io/zulip-archive/stream/274208-helpdesk-(published)/topic/.60eachline(.3A.3AIOBuffer).60.20for.20other.20delimiters.html (source at https://github.com/JuliaCommunity/zulip-archive/blob/master/274208-helpdesk-(published)/topic/.60eachline(.3A.3AIOBuffer).60.20for.20other.20delimiters.html)

I speculate this has something to do with the leading . (generated in this case by the post-URLencoding % -> . replacement). Does Jekyll have some special logic for this that is tripping things up?

README.md: inconsistent secrets handling

README.md, Step 3, correctly states that user should store a single secret zuliprc.

However, in step 6 the old way of using three secrets for url, bot email and and API key is still lingering, causing new users like me all sorts of confusion

Add support for targeting specifically streams with the `is_web_public` flag

Now that the Zulip server supports web-public streams, it seems like many organizations using this feature will want specifically the web-public streams in their organization to be included in their archive.

Note that because the native web public streams feature doesn't support search engine indexing, and adding that will be a big project, this will become the main use case for zulip-archive in open organizations.

This project will also continue to be useful for non-public organizations that are shutting down to create an HTML archive for posterity -- so while I think this would be a good default, we want it to remain easy to specify directly in the tool which streams to archive.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.