webrecorder / browsertrix Goto Github PK

Browsertrix is the hosted, high-fidelity, browser-based crawling service from Webrecorder designed to make web archiving easier and more accessible for all!

Home Page: https://browsertrix.com

License: GNU Affero General Public License v3.0

Dockerfile 0.07% Python 43.33% Shell 0.48% JavaScript 2.36% TypeScript 52.87% CSS 0.72% EJS 0.07% Jinja 0.10%

archiving cloud warc web-archive web-archiving webrecorder wacz kubernetes

browsertrix's Introduction

Browsertrix is an open-source cloud-native high-fidelity browser-based crawling service designed to make web archiving easier and more accessible for everyone.

The service provides an API and UI for scheduling crawls and viewing results, and managing all aspects of crawling process. This system provides the orchestration and management around crawling, while the actual crawling is performed using Browsertrix Crawler containers, which are launched for each crawl.

See browsertrix.com for a feature overview and information about Browsertrix hosting.

Documentation

The full docs for using, deploying, and developing Browsertrix are available at: https://docs.browsertrix.com

Deployment

The latest deployment documentation is available at: https://docs.browsertrix.com/deploy

The docs cover deploying Browsertrix in different environments using Kubernetes, from a single-node setup to scalable clusters in the cloud.

Previously, Browsertrix also supported Docker Compose and podman-based deployment. This has been deprecated due to the complexity of maintaining feature parity across different setups, and with various Kubernetes deployment options being available and easy to deploy, even on a single machine.

Making deployment of Browsertrix as easy as possible remains a key goal, and we welcome suggestions for how we can further improve our Kubernetes deployment options.

If you are looking to just try running a single crawl, you may want to try Browsertrix Crawler first to test out the crawling capabilities.

Development Status

Browsertrix is currently in a beta, though the system and backend API is fairly stable, we are working on many additional features.

Additional developer documentation is available at https://docs.browsertrix.com/develop

Please see the GitHub issues and this GitHub Project for our current project plan and tasks.

License

Browsertrix is made available under the AGPLv3 License.

Documentation is made available under the Creative Commons Attribution 4.0 International License

browsertrix's People

Contributors

Stargazers

Watchers

browsertrix's Issues

Add a way to configure superuser

Probably via char value/env var, with password need to be set on first use?

GET crawl config by ID returns archive instead of crawl config

The /archives/{aid}/crawlconfigs/{cid} endpoint seems to return the parent archive instead of of the crawl config. Tested in the docs UI.

Example curl request:

curl -X 'GET' \
  'https://btrix-dev.webrecorder.net/api/archives/0146a76e-b4fe-498d-a6df-0e8be8858dd1/crawlconfigs/a9ba1884-b9a3-4ec5-9210-f8dd8501cde9' \
  -H 'accept: application/json' \
  -H 'Authorization: Bearer eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJ1c2VyX2lkIjoiZTY3NmY2MWMtMjUxZi00ZDU2LWEwNzYtNTNmZTM1YzM4YmE0IiwiYXVkIjpbImZhc3RhcGktdXNlcnM6YXV0aCJdLCJleHAiOjE2NDI5MDIxNzB9.K4EOY1NftfDO9fH9Ti8sudR3FRrBCXS5_c72YikEz4Y'

Returns:

{
  "id": "0146a76e-b4fe-498d-a6df-0e8be8858dd1",
  "name": "sua dev's Archive",
  "users": {
    "e676f61c-251f-4d56-a076-53fe35c38ba4": 40
  },
  "storage": {
    "name": "default",
    "path": "0146a76e-b4fe-498d-a6df-0e8be8858dd1/"
  },
  "usage": {
    "2022-01": 3670
  }
}

Register for account in UI

Add sign up page
Add verification page
Show verification status in UI

Minify Lit/html in prod build

Lit elements/HTML markup is not currently minified in the prod build.

See https://lit.dev/docs/tools/production/#preparing-code-for-production

terser/terser#951

Set up Lit remote build

Add hash to js files/check other prod optimizations #46
CI/CD (GH actions?)
Cloud hosting for dev site

Crawls API

Running Crawls

List Running Crawls
Stop Running Crawl w/ Save
Stop Running Crawl and Discard

Crawl Configs

Update Existing Config
Run Now from Existing Config

Set up linter & formatter

Linter
Formatter

Crawl config detail view enhancements

Show created by user name instead of ID
- API returns a user object with id and `name (@ikreymer)
- Update frontend (@SuaYoo )
Show created at date (@ikreymer and @SuaYoo)
Make entire template card clickable (@SuaYoo)
Rename template when duplicating to [name] Copy or [name] 2` (@SuaYoo)
Change runNow UI from switch to checkbox (@SuaYoo)
(Needs investigation) Show schedule in browser local time (@SuaYoo)
(Needs discussion) Limit JSON shown only to user specified fields

Investigate showing invitation metadata

Backend: Make archive info available, either via public API endpoint or JWT in invitation URL @ikreymer
Frontend: Show invite details @SuaYoo #68

Backend: use UUID for external doc references

Use UUID instead of str for _id for all doc tables (except crawls, which get unique string id)

Will allow joining by username and adding more consistent typing.

Use `/api/settings` to configure initial settings

Switch to using this api to determine if open registration is enabled, and refresh token lifetime

Create Terms of Service

There should be a ToS to present to users who are signing up, from self registration or invite. TBD actual details of ToS and whether it needs to be explicitly or implicitly accepted.

Set up Typescript

Create crawl config enhancements

Update section order to be Basic (name) -> Pages (seeds) -> Scheduling
Default time to current hour instead of midnight
Show toast linking to new crawl

Password manager doesn't recognize username/password input

Password managers don't seem to play nice with web components in general. There doesn't seem to be a clear solution besides moving to light DOM, which is possible with LiteElement but not with shoelace components.

Crawl config validation

Server-side validation for POST/PATCH
Client-side validation for both form and JSON editor

Registering a user-email that is already registered shows incorrect information.

When registering with an e-mail that is already registered, the backend returns 400 with {detail: "REGISTER_USER_ALREADY_EXISTS"}.

The frontend should display this error message and perhaps offer a link to login page and/or forgot password page?

(Currently, it attempts to login with the new credentials, which also errors out, but then still displays 'Successfully signed up' message.)

Frontend production build

Check frontend prod build settings:

Hash/chunked assets
Minification #55
Source maps
Run in docker (add dockerignore) #60

Invite users as superadmin

Superadmins can invite user to join platform

Backend: Create endpoint for generating invite token @ikreymer
Frontend: Create UI for inviting users @SuaYoo

Handle expired JWT

JWT expires in 60 minutes by default, handle if expired

Also, refresh tokens are possibility: https://fastapi-users.github.io/fastapi-users/configuration/authentication/jwt/#tip-refresh (cc @ikreymer)

Crawl List + Detail View UI

Crawl List UI should include:

For finished crawls:

Number of files crawled (but not the file names, details only)
Duration of crawl and total size of all the files, maybe 10MB (2 Files)
- #116

For running crawls:

Option to watch crawl, or only on detail page?
Crawl stats?
Option to cancel (stop and discard) or stop (stop and keep what was crawled), or only on detail?
- #109
- #108
Sorting by: Start, End State, Config id
Filter by: config id?
- #109

Detail View - Finished Crawl:

Includes list of files and links to download each file (needs backend support)
Link to replay - shown inline in page? (needs backend support)

Detail View - Running Crawl

Includes option to watch - will show an inline watch iframe? (need backend support)

Refetch crawl templates on tab change

The crawl template list isn't refreshed after changing tabs.

Re-fetch data when visible
Refactor tab URL (see crawl detail example)

Ensure email sending is working with proper config

Needed for #27 and #30 to be fully functional!

Update archive view layout

Remove unnecessary side bar
Adjust archives/archive navigation

Mockup: https://app.mockplus.com/run/rp/TgN5xi_FPJvV/b-f_40JtJ3_z?cps=expand&rps=expand&nav=1&ha=1&la=1&fc=0&out=0&rt=1

Get JWT expiration from decoded token

The login JWT should be decoded on the frontend to get the expiration time.

Add display name to user data model

Include name property for users, settable upon registration
Use name property when returning list of users that are part of an archive

Allow archive admins to invite users

Add ability for a super-admin to invite select users to register.

Show archive members
Show invite form with email in archive
Sign up with accept invitation

Lit UI/CSS library investigation

Decide on:

UI framework (Shoelace? Daisy UI?)
Utility CSS library (Tailwind?)

Authentication not persisting

The frontend is logged out after a while, probably the hour token expiration

Crawl Configurations UI

The Crawl configuration UI will include a way to create new crawl configurations, list existing crawl configurations, delete crawl configurations.

Create Crawl Configurations
List Existing Crawl Configurations
Delete Crawl Configurations

Pausing / Scaling Crawls

Support increasing/decreasing number of pods running on a crawl.
Requires:

Generate Crawl ID separately, not based on job/docker container id
Use Shared Redis for Crawl, instead of local one in Browsertrix Crawler (for how long??)
Crawl doc supports multiple file entries instead of just one
Decide which approach to take, 1 or 2:

1. Scale via Pause and Restart

To scale:

job stopped gracefully, WACZ written
crawl doc set to 'partial_complete', files added for each completed pod.
restart with same crawl id, shared redis state.
final pod adds final WACZ, sets crawl state to 'complete'

Pros:

Maintain one job per crawl at a time.
K8s takes care of parallelism, works with cron job.

Cons:

Scaling up or down requires stopping job, restarting with more pods.
harder to support via Docker only

2. Add more jobs / remove jobs to scale

To scale up:

New bob added with crawl id of existing Redis state.

To scale down:

One or more existing jobs stopped (graceful stop)
crawl doc updated with new WACZ and 'partial_complete'

Pros:

Scaling up and down without any interruption
Can be implemented in similar way w/o K8S

Cons:

multiple jobs per crawl
unclear how to handle cronjobs

Configure API path in frontend

API can be accessed via local docker container or remote URL

`btrix-log-in` update warning

Fix source of warning on login page: Element btrix-log-in scheduled an update (generally because a property was set) after an update completed, causing a new update to be scheduled. This is inefficient and should be avoided unless the next update can only be scheduled as a side effect of the previous update. See https://lit.dev/msg/change-in-update for more information.

Create Crawl Configuration/Template/Definition UI

This screen will produce a JSON that is then passed to the crawl config creation API endpoint.

The format includes a top-level dictionary with a Browsertrix Cloud-specific options, and a config dictionary, which
corresponds to the Browsertrix Crawler config.

The format is:

{
  "schedule": "",
  "runNow": false,
  "colls": [],
  "crawlTimeout": 0,
  "parallel": 1,
  "config": {...}
}

The key properties to include are:

run now, a checkbox to start a crawl instantly.
schedule, a way to specify a schedule in cron-style format. (but can be simpler, eg. a day time, and an option like daily, weekly, monthly, etc..)
Time Limit in seconds (mostly will be helpful for testing, though not strictly required)

The actual crawl configuration, the config property, can be what is passed to browsertrix-crawler can be either a:

Advanced view where json can be pasted
Simplified view that includes a subset of properties, maybe starting with:
seed list, containing:
- URL
- Scope Type (page, page-spa, prefix, host, any)
  other properties:
- limit, total number of pages to crawl

For the seed list, the input might be:

Text area with one URL per line + scope type, which then get added to the list. This would be to support pasting in a bunch of URLs with a specified scope.

The supported properties in the 'simplified view' will likely continue to evolve, but also have the advanced view for pasting a custom config.

Storage Refactor

Allow user to reset password

"Forgot password" reset
Logged in user page reset

Set up frontend localization

Accept invite endpoint return 400 on success

To reproduce (on main):

Run frontend app with yarn start-dev, log in
Choose an archive and click "Members"
Click "Add Member" and finish invite an email you can access
Log out
Find invite in your inbox, replace remote URL in link with http://localhost:9870 and visit
Complete sign up
Click "Accept Invite". You should see an "Invalid invite" error. However, if you go to http://localhost:9870/archives, you'll see the invite that you were added to.

[Low Pri] When showing last crawl, also show indicator if it succeeded/failed

Could show a ✔️ or an ✖️ depending on if the last crawl succeeded or failed.
Will need to add a lastCrawlStatus to the crawlconfig in addition to id and date.
(Low priority for now)

Additional API fields for crawls and crawlconfigs

Return crawl template name in /crawls and /crawls/:id response data to show in UI
Return user name in /crawls, /crawls/:id and /crawlconfigs/:id
Return created at date in /crawlconfigs and /crawlconfigs/:id
For crawl list, return fileSize and fileCount for each crawl instead of files array (drop files field)
Flatten crawls list to single crawls list instead of running and finished

e.g.

type Crawl = {
  id: string;
  user: string;
  // user: { id: string; name: string; }
  // or
  // userName: string;
  aid: string;
  cid: string;
  // crawlconfig: { id: string; name: string; }
  // or
  // crawlconfigName: string;
};

(Low priority) Invites generate additional verification email

Not a high priority, but would be better UX to not generate a verification email for users that sign up from an invite.

To reproduce (on main):

Run frontend app with yarn start-dev, log in
Choose an archive and click "Members"
Click "Add Member" and finish invite an email you can access
Log out
Find invite in your inbox, replace remote URL in link with http://localhost:9870 and visit
Complete sign up. After a few minutes you'll get a verification email

Upgrade Shoelace 2.0.0-beta.61 to latest

Major breaking changes:

Use native form instead of sl-form https://github.com/shoelace-style/shoelace/blob/next/docs/resources/changelog.md#200-beta64
Update variant attribute https://github.com/shoelace-style/shoelace/blob/next/docs/resources/changelog.md#200-beta63

Plus:

Update icons trash -> trash3

See https://github.com/shoelace-style/shoelace/blob/next/docs/resources/changelog.md

Unable to open links in new tab with cmd+click

The current navLink utility method prevents opening links in a new tab by combining the click with a keypress (command in mac)

User verification stuck on spinner

Now that verification emails are being sent, it seems the verification is getting stuck somewhere.

Repro Steps

Register new account, receive registration e-mail with verification token
Load http://localhost:9870/verify?token=<token>
Observe the POST request is successful 200, but page stuck on spinner.
Refreshing the page returns 'Something is wrong' when the backend returns a 400 with {detail: "VERIFY_USER_ALREADY_VERIFIED"} - this should be also be a recognized error message.

Expected:

The page displays a message that the address has been verified.
If already verified, display appropriate error message

Question: what should happen if user is not logged in? I guess this should not log in the user, only validate the user, right?
It looks like the backend does not requires the auth token to be passed for verification, so would get verified no matter what, even if not logged in..

Show sign up confirmation message

Per Discord convo, add notification on sign up success as first pass at onboarding flow. Show message like "Welcome to Browsertrix Cloud. A confirmation email has been sent to the e-mail address you specified"

[Product Design] Decide how/if users should be able to create multiple archives

Currently, users either start with one archive/organization when they join, or are invited to an existing archive/organization.

The API currently allows for multiple archive creation per user, but not yet the UI..

Should decide if:

Users can create an unlimited number of new archives? Or only certain users?
Users get one archive that they 'own', but can be members of as many other archives that they're invited to?
Different options for different user roles (may be too complicated)

Current Schedule
Option to Edit Schedule
- #103
Display time of last finished crawl with link to the crawl, if any
- #97
Run Now option to start a new crawl, if allowed (see below)
- #94
Option to Duplicate a Crawl Config, create a new config with seed list of previous one (for later).
- #99

The crawl config detail view can show:

Schedule editing
- #103
Seed List (read only) and/or raw JSON (for both standard and advanced configs?)
- #97

Backend:

Need check to see if last crawl is currently running, prevent starting new crawl
API should include a bool to indicate if can start a new crawl.

Some questions:

Do we want to show the raw JSON, even for standard-created configs, and/or only for advanced?
Do we need to track if it was a 'standard' created via UI or custom config created through raw JSON, eg. for duplication? Or can we detect this heuristically (eg. if json has more properties than supported by standard config, use the JSON view)

Add 'Include External Links' checkbox

An extra checkbox to the crawl configuration simple view below the scope type, which, if checked would set "extraHops": 1 in the config, otherwise "extraHops": 0.

Maybe should call it: Include all links to external pages ?

Also, perhaps should rename scope type to Crawl Scope