archivesunleashed / archivesunleashed.org Goto Github PK
View Code? Open in Web Editor NEWThis repository powers the project website.
Home Page: https://archivesunleashed.org
License: MIT License
This repository powers the project website.
Home Page: https://archivesunleashed.org
License: MIT License
We'll need to update the text to reflect the new derivative, and the new screenshot too. We can get this screenshot from the revised Cloud docs.
Updating the get involved page of the website to include the PDF of the latest newsletter release.
Update GWU datathon page with schedule. This will follow format of other datathon pages, having a clickable schedule image that opens to downloadable/printable pdf.
Opening ticket to address adding in PDF links for Washington Datathon final project slides.
Tasks:
With AUT PR #332 we now have the ability to read data directly in from S3. This should be (briefly) documented for the next release.
As per AUT PR #335, we'll now have consistent behaviour between ARCs and WARCs and HTTP headers will always be present by default. We should better document the RemoveHttpHeader
command and build it into more of the documentation.
Now that our project has an ever-growing publications list, we should have a page to showcase them!
Opening up an issue to add in floor plan image of Gelman library to give directions for participants.
In #77 we are updating documentation to reflect new functionality in 0.18.0. This ticket can contain information on binary extraction that needs to be added to the DataFrame section.
New release is done, we should update the docs were necessary. Once that is done, feel free to promote it.
Update the Washington datathon page to include travel information for participants, info including:
Using branch: gw-travel
PR and deployment will be issued right before participants are notified (Dec. 19).
Right now "Hands on With The Archives Unleashed Toolkit" lives in a markdown file in a personal repo.
As per @ruebot's comment here we should move it over to the website (or somewhere in the org, but I think the website probably makes most sense). That way we'd also make sure to keep things updated as the docs evolve.
Create a section on the Toronto datathon page to display team participants, bios (if provided), and link to presentation slides. Need to insert a clickable image that links to a PDF of slide deck. This will be used to promote the collaboration coming out of the event and to highlight examples of projects for future datathon participants.
Wireframing has already been started locally
Include additional screen shots and explanation about what users should see inside the example directory folder.
pulling from @ruebot explanation we can also add instructions for:
cat part-00000
or cat part-00001
. --> You can also cat part-00000 part-00001 > data.txt
This issue is prompted by a questions from one of our DC participants, while running through the instructions.
Information will need to be provided for participants to plan their travel arrangements. Items to be included:
Create page to display information for upcoming datathon at George Washington University
Upload newsletters to the website via the get involved page.
Create a new header specifically for past issues of newsletters.
Right now, only those who are subscribers get the newsletter, this will allow us to share newsletters after the campaigns have been sent out, and for individuals to access them at any time.
We've recently opened a YouTube channel and will be uploading tutorials and other AUT project related video content. This is a reminder to update website to include YouTube link (under the get in touch section in left panel).
We may want to schedule the PR to go live in around same time as AIT blog post comes out and introduces video and YouTube Channel to a wider audience.
YouTube Channel: https://www.youtube.com/channel/UC4Sq0Xi6UWhYK2VbmAzFhAw
While updating the Vancouver event page, I noticed that when I launched Hugo serve command, it brings me to the archivesunleashed.org main page, however navigating to any other page to view changes resulting in 404 error.
Checked:
Will try to see if there are any other paths that might not be syncing correctly.
Attached documentation for reference:
Update Vancouver datathon page to include the schedule.
Will mimic format from Toronto page:
Now that we have PySpark running, we need to document.
I've created a shell on a branch (aut/pyspark.md
) which if you're serving hugo locally will appear at http://localhost:1313/aut/pyspark/. Throughout the doc are code blocks that need to be filled in.
Right now, let's focus on the core web archiving stuff, and leave Twitter analysis until later.
From Slack @ruebot noted that if you say
import io.archivesunleashed._
import io.archivesunleashed.df._
val df = RecordLoader.loadArchives("/tuna1/scratch/nruest/geocites/warcs/9/image_test/*", sc).extractImageDetailsDF();
val res = df.select($"bytes").orderBy(desc("bytes")).saveToDisk("bytes", "/tuna1/scratch/nruest/geocites/images/")
Then in the images
directory you get a bunch of files with the preceding -
.
We should refine and then document this example so that it is say
val res = df.select($"bytes").orderBy(desc("bytes")).saveToDisk("bytes", "/tuna1/scratch/nruest/geocites/images/geo")
where each image would then be geo-HASH.gif
etc.
Opening an issue just so I don't forget to do this.
Right now our home page is a bit information heavy:
A suggestion from an advisory board member was to have an introductory video or image on the homepage. Maybe we could replace the current image with a playable video, the short introductory one that @SamFritz is working on? i.e.
Update information and institution links for Archives Unleashed advisory board members.
Based on some feedback.
For the next datathon, let's create a cheat sheet of common things. In the past, we've had this in Slack but would be better placed in the website too.
some example topics:
ssh
into the server, chmod
, etc.update documentation on AUK page to reflect changes in screenshots and any additional content
Place-holder ticket for updating aut
documentation for archivesunleashed/aut#289 when the next release happens.
Branch should not be merged until a new release is cut with Jimmy's new refactoring.
If I push up a couple branches implementing some new themes, would y'all entertain changing it? I kinda don't like the one we use now.
Looking at:
Similar to this issue, I'd like to do a renaming on the site.
Interesting use case at the datathon where they wanted to work with the raw HTML to help find data using specific tags. Makes sense to me! I will add to the documentation.
Thanks to @obrienben for the suggestion.
Testing with:
import io.archivesunleashed._
import io.archivesunleashed.matchbox._
RecordLoader.loadArchives("/mnt/vol1/data_sets/ubc-wildfires-2017/*.gz", sc)
.keepValidPages()
.map(r => (r.getCrawlDate, r.getDomain, r.getUrl, r.getContentString))
.saveAsTextFile("/mnt/vol1/derivative_data/ubc-wildfires-2017/plain-html")
Will see how it works with the team and if this meets needs, will add to our docs.
Privacy policy was written to cover all archives unleashed domains. The policy will be added to the archivesunleashed.org website under the About page.
Reorder of About menu to look like:
Right now all our docs are domain-to-domain links, we need to document url-to-url links.
We should move these Docker install instructions which are sitting in one of my random repos to the ArchivesUnleashed.org site to support this lesson.
I'll probably create a new page to do this.
We should put our code of conduct somewhere on the website.
On the about page? On the get involved page? Thoughts?
We need to document DataFrames in the documentation. Discussed with @ruebot and I think we will document DF alongside RDDs in the existing docs.
I will draw from:
Updating Vancouver datathon page to include information about final projects.
Based on some preliminary datathon feedback, NER was a common rabbit hole. I should frame it a bit better in the docs and explain how long it can take.
It wasn't working. I don't want to have misleading scripts up in advance of the hackathon, so I removed it for now until I can get it working again.
Previous script was:
import io.archivesunleashed._
import io.archivesunleashed.matchbox._
val links = RecordLoader.loadArchives("example.arc.gz", sc)
.keepValidPages()
.flatMap(r => ExtractLinks(r.getUrl, r.getContentString))
.filter(r => r._1 != "" && r._2 != "")
.countItems()
.filter(r => r._2 > 5)
links.saveAsTextFile("full-links-all/")
Noticing a few areas of the website that need minor updates/adjustments for consistency, style, and syntax.
Wanted to create one ticket that addresses multiple pages/areas
The summer newsletter will be released on August 6, 2019.
After the release we will need to transform into PDF and list on get involved page under the newsletter subscription
Place-holder ticket for updating aut documentation for archivesunleashed/aut#292 when the next release happens.
Thanks for catching this @SamFritz!
As part of #298, @ruebot discovered that the problem of image extraction scale was resolved in part by a more recent version of Spark as well as configuration settings. To do a large-scale image extraction job, some additional flags might be required when running spark-shell
.
From #298:
Yeah, we might add a cautionary note to this section about file systems, and flags. I can help flesh that out when the time comes.
-@ruebot
Our website might be intimidating to new users. Would it make sense to have a getting started section on the website?
An advisory board member suggested that such a section could perhaps have:
Could put it here? Thoughts? I think it is a good idea but am happy to talk further, as we also want to avoid cluttering our website too much...
A very simple walk through from borg cube to nice layout.
Sam is on the team so let's get her on the page! ๐
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.