Giter VIP home page Giter VIP logo

archivesunleashed.org's People

Contributors

ianmilligan1 avatar imgbot[bot] avatar ruebot avatar samfritz avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

Forkers

michaelnashed

archivesunleashed.org's Issues

Add Archives Unleashed Jupyter Notebooks to the website

We should highlight these notebooks on our website, via a short page here:

Screen Shot 2019-04-17 at 2 46 12 PM

It'll be "Archives Unleashed Jupyter Notebooks" and will have a brief description of them. This would give us the benefit of something solid to link to beyond the medium post (so we can update it more readily).

GWU Schedule

Update GWU datathon page with schedule. This will follow format of other datathon pages, having a clickable schedule image that opens to downloadable/printable pdf.

Washington Datathon Final Projects

Opening ticket to address adding in PDF links for Washington Datathon final project slides.

Tasks:

  • Team Project section
  • List team members
  • Project image as placeholder
  • Upload project slides and link to PDF
  • List Washington datathon under Events > past datathons

Document S3 Access

With AUT PR #332 we now have the ability to read data directly in from S3. This should be (briefly) documented for the next release.

Publications page

Now that our project has an ever-growing publications list, we should have a page to showcase them!

Document Binary Object Extraction

In #77 we are updating documentation to reflect new functionality in 0.18.0. This ticket can contain information on binary extraction that needs to be added to the DataFrame section.

Update Washington Travel Information

Update the Washington datathon page to include travel information for participants, info including:

Using branch: gw-travel

  • accommodation
  • airport
  • transportation options to airports
  • local transportation options
  • local attractions

PR and deployment will be issued right before participants are notified (Dec. 19).

Toronto Datathon Final Projects

Create a section on the Toronto datathon page to display team participants, bios (if provided), and link to presentation slides. Need to insert a clickable image that links to a PDF of slide deck. This will be used to promote the collaboration coming out of the event and to highlight examples of projects for future datathon participants.

Wireframing has already been started locally

update lesson page

Include additional screen shots and explanation about what users should see inside the example directory folder.

  • image: what :data folder looks like (with additional folders - _SUCCESS, part-00000, part-00001)
  • image: what opening part-00001 looks like in text editor

pulling from @ruebot explanation we can also add instructions for:

cat part-00000 or cat part-00001. --> You can also cat part-00000 part-00001 > data.txt

This issue is prompted by a questions from one of our DC participants, while running through the instructions.

Vancouver Travel Information

Information will need to be provided for participants to plan their travel arrangements. Items to be included:

  • accommodation
  • airport
  • downtown transit
  • local attractions
  • create a points-of-interest specific map (using google maps)
  • schedule: create PDF schedule in canva, then upload as image and link to downloadable PDF

Create GWU page

Create page to display information for upcoming datathon at George Washington University

Archives Unleashed Newsletters - PDF access

Upload newsletters to the website via the get involved page.
Create a new header specifically for past issues of newsletters.

Right now, only those who are subscribers get the newsletter, this will allow us to share newsletters after the campaigns have been sent out, and for individuals to access them at any time.

Add YouTube Channel Link

We've recently opened a YouTube channel and will be uploading tutorials and other AUT project related video content. This is a reminder to update website to include YouTube link (under the get in touch section in left panel).

We may want to schedule the PR to go live in around same time as AIT blog post comes out and introduces video and YouTube Channel to a wider audience.

YouTube Channel: https://www.youtube.com/channel/UC4Sq0Xi6UWhYK2VbmAzFhAw

Bash - Hugo Serve - results in 404

While updating the Vancouver event page, I noticed that when I launched Hugo serve command, it brings me to the archivesunleashed.org main page, however navigating to any other page to view changes resulting in 404 error.

Checked:

  • on correct branch
  • homebrew is up-to-date (1.6.14)

Will try to see if there are any other paths that might not be syncing correctly.

Attached documentation for reference:

Bash-Hugo serve- results in 404.pdf

screen shot 2018-07-12 at 11 17 49 pm
screen shot 2018-07-12 at 11 17 56 pm

Vancouver Schedule

Update Vancouver datathon page to include the schedule.

Will mimic format from Toronto page:

  • hyperlink to open as PDF document
  • image of the schedule to be displayed in a table format (day 1 in column 1, day 2 in column 2) and clickable to open as PDF

PySpark Documentation

Now that we have PySpark running, we need to document.

I've created a shell on a branch (aut/pyspark.md) which if you're serving hugo locally will appear at http://localhost:1313/aut/pyspark/. Throughout the doc are code blocks that need to be filled in.

Right now, let's focus on the core web archiving stuff, and leave Twitter analysis until later.

Fix image documentation to make prefix clear

From Slack @ruebot noted that if you say

import io.archivesunleashed._
import io.archivesunleashed.df._
val df = RecordLoader.loadArchives("/tuna1/scratch/nruest/geocites/warcs/9/image_test/*", sc).extractImageDetailsDF();
val res = df.select($"bytes").orderBy(desc("bytes")).saveToDisk("bytes", "/tuna1/scratch/nruest/geocites/images/")

Then in the images directory you get a bunch of files with the preceding -.

We should refine and then document this example so that it is say

val res = df.select($"bytes").orderBy(desc("bytes")).saveToDisk("bytes", "/tuna1/scratch/nruest/geocites/images/geo")

where each image would then be geo-HASH.gif etc.

Opening an issue just so I don't forget to do this.

Feature Introductory Video/etc. on Home Page

Right now our home page is a bit information heavy:

Screen Shot 2019-05-01 at 1 22 14 PM

A suggestion from an advisory board member was to have an introductory video or image on the homepage. Maybe we could replace the current image with a playable video, the short introductory one that @SamFritz is working on? i.e.

Screen Shot 2019-05-01 at 1 23 08 PM

Create Datathon Cheat Sheet

Based on some feedback.

For the next datathon, let's create a cheat sheet of common things. In the past, we've had this in Slack but would be better placed in the website too.

some example topics:

  • how to find your terminal;
  • how to ssh into the server, chmod, etc.
  • how to launch AUT;
  • standard dataset techniques & examples;
  • how to copy files to/from a virtual machine.

Use Case: Document Keeping HTML Tags in Text Output

Interesting use case at the datathon where they wanted to work with the raw HTML to help find data using specific tags. Makes sense to me! I will add to the documentation.

Thanks to @obrienben for the suggestion.

Testing with:

import io.archivesunleashed._
import io.archivesunleashed.matchbox._

RecordLoader.loadArchives("/mnt/vol1/data_sets/ubc-wildfires-2017/*.gz", sc)
  .keepValidPages()
  .map(r => (r.getCrawlDate, r.getDomain, r.getUrl, r.getContentString))
  .saveAsTextFile("/mnt/vol1/derivative_data/ubc-wildfires-2017/plain-html")

Will see how it works with the team and if this meets needs, will add to our docs.

Add Privacy Policy

Privacy policy was written to cover all archives unleashed domains. The policy will be added to the archivesunleashed.org website under the About page.

Reorder of About menu to look like:

  • Project Team
  • Advisory Board
  • Funding
  • Code of Conduct
  • Privacy Policy

Update Vancouver Page Projects

Updating Vancouver datathon page to include information about final projects.

  • Team name
  • Team members + affiliation
  • Screenshot of project
  • Attached project presentation (pdf)

More Cautionary Notes on NER

Based on some preliminary datathon feedback, NER was a common rabbit hole. I should frame it a bit better in the docs and explain how long it can take.

Raw URL Link Structure script

It wasn't working. I don't want to have misleading scripts up in advance of the hackathon, so I removed it for now until I can get it working again.

Previous script was:

import io.archivesunleashed._
import io.archivesunleashed.matchbox._

val links = RecordLoader.loadArchives("example.arc.gz", sc)
  .keepValidPages()
  .flatMap(r => ExtractLinks(r.getUrl, r.getContentString))
  .filter(r => r._1 != "" && r._2 != "")
  .countItems()
  .filter(r => r._2 > 5)

links.saveAsTextFile("full-links-all/") 

Update Website Content

Noticing a few areas of the website that need minor updates/adjustments for consistency, style, and syntax.

Wanted to create one ticket that addresses multiple pages/areas

Add summer newsletter to website

The summer newsletter will be released on August 6, 2019.
After the release we will need to transform into PDF and list on get involved page under the newsletter subscription

Update AUT documentation for image extraction job execution

As part of #298, @ruebot discovered that the problem of image extraction scale was resolved in part by a more recent version of Spark as well as configuration settings. To do a large-scale image extraction job, some additional flags might be required when running spark-shell.

From #298:

Yeah, we might add a cautionary note to this section about file systems, and flags. I can help flesh that out when the time comes.
-@ruebot

Add a Beginner Section

Our website might be intimidating to new users. Would it make sense to have a getting started section on the website?

An advisory board member suggested that such a section could perhaps have:

  • videos (some of the intro ones that Sam is putting together?)
  • a few key screenshots
  • Notebooks (making clear that they are beginner)
  • And "maybe a quick note on what skills (basic and/or advanced) are required to use the service: it could help scholars who want to start a project and hire people who have the skills to help them (ex. what to look for, what to mention in a job offer when you want to hire a skilled student or professional to help out) + you could advertise here the datathons and other events you organize as great opportunities to send students to get trained!"

Could put it here? Thoughts? I think it is a good idea but am happy to talk further, as we also want to avoid cluttering our website too much...

Screen Shot 2019-05-01 at 1 26 17 PM

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.