Giter VIP home page Giter VIP logo

fromthepage's Introduction

FromThePage is an open-source tool that allows volunteers to collaborate to transcribe handwritten documents.

Features

  • Wiki-style Editing: Users add or edit transcriptions using simple, wiki-style syntax on one side of the screen while viewing a scanned image of the manuscript page on the other side.
  • Version Control: Changes to each page transcription are recorded and may be viewed to follow the edit history of a page.
  • Wikilinks: Subjects mentioned within the document may are indexed via simple wikilinks within the transcription. Users can annotate subjects with full subject articles.
  • Presentation: Readers can view transcriptions in a multi-page format or alongside page images. They can also read all the pages that mention a subject
  • Automatic Markup: FromThePage can suggest wikilinks to editors by mining previously edited transcriptions. This helps insure editorial consistency and vastly reduces the amount of effort involved in markup.
  • Internet Archive integration: FromThePage can be pointed at manuscripts hosted on Archive.org. It will import the page structure and any printed page titles into its native format for transcription, while serving page images from the Internet Archive.

License

FromThePage is currently issued under the Affero GPL. This license remains controversial, however, so we are trying to preserve the option to dual-license the code.

Platform

FromThePage has been run successfully under both Linux and Windows. It currently requires Ruby on Rails version 6.0.3.2 and the RMagick, hpricot, will_paginate, and OAI gems.

Installation

Detailed Installation Instructions are available in the wiki, inclusing a link to a Docker file.

If you install FromThePage, please join the low volume FromThePage Google Group so we can keep you informed of bug fixes and new releases.

fromthepage's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

fromthepage's Issues

User can adjust size of transcription window

I have encountered several cases where the transcription window was not wide enough to enter JG's entire line. See S3 Page 7. As I entered some of the lines, they wrapped, which makes it a little harder to read when checking the work. However, when I hit save, the wrapping disappeared. When I did this and similar pages, when I reach the last couple lines on the original text, I have scrolled so far that I cannot see the transcription window and have to either jot down the transcription or scroll, enter a few words, scroll down, back up, etc.

reject possible duplicates

Possible duplicates/combine -- need a way to reject a possible dupe and not show it again. (So it doesn't clutter up the subject pages.)

Curly brace extending over several lines

In S3 Page 26, JG lists on three lines records of three quail. He then drew a curly brace from the first to the last and wrote a long note that started on the first line, went to the last, and then went on to the next couple lines. I indicated this with a single { after each of the birds and added a note in the text and in the notes themselves.

Use OpenLibrary persistent URLs

The code built around determining URLs for IA images is now obsolete thanks to the new redirection APIs built into OL. As Mike Lichtenberg writes:

FWIW, IA added a way to access their images that doesn’t rely on knowing that the base address was “ia700609.us.archive.org/6/items/mcz13103363v2” at one time but might be different now. See https://openlibrary.org/dev/docs/bookurls.

Using these addresses, this…

http://www.archive.org/download/mcz13103363v2/page/n2_w840

… redirects to this…

http://ia600609.us.archive.org/BookReader/BookReaderImages.php?id=mcz13103363v2&itemPath=/6/items/mcz13103363v2&server=ia600609.us.archive.org&page=n2_w840

… and returns the same image as this…

http://ia600609.us.archive.org/BookReader/BookReaderImages.php?zip=/6/items/mcz13103363v2/mcz13103363v2_jp2.zip&file=mcz13103363v2_jp2/mcz13103363v2_0003.jp2&scale=2

The beauty is you don’t need to know those ugly server names and file paths (and don’t need to double-check them every time).

Grinnel used brackets

On the page S3 Page 31, JG used brackets twice. E.g., [Elliot's]. Will this affect the annotation process?

I keep thinking that I have encountered the last of his quirks and then he surprises me again.

Error importing internet archive works with spaces in filenames

Using the new uploader, Internet Archive users can upload files containing spaces in the image file names. FromThePage still expects filenames in IA to conform to the old, persnickety file format, so an import will blow up on the third step during XML file parsing.

Logfiles:

Processing IaController#ia_book_form (for 70.112.88.81 at 2014-10-06 22:14:47) [GET]
  Parameters: {"ol"=>"d_ia_import"}
Rendering template within layouts/application
Rendering ia/ia_book_form
Completed in 18ms (View: 5, DB: 4) | 200 OK [http://beta.fromthepage.com/ia/ia_book_form?ol=d_ia_import]


Processing IaController#confirm_import (for 70.112.88.81 at 2014-10-06 22:14:54) [POST]
  Parameters: {"commit"=>"Next", "authenticity_token"=>"MRUo4XaibGPztbe4xpDkP2/2O1+6M7NnglFvfkjFUxg=", "detail_url"=>"https://archive.org/details/Doc3617Pp312"}
Rendering template within layouts/application
Rendering ia/confirm_import
Completed in 33ms (View: 9, DB: 15) | 200 OK [http://beta.fromthepage.com/ia/confirm_import]


Processing IaController#import_work (for 70.112.88.81 at 2014-10-06 22:15:02) [POST]
  Parameters: {"commit"=>"Next", "authenticity_token"=>"MRUo4XaibGPztbe4xpDkP2/2O1+6M7NnglFvfkjFUxg=", "detail_url"=>"https://archive.org/details/Doc3617Pp312"}

URI::InvalidURIError (bad URI(is not URI?): http://ia802305.us.archive.org/10/items/Doc3617Pp312/slave ledger doc 3617 pp3-12_scandata.xml):
  /usr/local/lib/ruby/1.8/uri/common.rb:436:in `split'
  /usr/local/lib/ruby/1.8/uri/common.rb:485:in `parse'
  /usr/local/lib/ruby/1.8/open-uri.rb:29:in `open'
  app/controllers/ia_controller.rb:168:in `import_work'

Problem file list (from https://ia902305.us.archive.org/10/items/Doc3617Pp312/ ):

Doc3617Pp312_archive.torrent                       04-Oct-2014 01:02                3891
Doc3617Pp312_files.xml                             04-Oct-2014 01:02                4711
Doc3617Pp312_meta.sqlite                           03-Oct-2014 16:22                9216
Doc3617Pp312_meta.xml                              04-Oct-2014 01:02                 927
slave ledger doc 3617 pp3-12.djvu                  04-Oct-2014 01:01              567506
slave ledger doc 3617 pp3-12.epub                  04-Oct-2014 01:02                3975
slave ledger doc 3617 pp3-12.gif                   04-Oct-2014 00:59              128989
slave ledger doc 3617 pp3-12.pdf                   03-Oct-2014 16:22             8003108
slave ledger doc 3617 pp3-12_abbyy.gz              04-Oct-2014 01:00                2804
slave ledger doc 3617 pp3-12_djvu.txt              04-Oct-2014 01:02                  93
slave ledger doc 3617 pp3-12_djvu.xml              04-Oct-2014 01:00                4651
slave ledger doc 3617 pp3-12_jp2.zip               04-Oct-2014 00:59             4323550
slave ledger doc 3617 pp3-12_scandata.xml          04-Oct-2014 01:01                3205
slave ledger doc 3617 pp3-12_text.pdf              04-Oct-2014 01:02              704575

Unicode Support

Dominik Wujastyk has been interested in using FromThePage for Sanskrit manuscripts (see his blog entry on crowdsourced transcription). Quick tests reveal that the current production server running Ruby 1.8 and Rails 2 does not support unicode in any form. Migrating to Ruby 1.9 is required for this to work, as well as possibly migrating and/or changing the encoding of the backing MySQL database. srl295 might be interested in following this issue.

Create Image Set fails with NoMethodError in Transform#size_form

After clicking the orientation form, the application fails with this error:

NoMethodError in Transform#size_form

Showing /home/benwbrum/dev/products/fromthepage/fromthepage/app/views/transform/size_form.html.erb where line #14 raised:

undefined method `id' for nil:NilClass

Extracted source (around line #14):

11
12
13
14
15
16
17

<% form_tag({:action => 'size_process'}) do %> <%= hidden_field_tag('image_set_id', @image_set.id) %> <%= radio_button_tag('size', 'just_right')%> <%= label('size', 'just_right', "This is just right") %>

Rails.root: /home/benwbrum/dev/products/fromthepage/fromthepage
Application Trace | Framework Trace | Full Trace

app/views/transform/size_form.html.erb:14:in block in _app_views_transform_size_form_html_erb___1280253658677194652_24376240' app/views/transform/size_form.html.erb:13:in_app_views_transform_size_form_html_erb___1280253658677194652_24376240'

Replace notes with something modern

RJS is used in FromThePage in the 'notes' feature -- itself pulled from an old restful_comments plugin. This needs to be replaced with something modern, perhaps by replacing notes.

Acceptance criteria:
As a logged-in user,

  • Visit a page in a work (such as http://beta.fromthepage.com/display/display_page?ol=d_act_page&page_id=1946 )
  • Click the "Add Note" link
  • Verify a form appears
  • Type in some text
  • Save the note
  • Verify the note is saved.
  • Reload the page, or navigate to the next page and back again
  • Verify the note is displayed
  • Edit the note (n.b. may depend on Issue #30 )
  • Verify that edits to the note are made
  • Delete the note
  • Verify the note is deleted.
  • Reload the page, or navigate to the next page and back again
  • Verify the note is not displayed

Mrs. -- links wrong

subject linking for "Mrs." links to something wrong. (a real subject, but not the person in question..) Maybe the first "Mrs." defined.

.gitignore

in general, it would be nice to not have to modify files under change control to do configuration. Not sure how to do that.

Anyways, here's a .gitignore that reduces the noise a little.

public/images/simple_captcha
public/images/working/dot/*
public/images/working/upload/*
public/.htaccess
log

Devise Error in IA publish to FromThePage

As an owner user, go to the dashboard
Import an Internet Archive book
Use https://archive.org/details/mcz13103363v16 as the URL
Hit "Next" through the duplicate warning
Press "Convert to FromThePage"

This raises something which looks like a Devise error:
NoMethodError in IaController#convert

undefined method `current_user' for #Class:0x00000003155240

Rails.root: /home/benwbrum/dev/products/fromthepage/fromthepage
Application Trace | Framework Trace | Full Trace

app/models/page.rb:122:in create_version' app/controllers/ia_controller.rb:29:inblock in convert'
app/controllers/ia_controller.rb:23:in `convert'

Could links be added to each page

I don't know how many of the other JG trips would be affected but for the trip to the San Martir mountains, Chester Lamb was the other half of the party. A link to a cleaned up copy of his map of collecting stations as well as to his own notes could be useful at times. Also, are there records available on line of the plants that they collected? A link to these could be useful, especially since JG often used common names.

Feature: PDF generation and Publish-on-demand integration

The world has changed since I last worked on the LaTeX formatters for doing PDF generation. Now I should be able to use the Lulu.com publishing API to generate Publish-on-Demand books from manuscript transcripts and/or facsimiles.

Subjects cannot have double quotes

Apparently putting double quotes in the name of a subject (e.g. [[John "Jack" Coffee Hays]]) returns an indescript error. We should A) catch the error and make it meaningful, and B) figure out what's wrong with the double quotes within a wikilink.

Errors in transcripts should be displayed

If users enter unbalanced HTML tags, the parser will barf, but the end user will only see a "Something went wrong" error. When @lasuprema had an unbalanced bracket, the app was totally unhelpful, although it logged this in the logfile:


Processing TranscribeController#save_transcription (for 128.62.58.21 at 2014-10-09 17:37:18) [POST]
  Parameters: {"authenticity_token"=>"x3o06IDSXusyY02X3kgu28Ic3KvUaJUgtRkF/nWmUPA=", "page_id"=>"3437", "page"=>{"title"=>"", "source_text"=>"the [[Thinking Fellers Union Local 282]]\r\n\r\ninterview by [[susan]] and [[steve]] @ [[Emo's]] in August 1995\r\nsome of the questions are ours; some are from a psychiatrist's questioning of an alleged victim of satanic ritual abuse.\r\n\r\nGW: I want to know if you've had any interesting encounters with the law in Texas.\r\n[[Hugh]]: We got pulled over the first time we came here for speeding. It was kind of a speed trap.\r\n[[Anne]]: It's not all that interesting.\r\nHugh: Yeah. All it was, we came over a hill and [[Paul]], our first drummer, was driving. Was it early in the morning? I think it was.\r\nAnne: We'd driven all night.\r\n[[Brian]]: I thought that was interesting, to me, just because I had been one of the last people driving and I hadn't slept all night, and that whole road on Highway 10, I guess, for many hours was littered with deer and there were deer corpses all over the place. So, by the time this policeman stopped us, I was out of my mind, and he could have easily been anything other than a policeman, too. I couldn't tell what was going on at all.\r\n[[Anne]]: And we looked like we had been up all night, too, really scruffy and unwashed. I'm really surprised he didn't just tell us to follow him to the police station. 'Cause I've heard that happens, or else they try to get money out of you on the spot, or they run you out of town.\r\nGW: Do you have an opinion on whether or not electroconvulsive therapy is good therapeutic practice when used by a licensed psychiatrist?\r\n[[Hugh]]: I've heard that it's not. I've heard that it's a bad thing", "status"=>"incomplete"}, "save"=>"Save"}

REXML::ParseException (# >
/usr/local/lib/ruby/1.8/rexml/parsers/baseparser.rb:330:in `pull'
/usr/local/lib/ruby/1.8/rexml/parsers/treeparser.rb:22:in `parse'
/usr/local/lib/ruby/1.8/rexml/document.rb:227:in `build'
/usr/local/lib/ruby/1.8/rexml/document.rb:43:in `initialize'
/home/fromthepage/fromthepage/releases/20140408151942/app/models/xml_source_processor.rb:161:in `new'
/home/fromthepage/fromthepage/releases/20140408151942/app/models/xml_source_processor.rb:161:in `update_links_and_xml'
/home/fromthepage/fromthepage/releases/20140408151942/app/models/xml_source_processor.rb:74:in `process_source'

Rails 4 - Not able to create image set

After pulling a fresh version of the repo (as of today) and setting up my dev environment, I am not able to 'create an image set'. I am able to click the 'Create an image set' link. I enter the path (local) to a group of images and then click 'next' and I get this error:

unable to open image `2013-03-14_19-07-37_165.jpg': @ error/blob.c/OpenBlob/2587
Extracted source (around line #108):

106
107
108
109
110
111

set default image data

orig = Magick::ImageList.new(sample_image.original_file)
self.original_width = orig.columns
self.original_height = orig.rows

Rails.root: /home/johnmlocklear/railsApps/fromthepage
Application Trace | Framework Trace | Full Trace

app/models/image_set.rb:108:in new' app/models/image_set.rb:108:inprocess_sample_image'
app/models/image_set.rb:62:in directory_setup' app/controllers/transform_controller.rb:337:inprocess_source_directory'
app/controllers/transform_controller.rb:87:in `directory_process'

Request

Parameters:

{"utf8"=>"✓",
"authenticity_token"=>"BNrCnS/aQ6H8IMBghfveM4MI2tEoBHqrrrEp9bujTQc=",
"directory"=>"/home/johnmlocklear/Pictures/art",
"commit"=>"Next"}

UI issues Editing transcription conventions

  1.  Initially, I was able to edit the transcription conventions, but after two or three times accessing the edit box, it no longer works. When I click in that area, it appears to open the editing box, but then as soon as I click again to actually enter text, it disappears and instead the “Permissions” editing box opens. 
    
  2.  Along the same lines, I notice that you can format the transcription conventions using html tags, but the tags disappear if you re-open the edit box. Is there another way I should be formatting text?
    

capture
capture 2

Data too long for column exception processing number location

Clicking on number location in the image set processor results in the following error:

ActiveRecord::StatementInvalid in TransformController#number_location_process
Mysql2::Error: Data too long for column 'action' at row 1: INSERT INTO interactions (action, browser, created_on, ip_address, params, session_id, status, user_id) VALUES ('number_location_process', 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:30.0) Gecko/20100101 Firefox/30.0', '2014-09-22 11:20:53', '127.0.0.1', '{"utf8"=>"✓", "authenticity_token"=>"nYVfiVsKMgh7RqL5RpO+uAtmtORkgAi5Zhvs07o33Oo=", "image_set_id"=>"110", "coordinate.x"=>"1...', 106, 'incomplete', 2)
Extracted source (around line #133):

131
132
133
134
135
136

@interaction.page_id = @page.id
end
@interaction.save
end
def complete_interaction

Access Control Lists/Private Transcription Projects

It's great that FTP is being used for transcribing important documents, however, it could be useful in situations where the content is not 'world readable'. Consider access control lists to restrict access to projects.

This could be a plugin model, where the access control is provided by other code. (In fact, that's where integrating fromthepage with other code might come in)

Internet Archive Import should check for derive task status

If a user uploads a PDF to the Internet Archive, a stub entry is created there while the derive task is enqueued. Once the derive task is done, the work is ready to be imported into FromThePage.

Currently, FromThePage does not issue any error or warning if users attempt to import stub Internet Archive works into FromThePage. We should check for IA status and ask the user to wait and try again (with advice about knowing when their work is ready), rather than barfing when we can't find derived files.

How to handle dittos

In Grinnell 1925 S3 Page 7, he uses dittos for several bird names as well as collection location descriptions. I chose to insert the actual text instead of the ditto marks. How many transcription rules does this violate :) ? My thinking is that using the marks would make the text harder to mine and Grinnell's meaning was patently clear.

Create New Image Set error

When clicking "Create Image Set" from the dashboard, (visiting http://localhost:3000/transform) the user gets the following error:

NoMethodError in Transform#index

Showing /home/benwbrum/dev/products/fromthepage/fromthepage/app/views/transform/directory_form.html.erb where line #1 raised:

undefined method `allow_forgery_protection' for {}:Hash

Extracted source (around line #1):

1: <%= form_tag( {:action => 'directory_process'} ) do %>
2:
3: <% if flash['error'] %>
4:

<%= flash['error'] %>

Rails.root: /home/benwbrum/dev/products/fromthepage/fromthepage
Application Trace | Framework Trace | Full Trace

app/views/transform/directory_form.html.erb:1:in _app_views_transform_directory_form_html_erb__1749423035688235413_42378720' app/controllers/transform_controller.rb:20:inindex'

Convert hpricot to nokogiri

New versions of Rails and Ruby appear to break the old hpricot library entirely.

We should convert this to nokogiri

Grinnel used plus or minus

On Grinnel's S3 Page 46, he lists a number of birds that he saw but for which he did not take careful notes so he indicates some as +- with the + superimposed over the -. I rendered this as, for example
Great Auk (1000 +-).
NHW

Works should not be transcribable without a collection.

It's possible for works to exist without a collection, but only in an intermediate state. We should force owners to add works to a collection so that they do not run into errors when they try to transcribe works in an invalid state..

Handling page headings that vary

In the earlier Grinnell notes that I have seen, the page heading appeared to always have the elements collector, locality, date, and page number. However, in the late summer survey in Mexico that he did with Lamb, the locality notes are sometimes more complex and can include latitude and elevation. E.g.:
Collector: Grinnell - 1925
Date: September 27
Location: San Jose, 2500 ft. Lat. 31 degrees (altitude according to our aneroid)
Page Number: 2550

My question is whether to render the transcription as I did above or whether to break up the location to aid future parsing. E.g.
Location: San Jose
Elevation: 2500 ft.
Latitude: Lat. 31 degrees
Instrument: (altitude according to our aneroid)

Pages without images handled gracelessly

Currently the system allows users to create works and associate images for only a few of their pages. This can lead to exceptions being thrown, for example
ActionView::TemplateError (private method sub' called for nil:NilClass) on line #23 of app/views/shared/_zoom_div.rhtml or ActionView::TemplateError (private methodsub' called for nil:NilClass) on line #34 of app/views/display/_multi_page.rhtml

We should do something other than barf if we're missing images for a page.

BookReader does not work when a doctype is added to the layout

BookReader does not work when a doctype such as <!DOCTYPE html> is added to the HTML document or layout.

I don't know why that is but the library won't display anything when a doctype is added to the document, I tried many doctypes and it never works with one.

BookReader will only work in quirks mode (without a doctype), which is odd.

IA support for non-JP2

In order to handle the Graves diaries correctly, we need to be able to import books from IA with multiple filetypes as the originals.

Write a readme

Having the Rails one show up when you first get to the "fromthepage" project is not so good!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.