Giter VIP home page Giter VIP logo

documentcloud's Introduction

This is the repository for the legacy DocumentCloud site, please see the current repository here:

https://github.com/muckrock/documentcloud

______                                      _   _____ _                 _
|  _  \                                    | | /  __ \ |               | |
| | | |___   ___ _   _ _ __ ___   ___ _ __ | |_| /  \/ | ___  _   _  __| |
| | | / _ \ / __| | | | '_ ` _ \ / _ \ '_ \| __| |   | |/ _ \| | | |/ _` |
| |/ / (_) | (__| |_| | | | | | |  __/ | | | |_| \__/\ | (_) | |_| | (_| |
|___/ \___/ \___|\__,_|_| |_| |_|\___|_| |_|\__|\____/_|\___/ \__,_|\__,_|

DocumentCloud is a catalog of primary source documents and a tool for annotating, organizing and publishing them on the web. Documents are contributed by journalists, researchers and archivists.

This codebase contains the entirety of DocumentCloud.org, and pulls together the rest of our open-source projects: Docsplit is used to extract data from incoming documents; that work is parallelized across CloudCrowd; data on the client-side is modeled by Backbone.js, which depends on Underscore.js for all of its abilities; Jammit concatenates and compresses the dozens of CSS and JS files into a single asset package; the NYTimes' Document Viewer displays the documents, while Pixel Ping records the traffic.

If you find a security issue while browsing the source, please email [email protected] to inform us of the problem.

Code contributed to this project is provided under the MIT license (see the LICENSE file). Some components of the project are subject to their own licenses as indicated (see /vendor and /public/javascripts/vendor directories).

documentcloud's People

Contributors

adamhooper avatar adler avatar aronpilhofer avatar cometman avatar creationix avatar dannguyen avatar davidlemayian avatar esthervillars avatar freedmand avatar hackshacker avatar ivarvong avatar jashkenas avatar kant avatar knowtheory avatar lgrandestaff avatar mitchelljkotler avatar nathanstitt avatar nloadholtes avatar p-j avatar reefdog avatar samanthasunne avatar samuelclay avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

documentcloud's Issues

More efficient document data entry

A way to enter metadata at the time of the upload would save us a lot of time.

I'm trying to solve a use case for a mining contracts database where we're making heavy use of document metadata.

We are processing many documents with a similar metadata structure. The batch upload comes in handy. However, we do need to go back in and edit each document entering metadata starting each time from the empty form:

Before we are scoping out what a solution using the API would look like - are there any plans to integrate metadata entry in the upload interface?

HTTP Basic Auth never fails

If I enter an invalid username and password, I don't receive an HTTP 401 response as I should.

Test URL: https://dksjgha:[email protected]/api/projects.json

Expected results: a 401 error

Actual results: a listing of public projects

I understand that an API to list public projects is important. But it's also important to tell users the password is wrong. Maybe there should be a "public:public" user, or something janky like that?

API: Faceted search?

Any plans to expose search facets through the public API?

I'd be looking for a facets hash that is part of the search result. This hash would contain all metadata key/value pairs present in the result set and their corresponding totals. With such a hash I could build a faceted drill down experience.

GET /api/projects.json should allow eliding document_ids

In our project, we need to list projects, but we don't need their IDs. For large projects, the list of IDs will eat up bandwidth. There should be an option to request just a count of documents, not the actual document IDs.

Safari goes to first page when switching to fullscreen

Steps to reproduce (or so I'm told -- I don't have Safari)

  1. Browse to https://www.documentcloud.org/documents/437144-100-amesys-eagle-glint-operator-manual-extras.html?sidebar=false
  2. Scroll to page 24
  3. Click the DV-fullscreen button at the bottom-left of the page

Expected results: the document opens at page 24
Actual results: the document opens at page 1

The reason: the document is served with "canonical_url":"http://www.documentcloud.org/documents/437144-100-amesys-eagle-glint-operator-manual-extras.html" in its JSON. document-viewer uses the canonical_url to open the fullscreen window. When Safari opens up http://example.org#hash and that redirects to https://example.org, it drops the hash. (Chrome, for some reason, keeps the hash.)

There are two solutions. Both seem sensible, though only one is strictly necessary to solve this bug:

  1. Make document-viewer build its fullscreen URL with window.location instead of canonical_url.
  2. Ensure each document's canonical_url doesn't produce a 302 response. (In practice: make those links https.)

Add cache buster parameter to page image urls

Currently, if one redacts a document, the previous cached images will remain in cache, causing confusion for some users.

Adding a hash based on the document's update time would provide a stable cache key, and be updated at the appropriate time.

Embedding duplicates of a single note

@jsvine has noted that embedding a note twice clobbers the first of the two notes. The first note is clobbered because the notes embed themselves using options stored on the note model, of which there is only one. Possible solutions are as follows:

  • don't embed duplicate notes
  • modify note model & view to explicitly support loading into 2+ divs
  • index note models off of a client side uuid rather than the note's resource id

IE9 can't log in in an iframe

Steps to reproduce:

  1. Create a page with an iframe that points to a private document.
  2. Open the page in Internet Explorer 9. You'll see a 403 page.
  3. Log in

Expected results: you log in

Actual results: you get a 403 page again

The reason: IE9 rejects cookies in iframes by default, because of some weird standard called P3P.

More info: http://stackoverflow.com/questions/389456/cookie-blocked-not-saved-in-iframe-in-internet-explorer

One possible solution: https://github.com/hoopla/rack-p3p

GemLoadError

Hi guys,

I'm trying to install documentcloud and am having a number of issues with the installation. I got through the installation, but am now getting:

/usr/local/lib/site_ruby/1.8/rubygems/specification.rb:1637:in `raise_if_conflicts': Unable to activate actionpack-2.3.14, because rack-1.4.1 conflicts with rack (~> 1.1.0) (Gem::LoadError)

My Gem List is:

*** LOCAL GEMS ***

actionmailer (2.3.14)
actionpack (2.3.14)
activerecord (2.3.14)
activeresource (2.3.14)
activesupport (2.3.14)
bcrypt-ruby (3.0.1)
builder (3.0.0)
calais (0.0.13)
cloud-crowd (0.6.2)
curb (0.8.0)
daemon_controller (1.0.0)
daemons (1.1.8)
docsplit (0.6.3)
eventmachine (0.12.10)
fastthread (1.0.7)
hpricot (0.8.6)
jammit (0.6.5)
json (1.7.3)
libxml-ruby (2.3.2)
mime-types (1.18)
nokogiri (1.5.4)
open4 (1.3.0)
passenger (3.0.13)
pg (0.13.2)
Platform (0.4.0)
POpen4 (0.1.4)
pr_geohash (1.0.0)
rack (1.4.1, 1.1.3)
rack-protection (1.2.0)
rails (2.3.14)
rake (0.9.2.2)
rdiscount (1.6.8)
rest-client (1.6.7)
right_aws (3.0.4)
right_http_connection (1.3.0)
rsolr (1.0.8)
rubygems-update (1.3.5)
rubyzip (0.9.9)
sanitize (2.0.3)
sinatra (1.3.2, 0.9.6)
sqlite3 (1.3.6)
sqlite3-ruby (1.3.3)
sunspot (1.3.3)
sunspot_rails (1.3.3)
thin (1.3.1)
tilt (1.3.3)
yui-compressor (0.9.6)

My ruby Version is: ruby 1.8.7 (2010-01-10 patchlevel 249) [i486-linux]
Rails: Rails 2.3.14 (can also be seen in the gems list)

I'm raising this as an issue because I followed the documentation as closely as possible. I have done this a number of times as well...

My two cents: Either the documentation needs to be updated, or there is a compatibility issue.

I hope that I'm wrong on both counts and it IS indeed my fault...:)

Multiple copies of the same document

I've noticed that often the same document is uploaded by multiple users, yet each time it still has to be processed. Would it be possible to detect that the document has been previously uploaded and clone the processed document? Also it might be interesting if we were all working on the same copy to be able to see other user's annotations.

Add an ability to skip OCR for documents

Sometimes you just want to display a document, and you don't care about OCR or OpenCalais or whatever. If there were an option to totally skip that stuff, it would be amazing.

Bulk document upload progress idiosyncrasies

These are all minor UI issues relating to selecting many documents (100s to 1000s) and uploading them in a batch.

  • Display of documents "ahead of you in line" never changes
  • No way to tell how many documents remaining. Title with "upload X documents" never changes
  • "Email when finished" seems to email after each document is uploaded, not after the batch is done

PDF Properties

The PDF properties of PDF documents should be extracted and shown in the right pane of the document page. Sometimes the properties (especially the Author, dates, and name of the program used to produce the PDF) can be interesting.

And then it should be possible to search on them, to say, "which other PDF's were created by the same author?"

The contents of the property fields are not standardized, and are only circumstantial evidence, but every little bit helps.

Document text retrieved through API gives random 404 errors

Overview retrieves hundreds or thousands of documents through the API (redirecting to s3) when important a document set. Most of the time this works. But sometimes, for no apparent reason, about half of the docs start coming back with a 404 error. Restarting the import sometimes works.

This seems completely unpredictable. Maybe it has something to do with the load that Overview puts on DocumentCloud. Currently we try to keep up to 4 http requests in flight at all times.

Here's an example of a URL that came back 404. I can provide hundreds on request. They all seems to work if pasted into a browser immediately after the error.

https://www.documentcloud.org/api/documents/409438-problems-utmining.txt

By 404 I mean we get the standard 404 error page, "The page you tried to access could not be found. It may have moved or been deleted. Maybe you'd like tosearch our catalog of public documents?"

Here's the header from one such response. Maybe the cookie will help you sort it out.

22:50:06.193 WARN WORKER - Unable to retrieve document from https://www.documentcloud.org/api/documents/147466-d208326372.txt. Status Code: 404
Cache-Control:no-cache
Connection:close
Content-Encoding:identity
Content-Type:text/html; charset=utf-8
Server:nginx/1.2.2 + Phusion Passenger 3.0.15 (mod_rails/mod_rack)
Set-Cookie:document_cloud_session=BAh7BjoPc2Vzc2lvbl9pZCIlMmIzNmJjNDhhYjc5NGViMTU4Y2Y2OWM5NmExMzMxYmQ%3D--db45b4eeaff1615836d54710ce90fc66d4f75c9d; path=/; expires=Fri, 12-Apr-2013 02:49:25 GMT; secure; HttpOnly
Status:404
Transfer-Encoding:chunked
Vary:Accept-Encoding
X-Powered-By:Phusion Passenger (mod_rails/mod_rack) 3.0.15

... etc.

Bulk delete

It's frequently necessary to delete many files at once when working with a large document set:

  • If an upload of many documents fails partway
  • when there is a new version of the document set
  • as a workaround if you need to change access level (due to #3)

I imagine that one could implement this by deleting an entire project at once. But also need a way to delete docs not in a project. Perhaps ability to assign a search result set to a project?

Documents are not embed-able or viewable while processing

We'd like to be more responsive to users who are embedding documents by allowing them to embed documents immediately after uploading. This seems reasonable, since the image extraction doesn't take a prohibitive amount of time relative to the text extraction and indexing, and the images are uploaded to s3 as soon as they are extracted.

It's a relatively easy change UI change to do so, and I've completed the modifications in commit: nathanstitt@4879670

The resulting embed fails to load while the document is processing due to documents_controller returning 403 in the show method. This traces back to Document#accessible checking if the access is set to PUBLIC, which it is not, it's PENDING.

TLDR; we need to create a way to indicate a document is processing but access should be viewable.

I see three avenues we can take to accomplish this:

  • Add a new access level PROCESSING_PUBLIC (or some such). I find this one distasteful as the change would be very invasive to the codebase, and wouldn't provide any future benefits.

  • Covert the access levels into bitmasks. By doing so a document could have multiple status's such as PROCESSING & PUBLIC. This would keep us from having to clutter up the documents table with another column, but the downside is the same as the upside, it would be possible for a document to have two statuses. PROCESSING & PUBLIC makes sense - PUBLIC & PRIVATE does not.

  • Remove the processing access level and make it a boolean flag on the document model.
    This one is my favorite. In my mind there are two concerns with the document.

    • The access level, which controls who can view/modify it.
    • Whether the document is being processed.

    A side benefit of this method is that documents would no longer lose their access level when an error occurs during processing.

Perhaps someone can think of a better method?

Multiple file upload in page insertion is broken

When multiple files are uploaded for insertion between pages, they are all assigned to be inserted into the same page position.

For instance if you put the insertion point between page 1 and 2 - the uploaded document gets a filename 2.pdf, and is then inserted into page number 2, and the remaining pages are moved down. This works as expected.

However if multiple documents are uploaded for insertion, they all get the filename 2.pdf and overwrite one another. The insert_pages action is kicked off multiple times, and processes first file fine, but the remaining actions all fail because the upload file 2.pdf no longer exists - because the first job deletes it.

The easy fix here is to dis-allow multiple file uploads for page insertions. If we do decide to support them perhaps we could randomize the file name and pass that to the import action?

Append documents to a project

Instead of posting all of the document IDs, we should be able to append (or remove!) individual document IDs to a project.

If you have a few hundred documents, you quickly have a POST request that is too unwieldy.

Using Document Data key/value for search shortcuts

Given a list of health-inspection documents, we would like to "tag" them with the violations mentioned inside. Right now the Document Data key/value pairs enforce unique keys, so we can't do something like:

violation: foo
violation: bar
violation: baz

Our next thought was to go with a comma-separated list, e.g.:

violations: foo, bar, baz

But the search appears to return only exact matches, so putting "violations:foo" in the search box does not return the expected document(s). Is it possible to make this work somehow?

How do project IDs work?

The API's JSON respresents them as numbers, but the browser UI shows a slug of the title after the ID. For example 3095 versus 3095-conrad-murray-trial.

Search embed codes do not escape quotes.

see:

<div id="DC-search-projectid-4478-2012advault-contributedto-freethefiles" class="DC-search-container"></div>
<script src="http://s3.documentcloud.org/embed/loader.js"></script>
<script>
  dc.embed.load('http://www.documentcloud.org/search/embed/', {
    q: "projectid: 4478-2012advault  contributedto: '"freethefiles"'",
    container: "#DC-search-projectid-4478-2012advault-contributedto-freethefiles",
    title: "",
    order: "score",
    per_page: 12,
    search_bar: true,
    organization: 1
  });
</script>

Add /organizations/:organization_id/accounts REST endpoint

Currently both user and admin modifications are performed in the /accounts controller.

The admin functions should be removed from that controller and moved to a new one using nested resources on the organization.

This will cleanly separate the roles between an account updating their own information and an administrator updating members of his organization.

Big PDF

I've got several 5,000 page pdfs that I've tried to upload. It's a bit ridiculous, I know. Any case, out of 10 PDFs, DC would only accept one. I've tried re-uploading the other nine and have had no luck getting them into the system.

Need an authenticated method to access text of private documents

The resources.text field of the document JSON returned from GET /api/search indicates the URL where the plain text of the document can be found, currently stored on S3. But the text is not accessible for private documents. Overview needs to a way to retrieve the text, with appropriate credentials, perhaps presented over SSL with HTTP basic auth.

Annotations spanning across page

Feature request from our users. Murphy's Law currently ensures that the most interesting and quotable parts of a document will always wrap over to the following page. Might it be possible to allow cross page annotations? Thanks

Can't upload bulk documents if any one of them is over 200MB

When trying to use multi-select to upload a batch of documents, if any one of them is over 200MB then the entire upload is cancelled ("please optimize documents before uploading.")

I would expect that the too large files would be skipped, perhaps with a warning message listing them.

Also, the too-large file was in this case a .zip which (presumably) would not have been uploaded anyway.

Project wide document editing

Scott Klein of ProPublica has asked for the ability to modify the properties (e.g. access level) across all of the documents in a project, rather than touching each document individually.

No API to view password-protected documents

We can receive the JSON for a password-protected document through AJAX. However, when we feed that to document-viewer, the URLs DocumentViewer tries to access all return 403 Forbidden.

There's no workaround: we can only show the document by using an iframe.

Potential fixes:

  • An API for generating session tokens. This would either be OAuth or it would be insecure.
  • A public API for generating image URLs, which document-viewer would use. This way we could pass the same password to the image-fetch URLs that we do to the document-JSON URL.
  • Password-free, time-sensitive image URLs in the JSON.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.