documentcloud / documentcloud Goto Github PK

View Code? Open in Web Editor NEW

424.0 20.0 162.0 66.21 MB

The DocumentCloud platform

Home Page: https://www.documentcloud.org

License: MIT License

Ruby 33.45% JavaScript 31.06% CSS 8.95% HTML 18.65% Shell 1.25% XSLT 1.39% EJS 5.24%

documentcloud's Introduction

This is the repository for the legacy DocumentCloud site, please see the current repository here:

https://github.com/muckrock/documentcloud

______                                      _   _____ _                 _
|  _  \                                    | | /  __ \ |               | |
| | | |___   ___ _   _ _ __ ___   ___ _ __ | |_| /  \/ | ___  _   _  __| |
| | | / _ \ / __| | | | '_ ` _ \ / _ \ '_ \| __| |   | |/ _ \| | | |/ _` |
| |/ / (_) | (__| |_| | | | | | |  __/ | | | |_| \__/\ | (_) | |_| | (_| |
|___/ \___/ \___|\__,_|_| |_| |_|\___|_| |_|\__|\____/_|\___/ \__,_|\__,_|

DocumentCloud is a catalog of primary source documents and a tool for annotating, organizing and publishing them on the web. Documents are contributed by journalists, researchers and archivists.

This codebase contains the entirety of DocumentCloud.org, and pulls together the rest of our open-source projects: Docsplit is used to extract data from incoming documents; that work is parallelized across CloudCrowd; data on the client-side is modeled by Backbone.js, which depends on Underscore.js for all of its abilities; Jammit concatenates and compresses the dozens of CSS and JS files into a single asset package; the NYTimes' Document Viewer displays the documents, while Pixel Ping records the traffic.

If you find a security issue while browsing the source, please email [email protected] to inform us of the problem.

Code contributed to this project is provided under the MIT license (see the LICENSE file). Some components of the project are subject to their own licenses as indicated (see /vendor and /public/javascripts/vendor directories).

documentcloud's People

Contributors

Stargazers

Watchers

Forkers

macdiva gregelin d5nguyenvan lyonjs samanthasunne tthibo neostoic blaine netconstructor stevenrich-zz driki hyperstudio gmimano markng jackey olimjon uxscripts open-source-gis san-diego-web-design leadsplus netcon-source zearlin adamhooper ejucovy nathanstitt designr pombredanne dreamfrog hkf moacap duggi sebastienhatton mpmedia dswwsd doncruse tsesci hengesense nloadholtes imclab rwalport mdenisov afridocs mingzhouyang secondstar alansparrow lonjoy corey-rr gijs ankyskywalker davidlemayian neojski malkassem jrsystems web5design brianjsitz paulosborne won21kr payingattention tchen0123 gollapudi abdlquadri dannguyen dsias harlo bartonfriedland tornabene youaani gabelula ollie314 marybethbaker josephpaulk jnuthong bdacode chriszs kuguobing renzhewk jwachira hihihippp maecro pesaply leogau suneeshtr countculture rmarshasatx nkwood stefanw raksuns kirinse dud3 schlos saakaifoundry codeforafrica datajensen ypdai asd1355215911 mydos phuong3030 bbest123 kennyhui gustavobotega

documentcloud's Issues

Documents remain in pending state even after processing fails

In the event that a job fails due to solr timeout or unavailability, the job/document is not marked as failed in DocumentCloud. This is possibly a cloudcrowd error, but may also be due to the way that our actions are written.

HTTP Basic Auth never fails

If I enter an invalid username and password, I don't receive an HTTP 401 response as I should.

Test URL: https://dksjgha:[email protected]/api/projects.json

Expected results: a 401 error

Actual results: a listing of public projects

I understand that an API to list public projects is important. But it's also important to tell users the password is wrong. Maybe there should be a "public:public" user, or something janky like that?

API: Faceted search?

Any plans to expose search facets through the public API?

I'd be looking for a facets hash that is part of the search result. This hash would contain all metadata key/value pairs present in the result set and their corresponding totals. With such a hash I could build a faceted drill down experience.

GET /api/projects.json should allow eliding document_ids

In our project, we need to list projects, but we don't need their IDs. For large projects, the list of IDs will eat up bandwidth. There should be an option to request just a count of documents, not the actual document IDs.

Firefox 22 will block 3rd party cookies by default

This basically means our embed strategy is broken by default on Firefox 22.

If we want users to be able to see authenticated content on other sites, we're going to have to implement an iframe embedding strategy.

Check the discussion here: https://news.ycombinator.com/item?id=5271971

When uploading new documents in a batch, allow any publication status to be set

Currently you can just set the batch to public with the check box, what if you want to set a batch of documents to private to you organization?

Note and Project Embeds should have afterLoad callback

The DocumentViewer has them!

Metadata not exposed in search.json endpoints

Are there any plans to expose metadata on the search.json endpoints?

No metadata in external search endpoint: https://skitch.com/alexbarth/8hib2/untitled-10
Metadata in internal search endpoint: https://skitch.com/alexbarth/8hib8/untitled-9
I see metadata is available in external single doc endpoint, but I'd like to avoid a round trip per doc: https://skitch.com/alexbarth/8hing/untitled-11

Retaining User Preferences

Mike Masnick from techdirt has requested user set defaults for the document embed dialogs.

Language must be present on organizations and accounts

Need validations on both.

upload page button still active even if position not selected

It'll pop open the dialog, ask you for a page, and then close the document, but w/o actually firing up a job (since it doesn't have a position to insert the page into).

Safari goes to first page when switching to fullscreen

Steps to reproduce (or so I'm told -- I don't have Safari)

Browse to https://www.documentcloud.org/documents/437144-100-amesys-eagle-glint-operator-manual-extras.html?sidebar=false
Scroll to page 24
Click the DV-fullscreen button at the bottom-left of the page

Expected results: the document opens at page 24
Actual results: the document opens at page 1

The reason: the document is served with "canonical_url":"http://www.documentcloud.org/documents/437144-100-amesys-eagle-glint-operator-manual-extras.html" in its JSON. document-viewer uses the canonical_url to open the fullscreen window. When Safari opens up http://example.org#hash and that redirects to https://example.org, it drops the hash. (Chrome, for some reason, keeps the hash.)

There are two solutions. Both seem sensible, though only one is strictly necessary to solve this bug:

Make document-viewer build its fullscreen URL with window.location instead of canonical_url.
Ensure each document's canonical_url doesn't produce a 302 response. (In practice: make those links https.)

Add cache buster parameter to page image urls

Currently, if one redacts a document, the previous cached images will remain in cache, causing confusion for some users.

Adding a hash based on the document's update time would provide a stable cache key, and be updated at the appropriate time.

Sharing documents is located in Analyze dropdown menu

"Share these documents" and "Share this project" are options in the Analyze dropdown menu, rather than the Projects or Publish menus. Not very intuitive.

Embedding duplicates of a single note

@jsvine has noted that embedding a note twice clobbers the first of the two notes. The first note is clobbered because the notes embed themselves using options stored on the note model, of which there is only one. Possible solutions are as follows:

don't embed duplicate notes
modify note model & view to explicitly support loading into 2+ divs
index note models off of a client side uuid rather than the note's resource id

IE9 can't log in in an iframe

Steps to reproduce:

Create a page with an iframe that points to a private document.
Open the page in Internet Explorer 9. You'll see a 403 page.
Log in

Expected results: you log in

Actual results: you get a 403 page again

The reason: IE9 rejects cookies in iframes by default, because of some weird standard called P3P.

More info: http://stackoverflow.com/questions/389456/cookie-blocked-not-saved-in-iframe-in-internet-explorer

One possible solution: https://github.com/hoopla/rack-p3p

GemLoadError

Hi guys,

I'm trying to install documentcloud and am having a number of issues with the installation. I got through the installation, but am now getting:

/usr/local/lib/site_ruby/1.8/rubygems/specification.rb:1637:in `raise_if_conflicts': Unable to activate actionpack-2.3.14, because rack-1.4.1 conflicts with rack (~> 1.1.0) (Gem::LoadError)

My Gem List is:

*** LOCAL GEMS ***

actionmailer (2.3.14)
actionpack (2.3.14)
activerecord (2.3.14)
activeresource (2.3.14)
activesupport (2.3.14)
bcrypt-ruby (3.0.1)
builder (3.0.0)
calais (0.0.13)
cloud-crowd (0.6.2)
curb (0.8.0)
daemon_controller (1.0.0)
daemons (1.1.8)
docsplit (0.6.3)
eventmachine (0.12.10)
fastthread (1.0.7)
hpricot (0.8.6)
jammit (0.6.5)
json (1.7.3)
libxml-ruby (2.3.2)
mime-types (1.18)
nokogiri (1.5.4)
open4 (1.3.0)
passenger (3.0.13)
pg (0.13.2)
Platform (0.4.0)
POpen4 (0.1.4)
pr_geohash (1.0.0)
rack (1.4.1, 1.1.3)
rack-protection (1.2.0)
rails (2.3.14)
rake (0.9.2.2)
rdiscount (1.6.8)
rest-client (1.6.7)
right_aws (3.0.4)
right_http_connection (1.3.0)
rsolr (1.0.8)
rubygems-update (1.3.5)
rubyzip (0.9.9)
sanitize (2.0.3)
sinatra (1.3.2, 0.9.6)
sqlite3 (1.3.6)
sqlite3-ruby (1.3.3)
sunspot (1.3.3)
sunspot_rails (1.3.3)
thin (1.3.1)
tilt (1.3.3)
yui-compressor (0.9.6)

My ruby Version is: ruby 1.8.7 (2010-01-10 patchlevel 249) [i486-linux]
Rails: Rails 2.3.14 (can also be seen in the gems list)

I'm raising this as an issue because I followed the documentation as closely as possible. I have done this a number of times as well...

My two cents: Either the documentation needs to be updated, or there is a compatibility issue.

I hope that I'm wrong on both counts and it IS indeed my fault...:)

Multiple copies of the same document

I've noticed that often the same document is uploaded by multiple users, yet each time it still has to be processed. Would it be possible to detect that the document has been previously uploaded and clone the processed document? Also it might be interesting if we were all working on the same copy to be able to see other user's annotations.

Add a noConflict function so global dc can be used elsewhere

We had a clash with some siteCatalyst code which was using dc as a global var. It would be great if you could add noConflict like in jQuery, etc.

https://github.com/documentcloud/documentcloud/blob/master/public/note_embed/note_embed.js

Thanks :)

Add an ability to skip OCR for documents

Sometimes you just want to display a document, and you don't care about OCR or OpenCalais or whatever. If there were an option to totally skip that stuff, it would be amazing.

Bulk document upload progress idiosyncrasies

These are all minor UI issues relating to selecting many documents (100s to 1000s) and uploading them in a batch.

Display of documents "ahead of you in line" never changes
No way to tell how many documents remaining. Title with "upload X documents" never changes
"Email when finished" seems to email after each document is uploaded, not after the batch is done

Can the API download the "local" version of the document viewer available now by push button?

PDF Properties

The PDF properties of PDF documents should be extracted and shown in the right pane of the document page. Sometimes the properties (especially the Author, dates, and name of the program used to produce the PDF) can be interesting.

And then it should be possible to search on them, to say, "which other PDF's were created by the same author?"

The contents of the property fields are not standardized, and are only circumstantial evidence, but every little bit helps.

Document text retrieved through API gives random 404 errors

Overview retrieves hundreds or thousands of documents through the API (redirecting to s3) when important a document set. Most of the time this works. But sometimes, for no apparent reason, about half of the docs start coming back with a 404 error. Restarting the import sometimes works.

This seems completely unpredictable. Maybe it has something to do with the load that Overview puts on DocumentCloud. Currently we try to keep up to 4 http requests in flight at all times.

Here's an example of a URL that came back 404. I can provide hundreds on request. They all seems to work if pasted into a browser immediately after the error.

https://www.documentcloud.org/api/documents/409438-problems-utmining.txt

By 404 I mean we get the standard 404 error page, "The page you tried to access could not be found. It may have moved or been deleted. Maybe you'd like tosearch our catalog of public documents?"

Here's the header from one such response. Maybe the cookie will help you sort it out.

22:50:06.193 WARN WORKER - Unable to retrieve document from https://www.documentcloud.org/api/documents/147466-d208326372.txt. Status Code: 404
Cache-Control:no-cache
Connection:close
Content-Encoding:identity
Content-Type:text/html; charset=utf-8
Server:nginx/1.2.2 + Phusion Passenger 3.0.15 (mod_rails/mod_rack)
Set-Cookie:document_cloud_session=BAh7BjoPc2Vzc2lvbl9pZCIlMmIzNmJjNDhhYjc5NGViMTU4Y2Y2OWM5NmExMzMxYmQ%3D--db45b4eeaff1615836d54710ce90fc66d4f75c9d; path=/; expires=Fri, 12-Apr-2013 02:49:25 GMT; secure; HttpOnly
Status:404
Transfer-Encoding:chunked
Vary:Accept-Encoding
X-Powered-By:Phusion Passenger (mod_rails/mod_rack) 3.0.15

... etc.

Bulk delete

It's frequently necessary to delete many files at once when working with a large document set:

If an upload of many documents fails partway
when there is a new version of the document set
as a workaround if you need to change access level (due to #3)

I imagine that one could implement this by deleting an entire project at once. But also need a way to delete docs not in a project. Perhaps ability to assign a search result set to a project?

Documents are not embed-able or viewable while processing

We'd like to be more responsive to users who are embedding documents by allowing them to embed documents immediately after uploading. This seems reasonable, since the image extraction doesn't take a prohibitive amount of time relative to the text extraction and indexing, and the images are uploaded to s3 as soon as they are extracted.

It's a relatively easy change UI change to do so, and I've completed the modifications in commit: nathanstitt@4879670

The resulting embed fails to load while the document is processing due to documents_controller returning 403 in the show method. This traces back to Document#accessible checking if the access is set to PUBLIC, which it is not, it's PENDING.

TLDR; we need to create a way to indicate a document is processing but access should be viewable.

I see three avenues we can take to accomplish this:

Add a new access level PROCESSING_PUBLIC (or some such). I find this one distasteful as the change would be very invasive to the codebase, and wouldn't provide any future benefits.
Covert the access levels into bitmasks. By doing so a document could have multiple status's such as PROCESSING & PUBLIC. This would keep us from having to clutter up the documents table with another column, but the downside is the same as the upside, it would be possible for a document to have two statuses. PROCESSING & PUBLIC makes sense - PUBLIC & PRIVATE does not.
Remove the processing access level and make it a boolean flag on the document model.
This one is my favorite. In my mind there are two concerns with the document.
- The access level, which controls who can view/modify it.
- Whether the document is being processed.
A side benefit of this method is that documents would no longer lose their access level when an error occurs during processing.

Perhaps someone can think of a better method?

note & search embeds don't contain noscript links

The DocumentCloud document embed generator includes <noscript> section indication where a document can be found. The other embed codes probably should as well.

Embed note causes "Operation oborted" error in IE7

Here's an example
http://jsfiddle.net/RtASV/3/

and one in the wild
http://motherjones.com/politics/2011/08/fbi-sting-greatest-hits

IE 7.0.5730.11CO, Win XP

Long query strings for DocumentSet embeds cause caching errors

Since we static cache JSON blobs w/ a filename set by the query. So, if the query is too long, the file caching will error out.

We should restrict the length of queries or find a way to uniquely hash filenames for caching purposes.

/api/projects/:id.json is synonym for /api/projects.json

I'd expect /api/projects/:id.json to behave differently from /api/projects.json, but it seems to do the same thing. In particular, it lists lots of projects, and not just one.

Integrate with Tabula

... because it would be lovely.

Ability to add/update/delete sections and annotations via the API and get the permalinks for those

Multiple file upload in page insertion is broken

When multiple files are uploaded for insertion between pages, they are all assigned to be inserted into the same page position.

For instance if you put the insertion point between page 1 and 2 - the uploaded document gets a filename 2.pdf, and is then inserted into page number 2, and the remaining pages are moved down. This works as expected.

However if multiple documents are uploaded for insertion, they all get the filename 2.pdf and overwrite one another. The insert_pages action is kicked off multiple times, and processes first file fine, but the remaining actions all fail because the upload file 2.pdf no longer exists - because the first job deletes it.

The easy fix here is to dis-allow multiple file uploads for page insertions. If we do decide to support them perhaps we could randomize the file name and pass that to the import action?

Printing pages with embedded notes results in unexpected behavior

Pages such as this propublica article don't render for print in a legible manner.

They end up looking something like this: http://cl.ly/0W43353z0V1T1F2a3Z1x

(reported by @kleinmatic )

Append documents to a project

Instead of posting all of the document IDs, we should be able to append (or remove!) individual document IDs to a project.

If you have a few hundred documents, you quickly have a POST request that is too unwieldy.

Using Document Data key/value for search shortcuts

Given a list of health-inspection documents, we would like to "tag" them with the violations mentioned inside. Right now the Document Data key/value pairs enforce unique keys, so we can't do something like:

violation: foo
violation: bar
violation: baz

Our next thought was to go with a comma-separated list, e.g.:

violations: foo, bar, baz

But the search appears to return only exact matches, so putting "violations:foo" in the search box does not return the expected document(s). Is it possible to make this work somehow?

DocumentCloud "top" offset wrong in embeds

If you visit this story and click on the link halfway down that reads "one Los Angeles Times article warned" you will see:

But if you visit the annotation on the DocumentCloud site you will see:

I think you can see the issue.

How do project IDs work?

The API's JSON respresents them as numbers, but the browser UI shows a slug of the title after the ID. For example 3095 versus 3095-conrad-murray-trial.

Filter: "draft annotations"

It would be helpful to filter documents by "draft annotations"

Search embed codes do not escape quotes.

see:

<div id="DC-search-projectid-4478-2012advault-contributedto-freethefiles" class="DC-search-container"></div>
<script src="http://s3.documentcloud.org/embed/loader.js"></script>
<script>
  dc.embed.load('http://www.documentcloud.org/search/embed/', {
    q: "projectid: 4478-2012advault  contributedto: '"freethefiles"'",
    container: "#DC-search-projectid-4478-2012advault-contributedto-freethefiles",
    title: "",
    order: "score",
    per_page: 12,
    search_bar: true,
    organization: 1
  });
</script>

Add /organizations/:organization_id/accounts REST endpoint

Currently both user and admin modifications are performed in the /accounts controller.

The admin functions should be removed from that controller and moved to a new one using nested resources on the organization.

This will cleanly separate the roles between an account updating their own information and an administrator updating members of his organization.

Big PDF

I've got several 5,000 page pdfs that I've tried to upload. It's a bit ridiculous, I know. Any case, out of 10 PDFs, DC would only accept one. I've tried re-uploading the other nine and have had no luck getting them into the system.

Need an authenticated method to access text of private documents

The resources.text field of the document JSON returned from GET /api/search indicates the URL where the plain text of the document can be found, currently stored on S3. But the text is not accessible for private documents. Overview needs to a way to retrieve the text, with appropriate credentials, perhaps presented over SSL with HTTP basic auth.

Annotations spanning across page

Feature request from our users. Murphy's Law currently ensures that the most interesting and quotable parts of a document will always wrap over to the following page. Might it be possible to allow cross page annotations? Thanks

Can't upload bulk documents if any one of them is over 200MB

When trying to use multi-select to upload a batch of documents, if any one of them is over 200MB then the entire upload is cancelled ("please optimize documents before uploading.")

I would expect that the too large files would be skipped, perhaps with a warning message listing them.

Also, the too-large file was in this case a .zip which (presumably) would not have been uploaded anyway.

Potential fixes:

An API for generating session tokens. This would either be OAuth or it would be insecure.
A public API for generating image URLs, which document-viewer would use. This way we could pass the same password to the image-fetch URLs that we do to the document-JSON URL.
Password-free, time-sensitive image URLs in the JSON.