Giter VIP home page Giter VIP logo

lookingglass's Introduction

LookingGlass

Search, filter, and browse any set of documents. LookingGlass includes full text search, category filters, and date queries all through a nice search interface with an Elasticsearch backend. LookingGlass also supports customizable themes and flexible document view pages for browsing and embedding a variety of document types.

LookingGlass requires DocManager so that it can interact with Elasticsearch. LookingGlass can be used in combination with Harvester for crawling, parsing, and loading documents and automatically turning them into a searchable archive. However, it also works well as a standalone archiving tool.

Installation

Dependencies

  • DocManager and all of its dependencies
  • ruby 2.4.1
  • rails 5
  • (optionally) Harvester
  • libmagic-dev

Setup Instructions

  1. Install the dependencies
  1. Get LookingGlass
  • Clone repo: git clone --recursive [email protected]:TransparencyToolkit/LookingGlass.git
  • Go into the LookingGlass directory: cd LookingGlass
  • Install the Rubygems LookingGlass uses: bundle install
  • Generate simple form data: rails generate simple_form:install --bootstrap
  • Precompile assets: rake assets:precompile
  1. Run LookingGlass
  • Start DocManager: Follow the instructions on the DocManager repo
  • Configure Project: Edit the file in config/initializers/project_config so that the PROJECT_INDEX value is the name of the index in the DocManager project config LookingGlass should use
  • Start LookingGlass: Run rails server -p 3001
  • Use LookingGlass: Go to http://0.0.0.0:3001 in your browser

Features

LookingGlass is a frontend for searchable document archives. Previously, it also included the backend that interacted with Elasticsearch, but this has since been split out into DocManager. The key features are described below.

Display of Document Sets

LookingGlass shows document sets from multiple data sources. It displays a list of documents on the main page. The fields displayed for each document on the index page and the order the documents are displayed in (sorted by date or another numerical field) are customizable in DocManager's data source config files.

Each individual document set is then displayed on its own page for easy reading. The document page includes a sidebar with the document's categorical field and a customizable set of tabs that can display the document text, embed the document itself (which is stored remotely, locally, or on document cloud), offer document downloads, or load links.

Search

LookingGlass allows full text of document sets using the Elasticsearch backend. It can be used to search documents in most languages. LookingGlass supports searching all fields or individual fields, and a variety of non-text fields like dates. Results are sorted by relevance with text matching the query highlighted.

Categorical Filters

Many document sets have categorical fields that are common across documents, either in the original data or that can be extracted with a tool like Catalyst. For example, countries mentioned in a document, file format, hashtags, and topic-specific keywords are common types of categories. LookingGlass allows filtering document sets by one or more categories by clicking links on the sidebar to get, say, all the documents that are about a particular country.

The category sidebar also displays the number of documents for each value in each category that matches the current query. This is great for getting an overview of the content in the document set.

Document View Templates

On both the search results/document index and individual document pages, the way the document is displayed is highly customizable. It is possible to add new templates to display different types of data sources however you want and even thread together multiple documents when needed (in email datasets, for example).

These view templates are defined in app/views/docs/show/tabs/panes (for the document view page) and app/views/docs/index/results/result_templates (for the index/result view). The fields to use as a thread ID and view templates to used are specified per-source in the DocManager data source config files.

Version Tracking

LookingGlass can be used to track which documents change over time and how. Documents that are changed are specified in categories on the sidebar and the document view page has a tool that allows users to view the exact difference between two documents over time.

The fields used to check if a document has changed are specified per-source in the DocManager data source config files.

Custom Themes

LookingGlass supports custom theming. The color scheme, fonts, logo, text, and links are all entirely customizable.

Some of these settings, like the theme used, project title, and logo are defined in the DocManager project config file. The colors and fonts can then be set by creating a theme.

lookingglass's People

Contributors

ageis avatar bnvk avatar iliasbartolini avatar shidash avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

lookingglass's Issues

Move "dataset_details.json" to instance config.json

Currently the dataspec file for a given type of data collection source- e.g. for LinkedIn is in: app/dataspec/dataspec-linkedin/dataset_details.json

This file contains values like "Path": "/home/username/Data/processed_version_error" and since this info changes from instance to instance- this should probably be moved to the instance-config.json file!

Only Redirect to Existing Entities

When users are creating an entity, sometimes it redirects before the entity exists and users get an error. The entity still saves correctly though and can be viewed on refresh, it just isn't there yet when users load the page. Check if the entity exists and wait before redirecting.

Add normalized term list input form

Users need a way to input a normalized term list input form. This form should output a hash/dictionary of the following format:

{
  "Canonical Name" : [
      "Name Variant 1",
      "Variant 2",
      "var 3"]
}

This is useful for normalizing multiple name variants or grouping documents with topically-related but different keywords. Users should be able to enter the keys they want to show up as the categories, and then for each key enter an unlimited # of terms documents should be checked for. Each of these lists of terms acts as a normal term list that Catalyst can extract, except instead of being tagged with the terms themselves it is tagged with the overall key associated with the terms.

Search result formatting where there is long text but not description

In datasets where there is a field of display type Long Text but not of display type Description, the Long Text field shows up to the right and there is a large blank space on the left. When there is no Description type field, this should be single column.

This should work to check if there are description fields/get the num of description fields-
dataspec.field_info.select{|f| f["Display Type"]== "Description"}.length
You may need to use @dataspec or a different var depending on what dataspec is where this is called.

On Catalyst job list page, remove any non-working buttons

There are some buttons that don't do anything on the Catalyst job list page. We need to remove them for now. Then, separately, we can discuss if we should add functionality described on the buttons (such as rerunning Catalyst jobs).

Make Catalyst search step clearer

The step where users filter the documents on the advanced search form for Catalyst is a bit confusing. The following should be done to make it clearer-

  • Change "Search Name" to "Mining Job Name" or similar
  • Move the Mining Job Name field above the field with the dropdown settings menu
  • Divide the entities and documents/emails into two categories on dropdown (similarly to how fields are divided into text/category/date types on search form)

Add Field Interface

Users should be able to add their own fields to documents. This is already supported on the backend, but there needs to be an interface which collects the following details:

  • Document type to add field on
  • Field name (human readable name)
  • Icon
  • Field type

Implement the idea of "Anomalies"

Basically a way for a human to "tag" one point of data in a given search result item as something of interest or an "anomaly" such as:

  • Job at Defense Contractor
  • Job at Pharmacy
  • Job at Defense Contractor
  • Job in Military
  • Job at Defense Contractor

The thinking here being that once a user (of LookingGlass) flags something as an anomaly, they can then view other anomalies with the goal of perhaps finding patterns between anomalies. In the above case there might be "another person who worked at a similarly odd company in between defense contracting gigs during the same few year period"

Separate search dropdown into "groups"

Currently the search dropdown menu looks like this:

screen shot 2015-04-15 at 9 29 43 pm

This could be made more user friendly by grouping the types of search into sections using the <optgroup> tag, so something like:

<select name="ftypes">
  <optgroup label="Date Ranges">
    <option value="doc-date">Document Date</option>
  </optgroup>
  ...

Fix page <title> with dynamic values

Current this just shows the same word "Search" no matter the page you are on. This should be more dynamic and reflect specific pages and queries

Add "show/hide" for search result items

This was lost in refactoring but should be added back! This should allow the user to show / hide certain fields in a search result items. The point being- to make it easier to comb through results visually

Simplified Catalyst

We should change our current Catalyst form to be an "advanced" form (much like the advanced search forms that are prevalent) so that we can keep much of the functionality, but default to a simpler form instead. This form should have the following defaults-

Search Settings: Don't allow filtering the documents at all. Just run over documents and emails (not entities). Run over all documents.

Choosing Catalyst Methods: This can be similar to the advanced settings.

Settings: The same defaults as in #75 should be used, but the user should not be able to modify them. That means that in many cases the user will only need to choose what input method they want to use (if there are no settings to choose, such as entity extraction) and nothing else at all on the whole form. In others, where the Catalyst method requires user input (such as term lists), the user will need to be prompted to enter these settings.

The current form should still be available so that the full range of configuration is there, just not as an obvious thing. We can write a tutorial on the configuration options on the advanced form once we aren't changing the UI too much anymore. The simple form should be usable by nontechnical users without instructions, ideally (or with minimal instructions when entering settings on some Catalyst methods).

Bulk set categorical field values

There should be a way to set all documents to have a certain value for a specified categorical field. Both the frontend and backend are needed for this. The user interface needs to collect the following values-

  • Document type (which determines which categorical fields to show)
  • Field to set
  • Value to set to (or if it should be cleared entirely)

Eventually we may want to support bulk setting of documents matching search values or bulk deletion, but for now let's just provide this to set/unset values for all documents.

Asset compilation issue

When running rake assets:precompile in production, I get this error-
Sass::SyntaxError: Undefined variable: "$btn-border-radius-base".
(in /var/www/update/LookingGlass/app/assets/stylesheets/bootstrap-custom.scss:20)
/var/www/update/LookingGlass/vendor/bundle/ruby/2.2.0/gems/bootstrap-sass-3.3.5/assets/stylesheets/bootstrap/_buttons.scss:20:in `button-size'
/var/www/update/LookingGlass/vendor/bundle/ruby/2.2.0/gems/bootstrap-sass-3.3.5/assets/stylesheets/bootstrap/_buttons.scss:20
/var/www/update/LookingGlass/app/assets/stylesheets/bootstrap-custom.scss:17
I recall that we had this issue before. How can this be fixed?

Also, the logo in the upper left isn't showing up- what file/icon should be in the instance spec for that?

Thanks

Need dynamic element IDs for diffing

I made the diffing code work with the data loaded into LookingGlass. However, the ids for the diffing elements (like versions-container) are constant and hardcoded into the Javascript.

This is a problem because, due to the structure of the data where multiple items show per page, sometimes there need to be multiple diffing containers per page. This doesn't work with constant IDs. Is there a way to make them dynamic?

It seems like there are a couple ways this could be approached-

  1. The html for diffing is now in app/views/docs/_changetracker.html.erb. Setting dynamic ids based on the doc ID would be quite easy here. But these need to be matched up in the JS somehow. Perhaps they could have a common class and the current JS could be run for each element in the class, but without the ID hardcoded?
  2. The html for versions-diffing-type, versions-compute, and versions-diff could be rendered only once on the page and the div with the id versions-container could have a class of versions-container instead. Then all elements with this class could be added to the list to be diffed. These elements may require unique IDs themselves that correlate them to the appropriate doc (rather than just using the field name as the ID like they currently do).

The first solution seems preferable visually to me, but of course there may be other options too. What do you think?

To duplicate this issue, you should-

  1. Download the new test data on the server from /mnt/disk/processed_test_data
  2. Clone and use the current dataspec-linkedin repo
  3. Change the path in the dataset_details in the dataspec-linkedin folder to point to the location of processed_test_data
  4. Index the data with rails runner as normal
  5. Open the search and look for Aziz Omar

Create UI designs for Timeline view

This search visualization should take into account the following:

  • Display basic Event items
  • Display Document files and distinguish
  • Color coding by categories / tags
  • Sorting by: year, month, week, day
  • Sort by: ascending / descending
  • Switching between timeline / normal view
  • Quick adding of new events

Compatibility with ElasticSearch 2.x

It appears we are currently pegged to ElasticSearch 1.5.2. I tested LookingGlass on 2.3.x, 1.6.x and 1.7.x with errors and negative results.

The reason the application is incompatible is because it relies heavily on Facets, which were already deprecated in the version that is being used. In ElasticSearch 2.x, they have been replaced with Aggregations. They have some useful docs on how to migrate from Facets to Aggregations.

Since there is lots of code that is affected here, it is quite a task to migrate to the latest ElasticSearch. But hopefully we can get it done eventually.

Catalyst Advanced Defaults

The following defaults should be set on the Catalyst advanced form:

  • Icon: Already set. Keep using default even after picker added
  • Field Name: Already set.
  • Fields to Search: Should default to text/title/description on ArchiveDoc, Body/Subject/Attachment Text on EmailDoc, name/description or similar on the entities. Perhaps the defaults should be set in the dataspec for each document type.
  • Parameters: In some cases (term lists) defaults won't be possible, but it should work with optional defaults, which should be specified in the Catalyst seed data

Dropdown with different options for adding data

Reorganize the topbar on editable LG to have a dropdown with the different ways of adding content (currently a link to the upload form and a link to the entity form). Also try to make the wording clearer.

Set all documents to be published/not published

This is a special case of #83 where the document types (all of them), fields (to_publish), and values (Yes/No) are all preset. We just need "Set all documents to be published" and "Set all documents not to be published" buttons somewhere. This allows users to publish all documents by default and selectively filter out certain documents.

Warnings and deprecations

Might add more if I find them.

Updating jsontableschema
Fetching: jsontableschema-0.2.2.gem (100%)
  WARNING:   The 'jsontableschema' gem has been deprecated and will be replaced by the gem 'tableschema'.
             See: https://github.com/frictionlessdata/tableschema-rb
Post-install message from twitter-bootstrap-rails:
Important: You may need to add a javascript runtime to your Gemfile in order for bootstrap's LESS files to compile to CSS. 

Several of the active{*} or action{8} gems fail on less than 2.2.2, that's okay though since our production Ruby is now 2.4.1. I'm finding ~/.gemrc and ~/.bundle/config convenient to managing these environments.

Diffing only partially working with real data

Diffing now works if there is just a single item fields doc and the non item fields part that need to be diffed. But if multiple docs on a page need their item fields diffed, the box doesn't show up.

I've sent you an email with details on how to duplicate this on our test data instance. Could you please look into this?

Thanks

PI Dataspec

We need to add icons for the PI datspecs- both the company and materials dataspecs. These dataspecs can be found here- https://github.com/transparencytoolkit/dataspec-sii. Once these are added to LG, please also update the dataspec icon fields in the dataspec-sii repo with the appropriate images.

Cloudflare settings for Tor visitors

So everything is fine when viewing the ICWATCH site directly, but when visiting via Tor there's a nonspecific error that may be related to the CloudFlare settings. Here's a screenshot:

screenshot from 2015-05-08 14 08 45

Upgrade CSS of app to interface design V1

The current design of the interface (see below) as been decided to be "good enough" for now and thus implementing it in CSS is the next step.

Interface Design V1

I'm just opening an issue to track things I'm working on as it's my workflow for all my other projects, and it's more effort to not use Issues ๐Ÿ˜

Smoother Document Editing

Make the document editing experience smoother by ensuring that users don't need to click the edit button many times to get to the edit interface.

Number Slider/Input Box for Catalyst annotator settings

Currently some of the Catalyst methods accept numerical input and just have dropdowns of numbers "One" to "Ten" as input. But this often does not make sense. Sometimes the input should be in the hundreds, or other times it may be between 0 and 1. The range that makes sense should be added to the Catalyst seed data and a slider should be used to select #s in that range (with the option for the user to type in the # directly).

Delete Documents

Ensure delete document button works to remove document from Elasticsearch.

Deprecation warnings

The following deprecation warnings are filling up the log and should be addressed at some point:

DEPRECATION WARNING: Calling URL helpers with string keys controller, action is deprecated. Use symbols instead. (called from prepNewPath at /var/www/sii/LookingGlass/app/helpers/category_link.rb:27)
DEPRECATION WARNING: Calling URL helpers with string keys action, controller is deprecated. Use symbols instead. (called from selected at /var/www/sii/LookingGlass/app/helpers/category_link.rb:38)
DEPRECATION WARNING: Calling URL helpers with string keys action, controller is deprecated. Use symbols instead. (called from getRemoveLink at /var/www/sii/LookingGlass/app/helpers/searched_format.rb:112)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.