uudigitalhumanitieslab / i-analyzer Goto Github PK

View Code? Open in Web Editor NEW

6.0 3.0 1.0 49.61 MB

The great textmining tool that obviates all others

Home Page: https://ianalyzer.hum.uu.nl

License: MIT License

Python 55.70% JavaScript 0.13% HTML 8.12% TypeScript 34.26% SCSS 1.56% Dockerfile 0.07% Jinja 0.16%

corpus-linguistics corpus-search digital-history digital-humanities literary-studies text-analysis elasticsearch

i-analyzer's Introduction

I-analyzer

"The great text mining tool that obviates all others." — Julian Gonggrijp

I-analyzer is a web application for exploring corpora (large collections of texts). You can use I-analyzer to find relevant documents, or to make visualisations to understand broader trends in the corpus. The interface is designed to be accessible for users of all skill levels.

I-analyzer is primarily intended for academic research and higher education. We focus on data that is relevant for the humanities, but we are open to datasets that are relevant for other fields.

This repository contains the source code for the I-analyzer web application, which consists of a Django backend and Angular frontend.

For corpora included in I-analyzer, the backend includes a definition file that specifies how to read the source files, and how this data should be structured and presented in I-analyzer. This repository does not include the source data itself, beyond a few sample files for testing.

Usage

If you are interested in using I-analyzer, the most straightforward way to get started is to make an account at ianalyzer.hum.uu.nl. This server is maintained by the Research Software Lab and contains corpora focused on a variety of fields. We also maintain more specialised collections at PEACE portal and People & Parliament (not publicly accessible).

I-analyzer does not have an "upload data" option (yet!). If you are interested in using I-analyzer as a way to publish your dataset, or to make it easier to search and analyse, you can go about this two ways:

Contact us (see below for details) about hosting your dataset on one of our existing servers, or hosting a new server for your project.
Self-host I-analyzer. This would allow you to maintain full control over the data and who can access it. I-analyzer is open source software, so you are free to host it yourself, either as-is or with your own modifications. However, feel free to contact us with any questions or issues.

Development

The documentation directory contains documentation for developers. This includes installation instructions to set up an I-analyzer server.

Licence

I-analyzer is shared under an MIT licence. See LICENSE for more information.

Citation

If you wish to cite this repository, please use the metadata provided in our CITATION.cff file.

If you wish to cite material that you accessed through I-analyzer, or you are not sure if you should also be citing this repository, please refer to the citation instructions in the user manual.

Contact

For questions, small feature suggestions, and bug reports, feel free to create an issue. If you don't have a Github account, you can also contact the Centre for Digital Humanities.

If you want to add a new corpus to I-analyzer, or have an idea for a project, please contact the Centre for Digital Humanities rather than making an issue, so we can discuss the possibilities with you.

i-analyzer's People

Contributors

Stargazers

Watchers

Forkers

onurgezgin2

i-analyzer's Issues

Redesign visualization

Conform Mockup

Switch between frequency and percentage on y axis for term frequencies

Separate the frontend from the backend

So they can potentially run on different servers. Depends on #3 and #7.

Edit: this includes multiple places where HTML is generated at the backend; at least the admin templates (invoked from ianalyzer.view) and the filter implementations in ianalyzer.filters.

Separate the corpus definitions from the application

We don't want to build up a giant collection of unused corpus definitions within this repository. Instead, the application should only define the abstraction, and a concrete corpus definition should be attached to an application by means of local configuration only (similar to Flask and Django configuration modules). This means that the existing UUDigitalHumanitieslab/timestextminer corpus will be reduced to the existing corpus definition for the Times, which in turn will disappear from this repository. The yet to be written corpus definition for the Dutch banking corpus (#1) will move to a new repository.

Edit: depends on #13.

Separate the backend (accounts and saved queries) from the ES index

Like #8 but without the dependencies.

Use REST client for API communication

Use something like https://github.com/troyanskiy/ngx-resource to communicate with the server. To reduce the amount of code repetition.

Conform the backend API to the HAL specification

This is requirement for real REST. http://stateless.co/hal_specification.html

Plugins available for most web frameworks (including Flask and Backbone).

Serve Flask API and Angular UI through one server

Currently as part of #3 a Flask and NG server have been created. It would be more ideal if these could be served from one single address.

Wondering if the simple query string query is set to search in all fields or in content field.

Default Field
edit

When not explicitly specifying the field to search on in the query string syntax, the index.query.default_field will be used to derive which field to search on. It defaults to _all field.

If the _all field is disabled and no fields are specified in the request`, the simple_query_string query will automatically attempt to determine the existing fields in the index’s mapping that are queryable, and perform the search on those fields.

Redesign search detail

Conform Mockup

Enhance range picker

From-to in range pickers are now depicted as from:to, which is not intuitive to most people.
from-to might be better.
Also, a slider with low-high is not suited for all clients.
Some would prefer two fillable boxes denoting From and To.

CJK support

Because those languages (almost) do not use spaces they should be handled differently than languages which do. For instance the highlighter (#40) splits using word boundaries. I'm sure there're also plenty of other interesting things which will explode on impact.

Use a calendar picker for date ranges

José reported that the slider is bit difficult to use. It would probably be most efficient to do this after #3.

Export corpus definitions as JSON

There is a TODO screaming in ianalyzer.corpora.common.Corpus.json. The intention is that the Python-based corpus definition will remain the single source of truth about corpus structure, but other parts of the application (frontend, special services…) will be benefit from this same definition through the JSON interface.

Use migrations

Currently, the application does not have any migrations in place.

Filtering using a range with large numbers is troublesome in lower-end of the range

Using an arbitrary large number to set the upper bound of the range will put the useful values in only a small piece at the left side of the slider, which is not user-friendly.

Potential solutions:

Create a log-based slider such that smaller values occupy more of the slider
Preprocess the data to find a reasonable upper bound within some margin of the maximum value

Remove dead code bodies in api

There are parts of the code of the python-flask backend which are not used anymore (elasticsearch (other than indexing), interface). A cleanup to avoid unnecessary imports would be desirable.

Use a prefix for Angular components and directives

This is recommended in their style guide. So instead of <search></search> we could use <ia-search></ia-search> this might especially be useful when these names might collide with actual HTML names or if some future project would import stuff from I-analyzer. This prefix can be set in .angular-cli.json.

Highlight the search terms in the results

Either using Alto (but this only makes sense if we also show the scanned image, which we don't do yet) or by applying a quick client-side search on the plain text (as suggested by @oktaal).

Define common testing module (properties)

Currently different tests need to create these from scratch. Some of this could be shared or turned into a method. So instead of:

TestBed.configureTestingModule({
    imports: [RouterTestingModule.withRoutes([])],
    providers: [
        SearchService,
        { provide: ApiService, useValue: new ApiServiceMock() },
        { provide: ElasticSearchService, useValue: new ElasticSearchServiceMock() },
        LogService,
        QueryService,
        UserService,
        SessionService
    ]
});

It would be better if something like the following could be done:

TestBed.configureTestingModule(CommonTestingModule.createDef(/* settings */))

Functional tests

Possibly using Casper.

Implement simple search interface

Conform Mockup

Visualise timelines

Term frequency per year, respecting the filters in the query.

Help pages

Conform Mockup. Provides context sensitive information (e.g., corpus selection, search). Ideally, this would also provide corpus specific info, e.g. what filters can be used, some example queries, etc.
On the left, show menu to switch between different help topics.

Redesign search history

The user can get to their previous searches by clicking on their name on the top of pages. Mockup of this is still to be designed (out of space for the free Moqups version).
Estimated time: 9.5 hours

Secure access

It should be possible to secure access to the index (a username/password for the ElasticSearch DB is simply sent to the client as part of #9), and allow limiting access to the search interface.

What happens if you leave the password of a new user blank?

The user creation form in the admin panel permits you to leave the password field blank. This is an undocumented feature; it is not clear what will happen if you do so. Specifically, will the user be able to login or not? This needs to be addressed.

Search interface in Angular 4

Building on the framework of #3, add a search interface using the default corpus as defined in the configuration.

Corpus is hardcoded in create_database

See 984cf2d.

Add more unittests

Currently there is one unittest, which means that in theory, we have a unittest suite. Need to add more tests so we have something to fall back on when things break. Will be subject to available time.

Redesign the frontend

Currently it depends on Flask-Admin. Need to eliminate this dependency. It would be fine to keep using Bootstrap, in fact it may be possible to embrace it more and (perhaps) use Sass as well.

Use LDAP at the application level

The current installation at dutchbanking.hum.uu.nl requires two logins: one with a solis-ID over LDAP at the webserver level and one with a custom account at the application level. This is annoying for users. We need to change this to a single login with LDAP at the application level with a fallback option for users without a solis-ID, following the example in Microcontact.

Write a corpus module for the Dutch banking corpus

Fix admin module

Currently, if you add a new user, it is not possible to edit that user properly later (i.e., change their download limit): the user field will be empty when editing. Potentially, prefill fields such as download limit when creating a user to the default value. See also #37 , according to which the password field can even be left empty.

Pressing enter while the search input is focused does not submit the form

Might be addressed automatically as a side effect of #3, but making it explicit just in case.

Search results preview is largely blank

When doing a search, the user gets a preview of five hits. This preview contains the right fields, but the fields do not always contain data, even though those fields are populated and present in the downloaded CSV file. In the case of the current DutchBanking index, all fields except for the document id are previewed as blank, even though all fields always contain data in the downloaded hits.

While downloading full search results still works, this does not make a good impression to the user.

Improve support for right-to-left languages

Currently absolutely minimal support is in place in the search highlighter. I couldn't get the test to work with wildcards or quotation marks, because the mixing of those symbols cause confusion. At the very least for the developer. This could definitely be improved, maybe by using native symbols from those writing systems or adding some mode support for switching direction.

Show search results in Angular 4

Show the search results of #32.

Add Elasticsearch settings to User Manual

I wrote a user manual on the simple query string query language that is used for searching. However, I am not sure how certain top level parameters that influence usage/search results are set. E.g.:
analyze_wildcard, lenient, minimum_should_match, default field etc.
A list can be found here: https://www.elastic.co/guide/en/elasticsearch/reference/5.5/query-dsl-simple-query-string-query.html
It would be nice if this could be documented so that I can use it for enhancement of the user manual.

Decide on a name for the app & get a logo designed

We should make decisions about the name of the app soon so we can have a logo designed.

Return full result list from ElasticSearch

Currently, only five results are returned by the backend (default value). Should be resolved as a consequence of #9.

Redesign search results & filters

See Mockup

Add CSRF token to API and web UI

It could be done by assigning one on logon. Set and retain this in the ApiService and always add this property to the JSON data before sending.

Including OR in the query string causes ES to search for literal "OR"

We need to figure out what parts of the ES query string syntax are supported in the current setup, and how to change it.

No init method in Corpus class

This may be a non-issue, but there is no init method in the Corpus class. Should I add one, or is this left out for a reason?

Corpus serialization has side-effects

When trying to serialize the corpus representation more than once an error is thrown. This can be reproduced using a shell from the (back-end) application root:

> from ianalyzer.corpora.dutchbanking import DutchBanking
> d = DutchBanking()
> print(d.json())

This goes well, now try to serialize again:

> d.json()
Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "C:\Users\Sheean\src\I-analyzer\ianalyzer\corpora\common.py", line 136, in json
    search_dict = field_dict['search_filter'].__dict__
AttributeError: 'dict' object has no attribute '__dict__'

uudigitalhumanitieslab / i-analyzer Goto Github PK

i-analyzer's Introduction

I-analyzer

Contents

Usage

Development

Licence

Citation

Contact

i-analyzer's People

Contributors

Stargazers

Watchers

Forkers

i-analyzer's Issues

Recommend Projects

Recommend Topics

Recommend Org