Giter VIP home page Giter VIP logo

uudigitalhumanitieslab / i-analyzer Goto Github PK

View Code? Open in Web Editor NEW
6.0 3.0 1.0 49.61 MB

The great textmining tool that obviates all others

Home Page: https://ianalyzer.hum.uu.nl

License: MIT License

Python 55.70% JavaScript 0.13% HTML 8.12% TypeScript 34.26% SCSS 1.56% Dockerfile 0.07% Jinja 0.16%
corpus-linguistics corpus-search digital-history digital-humanities literary-studies text-analysis elasticsearch

i-analyzer's Introduction

I-analyzer

DOI Actions Status

"The great text mining tool that obviates all others." — Julian Gonggrijp

I-analyzer is a web application for exploring corpora (large collections of texts). You can use I-analyzer to find relevant documents, or to make visualisations to understand broader trends in the corpus. The interface is designed to be accessible for users of all skill levels.

I-analyzer is primarily intended for academic research and higher education. We focus on data that is relevant for the humanities, but we are open to datasets that are relevant for other fields.

Contents

This repository contains the source code for the I-analyzer web application, which consists of a Django backend and Angular frontend.

For corpora included in I-analyzer, the backend includes a definition file that specifies how to read the source files, and how this data should be structured and presented in I-analyzer. This repository does not include the source data itself, beyond a few sample files for testing.

Usage

If you are interested in using I-analyzer, the most straightforward way to get started is to make an account at ianalyzer.hum.uu.nl. This server is maintained by the Research Software Lab and contains corpora focused on a variety of fields. We also maintain more specialised collections at PEACE portal and People & Parliament (not publicly accessible).

I-analyzer does not have an "upload data" option (yet!). If you are interested in using I-analyzer as a way to publish your dataset, or to make it easier to search and analyse, you can go about this two ways:

  • Contact us (see below for details) about hosting your dataset on one of our existing servers, or hosting a new server for your project.
  • Self-host I-analyzer. This would allow you to maintain full control over the data and who can access it. I-analyzer is open source software, so you are free to host it yourself, either as-is or with your own modifications. However, feel free to contact us with any questions or issues.

Development

The documentation directory contains documentation for developers. This includes installation instructions to set up an I-analyzer server.

Licence

I-analyzer is shared under an MIT licence. See LICENSE for more information.

Citation

If you wish to cite this repository, please use the metadata provided in our CITATION.cff file.

If you wish to cite material that you accessed through I-analyzer, or you are not sure if you should also be citing this repository, please refer to the citation instructions in the user manual.

Contact

For questions, small feature suggestions, and bug reports, feel free to create an issue. If you don't have a Github account, you can also contact the Centre for Digital Humanities.

If you want to add a new corpus to I-analyzer, or have an idea for a project, please contact the Centre for Digital Humanities rather than making an issue, so we can discuss the possibilities with you.

i-analyzer's People

Contributors

alexhebing avatar ar-jan avatar beritjanssen avatar cowl avatar dependabot[bot] avatar fliepeltje avatar gersonfoks avatar jelmervnuss avatar jeltevanboheemen avatar jgonggrijp avatar lukavdplas avatar mdavarci avatar meesch avatar oktaal avatar robertloeberdevelopment avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Forkers

onurgezgin2

i-analyzer's Issues

Separate the frontend from the backend

So they can potentially run on different servers. Depends on #3 and #7.

Edit: this includes multiple places where HTML is generated at the backend; at least the admin templates (invoked from ianalyzer.view) and the filter implementations in ianalyzer.filters.

Separate the corpus definitions from the application

We don't want to build up a giant collection of unused corpus definitions within this repository. Instead, the application should only define the abstraction, and a concrete corpus definition should be attached to an application by means of local configuration only (similar to Flask and Django configuration modules). This means that the existing UUDigitalHumanitieslab/timestextminer corpus will be reduced to the existing corpus definition for the Times, which in turn will disappear from this repository. The yet to be written corpus definition for the Dutch banking corpus (#1) will move to a new repository.

Edit: depends on #13.

Wondering if the simple query string query is set to search in all fields or in content field.

Default Field
edit

When not explicitly specifying the field to search on in the query string syntax, the index.query.default_field will be used to derive which field to search on. It defaults to _all field.

If the _all field is disabled and no fields are specified in the request`, the simple_query_string query will automatically attempt to determine the existing fields in the index’s mapping that are queryable, and perform the search on those fields.

Enhance range picker

From-to in range pickers are now depicted as from:to, which is not intuitive to most people.
from-to might be better.
Also, a slider with low-high is not suited for all clients.
Some would prefer two fillable boxes denoting From and To.

CJK support

Because those languages (almost) do not use spaces they should be handled differently than languages which do. For instance the highlighter (#40) splits using word boundaries. I'm sure there're also plenty of other interesting things which will explode on impact.

Export corpus definitions as JSON

There is a TODO screaming in ianalyzer.corpora.common.Corpus.json. The intention is that the Python-based corpus definition will remain the single source of truth about corpus structure, but other parts of the application (frontend, special services…) will be benefit from this same definition through the JSON interface.

Use migrations

Currently, the application does not have any migrations in place.

Filtering using a range with large numbers is troublesome in lower-end of the range

Using an arbitrary large number to set the upper bound of the range will put the useful values in only a small piece at the left side of the slider, which is not user-friendly.

Potential solutions:

  • Create a log-based slider such that smaller values occupy more of the slider
  • Preprocess the data to find a reasonable upper bound within some margin of the maximum value

Remove dead code bodies in api

There are parts of the code of the python-flask backend which are not used anymore (elasticsearch (other than indexing), interface). A cleanup to avoid unnecessary imports would be desirable.

Define common testing module (properties)

Currently different tests need to create these from scratch. Some of this could be shared or turned into a method. So instead of:

TestBed.configureTestingModule({
    imports: [RouterTestingModule.withRoutes([])],
    providers: [
        SearchService,
        { provide: ApiService, useValue: new ApiServiceMock() },
        { provide: ElasticSearchService, useValue: new ElasticSearchServiceMock() },
        LogService,
        QueryService,
        UserService,
        SessionService
    ]
});

It would be better if something like the following could be done:

TestBed.configureTestingModule(CommonTestingModule.createDef(/* settings */))

Help pages

Conform Mockup. Provides context sensitive information (e.g., corpus selection, search). Ideally, this would also provide corpus specific info, e.g. what filters can be used, some example queries, etc.
On the left, show menu to switch between different help topics.

Redesign search history

The user can get to their previous searches by clicking on their name on the top of pages. Mockup of this is still to be designed (out of space for the free Moqups version).
Estimated time: 9.5 hours

Secure access

It should be possible to secure access to the index (a username/password for the ElasticSearch DB is simply sent to the client as part of #9), and allow limiting access to the search interface.

What happens if you leave the password of a new user blank?

The user creation form in the admin panel permits you to leave the password field blank. This is an undocumented feature; it is not clear what will happen if you do so. Specifically, will the user be able to login or not? This needs to be addressed.

Add more unittests

Currently there is one unittest, which means that in theory, we have a unittest suite. Need to add more tests so we have something to fall back on when things break. Will be subject to available time.

Redesign the frontend

Currently it depends on Flask-Admin. Need to eliminate this dependency. It would be fine to keep using Bootstrap, in fact it may be possible to embrace it more and (perhaps) use Sass as well.

Use LDAP at the application level

The current installation at dutchbanking.hum.uu.nl requires two logins: one with a solis-ID over LDAP at the webserver level and one with a custom account at the application level. This is annoying for users. We need to change this to a single login with LDAP at the application level with a fallback option for users without a solis-ID, following the example in Microcontact.

Fix admin module

Currently, if you add a new user, it is not possible to edit that user properly later (i.e., change their download limit): the user field will be empty when editing. Potentially, prefill fields such as download limit when creating a user to the default value. See also #37 , according to which the password field can even be left empty.

Search results preview is largely blank

When doing a search, the user gets a preview of five hits. This preview contains the right fields, but the fields do not always contain data, even though those fields are populated and present in the downloaded CSV file. In the case of the current DutchBanking index, all fields except for the document id are previewed as blank, even though all fields always contain data in the downloaded hits.

While downloading full search results still works, this does not make a good impression to the user.

Improve support for right-to-left languages

Currently absolutely minimal support is in place in the search highlighter. I couldn't get the test to work with wildcards or quotation marks, because the mixing of those symbols cause confusion. At the very least for the developer. This could definitely be improved, maybe by using native symbols from those writing systems or adding some mode support for switching direction.

Add Elasticsearch settings to User Manual

I wrote a user manual on the simple query string query language that is used for searching. However, I am not sure how certain top level parameters that influence usage/search results are set. E.g.:
analyze_wildcard, lenient, minimum_should_match, default field etc.
A list can be found here: https://www.elastic.co/guide/en/elasticsearch/reference/5.5/query-dsl-simple-query-string-query.html
It would be nice if this could be documented so that I can use it for enhancement of the user manual.

Add CSRF token to API and web UI

It could be done by assigning one on logon. Set and retain this in the ApiService and always add this property to the JSON data before sending.

Corpus serialization has side-effects

When trying to serialize the corpus representation more than once an error is thrown. This can be reproduced using a shell from the (back-end) application root:

> from ianalyzer.corpora.dutchbanking import DutchBanking
> d = DutchBanking()
> print(d.json())

This goes well, now try to serialize again:

> d.json()
Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "C:\Users\Sheean\src\I-analyzer\ianalyzer\corpora\common.py", line 136, in json
    search_dict = field_dict['search_filter'].__dict__
AttributeError: 'dict' object has no attribute '__dict__'

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.