Giter VIP home page Giter VIP logo

geo-question-parser's People

Contributors

e-nyamsuren avatar haiqixu avatar nsbgn avatar

Watchers

 avatar  avatar  avatar

geo-question-parser's Issues

Eliminate grammar parser/blockly interface overlap

The pressing issues with this part of the pipeline are with robustness, scalability and testing. For the final product, we need a lot of simplifications. To organize and document the development, I will be tracking that in the issue tracker here.

Currently, if I understand correctly, the procedure can be roughly sketched as follows. I will edit as I go along; please comment if I am mistaken.

  1. The question is cleaned. nltk is used to detect and clean adjectives like 'nearest', so that the important nouns can be isolated and recognized in subsequent steps.
  2. Important words in the questions are annotated.
    1. Recognize concepts, amounts and proportions via a pre-defined concept dictionary.
    2. Recognize place names via ELMO-based named entity recognition (NER) from allennlp.
    3. Recognize times, quantities and dates via NER from spaCy.
  3. Extract functional roles based on syntactic structures and connective words, via a grammar implemented in ANTLR. This yields a parse tree.
  4. Convert parse trees into transformations between concept types.
    1. Find input and output concept types by matching questions to categories that are associated with concept transformation models.
    2. The order of concept types is derived from the function of each phrase in which they occur: subcondition is calculated before the condition, etcetera. A table is generated that calculates the order for each functional part, which is then itself combined in a rule-based way (see Algorithm 1 in the paper).
  5. Transform concept types into cct types via manually constructed rules based on the concepts/extents/transformations that were found in previous steps.

The issue is that this is rather fragile; it depends (among other things) on:

  • All concepts and entities being annotated properly.
  • Having a complete rule set for converting concept types into CCT types.

We have chosen blockly to constrain the natural language at the user end, in such a way that the questions that may be presented to the parser are questions that the parser can handle. However, this only formats the question to reduce the problems of an otherwise unchanged natural language processing pipeline. As discussed in the meeting and elsewhere:

  1. Given that we already know the type of entity when constructing a query via blockly instead of freeform text, we will no longer need named entity recognition or question cleaning. This would strip out the nltk, spaCy, and allenlp packages, tremendously simplifying the process.
  2. To guarantee robustness, the visual blocks need to be in perfect accordance with the parser. For this, they should be automatically constructed from one common source of truth.
  3. In fact, given that the blockly-constructed query can output something different than what's written on the blocks, we might even forego the natural language parser completely, in favour of JSON output at the blockly level (or another format that is easily parsed). This would eliminate even the ANTLR parser, further reducing complexity. The downside is that we would no longer be able to parse freeform text (though that would be impacted by the removal of named entity recognition anyway). We could describe this with JSON Schema to really pin it down.
  4. To make sure that no regressions are introduced, we should have expected output for every step (that is, not just expected output from the whole pipeline).

This would make this repository not so much a geo-question-parser as much as a geo-question-formulator. This is good, because the code right now is very complex and very fitted to the specific questions in the original corpus, which isn't acceptable in a situation where users can pose their own questions.

Note: If we simplify to this extent, it might be nice to use rdflib.js to output a transformation graph directly, but that is for later.

The process would thus become:

  1. In blockly, construct JSON that represents a question.
  2. Convert that question into transformations between concept types.
    1. Find input and output concept types by matching questions to transformation categories.
    2. Find concept type ordering.
  3. Transform concept types into cct types via rules.

I'm not sure to what extent we can still simplify step 2. Depending how much code would be left, it would be nice to port/rewrite in JavaScript, alongside blockly, so that we can visualize most things client-side and with minimal moving parts.

Set up testing

Dependencies: #3

At the moment, there are no automated tests for this part of the pipeline. Any improvements made to one part might cause a regression elsewhere (see also #1).

We will need tests at these levels:

  1. A tool that builds blocks from a natural language question, to test whether all corpus questions can still be built as blocks.
  2. Tests to check whether functional roles are correctly extracted from blocks.
  3. Tests to check whether transformation graphs with core concepts are correctly generated from functional roles.
  4. Tests to check whether transformation graphs with CCT types are correctly identified from transformation graphs with core concepts.

I'm not yet familiar with JavaScript's testing ecosystem. Updates will be tracked here.

Extract functional roles from question formulator output

As discussed in #1, the issues of constructing a question and of extracting functional roles should be separated. For this, we need to figure out how to replace or adapt the ANTLR parser for the recognition of functional roles.

Issue #5 discusses changing the output of blocks to simplify this step. The issue you're reading now is about taking that output and actually producing the functional roles/transformations.

Ideally, the information needed to both show question blocks and extract functional roles from their output would be declared in a single grammar file. This is ideal because it would mean that phrases and their functional roles are kept in a single place; and that the procedural code to generate the blocks and transformation extraction can be kept separate from the declarative code for the grammar. This would allow those who edit the grammar to focus on the important bits. It may or may not be possible; I will need a better understanding of the blocks & parser.

Determine and produce structured question formulator output

As mentioned in #1, we need to get rid of the overlap between the grammar and the Blockly interface.

For this, the blocks should generate not another semi-natural language question that needs to pass through the whole parsing pipeline again, but rather, a structure from which functional roles are derived directly.

For example, the machine learning libraries that were used to recognize entities should be removed in favour of constraining the blocks in such a way that we already know the entity types. This will speed up the process and simplify the environment.

What other information should be conveyed by the block's output to be able to recognize functional roles? That should inform the details of the structure.

Once we have a better idea of what that structure should look like, we can generate it from blocks; see this page for more information.

Dynamic blocks for irrelevant grammatic variants

This is a low priority issue that ties into issue #7. We can use mutators to make dynamic blocks. We could use this to avoid users having to explicitly choose irrelevant grammatical variants.

We could automatically adapt 'fewer than' to 'less than', or automatically add connectives like 'and/or' when stacking relationships.

Merge Blockly interface with this repository

A Blockly interface is used to constrain natural language to a form that can be handled by Haiqi's grammar. There is presently an overlap between the two, which is (one of the...) reasons that maintenance is hard. More information at #1.

For this reason, the interface should be built from a common source, and thus the copy of the interface at https://github.com/HaiqiXu/haiqixu.github.io or https://github.com/quangis/quangis-web should be grafted into this repository.

Structure blocks according to grammar

We claim to be able to extract a lot of information from a question. However, a block where a text field can be left empty; or where a text field occurs on its own without contextual information to constrain it; or multiple variants of a block that carry only syntactical differences, indicates to me that we haven't pinned down exactly what information is contained in a block, and we're handwaving away the extraction of that information by pointing to the ANTLR parser.

This is a problem because the parser is hard to verify and test systematically, since it is much less constrained than the blocks.

Also, hiding blocks makes it hard for the user to understand the space of possibilities. We can disable blocks, but I don't think we should hide them. Of course, this is more feasible when the set of blocks is smaller.

That's why I think we should systematize the set of blocks a bit more. This would also help with issue #6.

Connect question elements to CCT operators

Eventually, it is my understanding that the operators of the cct language should inform the queries --- not just the CCT types. For this, semantic markers in the questions should be connected to CCT operators.

Set up JavaScript tooling

Dependencies: #2

At the moment, plain JavaScript is used with a copy of blockly to create the interface. This was fine for local testing, but if we are to maintain a version of this that is exposed to the outside world, we need to streamline the process (reasons can be reviewed on Blockly's 'get started' page).

The JavaScript tooling ecosystem is fragmented: it's easy to be overwhelmed by the plethora of package managers, build tools, module bundlers, etcetera. I have opted for NPM as our package manager and Parcel as our bundler --- this struck me as easiest. For now, I don't think we need additional build tools. However, I'm open for persuasion towards Gulp, Grunt, Yarn, Webpack, Rollup, etcetera. Additionally, we will use TypeScript to protect our sanity.

Note that you can install npm via conda-forge. I mention this for the benefit of Windows-using colleagues: since we use Conda elsewhere anyway, it's an easy way to install and keep track of the development environment.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.