quangis / geo-question-parser Goto Github PK
View Code? Open in Web Editor NEWExtract core concept transformations from geo-analytical questions.
Extract core concept transformations from geo-analytical questions.
The pressing issues with this part of the pipeline are with robustness, scalability and testing. For the final product, we need a lot of simplifications. To organize and document the development, I will be tracking that in the issue tracker here.
Currently, if I understand correctly, the procedure can be roughly sketched as follows. I will edit as I go along; please comment if I am mistaken.
nltk
is used to detect and clean adjectives like 'nearest', so that the important nouns can be isolated and recognized in subsequent steps.cct
types via manually constructed rules based on the concepts/extents/transformations that were found in previous steps.The issue is that this is rather fragile; it depends (among other things) on:
We have chosen blockly
to constrain the natural language at the user end, in such a way that the questions that may be presented to the parser are questions that the parser can handle. However, this only formats the question to reduce the problems of an otherwise unchanged natural language processing pipeline. As discussed in the meeting and elsewhere:
blockly
instead of freeform text, we will no longer need named entity recognition or question cleaning. This would strip out the nltk
, spaCy
, and allenlp
packages, tremendously simplifying the process.blockly
-constructed query can output something different than what's written on the blocks, we might even forego the natural language parser completely, in favour of JSON output at the blockly level (or another format that is easily parsed). This would eliminate even the ANTLR parser, further reducing complexity. The downside is that we would no longer be able to parse freeform text (though that would be impacted by the removal of named entity recognition anyway). We could describe this with JSON Schema to really pin it down.This would make this repository not so much a geo-question-parser
as much as a geo-question-formulator
. This is good, because the code right now is very complex and very fitted to the specific questions in the original corpus, which isn't acceptable in a situation where users can pose their own questions.
Note: If we simplify to this extent, it might be nice to use rdflib.js
to output a transformation graph directly, but that is for later.
The process would thus become:
blockly
, construct JSON that represents a question.cct
types via rules.I'm not sure to what extent we can still simplify step 2. Depending how much code would be left, it would be nice to port/rewrite in JavaScript, alongside blockly
, so that we can visualize most things client-side and with minimal moving parts.
Dependencies: #3
At the moment, there are no automated tests for this part of the pipeline. Any improvements made to one part might cause a regression elsewhere (see also #1).
We will need tests at these levels:
I'm not yet familiar with JavaScript's testing ecosystem. Updates will be tracked here.
As discussed in #1, the issues of constructing a question and of extracting functional roles should be separated. For this, we need to figure out how to replace or adapt the ANTLR parser for the recognition of functional roles.
Issue #5 discusses changing the output of blocks to simplify this step. The issue you're reading now is about taking that output and actually producing the functional roles/transformations.
Ideally, the information needed to both show question blocks and extract functional roles from their output would be declared in a single grammar file. This is ideal because it would mean that phrases and their functional roles are kept in a single place; and that the procedural code to generate the blocks and transformation extraction can be kept separate from the declarative code for the grammar. This would allow those who edit the grammar to focus on the important bits. It may or may not be possible; I will need a better understanding of the blocks & parser.
As mentioned in #1, we need to get rid of the overlap between the grammar and the Blockly interface.
For this, the blocks should generate not another semi-natural language question that needs to pass through the whole parsing pipeline again, but rather, a structure from which functional roles are derived directly.
For example, the machine learning libraries that were used to recognize entities should be removed in favour of constraining the blocks in such a way that we already know the entity types. This will speed up the process and simplify the environment.
What other information should be conveyed by the block's output to be able to recognize functional roles? That should inform the details of the structure.
Once we have a better idea of what that structure should look like, we can generate it from blocks; see this page for more information.
This is a low priority issue that ties into issue #7. We can use mutators to make dynamic blocks. We could use this to avoid users having to explicitly choose irrelevant grammatical variants.
We could automatically adapt 'fewer than' to 'less than', or automatically add connectives like 'and/or' when stacking relationships.
A Blockly interface is used to constrain natural language to a form that can be handled by Haiqi's grammar. There is presently an overlap between the two, which is (one of the...) reasons that maintenance is hard. More information at #1.
For this reason, the interface should be built from a common source, and thus the copy of the interface at https://github.com/HaiqiXu/haiqixu.github.io or https://github.com/quangis/quangis-web should be grafted into this repository.
We claim to be able to extract a lot of information from a question. However, a block where a text field can be left empty; or where a text field occurs on its own without contextual information to constrain it; or multiple variants of a block that carry only syntactical differences, indicates to me that we haven't pinned down exactly what information is contained in a block, and we're handwaving away the extraction of that information by pointing to the ANTLR parser.
This is a problem because the parser is hard to verify and test systematically, since it is much less constrained than the blocks.
Also, hiding blocks makes it hard for the user to understand the space of possibilities. We can disable blocks, but I don't think we should hide them. Of course, this is more feasible when the set of blocks is smaller.
That's why I think we should systematize the set of blocks a bit more. This would also help with issue #6.
Eventually, it is my understanding that the operators of the cct
language should inform the queries --- not just the CCT types. For this, semantic markers in the questions should be connected to CCT operators.
Dependencies: #2
At the moment, plain JavaScript is used with a copy of blockly
to create the interface. This was fine for local testing, but if we are to maintain a version of this that is exposed to the outside world, we need to streamline the process (reasons can be reviewed on Blockly's 'get started' page).
The JavaScript tooling ecosystem is fragmented: it's easy to be overwhelmed by the plethora of package managers, build tools, module bundlers, etcetera. I have opted for NPM as our package manager and Parcel as our bundler --- this struck me as easiest. For now, I don't think we need additional build tools. However, I'm open for persuasion towards Gulp, Grunt, Yarn, Webpack, Rollup, etcetera. Additionally, we will use TypeScript to protect our sanity.
Note that you can install npm
via conda-forge
. I mention this for the benefit of Windows-using colleagues: since we use Conda elsewhere anyway, it's an easy way to install and keep track of the development environment.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.