Utilities for the Cognitive Service Custom Text document processing tool.

License: MIT License

C# 89.71% Jupyter Notebook 10.29%

cognitiveserviceslanguageutilities's Introduction

Repository Contents

CLUtils

CLUtils is a CLI tool that provides some core functionalities to make the experience of using some of the Cognitive Language Services simpler.

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

cognitiveserviceslanguageutilities's People

Contributors

Stargazers

Watchers

Forkers

abolfathi

cognitiveserviceslanguageutilities's Issues

compare edit distance with TF-IDF cosine similarity method

Optimize azure function

currently the azure function request allows it to accept more than one document/text
But the MiniSDK for Custom Text only sends one document per analyze call
we need to enable the sdk to submit multiple documents in a single job
and also enable that in the response of the azure function

investigate SIMD operations in .NET

normalize match score

when fuzzy matching client returns a result
current behavior: (cosine similarity) the lower the score is, the more similarity the word has
ex: infinity -> not matching, 0 -> exact matching
intended behavior: from 0 to 1, 1 meaning exact match, 0 meaning no match

Create comprehensive readme

create a readme for how to use both cli tool and the azure function to index your Custom Text app

Unify Cognitive Search api calls

currently we use a combination of both sdks and api calls the call cognitive search services
we need to unify that
either api calls or sdk calls

Create skeleton code for .Net approach

Migrate TF-IDF from python to .Net

Integrate document parser in indexing pipeline

we need to create a custom skill to parse documents which aren't in text format (pdfs, images, open xml formats)
in order to extract text from them
we'll be using the parser module from the Custom Text Cli tool

TF-IDF with preprocessing approach .NET

restructure projects inside repo

optimize run-time memory usage

in the runtime client, we need to figure out some way to use the compressed dataset without unrolling/unzipping it
(for example: dataset with 10k records, memory usage was 3GB before compressing the TF-IDF matrix)

investigate matrix operations in .NET

configure CI/CD pipeline

enable connecting to external data source for matching

the ability to match and output to an external data source (database etc..)

load tests

testing large number of documents
it may cause memory issues as we load documents in memory and work on them in parallel

performance tests

default matching threshold

scenario 1:

user input:
threshold: 0.8
matched entities: 131

Scenario 2:

User Input: Sentence that includes [Ministry of Defence, Weapons, Lightweight Missile Attack Systems Project Team (LMAS PT)] as an entity name
threshold: 0.8
matched entities: 65

i don't think this is a valid behavior returning that many matches
which by looking into data, weren't that correct
so we need to set a default theshold (ex: 0.9) to eliminate these outliers.

This tends to be more apparent with longer entities (as in the example above). There are more combinations of tokens that would result in an overall similarity that is acceptably high. (adding a few words or removing a few words will have less significant impact on the similarity of an entity with a long number of characters).

Should the (default) threshold value be a function of the number of words somehow?

TF-IDF No-preprocessing approach in .NET

Investigate non-exact matching approaches

match sentence a against a very large dataset (more that 3k, more likely 50k to 100k)

Integrate Custom Text with Cognitive Search

indexer field names cannot contain spaces

custom text entity names can have spaces and digits
but it's contrary to cognitive search index which can only contain letters and underscores
this causes the indexer cli tool to break without showing any error

samples for searching sdk

Multi-layered search!
Integrate both Qna Maker and cognitive search's search sdk
client can use both to search through their documents
akin to a fallback service if one doesn't return something

enable blob storage for preprocessed dataset

update namespaces and project structure

Create a comprehensive python approach for fuzzy matching with TF-IDF approach

Integrating Text Analytics with Cognitive Search

We'd like to fully integrate the entire Analyze Api of TA into cognitive search

helpfull error messages cli tool

indexer cli tool breaks and doesn't show useful messages to users

Remove secrets from Azure function

currently, we retrieve Custom Text app secrets from inside the Azure function
proposed solutions

use keyvault
pass secrets to the Azure function in the CustomSkillset definition as headers

this way, custom text secrets will also be added to the configs.json for the cli tool

Generate app schema automatically

currently, Custom Text authoring resource doesn't have an api to retreive app schema
So what we're doing, is let the user figure out how to get all custom entities in the app and provide them in the schema.json file (see Docs folder)
So we need to generate the schema automatically
proposed solution:
use the 'id_**_labels.json' file in the blob container used by CustomText
this file contains all info about labeling
use the 'entityNames' section that contains the labels/entities

updated fuzzy matching pipeline - approach 1

steps:

tokenize input sentence

different levels of tokenization
1-word tokens, 2-word tokens, ..
ex: "i want to travel from cairo to new york"
1-word tokens ["i", "want", "to", ..., "new", "york"]
2-word tokens ["i want", "to travel", ..., "new york"]

match against pre-processed dataset

this way we'll be able to get start and ending indices for matched results

Enable logging

Scenario:
when a document fails when running the indexer, it's possible to know why using a 'Debug session' in cognitive search portal
but most of out customer aren't that smart :)
so we need to enable logging in the indexing cli tool to report these types of issues

update docs for all projects

Investigate TF-IDF with preprocessing in .NET

Support Form-Recognizer in the pipeline

in order to support forms for search (whether digital files or scanned files), we need to support another AI-enrichment function
which is Form-Recognizer

Adding threshold for matching sentences

exe file stops when parsing sometimes

some clients reported issues that clutils.exe stops mid-parsing

optimize pro-processed dataset file size

updated fuzzy matching pipeline - approach 2

steps:

fuzzy matching entire input sentence against entire dataset
we get list of possible matches
trying to identify start and end indices for the matches

tokenize input sentence
different levels of teoknization
match each level against result entities

Investigate indexer behaviour

for edge cases, we need to investigate indexer behavior
for example: what happens we delete somefile.txt from data storage, and upload a different file with the same name
what would happen to the index in this case?
we need to figure out more edge cases, report, and find solutions

create tests for fuzzy matching client

update documentation/(code)

deploy azure function

remove publish profile so users won't get confused

azure function

how to create azure function
users must update model id

indexer command

understand parameters
configs.json (need to clarify it's for indexer.exe)

microsoft / cognitiveserviceslanguageutilities Goto Github PK