Giter VIP home page Giter VIP logo

cognitiveserviceslanguageutilities's Introduction

Repository Contents

CLUtils

CLUtils is a CLI tool that provides some core functionalities to make the experience of using some of the Cognitive Language Services simpler.

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

cognitiveserviceslanguageutilities's People

Contributors

a-noyass avatar magrefaat avatar microsoftopensource avatar mshaban-msft avatar mshaban93 avatar nawanas avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Forkers

abolfathi

cognitiveserviceslanguageutilities's Issues

Optimize azure function

currently the azure function request allows it to accept more than one document/text
But the MiniSDK for Custom Text only sends one document per analyze call
we need to enable the sdk to submit multiple documents in a single job
and also enable that in the response of the azure function

normalize match score

when fuzzy matching client returns a result
current behavior: (cosine similarity) the lower the score is, the more similarity the word has
ex: infinity -> not matching, 0 -> exact matching
intended behavior: from 0 to 1, 1 meaning exact match, 0 meaning no match

Unify Cognitive Search api calls

currently we use a combination of both sdks and api calls the call cognitive search services
we need to unify that
either api calls or sdk calls

Integrate document parser in indexing pipeline

we need to create a custom skill to parse documents which aren't in text format (pdfs, images, open xml formats)
in order to extract text from them
we'll be using the parser module from the Custom Text Cli tool

optimize run-time memory usage

in the runtime client, we need to figure out some way to use the compressed dataset without unrolling/unzipping it
(for example: dataset with 10k records, memory usage was 3GB before compressing the TF-IDF matrix)

load tests

testing large number of documents
it may cause memory issues as we load documents in memory and work on them in parallel

default matching threshold

scenario 1:

  • user input:
  • threshold: 0.8
  • matched entities: 131

Scenario 2:

  • User Input: Sentence that includes [Ministry of Defence, Weapons, Lightweight Missile Attack Systems Project Team (LMAS PT)] as an entity name
  • threshold: 0.8
  • matched entities: 65

i don't think this is a valid behavior returning that many matches
which by looking into data, weren't that correct
so we need to set a default theshold (ex: 0.9) to eliminate these outliers.

This tends to be more apparent with longer entities (as in the example above). There are more combinations of tokens that would result in an overall similarity that is acceptably high. (adding a few words or removing a few words will have less significant impact on the similarity of an entity with a long number of characters).

Should the (default) threshold value be a function of the number of words somehow?

indexer field names cannot contain spaces

custom text entity names can have spaces and digits
but it's contrary to cognitive search index which can only contain letters and underscores
this causes the indexer cli tool to break without showing any error

samples for searching sdk

Multi-layered search!
Integrate both Qna Maker and cognitive search's search sdk
client can use both to search through their documents
akin to a fallback service if one doesn't return something

Remove secrets from Azure function

currently, we retrieve Custom Text app secrets from inside the Azure function
proposed solutions

  1. use keyvault
  2. pass secrets to the Azure function in the CustomSkillset definition as headers
  • this way, custom text secrets will also be added to the configs.json for the cli tool

Generate app schema automatically

currently, Custom Text authoring resource doesn't have an api to retreive app schema
So what we're doing, is let the user figure out how to get all custom entities in the app and provide them in the schema.json file (see Docs folder)
So we need to generate the schema automatically
proposed solution:
use the 'id_**_labels.json' file in the blob container used by CustomText
this file contains all info about labeling
use the 'entityNames' section that contains the labels/entities

updated fuzzy matching pipeline - approach 1

steps:

  1. tokenize input sentence
  • different levels of tokenization
  • 1-word tokens, 2-word tokens, ..
  • ex: "i want to travel from cairo to new york"
  • 1-word tokens ["i", "want", "to", ..., "new", "york"]
  • 2-word tokens ["i want", "to travel", ..., "new york"]
  1. match against pre-processed dataset

this way we'll be able to get start and ending indices for matched results

Enable logging

Scenario:
when a document fails when running the indexer, it's possible to know why using a 'Debug session' in cognitive search portal
but most of out customer aren't that smart :)
so we need to enable logging in the indexing cli tool to report these types of issues

Support Form-Recognizer in the pipeline

in order to support forms for search (whether digital files or scanned files), we need to support another AI-enrichment function
which is Form-Recognizer

updated fuzzy matching pipeline - approach 2

steps:

  1. fuzzy matching entire input sentence against entire dataset
  2. we get list of possible matches
  3. trying to identify start and end indices for the matches
  • tokenize input sentence
  • different levels of teoknization
  • match each level against result entities

Investigate indexer behaviour

for edge cases, we need to investigate indexer behavior
for example: what happens we delete somefile.txt from data storage, and upload a different file with the same name
what would happen to the index in this case?
we need to figure out more edge cases, report, and find solutions

update documentation/(code)

  1. deploy azure function
  • remove publish profile so users won't get confused
  1. azure function
  • how to create azure function
  • users must update model id
  1. indexer command
  • understand parameters
  • configs.json (need to clarify it's for indexer.exe)

Samples for search

we need to provide code samples on how to use Cognitive Search SearchClient
how too use it for searching the created index
Currently, only provide samples for .net and python

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.