Giter VIP home page Giter VIP logo

wa-testing-tool's Introduction

WA-Testing-Tool

Scripts that run against Watson Assistant for

  • KFOLD K fold cross validation on training set,
  • BLIND Evaluating a blind test, and
  • TEST Testing the WA against a list of utterances.

In the case of a k-fold cross validation, or a blind set, the tool will output a precision curve, in addition to per-intent precision and recall rates, and a confusion matrix.

Features

  • Easy to setup in one configuration file.
  • Save the state when Assistant service is down in the middle of processing.
  • Able to resume from where it stops using modularized scripts.

Prerequisite

  • Python 3.6.4 +
  • Mac users: you may need to initialize Python's SSL certificate store by running Install Certificates.command found in /Applications/Python. See more here
  • Git client

Quick Start

Pre-work: Make sure to cd into the location of a projects folder, where you will clone this github repo. Within the folder, cd into the WA-Testing-Tool folder.

  1. Install code git clone https://github.com/cognitive-catalyst/WA-Testing-Tool.git
  2. Install dependencies pip3 install --upgrade -r requirements.txt
  3. Set up parameters properly in configuration file (ex: config.ini). Use config.ini.sample to bootstrap your configuration. a. In your terminal, copy the config file into a new one, cp config.ini.sample config.ini b. Open the config.ini file in your favorite text editor, edit and save the following information with your actual credentials: API Key url workspace_id (Watson Assistant v1) or environment_id (Watson Assistant v2) c. Set the mode and the mode-specific parameters.
  4. Run the process. python3 run.py -c config.ini or python3 run.py -c <path to your config file>

Quick Update

If you have already installed this utility use these steps to get the latest code.

  1. Upgrade dependencies pip3 install --upgrade -r requirements.txt
  2. Update to latest code level git pull

Input Files

config.ini - Configuration file for run.py. This is formatted differently for each mode. Review the Examples below to explore the possible modes and how each is configured.

test_input_file.csv - Test set for blind testing and standard test.

For blind test with golden intent used for comparison:

utterance golden intent
utterance 0 intent 0
utterance 1 intent 0
utterance 2 intent 1

For standard test, the input must only have one column or error will be thrown:

utterance
utterance 0
utterance 1
utterance 2

Examples

There are a variety of ways to use this tool. Primarily you will execute a k-folds, blind, or standard test.

Core execution modes

Run k-fold cross-validation

Run blind test

Run standard test without ground truth

Extended modes (executed by default)

Generate precision/recall for classification test

Generate confusion matrix for classification test

Compare two different blind test results

Extended modes

Generate description for intents

Generate long-tail classification results

Unit test dialog flows

Run syntax validation patterns on a workspace

Extract and analyze Watson Assistant log data

More examples

Long-form resources available in Article and Video form:

Title Article Video
Testing a Chatbot with k-folds Cross Validation https://medium.com/ibm-watson/testing-a-chatbot-with-k-folds-cross-validation-68dab111a6b https://www.youtube.com/watch?v=FrhK68WyOK4
Analyze chatbot classifier performance from logs https://medium.com/ibm-watson/analyze-chatbot-classifier-performance-from-logs-e9cf2c7ca8fd https://www.youtube.com/watch?v=yd89DKyf6hc
Improve a chatbot classifier with production data https://medium.com/ibm-watson/improve-a-chatbot-classifier-with-production-data-22a437f419b4 https://www.youtube.com/watch?v=ftFIQtHiQY8

Related projects

Watson Assistant is commonly paired with IBM Speech services to build voice-driven Conversational AI solutions. Check out these tools to assess and tune your speech models!

Testing Natural Language Understanding Classifier

This tool can also be used to test a trained Natural Language Understanding (NLU) Classifier. The configuration is similar to testing Watson Assistant except:

  1. Use the NLU URL in the url parameter (ex: https://api.us-south.natural-language-understanding.watson.cloud.ibm.com)
  2. Specify the <model_id> in the workspace_id parameter in the configuration
  3. Since NLU classifier does not support downloading training data, the original training data must be provided if run in 'kfold' mode (using the train_input_file parameter)

General Caveats and Troubleshooting

  1. Due to different coverage among service plans, user may need to adjust max_test_rate accordingly to avoid network connection error.

  2. Users on Lite plans are only able to create 5 workspaces. They should set fold_num=3 on their k-fold configuration file.

  3. In case of interrupted execution, the tool may not be able to clean up the workspaces it creates. In this case you will need to manually delete the extra workspaces.

  4. Workspace ID is not the Skill ID. In the Watson Assistant user interface, the Workspace ID can be found on the Skills tab, clicking the three dots (top-right of skill), and choosing View API Details.

  5. SSL: [CERTIFICATE_VERIFY_FAILED] on Mac means you may need to initialize Python's SSL certificate store by running Install Certificates.command found in /Applications/Python. See more here

  6. "This utility used to work and now it doesn't." Upgrade to latest dependencies with pip3 install --upgrade -r requirements.txt and latest code with git pull.

  7. If you get a Python module loading error, confirm that you are using matching pip and python version, ie pip3 and python3 or pip and python.

  8. Watson Assistant v2 configuration does not support k-folds mode. Watson Assistant v2 is tested "in-place" rather than creating temporary skills for this tool. Actions users may prefer to use Dialog Skill Analysis notebooks - these notebooks have additional capabilities for analyzing Dialog or Action skills.

wa-testing-tool's People

Contributors

acpang avatar ahariharan95 avatar ajainghub avatar amblock avatar andrewrfreed avatar apang-ibm avatar dalcaibm avatar dvridha avatar hollomanma avatar iainm22 avatar lmazzoli avatar mgorjis avatar modcarroll avatar mrstutterz avatar nozziel avatar pratyushsingh97 avatar thedalewilliams avatar victor-zhao-ibm avatar victorpovar avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

wa-testing-tool's Issues

Provide confusion matrix

Build a confusion matrix, either per fold or as summary of all folds.
This helps quickly identify poorly-performing intents and those intents that get confused for each other.

Improve clarity of config.ini.sample

The configuration file has a variety of required/optional parameters and these parameters are dependent on what mode the tool is run in.

Update the sample configuration file so that it provides guidance without having to go out to the README.

Add Blind testing required config.

Blind testing requires that the blind_figure_title exist in the config file. Need to add the following to the config.ini.sample file.

blind_figure_title =

User files are tracked in git changes

  1. Normal execution modifies the data/workspace_base.json file
  2. If a user has multiple config files, ie config.ini.mode1 and config.ini.mode2, these are by default tracked in gitignore

For 1), move the data/workspace_base.json file to data/workspace_base.json.sample, as it appears only created in Git to hold a reference to the data/ directory in the first place
For 2, update .gitignore files to allow the pattern described above

test-out file not being created

When running the Blind Test - the test-out.csv file does not get created. I am attaching my config file here. Any recommendations are welcome. Thank you - Rebecca James

[DEFAULT]
mode = BLIND
workspace_id =
test_input_file = ./data/test.csv
temporary_file_directory = ./data
; previous_blind_out = ./data/previous_blind_out.csv
test_output_path = ./data-test-out.csv
; Figure path for kfold and blind
out_figure_path= ./data/figure.png
keep_workspace_after_test = no
blind_figure_title='title'
; partial_credit_table = ./data/partial-credit-table.csv

[ASSISTANT CREDENTIALS]
username = apikey
password = 

Make the URL API an input parameter for all scripts that use a URL

As reported from a user email:

The problem is that your configuration parameters don't include the URL of the service. This means that the default of https://gateway.watsonplatform.net/assistant/api is used which is fine for all WA instances hosted in US South. However, in other regions the URL of the WA API is different e.g if the instance is based in Germany the URL of the API will be https://gateway-fra.watsonplatform.net/assistant/api and unless you give users a way to specify this they wont be able to use their code.

Solution would be add the URL to the config file, and pass it as an input parameter to all scripts that call the API

Expose the version variable to config file

The API version used in the API calls is coded in utils/__ init__.py

WCS_VERSION = '2018-07-10'

As a workaround the users can edit the __ init__.py and change the WCS_VERSION variable but they may forget or not know about it.

Ideally this should be exposed to the config files so that users and match/configure the testing version with the same version they use in their application

Insert a limited retry loop around the WA message API call

Something along the lines of:

def send_message(text, counter = 0):
    try:
        ## get the response and return
    except:
        if counter < 5:
            return send_message(text, counter+1):
        else:
            ## return actual failure

Around the call to Watson Assistant to be more robust to random bad responses.

jpg export causes problems on some systems

New python installs may see errors like the following:

File "WA-Tool/venv/lib/python3.7/site-packages/matplotlib/backend_bases.py", line 1956, in _get_output_canvas
.format(fmt, ", ".join(sorted(self.get_supported_filetypes()))))
ValueError: Format 'jpg' is not supported (supported formats: eps, pdf, pgf, png, ps, raw, rgba, svg, svgz)

'jpg' support varies by platform, some platforms require installing the additional pillow library for jpg support.

Half of the visualizations in this tool already use 'png' which does not cause the same issues. Workarounds include:

  • Add 'pillow' as dependency (requirements.txt, pip install)
  • Change any 'jpg' exports to 'png'

k-fold test results should include "golden intent" column

The results in each of the k-fold test results does not currently include the "golden intent" column, they only have the "predicted intent" column. For ease of analysis the "golden intent" (the intent that the utterance belongs to in the training data) should also be output as part of the experiment.

Reduce number of required parameters through increased consistency

Several parameters need not be provided by user - the tool can assume sensible defaults.
Ideally the user is only required to provide connection information to their workspace and the mode to run the tool in as that is information the tool truly cannot know otherwise.

This increases the ease of first running the tool.

From the config file

; (Required) Test request rate (maximum number of API calls per second)
max_test_rate = 100

; (Required) All temporary files will be stored here
temporary_file_directory = ./data

; (Required) yes/no on whether to keep(yes) or delete(no) workspaces created by this tool after the testing phase
keep_workspace_after_test = no

; (Required for blind and test) Test output path
test_output_path = ./data/test-out.csv

; (Required for blind and kfold) Output figure path
out_figure_path= ./data/figure.png

; (Required for kfold) Number of folds.  If on LITE plan use 3.
; Each fold creates a workspace (make sure you have enough workspaces available, LITE plans are restricted to 5)
fold_num = 5

; (Required for blind) Title for blind testing output figure
blind_figure_title = 'Blind Test Results'

max_test_rate, temporary_file_directory, and keep_workspace_after_test already have sensible defaults, we need not require the user to provide them.

test_output_path: k-folds defaults a version of this parameter, the other modes can default to data/blind_out.csv and data/test_out.csv (k-folds should thus be updated to read this parameter.). The tool already prints out the location of all files written, so the user will not be surprised by this result.

out_figure_path: A good default would be test_output_path + '.jpg'

blind_figure_title: The default above is fine, we should not require the user to provide it.

Tool failing with the SSL: [CERTIFICATE_VERIFY_FAILED] message

The tool is failing with the following message while running on MacOS:

    raise ClientConnectorSSLError(req.connection_key, exc) from exc
aiohttp.client_exceptions.ClientConnectorSSLError: Cannot connect to host gateway.watsonplatform.net:443 ssl:None [[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:777)]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/victorpovar/Desktop/WA-Testing-Tool-master/utils/testConversation.py", line 194, in <module>
    func(ARGS)
  File "/Users/victorpovar/Desktop/WA-Testing-Tool-master/utils/testConversation.py", line 131, in func
    loop.run_until_complete(asyncio.gather(*tasks))
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/asyncio/base_events.py", line 467, in run_until_complete
    return future.result()
  File "/Users/victorpovar/Desktop/WA-Testing-Tool-master/utils/testConversation.py", line 72, in fill_df
    'alternate_intents': True}, url, sem)
RETRY
  File "/Users/victorpovar/Desktop/WA-Testing-Tool-master/utils/testConversation.py", line 55, in post
3
    print(response.status)
RETRY
UnboundLocalError: local variable 'response' referenced before assignment

Fix deprecation warning

WA-Testing-Tool/utils/createTestTrainFolds.py:34: FutureWarning: The 'get_values' method is deprecated and will be removed in a future version. Use '.to_numpy()' or '.array' instead.
  enumerate(kf.split(df.index.get_values())):

Support Token-based Identity and Access Management (IAM) authentication

Per WA release notes, the authentication method will be changing for new services instances created on or after 10/30/18:

"API authentication changes
On 30 October 2018, the US South and Germany regions will transition from using Cloud Foundry to using token-based Identity and Access Management (IAM) authentication. (See Authenticating with IAM tokens External link icon for more information.)

The method used to authenticate with IAM service instances is different from the method used to authenticate with Cloud Foundry instances. Existing applications that use Cloud Foundry will continue to work. However, if you migrate a service instance or create a new service instance in a region that uses IAM, you must update the code that handles authentication. All regions are transitioning to IAM, but on a rolling schedule. For more details, see Data centers."

implement a partial credit score system

Given a input file for a partial credit intent scoring mapping that has the following 3 columns, where a golden intent can be on more than one line:

Golden Intent | Partial Credit Intent | Partial Credit Intent score

Where Golden Intent is an intent from your WA, and Partial Credit Intent is another intent in your WA that when served in place of the Golden Intent is going to be give partial credit for being correct, and the partial credit (in the range [0,1]) is in the column Partial Credit Intent score
Change the scoring of the

  • Add the partial credit mapping file as an optional parameter to run.py
  • Intent Metrics to use accept the partial credit table as an optional parameter to calculate TPR and PPV
  • Add a column to the Schema of Results after yes/no for test output, called score that is a numeric value between [0,1] that is created with the partial credit table, if no partial credit table then the values are just 1 and 0.
  • Precision Curve and here change correct from being the count of yes to be the sum of the scores (scores are [0,1])

Error thrown when previous blind test unavailable

In user testing it was reported that if there was no previous blind test available, the script could not continue even though this should be an optional parameter.

suggestion: if the string for prev result is empty, or the string does not resolve to a valid file, fall back to the default action of no previous results after a friendly message to the terminal.

Need ability to visualize metrics report

Intent metrics report provides nice summarization in table form however a picture is worth 1000 words (how many tables is that?)

A treemap is a logical way to visualize the metrics data:

  • SIZE of box relates to number of samples for that intent
  • COLOR of box relates to the accuracy for that intent

With a visual summary ala a tree map it becomes visually obvious to focus on the largest, most-red boxes.

Enhancement: Support Directly referencing an @Entity as an intent example

Firstly - great tool thank you.
The WA classifier now supports using entities within intent training. For instance "Can I get a @PhoneModelName? "
https://console.bluemix.net/docs/services/conversation/intents.html#defining-intents

This is very useful when reusing bot training across multiple bots for instance "Hello @me" where the intent training can be ported across and @me @otherpeople be controlled as entities.

At the moment when creating test data, the WA-Testing-Tool, submits "Hello @me" without replacing the @me with a literal from the entity training. This means that these values almost always fail, even though they would pass if @me replaced with one of the literal values.

Ask is to download the entity training with the intent training. When the intent training trains the workspace, @entity should be passed as currently. But when the test set is created, the @entity should be replaced with a random synonym/literal from the entity training set.

Ability to extract user utterances from a running workspace

Watson Assistant provides an API to review conversation logs:
https://cloud.ibm.com/apidocs/assistant-v1#list-log-events-in-a-workspace

This log output is easily filtered and scraped for various information.

We encourage chatbot developers to monitor their application in production and to review the way users interact with the bot. One point to monitor is the responses to open-ended intent gathering questions. (This open-ended intent is usually the first input from the user)

The user utterances containing intent responses is useful for testing the performance of a chatbot or improving its training.

Given a list of intent-based utterances it is possible to
Test accuracy: Create a "blind" test file by adding a column with a "golden intent" for each utterance, then running WA-Testing-Tool in blind mode.
Improve training: Inspect utterances for new intents/entities that may need to be added.

Mark a WA confidence score as a * on the Precsion plots

Currently, the default WA confidence score to fallback on "I don't know" is 0.2. Add this threshold value to your config settings as the variable "tau", and indicate on all of the precision test results at what point on the curve corresponds to the "tau" confidence score. Indicate with a * or o on the line, clear enough to distinguish and put the value of tau in the legend.

Warn for dialog nodes that dead-end a conversation

If a node is doing a "Wait for user input", it should have at least one of the following:

  • Text output to the user
  • A context action defined
  • A webhook defined

If there is no text output, the end condition should be a jump or a "skip user input".
Else, this node is likely in error.

Post-processor to correctly detect long-tail/out-of-scope utterances

Watson Assistant is frequently paired with Watson Discovery Service for handling short-tail (WA) and long-tail (WDS) questions. A common pattern is for WA to handle questions classified above a certain confidence, and to hand low-confidence utterances to WDS. The current k-folds tool does not handle this.

Challenges:

  1. Low confidence (<20%) results should not be considered correct.
    If intent[0].confidence < 0.20 then WA returns "Irrelevant". k-folds/blind should reflect this

  2. Chatbot may route low confidence (ie <50%) to "out of scope" intent
    Especially in case of chatbots using Discovery

This should be implemented as an optional post-processor like the intentmetrics.py and confusionmatrix.py

Feature Request: Per intent metrics on True Positive/False Positives

A user request was submitted to calculate the following per intent metrics given the results of the k-fold-union or a golden (or blind) set:

  • intent name
  • intent true positive rate = TP/(TP+FN)
  • intent positive predictive value = TP/(TP +FP)
  • intent number of samples (relevant elements)

Using the definitions here

I think we could add this as another sub-directory, like the intent descriptions, that would just contain a simple python script that inputs the "out" file from the k-fold or golden test, and outputs the per-intent metrics above in a csv.

refactor input to only require workspace ID, and not intent/entity spreadsheets

The input of the testing tool currently accepts the intent and entity spreadsheets that are created from exporting the information from a workspace. For usability, have the configuration only contain the user/password/workspace_id for the Watson Assistant workspace and have the test tool export information from there and create new workspaces as necessary

Also, update the README.md file to reflect this change

Validate dialog syntax for integration with external systems

Watson Assistant workspaces often integrate with other systems:
Service Orchestration Engine (SOE) layers are used to coordinate with additional APIs. A syntax pattern exists for directing SOEs how to handle a Watson Assistant response.
IBM Voice Gateway is used in voice bots with Speech to Text and Text to Speech services handling vocalizations and user response transcription. Voice Gateway expects a specific syntax. Additionally, best practices exist such as setting speech customization parameters on each dialog node where a user response is collected.

Watson Assistant does not natively validate the syntax used to integrate with these systems. We need methodology to validate the syntax and patterns used.

Move to ibm-watson SDK

watson-developer-cloud package functionality should instead use ibm-watson package namespace

Add the fold number to the union of the k-fold results

add another column in the k-fold detailed results that indicates the fold in which that test data was held out for testing. could be useful in deciding if a particular failure was caused by bad luck of all utterances in the vicinity of that utterance were held out on the same fold

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.