cognitive-catalyst / wa-testing-tool Goto Github PK

Scripts that run against Watson Assistant for K fold validation on training set, testing on blind test, and draw precision curves for comparison.

License: Apache License 2.0

Python 33.42% Jupyter Notebook 66.58%

watson-assistant watson watson-api

wa-testing-tool's Introduction

WA-Testing-Tool

Scripts that run against Watson Assistant for

KFOLD K fold cross validation on training set,
BLIND Evaluating a blind test, and
TEST Testing the WA against a list of utterances.

In the case of a k-fold cross validation, or a blind set, the tool will output a precision curve, in addition to per-intent precision and recall rates, and a confusion matrix.

Features

Easy to setup in one configuration file.
Save the state when Assistant service is down in the middle of processing.
Able to resume from where it stops using modularized scripts.

Prerequisite

Python 3.6.4 +
Mac users: you may need to initialize Python's SSL certificate store by running Install Certificates.command found in /Applications/Python. See more here
Git client

Quick Start

Pre-work: Make sure to cd into the location of a projects folder, where you will clone this github repo. Within the folder, cd into the WA-Testing-Tool folder.

Install code git clone https://github.com/cognitive-catalyst/WA-Testing-Tool.git
Install dependencies pip3 install --upgrade -r requirements.txt
Set up parameters properly in configuration file (ex: config.ini). Use config.ini.sample to bootstrap your configuration. a. In your terminal, copy the config file into a new one, cp config.ini.sample config.ini b. Open the config.ini file in your favorite text editor, edit and save the following information with your actual credentials: API Key url workspace_id (Watson Assistant v1) or environment_id (Watson Assistant v2) c. Set the mode and the mode-specific parameters.
Run the process. python3 run.py -c config.ini or python3 run.py -c <path to your config file>

Quick Update

If you have already installed this utility use these steps to get the latest code.

Upgrade dependencies pip3 install --upgrade -r requirements.txt
Update to latest code level git pull

Input Files

config.ini - Configuration file for run.py. This is formatted differently for each mode. Review the Examples below to explore the possible modes and how each is configured.

test_input_file.csv - Test set for blind testing and standard test.

For blind test with golden intent used for comparison:

utterance	golden intent
utterance 0	intent 0
utterance 1	intent 0
utterance 2	intent 1

For standard test, the input must only have one column or error will be thrown:

utterance
utterance 0
utterance 1
utterance 2

Examples

There are a variety of ways to use this tool. Primarily you will execute a k-folds, blind, or standard test.

More examples

Long-form resources available in Article and Video form:

Title	Article	Video
Testing a Chatbot with k-folds Cross Validation	https://medium.com/ibm-watson/testing-a-chatbot-with-k-folds-cross-validation-68dab111a6b	https://www.youtube.com/watch?v=FrhK68WyOK4
Analyze chatbot classifier performance from logs	https://medium.com/ibm-watson/analyze-chatbot-classifier-performance-from-logs-e9cf2c7ca8fd	https://www.youtube.com/watch?v=yd89DKyf6hc
Improve a chatbot classifier with production data	https://medium.com/ibm-watson/improve-a-chatbot-classifier-with-production-data-22a437f419b4	https://www.youtube.com/watch?v=ftFIQtHiQY8

Related projects

Watson Assistant is commonly paired with IBM Speech services to build voice-driven Conversational AI solutions. Check out these tools to assess and tune your speech models!

STT-WER-Python: Utilities for testing IBM Speech to Text
TTS-Python: Utilities for testing IBM Text to Speech

Testing Natural Language Understanding Classifier

This tool can also be used to test a trained Natural Language Understanding (NLU) Classifier. The configuration is similar to testing Watson Assistant except:

Use the NLU URL in the url parameter (ex: https://api.us-south.natural-language-understanding.watson.cloud.ibm.com)
Specify the <model_id> in the workspace_id parameter in the configuration
Since NLU classifier does not support downloading training data, the original training data must be provided if run in 'kfold' mode (using the train_input_file parameter)

General Caveats and Troubleshooting

Due to different coverage among service plans, user may need to adjust max_test_rate accordingly to avoid network connection error.
Users on Lite plans are only able to create 5 workspaces. They should set fold_num=3 on their k-fold configuration file.
In case of interrupted execution, the tool may not be able to clean up the workspaces it creates. In this case you will need to manually delete the extra workspaces.
Workspace ID is not the Skill ID. In the Watson Assistant user interface, the Workspace ID can be found on the Skills tab, clicking the three dots (top-right of skill), and choosing View API Details.
SSL: [CERTIFICATE_VERIFY_FAILED] on Mac means you may need to initialize Python's SSL certificate store by running Install Certificates.command found in /Applications/Python. See more here
"This utility used to work and now it doesn't." Upgrade to latest dependencies with pip3 install --upgrade -r requirements.txt and latest code with git pull.
If you get a Python module loading error, confirm that you are using matching pip and python version, ie pip3 and python3 or pip and python.
Watson Assistant v2 configuration does not support k-folds mode. Watson Assistant v2 is tested "in-place" rather than creating temporary skills for this tool. Actions users may prefer to use Dialog Skill Analysis notebooks - these notebooks have additional capabilities for analyzing Dialog or Action skills.

wa-testing-tool's People

Contributors

Stargazers

Watchers

wa-testing-tool's Issues

Include more helpful message when workspace limit is encountered

A common testing issue is running out of workspaces, particularly when on the lite plan.
The error message for running out of workspaces is well-defined, we should detect it and present a useful remediation to the user.

Configurable option to automatically run intentmetrics.py when running k-fold or blind test

intentmetrics.py can run standalone but is also naturally "chained" to a k-folds or blind run.
Issue suggestion is to add configuration in config.ini which runs intentmetrics right after those tests if the user sets some configuration value.

Provide confusion matrix

Build a confusion matrix, either per fold or as summary of all folds.
This helps quickly identify poorly-performing intents and those intents that get confused for each other.

Improve clarity of config.ini.sample

The configuration file has a variety of required/optional parameters and these parameters are dependent on what mode the tool is run in.

Update the sample configuration file so that it provides guidance without having to go out to the README.

Add Blind testing required config.

Blind testing requires that the blind_figure_title exist in the config file. Need to add the following to the config.ini.sample file.

blind_figure_title =

configparser is required

add config parser to requirements.txt

User files are tracked in git changes

Normal execution modifies the data/workspace_base.json file
If a user has multiple config files, ie config.ini.mode1 and config.ini.mode2, these are by default tracked in gitignore

For 1), move the data/workspace_base.json file to data/workspace_base.json.sample, as it appears only created in Git to hold a reference to the data/ directory in the first place
For 2, update .gitignore files to allow the pattern described above

Create documentation on how to use partial credit

test-out file not being created

When running the Blind Test - the test-out.csv file does not get created. I am attaching my config file here. Any recommendations are welcome. Thank you - Rebecca James

[DEFAULT]
mode = BLIND
workspace_id =
test_input_file = ./data/test.csv
temporary_file_directory = ./data
; previous_blind_out = ./data/previous_blind_out.csv
test_output_path = ./data-test-out.csv
; Figure path for kfold and blind
out_figure_path= ./data/figure.png
keep_workspace_after_test = no
blind_figure_title='title'
; partial_credit_table = ./data/partial-credit-table.csv

[ASSISTANT CREDENTIALS]
username = apikey
password =

Divide by zero error

Received a divide by zero error during analysis
https://github.com/cognitive-catalyst/WA-Testing-Tool/blob/master/createPrecisionCurve.py#L121

createTestTrainFolds.py fails reading generated .csv files

On windows it can fail reading the intent .csv file due to incorrect encodings

Make the URL API an input parameter for all scripts that use a URL

As reported from a user email:

The problem is that your configuration parameters don't include the URL of the service. This means that the default of https://gateway.watsonplatform.net/assistant/api is used which is fine for all WA instances hosted in US South. However, in other regions the URL of the WA API is different e.g if the instance is based in Germany the URL of the API will be https://gateway-fra.watsonplatform.net/assistant/api and unless you give users a way to specify this they wont be able to use their code.

Solution would be add the URL to the config file, and pass it as an input parameter to all scripts that call the API

Expose the version variable to config file

The API version used in the API calls is coded in utils/__ init__.py

WCS_VERSION = '2018-07-10'

As a workaround the users can edit the __ init__.py and change the WCS_VERSION variable but they may forget or not know about it.

Ideally this should be exposed to the config files so that users and match/configure the testing version with the same version they use in their application

Insert a limited retry loop around the WA message API call

Something along the lines of:

def send_message(text, counter = 0):
    try:
        ## get the response and return
    except:
        if counter < 5:
            return send_message(text, counter+1):
        else:
            ## return actual failure

Around the call to Watson Assistant to be more robust to random bad responses.

Keep the y-axis constant 0.0-1.0 in precision curve

Add the y-axis parameter to createPrecisionCurve script in order to keep the y-axis constant. Additionally, expose the value for the y-axis parameter within the config file.

jpg export causes problems on some systems

New python installs may see errors like the following:

File "WA-Tool/venv/lib/python3.7/site-packages/matplotlib/backend_bases.py", line 1956, in _get_output_canvas
.format(fmt, ", ".join(sorted(self.get_supported_filetypes()))))
ValueError: Format 'jpg' is not supported (supported formats: eps, pdf, pgf, png, ps, raw, rgba, svg, svgz)

'jpg' support varies by platform, some platforms require installing the additional pillow library for jpg support.

Half of the visualizations in this tool already use 'png' which does not cause the same issues. Workarounds include:

Add 'pillow' as dependency (requirements.txt, pip install)
Change any 'jpg' exports to 'png'

k-fold test results should include "golden intent" column

The results in each of the k-fold test results does not currently include the "golden intent" column, they only have the "predicted intent" column. For ease of analysis the "golden intent" (the intent that the utterance belongs to in the training data) should also be output as part of the experiment.

WA Testing tool no longer compatible with username:password authentication

due to an error in the python sdk, the username:password (pre-IAM) no longer works. Suggest removing support for username:password until the sdk is fixed.

Dialog results were not returned in a "Standard Test"

Used the standard test on a system with dialog, and dialog column in the results was completely blank.

Reduce number of required parameters through increased consistency

Several parameters need not be provided by user - the tool can assume sensible defaults.
Ideally the user is only required to provide connection information to their workspace and the mode to run the tool in as that is information the tool truly cannot know otherwise.

This increases the ease of first running the tool.

From the config file

; (Required) Test request rate (maximum number of API calls per second)
max_test_rate = 100

; (Required) All temporary files will be stored here
temporary_file_directory = ./data

; (Required) yes/no on whether to keep(yes) or delete(no) workspaces created by this tool after the testing phase
keep_workspace_after_test = no

; (Required for blind and test) Test output path
test_output_path = ./data/test-out.csv

; (Required for blind and kfold) Output figure path
out_figure_path= ./data/figure.png

; (Required for kfold) Number of folds.  If on LITE plan use 3.
; Each fold creates a workspace (make sure you have enough workspaces available, LITE plans are restricted to 5)
fold_num = 5

; (Required for blind) Title for blind testing output figure
blind_figure_title = 'Blind Test Results'

max_test_rate, temporary_file_directory, and keep_workspace_after_test already have sensible defaults, we need not require the user to provide them.

test_output_path: k-folds defaults a version of this parameter, the other modes can default to data/blind_out.csv and data/test_out.csv (k-folds should thus be updated to read this parameter.). The tool already prints out the location of all files written, so the user will not be surprised by this result.

out_figure_path: A good default would be test_output_path + '.jpg'

blind_figure_title: The default above is fine, we should not require the user to provide it.

Tool failing with the SSL: [CERTIFICATE_VERIFY_FAILED] message

The tool is failing with the following message while running on MacOS:

    raise ClientConnectorSSLError(req.connection_key, exc) from exc
aiohttp.client_exceptions.ClientConnectorSSLError: Cannot connect to host gateway.watsonplatform.net:443 ssl:None [[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:777)]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/victorpovar/Desktop/WA-Testing-Tool-master/utils/testConversation.py", line 194, in <module>
    func(ARGS)
  File "/Users/victorpovar/Desktop/WA-Testing-Tool-master/utils/testConversation.py", line 131, in func
    loop.run_until_complete(asyncio.gather(*tasks))
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/asyncio/base_events.py", line 467, in run_until_complete
    return future.result()
  File "/Users/victorpovar/Desktop/WA-Testing-Tool-master/utils/testConversation.py", line 72, in fill_df
    'alternate_intents': True}, url, sem)
RETRY
  File "/Users/victorpovar/Desktop/WA-Testing-Tool-master/utils/testConversation.py", line 55, in post
3
    print(response.status)
RETRY
UnboundLocalError: local variable 'response' referenced before assignment

Fix deprecation warning

WA-Testing-Tool/utils/createTestTrainFolds.py:34: FutureWarning: The 'get_values' method is deprecated and will be removed in a future version. Use '.to_numpy()' or '.array' instead.
  enumerate(kf.split(df.index.get_values())):

Add f-score to intentmetrics.py

F-score is computed but not reported, it is a useful addition to precision and recall metrics already created.

Add capability to run the tool from Watson Studio - Python Notebook

Need the capability so that users can run this tool from Watson Studio using Python Notebook. The test result files can be exported (if needed) and/or analyzed in the Notebook itself.

Voice gateway syntax validation should include list of valid known commands

https://www.ibm.com/support/knowledgecenter/en/SS4U29/api.html defines all legal Voice Gateway commands. Checking against this list verifies commands are not mis-spelled.

URL should be configurable parameter in config file

Users can modify the python code to set BASE_URL but this too should be exposed in config.ini.

Support Token-based Identity and Access Management (IAM) authentication

Per WA release notes, the authentication method will be changing for new services instances created on or after 10/30/18:

"API authentication changes
On 30 October 2018, the US South and Germany regions will transition from using Cloud Foundry to using token-based Identity and Access Management (IAM) authentication. (See Authenticating with IAM tokens External link icon for more information.)

The method used to authenticate with IAM service instances is different from the method used to authenticate with Cloud Foundry instances. Existing applications that use Cloud Foundry will continue to work. However, if you migrate a service instance or create a new service instance in a region that uses IAM, you must update the code that handles authentication. All regions are transitioning to IAM, but on a rolling schedule. For more details, see Data centers."

implement a partial credit score system

Given a input file for a partial credit intent scoring mapping that has the following 3 columns, where a golden intent can be on more than one line:

Golden Intent | Partial Credit Intent | Partial Credit Intent score

Where Golden Intent is an intent from your WA, and Partial Credit Intent is another intent in your WA that when served in place of the Golden Intent is going to be give partial credit for being correct, and the partial credit (in the range [0,1]) is in the column Partial Credit Intent score
Change the scoring of the

Add the partial credit mapping file as an optional parameter to run.py
Intent Metrics to use accept the partial credit table as an optional parameter to calculate TPR and PPV
Add a column to the Schema of Results after yes/no for test output, called score that is a numeric value between [0,1] that is created with the partial credit table, if no partial credit table then the values are just 1 and 0.
Precision Curve and here change correct from being the count of yes to be the sum of the scores (scores are [0,1])

Error thrown when previous blind test unavailable

In user testing it was reported that if there was no previous blind test available, the script could not continue even though this should be an optional parameter.

suggestion: if the string for prev result is empty, or the string does not resolve to a valid file, fall back to the default action of no previous results after a friendly message to the terminal.

Need ability to visualize metrics report

Intent metrics report provides nice summarization in table form however a picture is worth 1000 words (how many tables is that?)

A treemap is a logical way to visualize the metrics data:

SIZE of box relates to number of samples for that intent
COLOR of box relates to the accuracy for that intent

With a visual summary ala a tree map it becomes visually obvious to focus on the largest, most-red boxes.

Document more clearly new python setup

Ala #38 , new Mac users need to initialize their SSL store. This should be documented.

Additionally, recommend use of python3 ala https://python3statement.org/ with Python 2's eventual end of life.

Enhancement: Support Directly referencing an @Entity as an intent example

Firstly - great tool thank you.
The WA classifier now supports using entities within intent training. For instance "Can I get a @PhoneModelName? "
https://console.bluemix.net/docs/services/conversation/intents.html#defining-intents

This is very useful when reusing bot training across multiple bots for instance "Hello @me" where the intent training can be ported across and @me @otherpeople be controlled as entities.

At the moment when creating test data, the WA-Testing-Tool, submits "Hello @me" without replacing the @me with a literal from the entity training. This means that these values almost always fail, even though they would pass if @me replaced with one of the literal values.

Ask is to download the entity training with the intent training. When the intent training trains the workspace, @entity should be passed as currently. But when the test set is created, the @entity should be replaced with a random synonym/literal from the entity training set.

Ability to extract user utterances from a running workspace

Watson Assistant provides an API to review conversation logs:
https://cloud.ibm.com/apidocs/assistant-v1#list-log-events-in-a-workspace

This log output is easily filtered and scraped for various information.

We encourage chatbot developers to monitor their application in production and to review the way users interact with the bot. One point to monitor is the responses to open-ended intent gathering questions. (This open-ended intent is usually the first input from the user)

The user utterances containing intent responses is useful for testing the performance of a chatbot or improving its training.

Given a list of intent-based utterances it is possible to
Test accuracy: Create a "blind" test file by adding a column with a "golden intent" for each utterance, then running WA-Testing-Tool in blind mode.
Improve training: Inspect utterances for new intents/entities that may need to be added.

Create documentation for intentmetrics.py

Main README and examples do not describe how to use intentmetrics.py

Intent metrics miscounts number of samples if intent empty

@victor-zhao-ibm ... similar to the internal bug, but this one in the intent metrics file

Test for intents that do (or do not have) corresponding dialog node(s)

Documentation updates needed for IAM apikey and configurable URL

After including updates for #33 and #48 the examples need to be updated showing how to provide IAM apikey and configurable URL

Mark a WA confidence score as a * on the Precsion plots

Currently, the default WA confidence score to fallback on "I don't know" is 0.2. Add this threshold value to your config settings as the variable "tau", and indicate on all of the precision test results at what point on the curve corresponds to the "tau" confidence score. Indicate with a * or o on the line, clear enough to distinguish and put the value of tau in the legend.

Warn for dialog nodes that dead-end a conversation

If a node is doing a "Wait for user input", it should have at least one of the following:

Text output to the user
A context action defined
A webhook defined

If there is no text output, the end condition should be a jump or a "skip user input".
Else, this node is likely in error.

Post-processor to correctly detect long-tail/out-of-scope utterances

Watson Assistant is frequently paired with Watson Discovery Service for handling short-tail (WA) and long-tail (WDS) questions. A common pattern is for WA to handle questions classified above a certain confidence, and to hand low-confidence utterances to WDS. The current k-folds tool does not handle this.

Challenges:

Low confidence (<20%) results should not be considered correct.
If intent[0].confidence < 0.20 then WA returns "Irrelevant". k-folds/blind should reflect this
Chatbot may route low confidence (ie <50%) to "out of scope" intent
Especially in case of chatbots using Discovery

This should be implemented as an optional post-processor like the intentmetrics.py and confusionmatrix.py

Feature Request: Per intent metrics on True Positive/False Positives

A user request was submitted to calculate the following per intent metrics given the results of the k-fold-union or a golden (or blind) set:

intent name
intent true positive rate = TP/(TP+FN)
intent positive predictive value = TP/(TP +FP)
intent number of samples (relevant elements)

Using the definitions here

I think we could add this as another sub-directory, like the intent descriptions, that would just contain a simple python script that inputs the "out" file from the k-fold or golden test, and outputs the per-intent metrics above in a csv.

Update code to support 2.0+ version of watson-developer-cloud SDK

Had to force requirements.txt to use wdc 1.3.5 because the 2.0 release changed the signatures of the calls. Best thing to do here is update our code to use the latest SDK, so adding a ticket to remind us to do that

refactor input to only require workspace ID, and not intent/entity spreadsheets

The input of the testing tool currently accepts the intent and entity spreadsheets that are created from exporting the information from a workspace. For usability, have the configuration only contain the user/password/workspace_id for the Watson Assistant workspace and have the test tool export information from there and create new workspaces as necessary

Also, update the README.md file to reflect this change

Create a union of the k-fold results as standard procedure

Consolidate all of the k-fold-test-out.csv results to one file as standard procedure for easier analysis.

'ASSISTANT CREDENTIALS' is missing in config file

I keep getting this error even though assistant credentials is in the config file

'ASSISTANT CREDENTIALS' is missing in config file

Optional online vs offline WA modes

Right now configured to use workspace ID, add an option to run "offline" by passing in a WA workspace

Validate dialog syntax for integration with external systems

Watson Assistant workspaces often integrate with other systems:
Service Orchestration Engine (SOE) layers are used to coordinate with additional APIs. A syntax pattern exists for directing SOEs how to handle a Watson Assistant response.
IBM Voice Gateway is used in voice bots with Speech to Text and Text to Speech services handling vocalizations and user response transcription. Voice Gateway expects a specific syntax. Additionally, best practices exist such as setting speech customization parameters on each dialog node where a user response is collected.

Watson Assistant does not natively validate the syntax used to integrate with these systems. We need methodology to validate the syntax and patterns used.

Move to ibm-watson SDK

watson-developer-cloud package functionality should instead use ibm-watson package namespace

Top line of "standard test" results was the utterance "utterance"

When running the standard test, the top result (line 2) was the utterance "utterance" and this was not part of the input.

Add the fold number to the union of the k-fold results

add another column in the k-fold detailed results that indicates the fold in which that test data was held out for testing. could be useful in deciding if a particular failure was caused by bad luck of all utterances in the vicinity of that utterance were held out on the same fold