ibm / watson-document-co-relation Goto Github PK

Correlate text content across documents using Watson NLU, Python NLTK and Watson Studio.

Home Page: https://developer.ibm.com/patterns/watson-document-correlation/

License: Apache License 2.0

Jupyter Notebook 100.00%

jupyter-notebook natural-language-processing text-correlation document-correlation ibm-data-science-experience nlu watson-nlu ibmcode

watson-document-co-relation's Introduction

Correlation of text content across documents using Watson Natural Language Understanding, Python NLTK and IBM Data Science experience

Data Science Experience is now Watson Studio. Although some images in this code pattern may show the service as Data Science Experience, the steps and processes will still work.

In this code pattern we will use Jupyter notebooks in IBM Data Science experience(Watson Studio) to correlate text content across documents with Python NLTK toolkit and IBM Watson Natural Language Understanding. The correlation algorithm is driven by an input configuration json that contains the rules and grammar for building the relations. The configuration json document can be modified to obtain better correlation results between text content across documents.

When the reader has completed this code pattern, they will understand how to:

Create and run a Jupyter notebook in Watson Studio.
Use Object Storage to access data and configuration files.
Use IBM Watson Natural Language Understanding API to extract metadata from documents in Jupyter notebooks.
Extract and format unstructured data using simplified Python functions.
Use a configuration file to specify the co-reference and relations grammar.
Store the processed output JSON in Object Storage.

The intended audience for this code pattern is developers who want to learn a method for correlation of text content across documents. The distinguishing factor of this code pattern is that it allows a configurable mechanism of text correlation.

Included components

IBM Watson Studio: Analyze data using RStudio, Jupyter, and Python in a configured, collaborative environment that includes IBM value-adds, such as managed Spark.
IBM Cloud Object Storage: An IBM Cloud service that provides an unstructured cloud data store to build and deliver cost effective apps and services with high reliability and fast speed to market.
Watson Natural Language Understanding: A IBM Cloud service that can analyze text to extract meta-data from content such as concepts, entities, keywords, categories, sentiment, emotion, relations, semantic roles, using natural language understanding.

Featured technologies

Jupyter Notebooks: An open-source web application that allows you to create and share documents that contain live code, equations, visualizations and explanatory text.

Watch the Video

Steps

Follow these steps to setup and run this code pattern. The steps are described in detail below.

Sign up for Watson Studio
Create IBM Cloud services
Create the notebook
Add the data and configuraton file
Update the notebook with service credentials
Run the notebook
Analyze the results

1. Sign up for Watson Studio

Sign up for IBM's Watson Studio. By creating a project in Watson Studio a free tier Object Storage service will be created in your IBM Cloud account. Take note of your service names as you will need to select them in the following steps.

Note: When creating your Object Storage service, select the Free storage type in order to avoid having to pay an upgrade fee.

2. Create IBM Cloud services

Create the following IBM Cloud service and name it wdc-NLU-service:

Watson Natural Language Understanding

3. Create the notebook

In Watson Studio, click on Create notebook to create a notebook.
Create a project if necessary, provisioning an object storage service if required.
In the Assets tab, select the Create notebook option.
Select the From URL tab.
Enter a name for the notebook.
Optionally, enter a description for the notebook.
Enter this Notebook URL: https://github.com/IBM/watson-document-co-relation/blob/master/notebooks/watson_correlate_documents.ipynb
Select the free Anaconda runtime.
Click the Create button.

4. Add the data and configuration file

Add the data and configuration to the notebook

From the My Projects > Default page, Use Find and Add Data (look for the 10/01 icon) and its Files tab.
Click browse and navigate to this repo watson-document-co-relation/data/sample_text_1.txt
Click browse and navigate to this repo watson-document-co-relation/data/sample_text_2.txt
Click browse and navigate to this repo watson-document-co-relation/configuration/sample_config.txt

Note: It is possible to use your own data and configuration files. If you use a configuration file from your computer, make sure to conform to the JSON structure given in configuration/sample_config.txt.

Fix-up file names for your own data and configuration files

If you use your own data and configuration files, you will need to update the variables that refer to the data and configuration files in the Jupyter Notebook.

In the notebook, update the global variables in the cell following 2.3 Global Variables section.

Replace the sampleTextFileName1,sampleTextFileName2 with the name of your data file and sampleConfigFileName with your configuration file name.

5. Update the notebook with service credentials

Add the Watson Natural Language Understanding credentials to the notebook

Select the cell below 2.1 Add your service credentials from IBM Cloud for the Watson services section in the notebook to update the credentials for Watson Natural Language Understanding.

Open the Watson Natural Language Understanding service in your IBM Cloud Dashboard and click on your service, which you should have named wdc-NLU-service.

Once the service is open click the Service Credentials menu on the left.

In the Service Credentials that opens up in the UI, select whichever Credentials you would like to use in the notebook from the KEY NAME column. Click View credentials and copy username and password key values that appear on the UI in JSON format.

Update the username and password key values in the cell below 2.1 Add your service credentials from IBM Cloud for the Watson services section.

Add the Object Storage credentials to the notebook

Select the cell below 2.2 Add your service credentials for Object Storage section in the notebook to update the credentials for Object Store.
Delete the contents of the cell
Use Find and Add Data (look for the 10/01 icon) and its Files tab. You should see the file names uploaded earlier. Make sure your active cell is the empty one below 2.2 Add...
Select Insert to code (below your sample_text.txt).
Click Insert Credentials from drop down menu.
Make sure the credentials are saved as credentials_1.

6. Run the notebook

When a notebook is executed, what is actually happening is that each code cell in the notebook is executed, in order, from top to bottom.

IMPORTANT: The first time you run your notebook, you will need to install the necessary packages in section 1.1 and then Restart the kernel.

Each code cell is selectable and is preceded by a tag in the left margin. The tag format is In [x]:. Depending on the state of the notebook, the x can be:

A blank, this indicates that the cell has never been executed.
A number, this number represents the relative order this code step was executed.
A *, this indicates that the cell is currently executing.

There are several ways to execute the code cells in your notebook:

One cell at a time.
- Select the cell, and then press the Play button in the toolbar.
Batch mode, in sequential order.
- From the Cell menu bar, there are several options available. For example, you can Run All cells in your notebook, or you can Run All Below, that will start executing from the first cell under the currently selected cell, and then continue executing all cells that follow.
At a scheduled time.
- Press the Schedule button located in the top right section of your notebook panel. Here you can schedule your notebook to be executed once at some future time, or repeatedly at your specified interval.

7. Analyze the results

After running each cell of the notebook under Correlate text, the results will display.

The document similarity score is computed using the cosine distance function in NLTK module. The document similarity results can be enhanced by adding to the stop words or text tags. The words added to stop words will be ignored for comparison. The word tags from watson text classifier or any custom tags added will be accounted for the comparison.

The configuration json controls the way the text is correlated. The correlation involves two aspects - co-referencing and relation determination. The configuration json contains the rules and grammar for co-referencing and determining relations. The output from Watson Natural Language Understanding and Python NLTK toolkit is processed based on the rules and grammar specified in the configuration json to come up with the correlation of content across documents.

We can modify the configuration json to add more rules and grammar for co-referencing and determining the relations. The text content correlation results can be enhanced without changes to the code.

We can see from the 6. Visualize correlated text in the notebook the correlations between the text in the two sample documents that we provided. The output seen below is the augmented output from Watson Natural Language Understanding with the relationships extracted from the rules methodology explained in this pattern.

In addition to it the similarity between the two sample texts that we provided is computed in the notebook section 5. Correlate text. The similarity score between the two sample text is seen as 0.790569415042.

Other scenarios and use cases for which a solution can be built using the above methodology

See USECASES.md.

Troubleshooting

See DEBUGGING.md.

License

This code pattern is licensed under the Apache Software License, Version 2. Separate third party code objects invoked within this code pattern are licensed by their respective providers pursuant to their own separate licenses. Contributions are subject to the Developer Certificate of Origin, Version 1.1 (DCO) and the Apache Software License, Version 2.

Apache Software License (ASL) FAQ

watson-document-co-relation's People

Contributors

Stargazers

Watchers

watson-document-co-relation's Issues

Error when executing notebook

Getting the error - TypeError: <watson_developer_cloud.natural_language_understanding.features.v1.Entities object at 0x7f15027c2dd0> is not JSON serializable

Missing version in cell 2.1

The Notebook cell for 2.1 does not contain a version, nor does it specify how/which version to add.
I believe that is why my notebook fails with missing data in cell #5:

Upgrade to watson-developer-cloud 2.0 has breaking changes.

See the documentation.
Once fixed, we can pin to the new 2.0 version.

add repository description and topics

see https://github.com/IBM/watson-document-classifier as an example

error in 4.1 section

KeyErrorTraceback (most recent call last)
in ()
----> 1 auth_url = credentials_1['auth_url']+"/v3"
2 container = credentials_1['container']
3 IBM_Objectstorage_Connection = swiftclient.Connection(
4 key=credentials_1['password'], authurl=auth_url, auth_version='3', os_options={
5 "project_id": credentials_1['project_id'], "user_id": credentials_1['user_id'], "region_name": credentials_1['region']})

KeyError: 'auth_url'

Add feature to find text similarity score

In addition to the ability of extracting entities from correlating them to relate two text contents, a feature to find text similarity based on well known algorithms like cosine will help understand the extent to which two text contents are similar.

Cleanup repo

We have been cleaning up our repositories in light of changes, and to take care of technical debt.
Please have a look at these items, and apply as needed to this repo.
Please create a Pull Request for these changes.

README (Sign-up name in checklist below)
Replace "Bluemix" with "IBM Cloud"
Change "Journey" to "Code Pattern"
Ensure there are steps after "Flow"
Cleanup old READMEs to align with new template
Double check links section syncs up with IBMCode site
Add blog ? No
Add video? Yup
Remove "With Watson" blocks
Add "Learn More" sections: https://gist.github.com/dolph/8d5f1c34eb6f46f275563dbe9abdef14

Template for Code Pattern Readme:
https://github.ibm.com/developer-journeys/journey-docs/blob/master/_content/resources/templates/readme_template.md

Example commit:
IBM/watson-assistant-slots-intro@f406290

storage setup

after signup to dsx spark and object store is created but there is no option to set object storage as swift API

Readme Notebook create image is FromFile, it should be an image from URL

The Readme image for creating a notebook shows the From File tab.
This should be an image for From URL.

Error in 2.3

@hidden_cell

The following code contains the credentials for a file in your IBM Cloud Object Storage.

You might want to remove those credentials before you share your notebook.

credentials_1 = {
'IBM_API_KEY_ID': 'SwrwslFTRrPQZhBRcfxEMUppNrtpGiEBvaWOMtQUr6iO',
'IAM_SERVICE_ID': 'iam-ServiceId-0e2b1a61-1eef-4c13-9a50-a765df76f36b',
'ENDPOINT': 'https://s3-api.us-geo.objectstorage.service.networklayer.com',
'IBM_AUTH_ENDPOINT': 'https://iam.ng.bluemix.net/oidc/token',
'BUCKET': 'default2232ab05ca4324c5daa253e07d89dc38a',
'FILE': 'sample_text_1.txt'
}

@hidden_cell

The following code contains the credentials for a file in your IBM Cloud Object Storage.

You might want to remove those credentials before you share your notebook.

credentials_2 = {
'IBM_API_KEY_ID': 'SwrwslFTRrPQZhBRcfxEMUppNrtpGiEBvaWOMtQUr6iO',
'IAM_SERVICE_ID': 'iam-ServiceId-0e2b1a61-1eef-4c13-9a50-a765df76f36b',
'ENDPOINT': 'https://s3-api.us-geo.objectstorage.service.networklayer.com',
'IBM_AUTH_ENDPOINT': 'https://iam.ng.bluemix.net/oidc/token',
'BUCKET': 'default2232ab05ca4324c5daa253e07d89dc38a',
'FILE': 'sample_text_2.txt'
}

The above cell is executing properly but i am facing issues while compiling next cell. kindly provide solution as early as possible

Specify file names for sample text and configuration files

sampleTextFileName1 = "sample_text_1.txt"
sampleTextFileName2 = "sample_text_2.txt"
sampleConfigFileName = "sample_config.txt"

Maintain tagged text and plain text map

tagTextMap ={}

Stop words

stopWords = stopwords.words('english')

Additional words to be ignored

stopWords.extend(["The","This","That",".","!","?"])

//ERROR I AM GETTING

LookupError Traceback (most recent call last)
/opt/conda/envs/DSX-Python35/lib/python3.5/site-packages/nltk/corpus/util.py in __load(self)
79 except LookupError as e:
---> 80 try: root = nltk.data.find('{}/{}'.format(self.subdir, zip_name))
81 except LookupError: raise e

/opt/conda/envs/DSX-Python35/lib/python3.5/site-packages/nltk/data.py in find(resource_name, paths)
672 resource_not_found = '\n%s\n%s\n%s\n' % (sep, msg, sep)
--> 673 raise LookupError(resource_not_found)
674

LookupError:

Resource stopwords not found.
Please use the NLTK Downloader to obtain the resource:

import nltk
nltk.download('stopwords')

Searched in:
- '/home/dsxuser/nltk_data'
- '/usr/share/nltk_data'
- '/usr/local/share/nltk_data'
- '/usr/lib/nltk_data'
- '/usr/local/lib/nltk_data'
- '/opt/conda/envs/DSX-Python35/nltk_data'
- '/opt/conda/envs/DSX-Python35/lib/nltk_data'

During handling of the above exception, another exception occurred:

LookupError Traceback (most recent call last)
in ()
8
9 # Stop words
---> 10 stopWords = stopwords.words('english')
11 # Additional words to be ignored
12 stopWords.extend(["The","This","That",".","!","?"])

/opt/conda/envs/DSX-Python35/lib/python3.5/site-packages/nltk/corpus/util.py in getattr(self, attr)
114 raise AttributeError("LazyCorpusLoader object has no attribute 'bases'")
115
--> 116 self.__load()
117 # This looks circular, but its not, since __load() changes our
118 # class to something new:

/opt/conda/envs/DSX-Python35/lib/python3.5/site-packages/nltk/corpus/util.py in __load(self)
79 except LookupError as e:
80 try: root = nltk.data.find('{}/{}'.format(self.subdir, zip_name))
---> 81 except LookupError: raise e
82
83 # Load the corpus.

/opt/conda/envs/DSX-Python35/lib/python3.5/site-packages/nltk/corpus/util.py in __load(self)
76 else:
77 try:
---> 78 root = nltk.data.find('{}/{}'.format(self.subdir, self.__name))
79 except LookupError as e:
80 try: root = nltk.data.find('{}/{}'.format(self.subdir, zip_name))

/opt/conda/envs/DSX-Python35/lib/python3.5/site-packages/nltk/data.py in find(resource_name, paths)
671 sep = '*' * 70
672 resource_not_found = '\n%s\n%s\n%s\n' % (sep, msg, sep)
--> 673 raise LookupError(resource_not_found)
674
675

LookupError:

Resource stopwords not found.
Please use the NLTK Downloader to obtain the resource:

import nltk
nltk.download('stopwords')

Error under item 5.Correlate text

Hi Balaji
Thanks for your earlier response.I converted the pdf files into .txt files and ran the code. .but it is still throwing error under item 5 which is correlate text..The error has to do with sampleconfigfilename..kindly explain what is sample config filename..and which.txt file I have to provide there.
Also please specify what is this error. Is there any other .txt file that I need to provide under sample configfilename. If yes then what/which?
The error says:
ClientException: Object GET failed: https://dal.objectstorage.open.softlayer.com/v1/AUTH_feddc56178904edabf6d7bf96b773f5f/DefaultProjectrishank1312gmailcomc3di/sample_config.txt 404 Not Found [first 60 chars of response]

Not Found

The resource could not be found.<
Regards
Rishank

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.

ibm / watson-document-co-relation Goto Github PK

watson-document-co-relation's Introduction

Correlation of text content across documents using Watson Natural Language Understanding, Python NLTK and IBM Data Science experience

Included components

Featured technologies

Watch the Video

Steps

1. Sign up for Watson Studio

2. Create IBM Cloud services

3. Create the notebook

4. Add the data and configuration file

Add the data and configuration to the notebook

Fix-up file names for your own data and configuration files

5. Update the notebook with service credentials

Add the Watson Natural Language Understanding credentials to the notebook

Add the Object Storage credentials to the notebook

6. Run the notebook

7. Analyze the results

Other scenarios and use cases for which a solution can be built using the above methodology

Related links

Troubleshooting

License

watson-document-co-relation's People

Contributors

Stargazers

Watchers

Forkers

watson-document-co-relation's Issues

@hidden_cell

The following code contains the credentials for a file in your IBM Cloud Object Storage.

You might want to remove those credentials before you share your notebook.

@hidden_cell

The following code contains the credentials for a file in your IBM Cloud Object Storage.

You might want to remove those credentials before you share your notebook.

Specify file names for sample text and configuration files

Maintain tagged text and plain text map

Stop words

Additional words to be ignored

//ERROR I AM GETTING

Not Found

Recommend Projects

Recommend Topics

Recommend Org