Giter VIP home page Giter VIP logo

spark-nlp-workshop's Introduction

John Snow Labs: State-of-the-art NLP in Python

The John Snow Labs library provides a simple & unified Python API for delivering enterprise-grade natural language processing solutions:

  1. 15,000+ free NLP models in 250+ languages in one line of code. Production-grade, Scalable, trainable, and 100% open-source.
  2. Open-source libraries for Responsible AI (NLP Test), Explainable AI (NLP Display), and No-Code AI (NLP Lab).
  3. 1,000+ healthcare NLP models and 1,000+ legal & finance NLP models with a John Snow Labs license subscription.

Homepage: https://www.johnsnowlabs.com/

Docs & Demos: https://nlp.johnsnowlabs.com/

Features

Powered by John Snow Labs Enterprise-Grade Ecosystem:

  • ๐Ÿš€ Spark-NLP : State of the art NLP at scale!
  • ๐Ÿค– NLU : 1 line of code to conquer NLP!
  • ๐Ÿ•ถ Visual NLP : Empower your NLP with a set of eyes!
  • ๐Ÿ’Š Healthcare NLP : Heal the world with NLP!
  • โš– Legal NLP : Bring justice with NLP!
  • ๐Ÿ’ฒ Finance NLP : Understand Financial Markets with NLP!
  • ๐ŸŽจ NLP-Display Visualize and Explain NLP!
  • ๐Ÿ“Š NLP-Test : Deliver Reliable, Safe and Effective Models!
  • ๐Ÿ”ฌ NLP-Lab : No-Code Tool to Annotate & Train new Models!

Installation

! pip install johnsnowlabs

from johnsnowlabs import nlp
nlp.load('emotion').predict('Wow that was easy!')

See the documentation for more details.

Usage

These are examples of getting things done with one line of code. See the General Concepts Documentation for building custom pipelines.

# Example of Named Entity Recognition
nlp.load('ner').predict("Dr. John Snow is an British physician born in 1813")

Returns :

entities entities_class entities_confidence
John Snow PERSON 0.9746
British NORP 0.9928
1813 DATE 0.5841
# Example of Question Answering 
nlp.load('answer_question').predict("What is the capital of Paris")

Returns :

text answer
What is the capital of France Paris
# Example of Sentiment classification
nlp.load('sentiment').predict("Well this was easy!")

Returns :

text sentiment_class sentiment_confidence
Well this was easy! pos 0.999901
nlp.load('ner').viz('Bill goes to New York')

Returns:
ner_viz_opensource For a full overview see the 1-liners Reference and the Workshop.

Use Licensed Products

To use John Snow Labs' paid products like Healthcare NLP, [Visual NLP], [Legal NLP], or [Finance NLP], get a license key and then call nlp.install() to use it:

! pip install johnsnowlabs
# Install paid libraries via a browser login to connect to your account
from johnsnowlabs import nlp
nlp.install()
# Start a licensed session
nlp.start()
nlp.load('en.med_ner.oncology_wip').predict("Woman is on  chemotherapy, carboplatin 300 mg/m2.")

Usage

These are examples of getting things done with one line of code. See the General Concepts Documentation for building custom pipelines.

# visualize entity resolution ICD-10-CM codes 
nlp.load('en.resolve.icd10cm.augmented')
    .viz('Patient with history of prior tobacco use, nausea, nose bleeding and chronic renal insufficiency.')

returns:
ner_viz_opensource

# Temporal Relationship Extraction&Visualization
nlp.load('relation.temporal_events')\
    .viz('The patient developed cancer after a mercury poisoning in 1999 ')

returns: relationv_viz

Helpful Resources

Take a look at the official Johnsnowlabs page page: https://nlp.johnsnowlabs.com for user documentation and examples

Resource Description
General Concepts General concepts in the Johnsnowlabs library
Overview of 1-liners Most common used models and their results
Overview of 1-liners for healthcare Most common used healthcare models and their results
Overview of all 1-liner Notebooks 100+ tutorials on how to use the 1 liners on text datasets for various problems and from various sources like Twitter, Chinese News, Crypto News Headlines, Airline Traffic communication, Product review classifier training,
Connect with us on Slack Problems, questions or suggestions? We have a very active and helpful community of over 2000+ AI enthusiasts putting Johnsnowlabs products to good use
Discussion Forum More indepth discussion with the community? Post a thread in our discussion Forum
Github Issues Report a bug
Custom Installation Custom installations, Air-Gap mode and other alternatives
The nlp.load(<Model>) function Load any model or pipeline in one line of code
The nlp.load(<Model>).predict(data) function Predict on Strings, List of Strings, Numpy Arrays, Pandas, Modin and Spark Dataframes
The nlp.load(<train.Model>).fit(data) function Train a text classifier for 2-Class, N-Classes Multi-N-Classes, Named-Entitiy-Recognition or Parts of Speech Tagging
The nlp.load(<Model>).viz(data) function Visualize the results of Word Embedding Similarity Matrix, Named Entity Recognizers, Dependency Trees & Parts of Speech, Entity Resolution,Entity Linking or Entity Status Assertion
The nlp.load(<Model>).viz_streamlit(data) function Display an interactive GUI which lets you explore and test every model and feature in Johnsowlabs 1-liner repertoire in 1 click.

License

This library is licensed under the Apache 2.0 license. John Snow Labs' paid products are subject to this End User License Agreement.
By calling nlp.install() to add them to your environment, you agree to its terms and conditions.

spark-nlp-workshop's People

Contributors

ahmet-mesut avatar ahmetemintek avatar akrztrk avatar albertoandreottiatgmail avatar arshaannazir avatar aydinmyilmaz avatar bunyamin-polat avatar c-k-loan avatar cabir40 avatar damla-gurbaz avatar dcecchini avatar dependabot[bot] avatar diatrambitas avatar digaari avatar egenc avatar gadde5300 avatar galiph avatar gokhanturer avatar hashamulhaq avatar hsaglamlar avatar josejuanmartinez avatar kshitizgit avatar luca-martial avatar mary-sci avatar maziyarpanahi avatar meryem1425 avatar muhammetsnts avatar murat-gunay avatar prikshit7766 avatar vkocaman avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

spark-nlp-workshop's Issues

sparknlp.eval module not found

ModuleNotFoundError: No module named 'sparknlp.eval'

Description

I have sparknlp set up and I dont have any issues using other modules in sparknlp. Using from sparknlp.eval import * gives me an error.

Steps to Reproduce

  1. pip install sparknlp==2.5.4
  2. from sparknlp.eval import *
    image

AnalysisException: Reference 'bert-embedding' is ambiguous, could be: bert-embedding, bert-embedding.

AnalysisException Traceback (most recent call last)
in ()
----> 1 predictions = ner_model_bert.transform(test_data)

5 frames
/usr/local/lib/python3.7/dist-packages/pyspark/sql/utils.py in deco(*a, **kw)
115 # Hide where the exception came from that shows a non-Pythonic
116 # JVM exception message.
--> 117 raise converted from None
118 else:
119 raise

AnalysisException: Reference 'bert-embedding' is ambiguous, could be: bert-embedding, bert-embedding.
I am getting this error when I am trying NER using BertEmbeddings with pyspark==3.1.1 and spark-nlp==3.0.2. My code is working fine wid previous version of pyspark . Can you please help me out.

IllegalArgumentException: 'requirement failed: License Key not set please set environment variable JSL_OCR_LICENSE or property jsl.sparkocr.settings.license!'

License Key not set

The above exception occurs when I play with tutorials/Certification_Trainings/Healthcare/5.Spark_OCR.ipynb

  1. start running this notebook on colab
  2. uploading my licence key
  3. Spark NLP and Spark OCR are able to run correctly
  4. It is failed at this cell, result = pipeline().transform(pdf_example_df).cache()

Py4JJavaError Traceback (most recent call last)
/usr/local/lib/python3.7/dist-packages/pyspark/sql/utils.py in deco(*a, **kw)
62 try:
---> 63 return f(*a, **kw)
64 except py4j.protocol.Py4JJavaError as e:

7 frames
/usr/local/lib/python3.7/dist-packages/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
327 "An error occurred while calling {0}{1}{2}.\n".
--> 328 format(target_id, ".", name), value)
329 else:

Py4JJavaError: An error occurred while calling None.com.johnsnowlabs.ocr.transformers.PdfToImage.
: java.lang.IllegalArgumentException: requirement failed: License Key not set please set environment variable JSL_OCR_LICENSE or property jsl.sparkocr.settings.license!
at scala.Predef$.require(Predef.scala:224)
at com.johnsnowlabs.license.LicenseValidator$.checkLicense(LicenseValidator.scala:44)
at com.johnsnowlabs.license.LicenseValidator$.isValidLicense$lzycompute(LicenseValidator.scala:23)
at com.johnsnowlabs.license.LicenseValidator$.isValidLicense(LicenseValidator.scala:23)
at com.johnsnowlabs.license.Licensed$class.$init$(Licensed.scala:4)
at com.johnsnowlabs.ocr.transformers.PdfToImage.(PdfToImage.scala:39)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:238)
at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)

During handling of the above exception, another exception occurred:

IllegalArgumentException Traceback (most recent call last)
in ()
----> 1 result = pipeline().transform(pdf_example_df)#.cache()

in pipeline()
2
3 # Transforrm PDF document to images per page
----> 4 pdf_to_image = PdfToImage() .setInputCol("content") .setOutputCol("image")
5
6 # Run OCR

/usr/local/lib/python3.7/dist-packages/pyspark/init.py in wrapper(self, *args, **kwargs)
108 raise TypeError("Method %s forces keyword arguments." % func.name)
109 self._input_kwargs = kwargs
--> 110 return func(self, **kwargs)
111 return wrapper
112

/root/.local/lib/python3.7/site-packages/sparkocr/transformers/pdf/pdf_to_image.py in init(self)
69 """
70 super(PdfToImage, self).init()
---> 71 self._java_obj = self._new_java_obj("com.johnsnowlabs.ocr.transformers.PdfToImage", self.uid)
72 self._setDefault(outputCol='image')
73

/usr/local/lib/python3.7/dist-packages/pyspark/ml/wrapper.py in _new_java_obj(java_class, *args)
65 java_obj = getattr(java_obj, name)
66 java_args = [_py2java(sc, arg) for arg in args]
---> 67 return java_obj(*java_args)
68
69 @staticmethod

/usr/local/lib/python3.7/dist-packages/py4j/java_gateway.py in call(self, *args)
1523 answer = self._gateway_client.send_command(command)
1524 return_value = get_return_value(
-> 1525 answer, self._gateway_client, None, self._fqn)
1526
1527 for temp_arg in temp_args:

/usr/local/lib/python3.7/dist-packages/pyspark/sql/utils.py in deco(*a, **kw)
77 raise QueryExecutionException(s.split(': ', 1)[1], stackTrace)
78 if s.startswith('java.lang.IllegalArgumentException: '):
---> 79 raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)
80 raise
81 return deco

IllegalArgumentException: 'requirement failed: License Key not set please set environment variable JSL_OCR_LICENSE or property jsl.sparkocr.settings.license!'

Your Environment

  • Spark-NLP version: 2.7.4
  • Apache Spark version: 2.4.4
  • Operating System and version: Colab
  • Deployment (Docker, Jupyter, Scala, pip, conda, etc.):

Valid Certification Path Error in "How to use Light Pipelines"

Running spark-nlp lab "1- How to use Light Pipelines ".
Downloaded latest container April 5 2019.
Execute line:
pipeline = PretrainedPipeline('explain_document_ml', lang='en')

Fails with certificate error:
Py4JJavaError: An error occurred while calling z:com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.downloadPipeline.
: com.amazonaws.AmazonClientException: Unable to execute HTTP request: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:454)
etc...

Your Environment

sudo docker pull johnsnowlabs/spark-nlp-workshop
Using default tag: latest
latest: Pulling from johnsnowlabs/spark-nlp-workshop
Digest: sha256:d681c309eb52bb5af42171485bca246d1c311608d37b75773a2af57979be7368
Status: Image is up to date for johnsnowlabs/spark-nlp-workshop:latest

  • Operating System and version: CentOS Linux release 7.6.1810 (Core)
  • Deployment: Docker

Cannot import com.johnsnowlabs.nlp.annotators.ner.NerConverterInternal;

Description

I tried to run the java code NerConverterInternalFiltererExample. But I couldn't import the com.johnsnowlabs.nlp.annotators.ner.NerConverterInternal and create NerConverterInternal instance.

Steps to Reproduce

Your Environment

  • Spark-NLP version: 3.1.2
  • Apache Spark version: 3.1.2
  • Operating System and version:
  • Deployment (Docker, Jupyter, Scala, pip, conda, etc.): Azure Databricks Spark Job.

Error using PretrainedPipeline: AmazonS3Exception: AccessDenied

Description

Hello, I'm trying the example "running pre-trained models" from spark-nlp workshop but looks like I don't have access to the pre-trained models.

I'm using PretrainedPipeline("pipeline_basic") to annotate a String, but i get the following exception:

Exception in thread "main" com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 403, AWS Service: Amazon S3, AWS Request ID: 2CD94D6F322010B4, AWS Error Code: AccessDenied, AWS Error Message: Access Denied, S3 Extended Request ID: 5QJUS9/TG7MW8YBtzLnbNTP/lKjbQp5LW85K9w60OiNNYFHh0mREPdJsurpomDNnmreS5LDWTHk=
at com.amazonaws.http.AmazonHttpClient.handleErrorResponse(AmazonHttpClient.java:798)
at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:421)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:232)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3528)
at com.amazonaws.services.s3.AmazonS3Client.getObject(AmazonS3Client.java:1111)
at com.amazonaws.services.s3.AmazonS3Client.getObject(AmazonS3Client.java:984)
at com.johnsnowlabs.nlp.pretrained.S3ResourceDownloader.downloadMetadataIfNeed(S3ResourceDownloader.scala:66)
at com.johnsnowlabs.nlp.pretrained.S3ResourceDownloader.resolveLink(S3ResourceDownloader.scala:77)
at com.johnsnowlabs.nlp.pretrained.S3ResourceDownloader.download(S3ResourceDownloader.scala:89)
at com.johnsnowlabs.nlp.pretrained.ResourceDownloader$.downloadResource(ResourceDownloader.scala:101)
at com.johnsnowlabs.nlp.pretrained.ResourceDownloader$.downloadPipeline(ResourceDownloader.scala:133)
at com.johnsnowlabs.nlp.pretrained.ResourceDownloader$.downloadPipeline(ResourceDownloader.scala:128)
at com.johnsnowlabs.nlp.pretrained.PretrainedPipeline.model$lzycompute(PretrainedPipeline.scala:14)
at com.johnsnowlabs.nlp.pretrained.PretrainedPipeline.model(PretrainedPipeline.scala:13)
at com.johnsnowlabs.nlp.pretrained.PretrainedPipeline.annotate(PretrainedPipeline.scala:19)
at App.getPretrainedModels(App.java:65)
at App.main(App.java:37)

Do I have to provide the AWS API keys? if yes how do I do that?

Your Environment

  • Spark-NLP version: 2.0.1
  • Apache Spark version: 2.3.1
  • Operating System and version: Windows 7
  • Deployment (Docker, Jupyter, Scala, pip, conda, etc.): Scala

dl-ner.ipynb incorrect start and download with `pipeline_fast_dl`

PipelineModel with stages does not load

Steps to Reproduce

  1. Pull and run the docker
  2. Run notebook https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/jupyter/annotation/english/model-downloader/dl-ner.ipynb
  3. Try to launch the cell โ„–3. You will get the exception on spark.createDataFrame saying that spark is not available. There is a quick fix: in the cell โ„–2 you need to change sparknlp.start() to spark = sparknlp.start(). After that you can proceed forward.
  4. Then try to launch the cell โ„–4. You will get the exception saying that resource failed to download. Therefore, pipeline_fast_dl will not be initialized.

Your Environment

  • Spark-NLP version: 2.0.3
  • Apache Spark version: 2.4.1
  • Operating System and version: The latest Docker
  • Deployment (Docker, Jupyter, Scala, pip, conda, etc.): I have pulled the latest docker as described on the main page.

Examples in quickstart use outdated OcrHelper

Using guide of https://nlp.johnsnowlabs.com/docs/en/quickstart, most steps work, except when using OCR to read PDFs

Description

  1. adding coordinates to spark-shell complains about missing javax.media.jai. It seems that the artefact is missing on maven central. To get this working, had to add repository:
    bin/spark-shell --packages JohnSnowLabs:spark-nlp:2.0.4,com.johnsnowlabs.nlp:spark-nlp-ocr_2.11:2.0.8 --repositories https://repo.spring.io/plugins-release
  2. the OcrHelper class does not have a companion object (any more?).
    val data = OcrHelper.createDataset(spark, "/pdfs/", "text", "metadata") does not work.
    Instead, you need to instantiate a new object, an also createDataset does not exist as mentioned above. See https://github.com/JohnSnowLabs/spark-nlp/blob/master/ocr/src/main/scala/com/johnsnowlabs/nlp/util/io/OcrHelper.scala

Steps to Reproduce

Your Environment

Ubuntu linux, spark 2.4.3

  • Spark-NLP version:
  • Apache Spark version:
  • Operating System and version:
  • Deployment (Docker, Jupyter, Scala, pip, conda, etc.):

CoNLL reader function setting document = sentence

The CoNLL().readDataset() is not working as expected. The document is equal to the sentence and is not being built by the -DOCSTART- -X- -X- O flag.

I am not sure if this issue will affect the training of a NERDL model. However, it makes it impossible to refer back to a specific document (not sentence) where the entity is detected. To reproduce the example you can go through the example notebook provided here : https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/jupyter/training/english/dl-ner/ner_dl.ipynb

and inspect the columns after reading the CoNLL data.

https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/jupyter/training/english/dl-ner/ner_dl.ipynb

Your Environment

  • Spark-NLP version: 3.1.2
  • Apache Spark version: 3.1.2
  • Operating System and version: MacOS 11.4
  • Deployment (local Jupyter notebook):

Java example is not runnable - no pom.xml or build.gradle, missing imports and more

Description

The Java example only contains one file. It starts with a DocumentAssembler instance creation, but without any import of the DocumentAssembler class. There is also an import on the EmbeddingHelper class which is not supported on the "com.johnsnowlabs.nlp:spark-nlp_2.12:3.1.2" dependency.
image

The folder doesn't contain any gradle.build file or pom.xml, which makes the process much harder (and makes me wonder if the class was even tested).
I'd like to get an explanation of how to run this because the documentation also doesn't contain any Java references.

Missing column in Deep Learning NER example

Description

In ner_bert.ipynb, only the "sentence" column from BertEmbeddings is selected. The "tokens" column is also required.

Steps to Reproduce

Run the notebook up to the bert.transform cell. With the current BertEmbeddings cell

bert = BertEmbeddings.pretrained() \ .setInputCols(["sentence"])\ .setOutputCol("bert")\ .setCaseSensitive(False)

the training will fail. Changing the cell to

bert = BertEmbeddings.pretrained() \ .setInputCols(["sentence","token"])\ .setOutputCol("bert")\ .setCaseSensitive(False)
allows the training to finish.

Your Environment

  • Spark-NLP version: 2.4.5
  • Apache Spark version: 2.4.5
  • Operating System and version: Mint Linux, compatible with Ubuntu 18.04.
  • Deployment (Docker, Jupyter, Scala, pip, conda, etc.): Jupyter

Sentiment Analysis, missclassification

Pipeine from Sentiment_rb.ipynb missclassifies obvious sentences

Steps to Reproduce

  1. Open https://github.com/JohnSnowLabs/spark-nlp-workshop/tree/master/jupyter/annotation/english/dictionary-sentiment
  2. Load the pipeline analyze_sentiment_ml
  3. Try to annotate Harry Potter is a good movie. You will see that sentiment is positive. That's correct.
  4. Try to annotate Harry Potter is a bad movie. You will see that sentiment is still positive. That's a mistake.
  5. Also, try to annotate Harry Potter. The model will classify it as negative :)

Your Environment

  • Spark-NLP version: 2.0.3
  • Apache Spark version: 2.4.1
  • Operating System and version: Docker
  • Deployment (Docker, Jupyter, Scala, pip, conda, etc.): I have tried your actual Docker container.

Py4JJavaError : An error occurred while calling z:com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.downloadModel. : java.lang.UnsupportedOperationException: empty collection

Hi,

I am facing this error when I am trying to load Clinical Word Embedding Model. The same code use to run earlier with no errors in the same environment.

import pandas as pd
from pyspark.ml import Pipeline
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from pyspark.sql.types import *
import sparknlp
from sparknlp.annotator import *
from sparknlp_jsl.annotator import *
from sparknlp.base import *
import sparknlp_jsl

word_embeddings = WordEmbeddingsModel.pretrained('embeddings_healthcare', 'en', 'clinical/models') \ .setInputCols(['sentence', 'token']) \ .setOutputCol('embeddings')

PySpark Version - 2.4.4
SparkNLP Version - 2.6.4
Java Version - "1.8.0_45"

Here are the complete details of the Error -

embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]

Py4JJavaError Traceback (most recent call last)
in
15 # .setOutputCol('embeddings')
16
---> 17 word_embeddings = WordEmbeddingsModel.pretrained('embeddings_clinical', 'en', 'clinical/models')
18 .setInputCols(['sentence', 'token'])
19 .setOutputCol('embeddings')

~/anaconda3/envs/pytorch_p36/lib/python3.7/site-packages/sparknlp/annotator.py in pretrained(name, lang, remote_loc)
1745 def pretrained(name="glove_100d", lang="en", remote_loc=None):
1746 from sparknlp.pretrained import ResourceDownloader
-> 1747 return ResourceDownloader.downloadModel(WordEmbeddingsModel, name, lang, remote_loc)
1748
1749 @staticmethod

~/anaconda3/envs/pytorch_p36/lib/python3.7/site-packages/sparknlp/pretrained.py in downloadModel(reader, name, language, remote_loc, j_dwn)
39 t1.start()
40 try:
---> 41 j_obj = _internal._DownloadModel(reader.name, name, language, remote_loc, j_dwn).apply()
42 finally:
43 stop_threads = True

~/anaconda3/envs/pytorch_p36/lib/python3.7/site-packages/sparknlp/internal.py in init(self, reader, name, language, remote_loc, validator)
174 class _DownloadModel(ExtendedJavaWrapper):
175 def init(self, reader, name, language, remote_loc, validator):
--> 176 super(_DownloadModel, self).init("com.johnsnowlabs.nlp.pretrained."+validator+".downloadModel", reader, name, language, remote_loc)
177
178

~/anaconda3/envs/pytorch_p36/lib/python3.7/site-packages/sparknlp/internal.py in init(self, java_obj, *args)
127 super(ExtendedJavaWrapper, self).init(java_obj)
128 self.sc = SparkContext._active_spark_context
--> 129 self._java_obj = self.new_java_obj(java_obj, *args)
130 self.java_obj = self._java_obj
131

~/anaconda3/envs/pytorch_p36/lib/python3.7/site-packages/sparknlp/internal.py in new_java_obj(self, java_class, *args)
137
138 def new_java_obj(self, java_class, *args):
--> 139 return self._new_java_obj(java_class, *args)
140
141 def new_java_array(self, pylist, java_class):

~/anaconda3/envs/pytorch_p36/lib/python3.7/site-packages/pyspark/ml/wrapper.py in _new_java_obj(java_class, *args)
65 java_obj = getattr(java_obj, name)
66 java_args = [_py2java(sc, arg) for arg in args]
---> 67 return java_obj(*java_args)
68
69 @staticmethod

~/anaconda3/envs/pytorch_p36/lib/python3.7/site-packages/py4j/java_gateway.py in call(self, *args)
1255 answer = self.gateway_client.send_command(command)
1256 return_value = get_return_value(
-> 1257 answer, self.gateway_client, self.target_id, self.name)
1258
1259 for temp_arg in temp_args:

~/anaconda3/envs/pytorch_p36/lib/python3.7/site-packages/pyspark/sql/utils.py in deco(*a, **kw)
61 def deco(*a, **kw):
62 try:
---> 63 return f(*a, **kw)
64 except py4j.protocol.Py4JJavaError as e:
65 s = e.java_exception.toString()

~/anaconda3/envs/pytorch_p36/lib/python3.7/site-packages/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
326 raise Py4JJavaError(
327 "An error occurred while calling {0}{1}{2}.\n".
--> 328 format(target_id, ".", name), value)
329 else:
330 raise Py4JError(

Py4JJavaError: An error occurred while calling z:com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.downloadModel.
: java.lang.UnsupportedOperationException: empty collection
at org.apache.spark.rdd.RDD$$anonfun$first$1.apply(RDD.scala:1380)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.RDD.first(RDD.scala:1377)
at org.apache.spark.ml.util.DefaultParamsReader$.loadMetadata(ReadWrite.scala:615)
at org.apache.spark.ml.util.DefaultParamsReader.load(ReadWrite.scala:493)
at com.johnsnowlabs.nlp.FeaturesReader.load(ParamsAndFeaturesReadable.scala:12)
at com.johnsnowlabs.nlp.FeaturesReader.load(ParamsAndFeaturesReadable.scala:8)
at com.johnsnowlabs.nlp.pretrained.ResourceDownloader$.downloadModel(ResourceDownloader.scala:361)
at com.johnsnowlabs.nlp.pretrained.ResourceDownloader$.downloadModel(ResourceDownloader.scala:355)
at com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader$.downloadModel(ResourceDownloader.scala:469)
at com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.downloadModel(ResourceDownloader.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:745)

โ€‹

Fix: com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.downloadModel. : java.lang.OutOfMemoryError: Java heap space while running BERT Embedding

Fix OOM on Java Heap space while running BERT Embedding on pyspark

Question:

  1. What is machine configuration for running BERT Embedding?
  2. How to setup Java Heap space?

Steps to Reproduce

Code:

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from sparknlp.common import *
from sparknlp.embeddings import *
data = [
  ("New York is the greatest city in the world", 0),
  ("The beauty of Paris is vast", 1),
  ("The Centre Pompidou is in Paris", 1)
]
df = spark.createDataFrame(data, ["text","label"])
document_assembler = DocumentAssembler()\
  .setInputCol("text")\
  .setOutputCol("document")
tokenizer = Tokenizer().setInputCols(["document"])\
  .setOutputCol("token")
word_embeddings = BertEmbeddings.pretrained('bert_base_cased', 'en')\
  .setInputCols(["document", "token"])\
  .setOutputCol("embeddings")
bert_pipeline = Pipeline().setStages(
  [
    document_assembler,
    tokenizer,
    word_embeddings
  ]
)
df_bert = bert_pipeline.fit(df).transform(df)
display(df_bert)

Error Log

Approximate size to download 389.2 MB
Download done! Loading the resource.
[ โ€” ]2020-08-11 03:43:45.487324: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-08-11 03:43:45.493870: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2300000000 Hz
2020-08-11 03:43:45.494190: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7f3b55fa2960 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-08-11 03:43:45.494235: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
[ \ ]20/08/11 03:43:47 WARN MemoryStore: Not enough space to cache broadcast_5 in memory! (computed 417.4 MB so far)
20/08/11 03:43:47 WARN BlockManager: Persisting block broadcast_5 to disk instead.
[ / ]20/08/11 03:47:03 WARN BlockManager: Block broadcast_5 could not be removed as it was not found on disk or in memory
[OK!]
Traceback (most recent call last):
File "", line 2, in
File "/home/pt4_gcp/spark-nlp/anaconda3/envs/sparknlp/lib/python3.6/site-packages/sparknlp/annotator.py", line 1846, in pretrained
return ResourceDownloader.downloadModel(BertEmbeddings, name, lang, remote_loc)
File "/home/pt4_gcp/spark-nlp/anaconda3/envs/sparknlp/lib/python3.6/site-packages/sparknlp/pretrained.py", line 41, in downloadModel
j_obj = _internal._DownloadModel(reader.name, name, language, remote_loc, j_dwn).apply()
File "/home/pt4_gcp/spark-nlp/anaconda3/envs/sparknlp/lib/python3.6/site-packages/sparknlp/internal.py", line 176, in init
super(_DownloadModel, self).init("com.johnsnowlabs.nlp.pretrained."+validator+".downloadModel", reader, name, language, remote_loc)
File "/home/pt4_gcp/spark-nlp/anaconda3/envs/sparknlp/lib/python3.6/site-packages/sparknlp/internal.py", line 129, in init
self._java_obj = self.new_java_obj(java_obj, *args)
File "/home/pt4_gcp/spark-nlp/anaconda3/envs/sparknlp/lib/python3.6/site-packages/sparknlp/internal.py", line 139, in new_java_obj
return self._new_java_obj(java_class, *args)
File "/home/pt4_gcp/spark-nlp/anaconda3/envs/sparknlp/lib/python3.6/site-packages/pyspark/ml/wrapper.py", line 67, in _new_java_obj
return java_obj(*java_args)
File "/home/pt4_gcp/spark-nlp/anaconda3/envs/sparknlp/lib/python3.6/site-packages/pyspark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in call
File "/home/pt4_gcp/spark-nlp/anaconda3/envs/sparknlp/lib/python3.6/site-packages/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/home/pt4_gcp/spark-nlp/anaconda3/envs/sparknlp/lib/python3.6/site-packages/pyspark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.downloadModel.
: java.lang.OutOfMemoryError: Java heap space
at java.nio.file.Files.read(Files.java:3099)
at java.nio.file.Files.readAllBytes(Files.java:3158)
at com.johnsnowlabs.ml.tensorflow.TensorflowWrapper.writeObject(TensorflowWrapper.scala:173)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1154)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)
at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:43)
at org.apache.spark.serializer.SerializationStream.writeAll(Serializer.scala:140)
at org.apache.spark.serializer.SerializerManager.dataSerializeStream(SerializerManager.scala:174)
at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1$$anonfun$apply$7.apply(BlockManager.scala:1174)
at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1$$anonfun$apply$7.apply(BlockManager.scala:1172)
at org.apache.spark.storage.DiskStore.put(DiskStore.scala:69)
at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1172)
at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091)
at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156)
at org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:914)
at org.apache.spark.storage.BlockManager.putSingle(BlockManager.scala:1481)
at org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:123)
at org.apache.spark.broadcast.TorrentBroadcast.(TorrentBroadcast.scala:88)
at org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34)
at org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:62)

Environment

  • Spark-NLP version: 2.5.4
  • Apache Spark version: 2.4.4
  • Java version : openjdk version "1.8.0_265"
  • Operating System and version: Ubuntu 18.04 (Google VM)
  • VM Machine: 4 CPU, 15 GB RAM, 30 GB SSD
  • Deployment (Docker, Jupyter, Scala, pip, conda, etc.): Jupyter

Wrong interpretation for Language Detection

I used to run language detection as sample text mixed with English and French

Steps to Reproduce

Run from this example jupyter notebook https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/jupyter/annotation/english/language-detection/Language_Detection_and_Indentification.ipynb

Sample text:

Today is the anniversary of the publication of Robert Frostโ€™s iconic poem โ€œStopping by Woods on a Snowy Evening,โ€ a fact that spurred the Literary Hub office into a long conversation about their favorite poems, the most iconic poems written in English, and which poems we should all have already read (or at least be reading next). Turns out, despite frequent (false) claims that poetry is dead and/or irrelevant and/or boring, there are plenty of poems that have sunk deep into our collective consciousness as cultural icons.Demain, dรจs lโ€™aube, ร  lโ€™heure oรน blanchit la campagne,Je partirai. Vois-tu, je sais que tu mโ€™attends.Jโ€™irai par la forรชt, jโ€™irai par la montagne.Je ne puis demeurer loin de toi plus longtemps

Result:

`|result|
+------+ |
[en]| |

Environment

  • Spark-NLP version: 2.5.4
  • Apache Spark version: 2.4.4
  • Operating System and version: Ubuntu 18.04 (Google VM)
  • Deployment (Docker, Jupyter, Scala, pip, conda, etc.):Jupyter

Sentiment_rb.ipynb has issue with downloading pipeline

PretrainedPipeline("movies_sentiment_analysis")

Steps to Reproduce

  1. Pull and run the docker
  2. Run notebook https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/jupyter/annotation/english/dictionary-sentiment/sentiment_rb.ipynb
  3. Try to launch the cell #3. You will get the exception saying that resource failed to download.
    image

Looks like some dependency inside docker is missing. Could you please check?

Your Environment

  • Spark-NLP version: 2.0.3
  • Apache Spark version: 2.4.1
  • Operating System and version: The latest Docker
  • Deployment (Docker, Jupyter, Scala, pip, conda, etc.): I have pulled the latest docker as described on the main page.

Wrong reference in output result. spell not found in explain_document_ml.ipynb

Description

[content['spell'] for content in result]

should be replaced to the:

[content['checked'] for content in result]

or can be related to the wrong version of library in the Docker image

same problem in:
/annotation/english/dictionary-sentiment/sentiment_rb.ipynb

Steps to Reproduce

Run /annotation/english/explain-document-ml/explain_document_ml.ipynb

Your Environment

  • Spark-NLP version: 2.0.4
  • Apache Spark version: 2.4.0
  • Operating System and version: Ubuntu 18.04 and Docker
  • Deployment (Docker, Jupyter, Scala, pip, conda, etc.): Docker + Jupyter
    Screenshot from 2019-06-25 18-17-13

Exception: Java gateway process exited before sending its port number

Description

Notebook:
import sparknlp
spark = sparknlp.start()

leads to the error "Exception: Java gateway process exited before sending its port number"

Best guess is JAVA_HOME is not set in the docker environment (see jupyter/notebook#743)

Steps to Reproduce

  1. Just use the docker image and start the notebook; run the cell

Your Environment

Provided docker setup mentioned in this repo

  • Spark-NLP version:
  • Apache Spark version:
  • Operating System and version:
  • Deployment (Docker, Jupyter, Scala, pip, conda, etc.):

Wrong or missing inputCols annotators in NerDLModel

Error using the NerDLPipeline example, it misses one transformer.

Description

It misses a WordEmbeddingsModel and some columns are wrong.

Steps to Reproduce

Run NerDLPipeline example.

Proposed fix

Sorry for not doing a pull request, it's kinda late ๐Ÿ˜„ .

[...]

  val normalizer = new Normalizer()
    .setInputCols("token")
    .setOutputCol("normal")

  val wordEmbeddings = WordEmbeddingsModel.pretrained()
    .setInputCols("document", "token")
    .setOutputCol("word_embeddings")

  val ner = NerDLModel.pretrained()
    .setInputCols("document", "normal", "word_embeddings")
    .setOutputCol("ner")

[...]

  val pipeline = new Pipeline().setStages(Array(
    document,
    token,
    wordEmbeddings,
    normalizer,
    ner,
    nerConverter,
    finisher))

[...]
}

TextMatcher with spark dataframe entities

How do I use a spark dataframe as entities for a TextMatcher?

from sparknlp.base import *
from sparknlp.annotator import *
from sparknlp.common import ReadAs

entity_extractor = TextMatcher() \
    .setInputCols(["description"])\
    .setOutputCol("locations")\
    .setCaseSensitive(False)\
    .setEntities(locations_df.select(F.col('location').alias('values')), ReadAs.SPARK_DATASET)

display(entity_extractor.fit(df).transform(df))
py4j.Py4JException: Method fromJava([class org.apache.spark.sql.Dataset, class java.lang.String, class java.util.HashMap]) does not exist

Named Entity Labels using the RegexMatcher annotator

I am trying to train a NERDL model. I am able to assemble my data using the DocumentAssembler -> SentenceDetector -> Tokenizer annotators. I need to generate my label column, which in my case will be a binary label 'software tool' and 'Other'. I am using the RegexMatcher to detect my labeled software tools, but I am unsure about how to generate the 'Other' class. Also I am not sure the RegexMatcher will work because the result from the annotator is not a named entity, rather it is a chunk. I looked through the documentation and could not find a labeler annotator for NERDL.

Any help/suggestions are appreciated.

No module found sparknlp.dataset. Docker version mismatch.

ModuleNotFoundError: No module named 'sparknlp.dataset'

Description

Error on running example /jupyter/training/english/crf-ner/ner_benchmark.ipynb
Same for:

/training/english/crf-ner/ner.ipynb
/training/english/dl-ner/ner_benchmark.ipynb

Steps to Reproduce

Run /jupyter/training/english/crf-ner/ner_benchmark.ipynb from docker image

Your Environment

  • Spark-NLP version: 2.0.4
  • Apache Spark version: 2.4.0
  • Operating System and version: Ubuntu 18.04 and Docker
  • Deployment (Docker, Jupyter, Scala, pip, conda, etc.): Docker + Jupyter

Screenshot from 2019-06-25 18-02-53

AWS 403 on Pretrained Pipeline "analyze_sentiment_ml"

Description

I get a 403 response from the downloader, only recently on the analyze_sentiment_ml Pretrained Pipeline.

Steps to Reproduce

import sparknlp
spark = sparknlp.start()
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline('analyze_sentiment_ml', 'en')

Your Environment

  • Spark-NLP version: 2.01
  • Apache Spark version: pyspark 2.4.3
  • Operating System and version: Arch
  • Deployment (Docker, Jupyter, Scala, pip, conda, etc.): pip

Trace

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/renwickt/.local/lib/python3.7/site-packages/sparknlp/pretrained.py", line 30, in __init__
    self.model = ResourceDownloader().downloadPipeline(name, lang, remote_loc)
  File "/home/renwickt/.local/lib/python3.7/site-packages/sparknlp/pretrained.py", line 18, in downloadPipeline
    j_obj = _internal._DownloadPipeline(name, language, remote_loc).apply()
  File "/home/renwickt/.local/lib/python3.7/site-packages/sparknlp/internal.py", line 65, in __init__
    self._java_obj = self._new_java_obj(self._java_obj, name, language, remote_loc)
  File "/home/renwickt/.local/lib/python3.7/site-packages/pyspark/ml/wrapper.py", line 67, in _new_java_obj
    return java_obj(*java_args)
  File "/home/renwickt/.local/lib/python3.7/site-packages/py4j/java_gateway.py", line 1257, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/home/renwickt/.local/lib/python3.7/site-packages/pyspark/sql/utils.py", line 63, in deco
    return f(*a, **kw)
  File "/home/renwickt/.local/lib/python3.7/site-packages/py4j/protocol.py", line 328, in get_return_value
    format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling z:com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.downloadPipeline.
: com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 403, AWS Service: Amazon S3, AWS Request ID: 847CBA657B7672C1, AWS Error Code: null, AWS Error Message: Forbidden, S3 Extended Request ID: 6GMSNXWGTzHG9pVMDS8h59OrM/kU9CqRC88VTh7CLXF+j9H/uscIUcSAuGZNNe2kGasFkDIzjJc=
	at com.amazonaws.http.AmazonHttpClient.handleErrorResponse(AmazonHttpClient.java:798)
	at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:421)
	at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:232)
	at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3528)
	at com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:976)
	at com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:956)
	at com.johnsnowlabs.nlp.pretrained.S3ResourceDownloader$S3ClientWrapper.doesObjectExist(S3ResourceDownloader.scala:183)
	at com.johnsnowlabs.nlp.pretrained.S3ResourceDownloader$$anonfun$download$1.apply(S3ResourceDownloader.scala:94)
	at com.johnsnowlabs.nlp.pretrained.S3ResourceDownloader$$anonfun$download$1.apply(S3ResourceDownloader.scala:91)
	at scala.Option.flatMap(Option.scala:171)
	at com.johnsnowlabs.nlp.pretrained.S3ResourceDownloader.download(S3ResourceDownloader.scala:90)
	at com.johnsnowlabs.nlp.pretrained.ResourceDownloader$.downloadResource(ResourceDownloader.scala:101)
	at com.johnsnowlabs.nlp.pretrained.ResourceDownloader$.downloadPipeline(ResourceDownloader.scala:133)
	at com.johnsnowlabs.nlp.pretrained.ResourceDownloader$.downloadPipeline(ResourceDownloader.scala:128)
	at com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader$.downloadPipeline(ResourceDownloader.scala:197)
	at com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.downloadPipeline(ResourceDownloader.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)

Can we use the spark-nlp to utilize the OCR to get HOCR along with BOLD, table etc tags, is it purely distributed on multiple nodes.?

HI,
Is it possible to have Table extraction through the spark-nlp (ocr) module ,and can it be used for the Bold and table detection? I have been going through the source code and running some out I couldn't find any attribute that could help me in detection for table and its data ?

Is is even possible or is it under progress? I do know its possible using OPENCV + tesseract but that will only be UDF based spark and won't be completely distributed .

Will JohnSnowLabs will be adding in the Table Detection etc in the upcoming future?
Will those jobs be distributed across multiple nodes ? (OCR JOB)

I do know that the NLP is distributed, I just wanted to know about the OCR as its will be used as a UDF (in the local spark-nlp code), Kindly please let me know if I am wrong ? UDF is multi core based but can it run on Slave nodes? or just runs on the master nodes?

Thanks.

py4j.protocol.Py4JJavaError: An error occurred while calling o451.showString. :

Hello! The new release sounds awesome! So I've tried upgrading everything (spark-nlp-jsl 3.0.0, spark-nlp 3.0.1, apache spark 3.1.1) and I've run into a bit of a problem. I am getting this error:
py4j.protocol.Py4JJavaError: An error occurred while calling o451.showString. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 6.0 failed 1 times, most recent failure: Lost task 0.0 in stage 6.0 (TID 24) (computername executor driver): java.net.SocketException: Connection reset

The example for OCR does not match the version of spark-nlp in the Docker (spark-nlp==2.0.3)

Function spark.start_with_ocr() does not exist in the spark-nlp==2.0.3

Hello, thanks for the awesome repository. I am trying to proceed with the example from "explain-document-dl" and I get 2 blocking issues.

Steps to Reproduce

  1. Pull and run the docker.

  2. Run notebook https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/jupyter/annotation/english/explain-document-dl/Explain%20Document%20DL%20with%20OCR.ipynb

  3. Try to launch the cell #1. You will get the exception saying that sparknlp.start_with_ocr() method does not exist.
    image

  4. If you change it to simple sparknlp.start() you can proceed forward, but then on the cell #4 you will get another exception about OcrHelper(). Looks like in new version of spark-nlp OcrHelper() is static, and previously it was an instance method.
    image

  5. Also, I have found that it is possible to launch OCR in new version with this call: sparknlp.start(include_ocr=True). However, after rather some time it still crashes and does not work.

Your Environment

  • Spark-NLP version: 2.0.3 and 2.0.1
  • Apache Spark version: 2.4.1
  • Operating System and version: 3 configurations - Ubuntu 16.04, Ubuntu 18.04 and Docker
  • Deployment (Docker, Jupyter, Scala, pip, conda, etc.): I have tried your Docker container, it installed spark-nlp==2.0.3. Also, I manually installed on my own host machine with Ubuntu 16.04 spark-nlp==2.0.1 and tried 2.0.3 as well. And on Ubuntu 18.04 both versions of spark-nlp. On these 3 configurations it does not work.

Error while downloading the sentence embeddings

use = UniversalSentenceEncoder.pretrained()
.setInputCols(["document"])
.setOutputCol("sentence_embeddings")

Py4JJavaError: An error occurred while calling z:com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.getDownloadSize.
: java.lang.NoSuchMethodError: org.json4s.jackson.JsonMethods$.parse(Lorg/json4s/JsonInput;Z)Lorg/json4s/JsonAST$JValue;
at com.johnsnowlabs.nlp.pretrained.ResourceMetadata$.parseJson(ResourceMetadata.scala:61)
at com.johnsnowlabs.nlp.pretrained.ResourceMetadata$$anonfun$readResources$1.applyOrElse(ResourceMetadata.scala:90)
at com.johnsnowlabs.nlp.pretrained.ResourceMetadata$$anonfun$readResources$1.applyOrElse(ResourceMetadata.scala:89)
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
at scala.collection.Iterator$$anon$14.next(Iterator.scala:542)
at scala.collection.Iterator$class.foreach(Iterator.scala:891)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:183)
at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:45)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
at scala.collection.AbstractIterator.to(Iterator.scala:1334)
at scala.collection.TraversableOnce$class.toList(TraversableOnce.scala:294)
at scala.collection.AbstractIterator.toList(Iterator.scala:1334)
at com.johnsnowlabs.nlp.pretrained.ResourceMetadata$.readResources(ResourceMetadata.scala:92)
at com.johnsnowlabs.nlp.pretrained.ResourceMetadata$.readResources(ResourceMetadata.scala:84)
at com.johnsnowlabs.nlp.pretrained.S3ResourceDownloader.downloadMetadataIfNeed(S3ResourceDownloader.scala:70)
at com.johnsnowlabs.nlp.pretrained.S3ResourceDownloader.resolveLink(S3ResourceDownloader.scala:81)
at com.johnsnowlabs.nlp.pretrained.S3ResourceDownloader.getDownloadSize(S3ResourceDownloader.scala:159)
at com.johnsnowlabs.nlp.pretrained.ResourceDownloader$.getDownloadSize(ResourceDownloader.scala:394)
at com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader$.getDownloadSize(ResourceDownloader.scala:479)
at com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.getDownloadSize(ResourceDownloader.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)

Error when running notebooks: Answer from Java side is empty

When I tried to run the notebook jupyter/annotation/english/explain-document-dl/Explain Document DL.ipynb with the updated docker image (As instructed in #21) , I failed to run the cell:

pipeline = PretrainedPipeline('explain_document_dl')

The error messages in the warning box (the red box) read:

ERROR:root:Exception while sending command.
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/py4j/java_gateway.py", line 1159, in send_command
    raise Py4JNetworkError("Answer from Java side is empty")
py4j.protocol.Py4JNetworkError: Answer from Java side is empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/py4j/java_gateway.py", line 985, in send_command
    response = connection.send_command(command)
  File "/usr/local/lib/python3.6/dist-packages/py4j/java_gateway.py", line 1164, in send_command
    "Error while receiving", e, proto.ERROR_ON_RECEIVE)
py4j.protocol.Py4JNetworkError: Error while receiving

The following error messages are:

---------------------------------------------------------------------------
Py4JError                                 Traceback (most recent call last)
<timed exec> in <module>

/usr/local/lib/python3.6/dist-packages/sparknlp/pretrained.py in __init__(self, name, lang, remote_loc)
     28 
     29     def __init__(self, name, lang='en', remote_loc=None):
---> 30         self.model = ResourceDownloader().downloadPipeline(name, lang, remote_loc)
     31         self.light_model = LightPipeline(self.model)
     32 

/usr/local/lib/python3.6/dist-packages/sparknlp/pretrained.py in downloadPipeline(name, language, remote_loc)
     16     @staticmethod
     17     def downloadPipeline(name, language, remote_loc=None):
---> 18         j_obj = _internal._DownloadPipeline(name, language, remote_loc).apply()
     19         jmodel = JavaModel(j_obj)
     20         return jmodel

/usr/local/lib/python3.6/dist-packages/sparknlp/internal.py in __init__(self, name, language, remote_loc)
     63     def __init__(self, name, language, remote_loc):
     64         super(_DownloadPipeline, self).__init__("com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.downloadPipeline")
---> 65         self._java_obj = self._new_java_obj(self._java_obj, name, language, remote_loc)
     66 
     67 

/usr/local/lib/python3.6/dist-packages/pyspark/ml/wrapper.py in _new_java_obj(java_class, *args)
     65             java_obj = getattr(java_obj, name)
     66         java_args = [_py2java(sc, arg) for arg in args]
---> 67         return java_obj(*java_args)
     68 
     69     @staticmethod

/usr/local/lib/python3.6/dist-packages/py4j/java_gateway.py in __call__(self, *args)
   1255         answer = self.gateway_client.send_command(command)
   1256         return_value = get_return_value(
-> 1257             answer, self.gateway_client, self.target_id, self.name)
   1258 
   1259         for temp_arg in temp_args:

/usr/local/lib/python3.6/dist-packages/pyspark/sql/utils.py in deco(*a, **kw)
     61     def deco(*a, **kw):
     62         try:
---> 63             return f(*a, **kw)
     64         except py4j.protocol.Py4JJavaError as e:
     65             s = e.java_exception.toString()

/usr/local/lib/python3.6/dist-packages/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
    334             raise Py4JError(
    335                 "An error occurred while calling {0}{1}{2}".
--> 336                 format(target_id, ".", name))
    337     else:
    338         type = answer[1]

Py4JError: An error occurred while calling z:com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.downloadPipeline

Any clue to fix that? Thanks!

Sentence similarity with SparkNLP only works on Google DataProc with ONE sentence, FAILS when multiple sentences are provided

Deployed the following colab python code(see link below) to DataProc on Google Cloud and it only works when the input_list is an array with one item, when the input_list has two items then the PySpark job dies with the following error on line "for r in result.collect()" in get_similarity method below:

java.io.IOException: Premature EOF from inputStream
        at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:194)
        at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
        at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
        at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:446)
        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:702)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:739)
        at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:124)
        at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232)
        at java.lang.Thread.run(Thread.java:745)
input_list=["no error"]                 <---- works
input_list=["this", "throws EOF error"] <---- does not work

link to colab for sentence similarity using spark-nlp:
https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/SENTENCE_SIMILARITY.ipynb#scrollTo=6E0Y5wtunFi4

def get_similarity(input_list):
    df = spark.createDataFrame(pd.DataFrame({'text': input_list}))
    result = light_pipeline.transform(df)
    embeddings = []
    for r in result.collect():
        embeddings.append(r.sentence_embeddings[0].embeddings)
    embeddings_matrix = np.array(embeddings)
    return np.matmul(embeddings_matrix, embeddings_matrix.transpose())

I've tried changing the "dfs.datanode.max.transfer.threads" to 8192 in hadoop cluster config and still no luck

hadoop_config.set('dfs.datanode.max.transfer.threads', "8192")

How can I get this code working when input_list has multiple items in the array?

ShowStopper. New Docker File has Java 11 which is incompatible

New DockerFile with Java 11 does not allow to launch any example.

Steps to Reproduce

  1. Try to launch any notebook that loads pre-trained pipeline or model, both in strata dir or in jupyter/annotation dir.

  2. After some time you will get the exception unsupported version class 55.
    image

  3. This is due to Java 11 version in new Docker image. More about this here: https://stackoverflow.com/questions/53583199/pyspark-error-unsupported-class-file-major-version-55

Even notebooks that were successfully launched by me inside Docker few days ago are now cannot be launched.
I cloned repository locally on my host machine and installed there Java 8. It works fine locally but does not work in the Docker at all.
This is probably the consequence of the DockerFile update in #37.

Your Environment

  • Spark-NLP version: 2.0.3
  • Apache Spark version: 2.4.1
  • Operating System and version: 2 configurations - Ubuntu 18.04(local) and Docker
  • Deployment (Docker, Jupyter, Scala, pip, conda, etc.): Docker container with Java 11 does not work with any example. My host config with Java 8 and cloned repository works ok.

Provided Tutorial notebook doesn't run

Provided Tutorial notebook doesn't run

Description

https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER_EN.ipynb

Here this errors occurs while running it in collab

Py4JJavaError: An error occurred while calling None.com.johnsnowlabs.nlp.DocumentAssembler.
: java.lang.NoSuchMethodError: org.apache.spark.ml.util.MLWritable.$init$(Lorg/apache/spark/ml/util/MLWritable;)V
	at com.johnsnowlabs.nlp.DocumentAssembler.<init>(DocumentAssembler.scala:13)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:238)
	at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
	at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)
```

## Steps to Reproduce
Running this collab

https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER_EN.ipynb

With the versions specified in the install

BertEmbeddings Error

In "3.NER_with_BERT.ipynb"

When i run the following code with setPoolingLayer attribute

bert_annotator = BertEmbeddings.pretrained('bert_base_cased', 'en') \
 .setInputCols(["sentence",'token'])\
 .setOutputCol("bert")\
 .setCaseSensitive(False)\
 .setPoolingLayer(0)

it gives me this error:

AttributeError                            Traceback (most recent call last)
<ipython-input-7-af673ae4eec4> in <module>()
----> 1 bert_annotator = BertEmbeddings.pretrained('bert_base_cased', 'en')  .setInputCols(["sentence",'token']) .setOutputCol("bert") .setCaseSensitive(False) .setPoolingLayer(0)

AttributeError: 'BertEmbeddings' object has no attribute 'setPoolingLayer'

without setPoolingLayer attribute it works...

Kindly suggest me the solution, if i am going to utilize the last layer and want to user setPoolingLayer, how should it work.

Thanks

Your Environment

  • Spark-NLP version: 2.6.2
  • Apache Spark version: 2.4.7
  • Operating System and version: Ubuntu 20.04
  • Deployment (Docker, Jupyter, Scala, pip, conda, etc.):Jupyter conda

Error when running notebooks: Answer from Java side is empty

Hello Team,

bert_model = BertEmbeddings.pretrained('bert_base_cased', 'en').setInputCols(["sentence",'token']).setOutputCol("bert").setCaseSensitive(False).setPoolingLayer(0)

df_bert_train = bert_model.transform(sparkNLP_transformed_full_train)

nerTagger = NerDLApproach().setInputCols(["sentence", "token", "bert"]).setLabelColumn("label").setOutputCol("ner")
.setMaxEpochs(1).setRandomSeed(0).setVerbose(1).setValidationSplit(0.2).setEvaluationLogExtended(True).setEnableOutputLogs(True).setIncludeConfidence(True)

#above code were successful and I can see bert embeddings added to the traindata.However the below code gives error
ner_tag_model_final = nerTagger.fit(df_bert_train)

I am trying to create a NER DL model and I am successful in creating the pipeline.

However, when I feed the train data to fit() the model, I am receiving following error

Exception happened during processing of request from ('127.0.0.1', 47346)
ERROR:root:Exception while sending command.
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/py4j/java_gateway.py", line 1159, in send_command
raise Py4JNetworkError("Answer from Java side is empty")
py4j.protocol.Py4JNetworkError: Answer from Java side is empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/py4j/java_gateway.py", line 985, in send_command
response = connection.send_command(command)
File "/usr/local/lib/python3.6/dist-packages/py4j/java_gateway.py", line 1164, in send_command
"Error while receiving", e, proto.ERROR_ON_RECEIVE)
py4j.protocol.Py4JNetworkError: Error while receiving
Traceback (most recent call last):
File "/usr/lib/python3.6/socketserver.py", line 320, in _handle_request_noblock
self.process_request(request, client_address)
File "/usr/lib/python3.6/socketserver.py", line 351, in process_request
self.finish_request(request, client_address)
File "/usr/lib/python3.6/socketserver.py", line 364, in finish_request
self.RequestHandlerClass(request, client_address, self)
File "/usr/lib/python3.6/socketserver.py", line 724, in init
self.handle()
File "/usr/local/lib/python3.6/dist-packages/pyspark/accumulators.py", line 269, in handle
poll(accum_updates)
File "/usr/local/lib/python3.6/dist-packages/pyspark/accumulators.py", line 241, in poll
if func():
File "/usr/local/lib/python3.6/dist-packages/pyspark/accumulators.py", line 245, in accum_updates
num_updates = read_int(self.rfile)
File "/usr/local/lib/python3.6/dist-packages/pyspark/serializers.py", line 717, in read_int
raise EOFError
EOFError


Py4JError Traceback (most recent call last)
in ()
----> 1 ner_tag_model_final = nerTagger.fit(df_bert_train)

5 frames
/usr/local/lib/python3.6/dist-packages/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
334 raise Py4JError(
335 "An error occurred while calling {0}{1}{2}".
--> 336 format(target_id, ".", name))
337 else:
338 type = answer[1]

Py4JError: An error occurred while calling o860.fit

Docker commands not opening jupyter notebooks localhost:8888

Description

Steps to Reproduce

1.docker pull johnsnowlabs/spark-nlp-workshop
2.docker run -it --rm -p 8888:8888 -p 4040:4040 johnsnowlabs/spark-nlp-workshop
3. http://localhost:8888/?token=LOOK_INSIDE_YOUR_CONSOLE

Your Environment

  • Spark-NLP version: 1.7.2
  • Apache Spark version: 2.3.2
  • Operating System and version: WIndows 10
  • Deployment (Docker, Jupyter, Scala, pip, conda, etc.): Docker

I hope versions are irrelevant as I am using Docker image of workshop so it should start the Jupyter notebooks bit it is not starting any.

Unable to run Clinical Entity Resolver from jupyter notebook

I have seen two types of notebook, When i clone into my virtual machine i have seen three types of version for Clinical Entity Resolver and I run through from jupyter notebook

spark-nlp-healthcare

In github -> IllegalArgumentException: "requirement failed: Wrong or missing inputCols annotators in NerConverterInternal_37f79d519de6

In Jupyter notebook -> AnalysisException: cannot resolve '`ner_token.metadata

Code snippet for Github link

posology_ner = NerDLModel.pretrained("ner_posology", "en", "clinical/models")
.setInputCols(["sentence", "token", "embeddings"])
.setOutputCol("ner")
ner_converter1 = NerConverterInternal()
.setInputCols(["sentence", "token", "ner"])
.setOutputCol("ner_chunk")
chunk_merge = ChunkMergeApproach().setInputCols("ner_chunk","ner_chunk").setOutputCol("merged_chunk")
.setReplaceDictResource("replace_dict.csv","TEXT", {"delimiter":","})
iob_tagger = IOBTagger().setInputCols("token","merged_chunk").setOutputCol("merged_ner")
ner_converter2 = NerConverterInternal()
.setInputCols(["sentence", "token", "merged_ner"])
.setOutputCol("greedy_chunk")
.setPreservePosition(False)
.setWhiteList(['DRUG'])
posology_rx = Pipeline(
stages = [
documentAssembler,
sentenceDetector,
tokenizer,
stopwords,
word_embeddings,
posology_ner,
ner_converter1,
chunk_merge,
iob_tagger,
ner_converter2,
chunk_embeddings,
rxnorm_resolver1
])
model_rxnorm = posology_rx.fit(data_ner)
output = model_rxnorm.transform(data_ner)
output.select(F.explode(F.arrays_zip("greedy_chunk.result","greedy_chunk.metadata","rxnorm_resolution.result","rxnorm_resolution.metadata")).alias("rxnorm_result"))
.select(F.expr("rxnorm_result['0']").alias("chunk"),
F.expr("rxnorm_result['1'].entity").alias("entity"),
F.expr("rxnorm_result['3'].all_k_resolutions").alias("target_text"),
F.expr("rxnorm_result['2']").alias("code"),
F.expr("rxnorm_result['3'].confidence").alias("confidence")).show(truncate = 100)

Error Log:
--------------------------------------------------------------------------- Py4JJavaError Traceback (most recent call last) ~/spark-nlp/anaconda3/envs/sparknlp/lib/python3.6/site-packages/pyspark/sql/utils.py in deco(*a, **kw) 62 try: ---> 63 return f(*a, **kw) 64 except py4j.protocol.Py4JJavaError as e: ~/spark-nlp/anaconda3/envs/sparknlp/lib/python3.6/site-packages/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name) 327 "An error occurred while calling {0}{1}{2}.\n". --> 328 format(target_id, ".", name), value) 329 else: Py4JJavaError: An error occurred while calling o4989.transform. : java.lang.IllegalArgumentException: requirement failed: Wrong or missing inputCols annotators in NerConverterInternal_37f79d519de6. Current inputCols: sentence,token,merged_ner. Dataset's columns: (column_name=text,is_nlp_annotator=false) (column_name=document,is_nlp_annotator=true,type=document) (column_name=sentence,is_nlp_annotator=true,type=document) (column_name=raw_token,is_nlp_annotator=true,type=token) (column_name=token,is_nlp_annotator=true,type=token) (column_name=embeddings,is_nlp_annotator=true,type=word_embeddings) (column_name=ner,is_nlp_annotator=true,type=named_entity) (column_name=ner_chunk,is_nlp_annotator=true,type=chunk) (column_name=merged_chunk,is_nlp_annotator=true,type=chunk) (column_name=merged_ner,is_nlp_annotator=true,type=chunk). Make sure such annotators exist in your pipeline, with the right output names and that they have following annotator types: document, token, named_entity at scala.Predef$.require(Predef.scala:224) at com.johnsnowlabs.nlp.AnnotatorModel._transform(AnnotatorModel.scala:43) at com.johnsnowlabs.nlp.AnnotatorModel.transform(AnnotatorModel.scala:79) at sun.reflect.GeneratedMethodAccessor117.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:748) During handling of the above exception, another exception occurred: IllegalArgumentException Traceback (most recent call last) <ipython-input-36-c40be2771ed1> in <module> 36 model_rxnorm = posology_rx.fit(data_ner) 37 ---> 38 output = model_rxnorm.transform(data_ner) 39 40 output.select(F.explode(F.arrays_zip("greedy_chunk.result","greedy_chunk.metadata","rxnorm_resolution.result","rxnorm_resolution.metadata")).alias("rxnorm_result")) \ ~/spark-nlp/anaconda3/envs/sparknlp/lib/python3.6/site-packages/pyspark/ml/base.py in transform(self, dataset, params) 171 return self.copy(params)._transform(dataset) 172 else: --> 173 return self._transform(dataset) 174 else: 175 raise ValueError("Params must be a param map but got %s." % type(params)) ~/spark-nlp/anaconda3/envs/sparknlp/lib/python3.6/site-packages/pyspark/ml/pipeline.py in _transform(self, dataset) 260 def _transform(self, dataset): 261 for t in self.stages: --> 262 dataset = t.transform(dataset) 263 return dataset 264 ~/spark-nlp/anaconda3/envs/sparknlp/lib/python3.6/site-packages/pyspark/ml/base.py in transform(self, dataset, params) 171 return self.copy(params)._transform(dataset) 172 else: --> 173 return self._transform(dataset) 174 else: 175 raise ValueError("Params must be a param map but got %s." % type(params)) ~/spark-nlp/anaconda3/envs/sparknlp/lib/python3.6/site-packages/pyspark/ml/wrapper.py in _transform(self, dataset) 310 def _transform(self, dataset): 311 self._transfer_params_to_java() --> 312 return DataFrame(self._java_obj.transform(dataset._jdf), dataset.sql_ctx) 313 314 ~/spark-nlp/anaconda3/envs/sparknlp/lib/python3.6/site-packages/py4j/java_gateway.py in __call__(self, *args) 1255 answer = self.gateway_client.send_command(command) 1256 return_value = get_return_value( -> 1257 answer, self.gateway_client, self.target_id, self.name) 1258 1259 for temp_arg in temp_args: ~/spark-nlp/anaconda3/envs/sparknlp/lib/python3.6/site-packages/pyspark/sql/utils.py in deco(*a, **kw) 77 raise QueryExecutionException(s.split(': ', 1)[1], stackTrace) 78 if s.startswith('java.lang.IllegalArgumentException: '): ---> 79 raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace) 80 raise 81 return deco IllegalArgumentException: "requirement failed: Wrong or missing inputCols annotators in NerConverterInternal_37f79d519de6.\n\nCurrent inputCols: sentence,token,merged_ner. Dataset's columns:\n(column_name=text,is_nlp_annotator=false)\n(column_name=document,is_nlp_annotator=true,type=document)\n(column_name=sentence,is_nlp_annotator=true,type=document)\n(column_name=raw_token,is_nlp_annotator=true,type=token)\n(column_name=token,is_nlp_annotator=true,type=token)\n(column_name=embeddings,is_nlp_annotator=true,type=word_embeddings)\n(column_name=ner,is_nlp_annotator=true,type=named_entity)\n(column_name=ner_chunk,is_nlp_annotator=true,type=chunk)\n(column_name=merged_chunk,is_nlp_annotator=true,type=chunk)\n(column_name=merged_ner,is_nlp_annotator=true,type=chunk).\nMake sure such annotators exist in your pipeline, with the right output names and that they have following annotator types: document, token, named_entity"

Code snippet for Jupyter link

Persisiting temporarily to keep DAG size and resource usage low (Ensmeble Resolvers are Resource Intensive)

pipelineModelFull = pipelineFull.fit(data)
output = pipelineModelFull.transform(data)
output.write.mode("overwrite").save("temp")
output = spark.read.load("temp")

Error Log:
`---------------------------------------------------------------------------
Py4JJavaError Traceback (most recent call last)
~/spark-nlp/anaconda3/envs/sparknlp/lib/python3.6/site-packages/pyspark/sql/utils.py in deco(*a, **kw)
62 try:
---> 63 return f(*a, **kw)
64 except py4j.protocol.Py4JJavaError as e:

~/spark-nlp/anaconda3/envs/sparknlp/lib/python3.6/site-packages/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
327 "An error occurred while calling {0}{1}{2}.\n".
--> 328 format(target_id, ".", name), value)
329 else:

Py4JJavaError: An error occurred while calling o6703.transform.
: org.apache.spark.sql.AnalysisException: cannot resolve 'ner_token.metadata' given input columns: [chunk_token_jsl, ner_jsl, chunk_drug, chunk_jsl, chunk_embs_jsl, text_feed, ner_drug, embeddings, token, document, chunk_token_drug, chunk_embs_drug, sentence, doc_id]; line 1 pos 0;
'Project ['ner_token.metadata]
+- Project [doc_id#4309L, text_feed#4310, document#4468, sentence#4473, token#4479, embeddings#4494, ner_jsl#4503, ner_drug#4513, chunk_jsl#4523, chunk_drug#4534, chunk_embs_jsl#4559, chunk_embs_drug#4586, chunk_token_jsl#4600, UDF(array(chunk_drug#4534)) AS chunk_token_drug#4615]
+- Project [doc_id#4309L, text_feed#4310, document#4468, sentence#4473, token#4479, embeddings#4494, ner_jsl#4503, ner_drug#4513, chunk_jsl#4523, chunk_drug#4534, chunk_embs_jsl#4559, chunk_embs_drug#4586, UDF(array(chunk_jsl#4523)) AS chunk_token_jsl#4600]
+- Project [doc_id#4309L, text_feed#4310, document#4468, sentence#4473, token#4479, embeddings#4494, ner_jsl#4503, ner_drug#4513, chunk_jsl#4523, chunk_drug#4534, chunk_embs_jsl#4559, chunk_embs_drug#4572 AS chunk_embs_drug#4586]
+- Project [doc_id#4309L, text_feed#4310, document#4468, sentence#4473, token#4479, embeddings#4494, ner_jsl#4503, ner_drug#4513, chunk_jsl#4523, chunk_drug#4534, chunk_embs_jsl#4559, UDF(array(chunk_drug#4534, embeddings#4494)) AS chunk_embs_drug#4572]
+- Project [doc_id#4309L, text_feed#4310, document#4468, sentence#4473, token#4479, embeddings#4494, ner_jsl#4503, ner_drug#4513, chunk_jsl#4523, chunk_drug#4534, chunk_embs_jsl#4546 AS chunk_embs_jsl#4559]
+- Project [doc_id#4309L, text_feed#4310, document#4468, sentence#4473, token#4479, embeddings#4494, ner_jsl#4503, ner_drug#4513, chunk_jsl#4523, chunk_drug#4534, UDF(array(chunk_jsl#4523, embeddings#4494)) AS chunk_embs_jsl#4546]
+- Project [doc_id#4309L, text_feed#4310, document#4468, sentence#4473, token#4479, embeddings#4494, ner_jsl#4503, ner_drug#4513, chunk_jsl#4523, UDF(array(sentence#4473, token#4479, ner_drug#4513)) AS chunk_drug#4534]
+- Project [doc_id#4309L, text_feed#4310, document#4468, sentence#4473, token#4479, embeddings#4494, ner_jsl#4503, ner_drug#4513, UDF(array(sentence#4473, token#4479, ner_jsl#4503)) AS chunk_jsl#4523]
+- Project [doc_id#4309L, text_feed#4310, document#4468, sentence#4473, token#4479, embeddings#4494, ner_jsl#4503, UDF(array(sentence#4473, token#4479, embeddings#4494)) AS ner_drug#4513]
+- Project [doc_id#4309L, text_feed#4310, document#4468, sentence#4473, token#4479, embeddings#4494, UDF(array(sentence#4473, token#4479, embeddings#4494)) AS ner_jsl#4503]
+- Project [doc_id#4309L, text_feed#4310, document#4468, sentence#4473, token#4479, embeddings#4486 AS embeddings#4494]
+- Project [doc_id#4309L, text_feed#4310, document#4468, sentence#4473, token#4479, UDF(array(sentence#4473, token#4479)) AS embeddings#4486]
+- Project [doc_id#4309L, text_feed#4310, document#4468, sentence#4473, UDF(array(sentence#4473)) AS token#4479]
+- Project [doc_id#4309L, text_feed#4310, document#4468, UDF(array(document#4468)) AS sentence#4473]
+- Project [doc_id#4309L, text_feed#4310, UDF(text_feed#4310) AS document#4468]
+- Project [_1#4305L AS doc_id#4309L, _2#4306 AS text_feed#4310]
+- LogicalRDD [_1#4305L, _2#4306], false
at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$3.applyOrElse(CheckAnalysis.scala:111)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$3.applyOrElse(CheckAnalysis.scala:108)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:281)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:281)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:280)
at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:93)
at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:93)
at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:105)
at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:105)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:104)
at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:116)
at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1$2.apply(QueryPlan.scala:121)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:121)
at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:126)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
at org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:126)
at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:93)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:108)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:86)
at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:86)
at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:95)
at org.apache.spark.sql.catalyst.analysis.Analyzer$$anonfun$executeAndCheck$1.apply(Analyzer.scala:108)
at org.apache.spark.sql.catalyst.analysis.Analyzer$$anonfun$executeAndCheck$1.apply(Analyzer.scala:105)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:201)
at org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:105)
at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:57)
at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:55)
at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:47)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:78)
at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:3412)
at org.apache.spark.sql.Dataset.select(Dataset.scala:1340)
at com.johnsnowlabs.nlp.annotators.resolution.EnsembleEntityResolverModel.checkIfTokensHaveChunk(EnsembleEntityResolverModel.scala:116)
at com.johnsnowlabs.nlp.annotators.resolution.EnsembleEntityResolverModel.transform(EnsembleEntityResolverModel.scala:72)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
During handling of the above exception, another exception occurred:
AnalysisException Traceback (most recent call last)
in
2 pipelineModelFull = pipelineFull.fit(data)
3
----> 4 output = pipelineModelFull.transform(data)
5
6 output.write.mode("overwrite").save("temp")
~/spark-nlp/anaconda3/envs/sparknlp/lib/python3.6/site-packages/pyspark/ml/base.py in transform(self, dataset, params)
171 return self.copy(params)._transform(dataset)
172 else:
--> 173 return self._transform(dataset)
174 else:
175 raise ValueError("Params must be a param map but got %s." % type(params))
~/spark-nlp/anaconda3/envs/sparknlp/lib/python3.6/site-packages/pyspark/ml/pipeline.py in _transform(self, dataset)
260 def _transform(self, dataset):
261 for t in self.stages:
--> 262 dataset = t.transform(dataset)
263 return dataset
264
~/spark-nlp/anaconda3/envs/sparknlp/lib/python3.6/site-packages/pyspark/ml/base.py in transform(self, dataset, params)
171 return self.copy(params)._transform(dataset)
172 else:
--> 173 return self._transform(dataset)
174 else:
175 raise ValueError("Params must be a param map but got %s." % type(params))
~/spark-nlp/anaconda3/envs/sparknlp/lib/python3.6/site-packages/pyspark/ml/wrapper.py in _transform(self, dataset)
310 def _transform(self, dataset):
311 self._transfer_params_to_java()
--> 312 return DataFrame(self._java_obj.transform(dataset._jdf), dataset.sql_ctx)
313
314
~/spark-nlp/anaconda3/envs/sparknlp/lib/python3.6/site-packages/py4j/java_gateway.py in call(self, *args)
1255 answer = self.gateway_client.send_command(command)
1256 return_value = get_return_value(
-> 1257 answer, self.gateway_client, self.target_id, self.name)
1258
1259 for temp_arg in temp_args:
~/spark-nlp/anaconda3/envs/sparknlp/lib/python3.6/site-packages/pyspark/sql/utils.py in deco(*a, **kw)
67 e.java_exception.getStackTrace()))
68 if s.startswith('org.apache.spark.sql.AnalysisException: '):
---> 69 raise AnalysisException(s.split(': ', 1)[1], stackTrace)
70 if s.startswith('org.apache.spark.sql.catalyst.analysis'):
71 raise AnalysisException(s.split(': ', 1)[1], stackTrace)
AnalysisException: "cannot resolve 'ner_token.metadata' given input columns: [chunk_token_jsl, ner_jsl, chunk_drug, chunk_jsl, chunk_embs_jsl, text_feed, ner_drug, embeddings, token, document, chunk_token_drug, chunk_embs_drug, sentence, doc_id]; line 1 pos 0;\n'Project ['ner_token.metadata]\n+- Project [doc_id#4309L, text_feed#4310, document#4468, sentence#4473, token#4479, embeddings#4494, ner_jsl#4503, ner_drug#4513, chunk_jsl#4523, chunk_drug#4534, chunk_embs_jsl#4559, chunk_embs_drug#4586, chunk_token_jsl#4600, UDF(array(chunk_drug#4534)) AS chunk_token_drug#4615]\n +- Project [doc_id#4309L, text_feed#4310, document#4468, sentence#4473, token#4479, embeddings#4494, ner_jsl#4503, ner_drug#4513, chunk_jsl#4523, chunk_drug#4534, chunk_embs_jsl#4559, chunk_embs_drug#4586, UDF(array(chunk_jsl#4523)) AS chunk_token_jsl#4600]\n +- Project [doc_id#4309L, text_feed#4310, document#4468, sentence#4473, token#4479, embeddings#4494, ner_jsl#4503, ner_drug#4513, chunk_jsl#4523, chunk_drug#4534, chunk_embs_jsl#4559, chunk_embs_drug#4572 AS chunk_embs_drug#4586]\n +- Project [doc_id#4309L, text_feed#4310, document#4468, sentence#4473, token#4479, embeddings#4494, ner_jsl#4503, ner_drug#4513, chunk_jsl#4523, chunk_drug#4534, chunk_embs_jsl#4559, UDF(array(chunk_drug#4534, embeddings#4494)) AS chunk_embs_drug#4572]\n +- Project [doc_id#4309L, text_feed#4310, document#4468, sentence#4473, token#4479, embeddings#4494, ner_jsl#4503, ner_drug#4513, chunk_jsl#4523, chunk_drug#4534, chunk_embs_jsl#4546 AS chunk_embs_jsl#4559]\n +- Project [doc_id#4309L, text_feed#4310, document#4468, sentence#4473, token#4479, embeddings#4494, ner_jsl#4503, ner_drug#4513, chunk_jsl#4523, chunk_drug#4534, UDF(array(chunk_jsl#4523, embeddings#4494)) AS chunk_embs_jsl#4546]\n +- Project [doc_id#4309L, text_feed#4310, document#4468, sentence#4473, token#4479, embeddings#4494, ner_jsl#4503, ner_drug#4513, chunk_jsl#4523, UDF(array(sentence#4473, token#4479, ner_drug#4513)) AS chunk_drug#4534]\n +- Project [doc_id#4309L, text_feed#4310, document#4468, sentence#4473, token#4479, embeddings#4494, ner_jsl#4503, ner_drug#4513, UDF(array(sentence#4473, token#4479, ner_jsl#4503)) AS chunk_jsl#4523]\n +- Project [doc_id#4309L, text_feed#4310, document#4468, sentence#4473, token#4479, embeddings#4494, ner_jsl#4503, UDF(array(sentence#4473, token#4479, embeddings#4494)) AS ner_drug#4513]\n +- Project [doc_id#4309L, text_feed#4310, document#4468, sentence#4473, token#4479, embeddings#4494, UDF(array(sentence#4473, token#4479, embeddings#4494)) AS ner_jsl#4503]\n +- Project [doc_id#4309L, text_feed#4310, document#4468, sentence#4473, token#4479, embeddings#4486 AS embeddings#4494]\n +- Project [doc_id#4309L, text_feed#4310, document#4468, sentence#4473, token#4479, UDF(array(sentence#4473, token#4479)) AS embeddings#4486]\n +- Project [doc_id#4309L, text_feed#4310, document#4468, sentence#4473, UDF(array(sentence#4473)) AS token#4479]\n +- Project [doc_id#4309L, text_feed#4310, document#4468, UDF(array(document#4468)) AS sentence#4473]\n +- Project [doc_id#4309L, text_feed#4310, UDF(text_feed#4310) AS document#4468]\n +- Project [_1#4305L AS doc_id#4309L, _2#4306 AS text_feed#4310]\n +- LogicalRDD [_1#4305L, _2#4306], false\n"`

Your Environment

  • Spark-NLP version: '2.5.3'
  • Operating System and version: Ubuntu 18.04 (Google VM)
  • Deployment (Docker, Jupyter, Scala, pip, conda, etc.): jupyter

SparkNLP Jupyter Notebook with TensorFlow for CoLab

Hi,

I've found your Colab tutorial examples very useful to better understand the usage of SparkNLP.

I used this tutorial and a talk by Alexander Thomas and his databricks based repo to update his notebook and get it to run in the CoLab environment and thought you might be interested in adding it into your collection of examples here:

Natural Language Understanding at Scale with Spark Native NLP, Spark ML &TensorFlow

I hope you find this a useful contribution.

The way I see it is that the code is using to cell [29] SparkNLP for Transformations whereas the remaining TensorFlow code of the Notebook implements text classification using a neural network with:

  • 2 hidden layers and
  • 1 output layer

Are you aware of a resource where I can read up on this particular way of classifying text using tensorflow (or using keras with tensorflow)?

Thanks for this great library ๐Ÿ‘

explain_document.ipynb has issue with downloading pipeline, maybe typo

PretrainedPipeline('explain_document_ml', lang='en')

Steps to Reproduce

  1. Pull and run the docker
  2. Run notebook https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/jupyter/annotation/english/explain-document-pipeline/explain_document.ipynb
  3. Try to launch the cell โ„–4. You will get the same download exception as in #38
    But here it looks like it is a simple typo. Because in other notebook when I use PretrainedPipeline('explain_document_dl', lang='en') instead of PretrainedPipeline('explain_document_ml', lang='en') everything works fine. Maybe here you just need to chane ml to dl. However, it's up to you to decide whether it's a typo or not.

Your Environment

  • Spark-NLP version: 2.0.3
  • Apache Spark version: 2.4.1
  • Operating System and version: The latest Docker
  • Deployment (Docker, Jupyter, Scala, pip, conda, etc.): I have pulled the latest docker as described on the main page.

Incorrect column name in explain-document-ml

Description

In explain-document-ml.ipynb the code tries to read a column using the name "lemma", but the correct name is "lemmas".

The cell preceding that cell is throwing an exception in the committed code, as seen on the GitHub page for it.

The version shown in the run committed on GitHub is 2.4.2. Could you please try all the examples with the latest version?

Steps to Reproduce

Run the notebook up to the cell

result.select("lemma.result").show(1, False)

which will fail. Changing the cell to

result.select("lemmas.result").show(1, False)

succeeds.

Your Environment

  • Spark-NLP version: 2.4.5
  • Apache Spark version: 2.4.5
  • Operating System and version: Mint Linux, compatible with Ubuntu 18.04.
  • Deployment (Docker, Jupyter, Scala, pip, conda, etc.): Jupyter

Could you guys take a look at the AssertionDLApproach class parameters labelCol and targetCol?

This Java code:

AssertionDLApproach assertionDLApproach = new AssertionDLApproach();
assertionDLApproach.setLabelCol("my_labels");

generates this error (for setLabelCol and setTargetCol):

java.util.NoSuchElementException: Param my_labels does not exist.
	at org.apache.spark.ml.param.Params$$anonfun$getParam$2.apply(params.scala:729)
	at org.apache.spark.ml.param.Params$$anonfun$getParam$2.apply(params.scala:729)
	at scala.Option.getOrElse(Option.scala:121)
	at org.apache.spark.ml.param.Params$class.getParam(params.scala:728)
	at org.apache.spark.ml.PipelineStage.getParam(Pipeline.scala:42)
	at org.apache.spark.ml.param.Params$class.set(params.scala:744)
	at org.apache.spark.ml.PipelineStage.set(Pipeline.scala:42)
	at com.johnsnowlabs.nlp.annotators.assertion.dl.AssertionDLApproach.setLabelCol(AssertionDLApproach.scala:65)
	at com.idexx.nlp.spark.jsl.ClinicalJslPipelineJobTests.testAssertionDLTraining(ClinicalJslPipelineJobTests.java:105)

Typo in jupyter/training/english/crf-ner/ner.ipynb

Typo in the section "# Download Glove Word Embeddings"

Description

prinnt("Unzipping the files now.") should be replaced to the print("Unzipping the files now.")

Steps to Reproduce

Run training/english/crf-ner/ner.ipynb from latest docker image

Screenshot from 2019-06-25 17-44-57

Error downloading pretrained pipeline

I'm running the jupyter notebook with the Docker, but when executing

pipeline = PretrainedPipeline('explain_document_dl')

there is a No such file or directory error. The error messages are included below:

---------------------------------------------------------------------------
Py4JJavaError                             Traceback (most recent call last)
<timed exec> in <module>

/usr/lib/python3.6/site-packages/sparknlp/pretrained.py in __init__(self, name, lang, remote_loc)
     28 
     29     def __init__(self, name, lang='en', remote_loc=None):
---> 30         self.model = ResourceDownloader().downloadPipeline(name, lang, remote_loc)
     31         self.light_model = LightPipeline(self.model)
     32 

/usr/lib/python3.6/site-packages/sparknlp/pretrained.py in downloadPipeline(name, language, remote_loc)
     16     @staticmethod
     17     def downloadPipeline(name, language, remote_loc=None):
---> 18         j_obj = _internal._DownloadPipeline(name, language, remote_loc).apply()
     19         jmodel = JavaModel(j_obj)
     20         return jmodel

/usr/lib/python3.6/site-packages/sparknlp/internal.py in __init__(self, name, language, remote_loc)
     63     def __init__(self, name, language, remote_loc):
     64         super(_DownloadPipeline, self).__init__("com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.downloadPipeline")
---> 65         self._java_obj = self._new_java_obj(self._java_obj, name, language, remote_loc)
     66 
     67 

/usr/lib/python3.6/site-packages/pyspark/ml/wrapper.py in _new_java_obj(java_class, *args)
     65             java_obj = getattr(java_obj, name)
     66         java_args = [_py2java(sc, arg) for arg in args]
---> 67         return java_obj(*java_args)
     68 
     69     @staticmethod

/usr/lib/python3.6/site-packages/py4j/java_gateway.py in __call__(self, *args)
   1255         answer = self.gateway_client.send_command(command)
   1256         return_value = get_return_value(
-> 1257             answer, self.gateway_client, self.target_id, self.name)
   1258 
   1259         for temp_arg in temp_args:

/usr/lib/python3.6/site-packages/pyspark/sql/utils.py in deco(*a, **kw)
     61     def deco(*a, **kw):
     62         try:
---> 63             return f(*a, **kw)
     64         except py4j.protocol.Py4JJavaError as e:
     65             s = e.java_exception.toString()

/usr/lib/python3.6/site-packages/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
    326                 raise Py4JJavaError(
    327                     "An error occurred while calling {0}{1}{2}.\n".
--> 328                     format(target_id, ".", name), value)
    329             else:
    330                 raise Py4JError(

Py4JJavaError: An error occurred while calling z:com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.downloadPipeline.
: java.lang.UnsatisfiedLinkError: /tmp/tensorflow_native_libraries-1553289340774-0/libtensorflow_jni.so: Error loading shared library ld-linux-x86-64.so.2: No such file or directory (needed by /tmp/tensorflow_native_libraries-1553289340774-0/libtensorflow_jni.so)
	at java.lang.ClassLoader$NativeLibrary.load(Native Method)
	at java.lang.ClassLoader.loadLibrary0(ClassLoader.java:1941)
	at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1824)
	at java.lang.Runtime.load0(Runtime.java:809)
	at java.lang.System.load(System.java:1086)
	at org.tensorflow.NativeLibrary.load(NativeLibrary.java:101)
	at org.tensorflow.TensorFlow.init(TensorFlow.java:66)
	at org.tensorflow.TensorFlow.<clinit>(TensorFlow.java:70)
	at org.tensorflow.Graph.<clinit>(Graph.java:361)
	at com.johnsnowlabs.ml.tensorflow.TensorflowWrapper$.readGraph(TensorflowWrapper.scala:98)
	at com.johnsnowlabs.ml.tensorflow.TensorflowWrapper$.read(TensorflowWrapper.scala:172)
	at com.johnsnowlabs.ml.tensorflow.ReadTensorflowModel$class.readTensorflowModel(TensorflowSerializeModel.scala:57)
	at com.johnsnowlabs.nlp.annotators.ner.dl.NerDLModel$.readTensorflowModel(NerDLModel.scala:97)
	at com.johnsnowlabs.nlp.annotators.ner.dl.ReadsNERGraph$class.readNerGraph(NerDLModel.scala:84)
	at com.johnsnowlabs.nlp.annotators.ner.dl.NerDLModel$.readNerGraph(NerDLModel.scala:97)
	at com.johnsnowlabs.nlp.annotators.ner.dl.ReadsNERGraph$$anonfun$2.apply(NerDLModel.scala:88)
	at com.johnsnowlabs.nlp.annotators.ner.dl.ReadsNERGraph$$anonfun$2.apply(NerDLModel.scala:88)
	at com.johnsnowlabs.nlp.ParamsAndFeaturesReadable$$anonfun$com$johnsnowlabs$nlp$ParamsAndFeaturesReadable$$onRead$1.apply(ParamsAndFeaturesReadable.scala:31)
	at com.johnsnowlabs.nlp.ParamsAndFeaturesReadable$$anonfun$com$johnsnowlabs$nlp$ParamsAndFeaturesReadable$$onRead$1.apply(ParamsAndFeaturesReadable.scala:30)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
	at com.johnsnowlabs.nlp.ParamsAndFeaturesReadable$class.com$johnsnowlabs$nlp$ParamsAndFeaturesReadable$$onRead(ParamsAndFeaturesReadable.scala:30)
	at com.johnsnowlabs.nlp.ParamsAndFeaturesReadable$$anonfun$read$1.apply(ParamsAndFeaturesReadable.scala:41)
	at com.johnsnowlabs.nlp.ParamsAndFeaturesReadable$$anonfun$read$1.apply(ParamsAndFeaturesReadable.scala:41)
	at com.johnsnowlabs.nlp.FeaturesReader.load(ParamsAndFeaturesReadable.scala:19)
	at com.johnsnowlabs.nlp.FeaturesReader.load(ParamsAndFeaturesReadable.scala:8)
	at org.apache.spark.ml.util.DefaultParamsReader$.loadParamsInstance(ReadWrite.scala:652)
	at org.apache.spark.ml.Pipeline$SharedReadWrite$$anonfun$4.apply(Pipeline.scala:274)
	at org.apache.spark.ml.Pipeline$SharedReadWrite$$anonfun$4.apply(Pipeline.scala:272)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
	at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
	at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
	at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
	at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
	at org.apache.spark.ml.Pipeline$SharedReadWrite$.load(Pipeline.scala:272)
	at org.apache.spark.ml.PipelineModel$PipelineModelReader.load(Pipeline.scala:348)
	at org.apache.spark.ml.PipelineModel$PipelineModelReader.load(Pipeline.scala:342)
	at com.johnsnowlabs.nlp.pretrained.ResourceDownloader$.downloadPipeline(ResourceDownloader.scala:134)
	at com.johnsnowlabs.nlp.pretrained.ResourceDownloader$.downloadPipeline(ResourceDownloader.scala:128)
	at com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader$.downloadPipeline(ResourceDownloader.scala:197)
	at com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.downloadPipeline(ResourceDownloader.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)

Are the notebooks supposed to run without error in the docker? Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.