deepset-ai / haystack-tutorials Goto Github PK
View Code? Open in Web Editor NEWHere you can find all the Tutorials for Haystack π
Home Page: https://haystack.deepset.ai/tutorials
License: Apache License 2.0
Here you can find all the Tutorials for Haystack π
Home Page: https://haystack.deepset.ai/tutorials
License: Apache License 2.0
We will need to include 'time to complete' and the 'date created' somehow to the frontmatter of each tutorial if we want to display these as tags on the tutorial overview page. I suggest having these as fields to add for a tutorial in the index.toml
and then add them to the frontmatter in generate_markdowns.py
Now the steps are complicated, and there are optional paths. Let's simplify it
https://haystack.deepset.ai/tutorials/05_evaluation
Describe the bug
Following the tutorial 6 and tutorial 7 in sequence on the same Colab runtime results in a FAISSDocumentStore error due to the presence of the already-created faiss_document_store.db
Related to this issue deepset-ai/haystack#1903
Error message
FAISSDocumentStore: number of documents present in the SQL database does not match the number of embeddings in FAISS
Expected behavior
Either a note in the notebook to mention that this might be expected behaviour and/or additional if...else.. code that automatically handles this.
Additional context
None
To Reproduce
Run Tutorial 6, then copy the relevant bits of code from Tutorial 7 into the same Colab.
FAQ Check
**System: Google Colab
Describe the issue
Fine-tuning a Model on Your Own Data
Part 2:
Downloading very small dataset to make tutorial faster (please use a bigger dataset for real use cases)
s3_url = "https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/squad_small.json.zip"
fetch_archive_from_http(url=s3_url, output_dir=doc_dir)
File download error cause the squad_small.json could not be downloaded.
To Reproduce
Just run with jupyter and cause error.
Expected behavior
Should execute without error.
What environment did you try to run the tutorial on?:
Additional context
should be changed to:
s3_url = "https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/squad_small.json.zip"
fetch_archive_from_http(url=s3_url, output_dir="data/temp")
!cp data/temp/*.json .
Create a tutorial that guides users on different ways to generate their own annotations using the annotation tool, and also in evaluation mode in the streamlit.ui. This might take the form of a blog post or video that should be linked to from the repository.
Describe the issue
Hello. I'm testing the first tutorial as it is with around 5000 text files, some are 1 page some are 15 pages long.
When the answer is getting printed I get this error.
ERROR - haystack.modeling.model.predictions - Invalid end offset:
prediction = pipe.run( query="how many people attended the last concert?", params={"Retriever": {"top_k": 10}, "Reader": {"top_k": 5}} )
Inferencing Samples: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 46/46 [00:38<00:00, 1.20 Batches/s]
ERROR - haystack.modeling.model.predictions - Invalid end offset:
(-26524, -26520) with a span answer.
ERROR - haystack.modeling.model.predictions - Invalid end offset:
(-6105, -6102) with a span answer.
ERROR - haystack.modeling.model.predictions - Invalid end offset:
(-24692, -24689) with a span answer.
ERROR - haystack.modeling.model.predictions - Invalid end offset:
(-32411, -32404) with a span answer.
ERROR - haystack.modeling.model.predictions - Invalid end offset:
(-27332, -27325) with a span answer.
ERROR - haystack.modeling.model.predictions - Invalid end offset:
(-32379, -32373) with a span answer.
ERROR - haystack.modeling.model.predictions - Invalid end offset:
(-30646, -30628) with a span answer.
ERROR - haystack.modeling.model.predictions - Invalid end offset:
(-30307, -30297) with a span answer.
**To Reproduce**
Tutorial 1, with 5000 text files, some are 1 page some are 15 pages long.
**Expected behavior**
with the default Game of Thrones dataset I didn't see this issue, can you please help me fix this? Many thanks.
**What environment did you try to run the tutorial on?:**
- OS: Ubuntu 20
- Firefox
import haystack
haystack.version
'1.11.0rc0'
**Additional context**
I suspect the problem is not specific to the first notebook. and that is why some unusual content gets printed as the result
With deepset-ai/haystack#2887, we replaced DPR with EmbeddingRetriever
in Tutorial 06.
Now, we might want to do the same for Tutorial 09 which covers training (or fine-tuning) a DPR Retriever model.
Q1. Should we go ahead with this switch? Any reason keeping DPR might be better?
Alternatively, we could create one for each. I guess depends on which we want to demonstrate plus what we think might be valuable for users.
Only the sentence-encoder variant of EmbeddingRetriever
can be trained.
Its train
method does some data setup and then calls the fit
method on SentenceTransformer
(from the sentence_transformers
package).
Input data format is:
[
{βquestionβ: β¦, βpos_docβ: β¦, βneg_docβ: β¦, βscoreβ: β¦},
...
]
It uses MarginMSELoss
(as part of the GPL procedure).
Q2. If we were to demonstrate its training, which data could be best to use? GPL et al. seem to use MSMARCO but then we need cross-encoder scores for the score
above, right? So there doesn't seem to be a download-and-use form of dataset available?
RFC: @brandenchan @vblagoje @agnieszka-m (please loop in anyone else if necessary)
cc: @mkkuemmel
We can have some tutorials around other DocumentStores, but to show the other features, we should start using InMemoryDocumentStore
The Audio tutorial (id=17) is failing in the nightlies. But it's failing because of a different issue than the issue we have on Colab.
However, the audio node is moving out to haystack-extras repo and will be installed via a different package so let's fix this tutorial or the test in conjunction with that node.
@ZanSara I let you make the call on whether we should skip this test for now
We will need help from you and possibly @silvanocerza to fix the tutorial once the package for haystack-extras is ready
Hello π!
Since xpdf dependency was removed in deepset-ai/haystack#4314,
xpdf installation instructions/commands should be removed in tutorials 8 and 16.
We recently created a German Question Answering Dataset and also a German Dense Passage Retrieval dataset, along with trained models for each.
It would be great to have a tutorial (something along the lines of Tutorial 1) that allows users to start playing around with these models!
Describe the tutorial you would like to see here
At this point in the tutorial we state that we're going to start Elastic but not why we're doing this. What's the objective of this Index section of the lab?
[x] I've checked the existing tutorials
Describe the issue
With PR #40 the Colab button is no more visible when looking at the tutorial files on GitHub. In my opinion, we should add them again.
Here is the old view with the button:
https://github.com/deepset-ai/haystack-tutorials/blob/1b592d5791e711489a6d25a4ff8f0a7160b174d1/tutorials/01_Basic_QA_Pipeline.ipynb
And the new one without the button:
https://github.com/deepset-ai/haystack-tutorials/blob/main/tutorials/01_Basic_QA_Pipeline.ipynb
@TuanaCelik if we don't add the button linking to colab I'd assume that fewer users will try running the code.
As an alternative to adding the button, maybe we could at least add a link?
We have Tutorial 5 on Evaluation, working with QA and passage search eval.
The pipeline.eval is structured to work with other nodes as well, but it would be good to have examples that people can base their work on, e.g. for generative QA or tableQA.
Other methods like FAQ matching, query + doc classifiers, summarization and translation need different labels (I think). We should create dedicated tutorials for those. FAQ matching would be a good next candidate.
Opened issue in error
Describe the tutorial you would like to see here
The PromptNode
works with models that have an invocation layer. There is functinonality to register an invocation layer. We should have a tutorial that shows you how to do this
[x] I've checked the existing tutorials
As discussed with @mkkuemmel - Tutorial 14 is quite complicated to follow and explains a few things in parallel that could possibly be made simpler by splitting the tutorial into a few:
Or another idea would be to extend the markdown explanations in the current tutorial.
Describe the issue
Tutorial "Better Retrieval with Embedding Retrieval" gives an error when run with MilvusDocumentStore
---------------------------------------------------------------------------
ContextualVersionConflict Traceback (most recent call last)
[<ipython-input-4-b2f9ac6965f8>](https://localhost:8080/#) in <module>
6
7 from haystack.utils import launch_milvus
----> 8 from haystack.document_stores import MilvusDocumentStore
9
10 launch_milvus()
9 frames
[/usr/local/lib/python3.9/dist-packages/pkg_resources/__init__.py](https://localhost:8080/#) in resolve(self, requirements, env, installer, replace_conflicting, extras)
798 # Oops, the "best" so far conflicts with a dependency
799 dependent_req = required_by[req]
--> 800 raise VersionConflict(dist, req).with_context(dependent_req)
801
802 # push the new requirements onto the stack
ContextualVersionConflict: (grpcio 1.51.3 (/usr/local/lib/python3.9/dist-packages), Requirement.parse('grpcio<=1.48.0,>=1.47.0'), {'pymilvus'})
To Reproduce
Run "Better Retrieval with Embedding Retrieval"
Expected behavior
No error
What environment did you try to run the tutorial on?:
Additional context
N/A
CsvTextConverter
will be included in release 1.13. We should update Tut04 to show how to use it in an indexing pipeline.
Describe the issue
I tried running the tutorial directly from Github https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/16_Document_Classifier_at_Index_Time.ipynb
The tutorials fails with:
To Reproduce
Click here
Expected behavior
Tutorial succeeds.
What environment did you try to run the tutorial on?:
Google Colab
Additional context
Add any other context about the problem here.
Some tutorials (02_finetune_a_model_on_your_data.ipynb etc) are excluded from the tests. What's the reason for this? Can we add them to nightly tests as well?
Formatting of some tutorials is off.
This is a header issue, probably it will all be fixed by using the correct level of titles. Only h1 and h2 are possible for ToC.
This issue is better to be fixed after PR #44 is ready and the new format for tutorials is settled.
Describe the issue
Tutorial: https://github.com/deepset-ai/haystack-tutorials/blob/main/tutorials/14_Query_Classifier.ipynb
An exception is thrown at lines with keyword_classifier.run(query=query)
...
File ~/miniconda3/envs/haystack/lib/python3.9/site-packages/haystack/pipelines/base.py:529, in Pipeline.run(self, query, file_paths, labels, documents, meta, params, debug)
525 except Exception as e:
526 # The input might be a really large object with thousands of embeddings.
527 # If you really want to see it, raise the log level.
528 logger.debug("Exception while running node '%s' with input %s", node_id, node_input)
--> 529 raise Exception(
530 f"Exception while running node '{node_id}': {e}\nEnable debug logging to see the data that was passed when the pipeline failed."
531 ) from e
532 queue.pop(node_id)
533 #
Exception: Exception while running node 'QueryClassifier': 'GradientBoostingClassifier' object has no attribute '_loss'
What environment did you try to run the tutorial on?:
Update the TableQa tutorial to reflect the linearized offsets will be deprecated in favor of offsets that specify the row and column indices of the table cell.
TableCell was implemented in this PR deepset-ai/haystack#4616 and will be included in Haystack v 1.16
Describe the issue
Running the tutorial I get the following error:
AttributeError: module 'PIL.Image' has no attribute 'Resampling'
To Reproduce
Run this tutorial in colab and runs the cells in order until you see the error.
Expected behavior
No error :)
What environment did you try to run the tutorial on?:
main
Additional context
Add any other context about the problem here.
Describe the issue
When generate_markdowns.py
script is run, Tutorial 18 gives this warning:
/opt/homebrew/Caskroom/miniforge/base/lib/python3.10/site-packages/nbformat/__init__.py:92: MissingIDFieldWarning: Code cell is missing an id field, this will become a hard error in future nbformat versions. You may want to use `normalize()` on your notebooks before validations (available since nbformat 5.1.4). Previous versions of nbformat are fixing this issue transparently, and will stop doing so in the future.
validate(nb)
I encountered this only with Tutorial 18.
To Reproduce
Run python scripts/generate_markdowns.py --index index.toml --notebooks tutorials/18_GPL.ipynb --output markdowns/
Expected behavior
Although I couldn't figure out why this happens, this warning might be something important. We should check it.
What environment did you try to run the tutorial on?:
pip install --upgrade pip
and pip install -r requirements.txt
are runDescribe the tutorial you would like to see here
As a follow up to the tutorial mentioned in #112 I suggest we have a second one where we:
PromptNode
as a node in a full pipelineShaper
[x] I've checked the existing tutorials
prompt_text
becomes prompt
in the next release (1.18). The tutorials using PromptTemplate should be updated
There is a broken link in tutorial 11 (pipeline tutorial) in the section about the TranslationWrapperPipeline:
translated search (TranslationWrapperPipeline) To find out more about these pipelines, have a look at our documentation
Describe the issue
In tutorial 2, Fine-tuning a Model on Your Own Data, tool fetch_archive_from_http
is called twice with the same output_dir here and here with the same output_dir this can't work because it will only download if the folder is empty (see implementation and warning here).
More over path are incorrect when calling augment_squad.py
.
I will push a PR with a fix.
To Reproduce
Steps to reproduce the behavior:
Expected behavior
A clear and concise description of what you expected to happen :
What environment did you try to run the tutorial on?:
Additional context
Add any other context about the problem here.
Describe the issue
Trying to launch the Rest-api with Milvusdocumentstore and dpr Retrieval but im not able to run the query endpoint.
Im using a seperate script for indexing
While uploading my_pipeline.yml with python and launching pipeline.run(query) it works fine. but when im launching the Rest API the Query endpoint send this :
File "/opt/venv/lib/python3.10/site-packages/rest_api/controller/search.py", line 67, in _process_request
result = pipeline.run(query=request.query, params=params, debug=request.debug)
AttributeError: 'NoneType' object has no attribute 'run'
To Reproduce
pipeline.yml
components:
My Docker-compse.yml for The Rest API :
version: '3.5'
services:
haystack-api:
#build:
# context: .
# dockerfile: ./haystack.Dockerfile
image: "deepset/haystack:cpu"
volumes:
- ./rest_api/rest_api/pipeline:/opt/pipelines
ports:
- 8000:8000
restart: on-failure
environment:
- PIPELINE_YAML_PATH=/opt/pipelines/my_pipeline.yml
- TOKENIZERS_PARALLELISM=false
- HAYSTACK_TELEMETRY_ENABLED=false
Additional context
The Initialized send 200 with True, Maybe i have something missing in my docker-comopse or pipeline.yml but as i said the pipeline.yml works fine with a python script.
Hi! The docs look great! I was wondering if you'd be open enough to let us know what documentation framework you're using?
Install one extra dependency in the tutorials with ElasticsearchDocumentStore
. So, pip install farm-haystack[elasticsearch]
. Check the Haystack PR for details.
This is also an opportunity to switch to InMemoryDocumentStore for some of them π -> Issue #153
Describe the issue
In the FAQ style QA tutorial, in "Init the document store", it says you need to specify the name of "text_field" in elasticsearch, but then in the code sample this field is not listed.
To Reproduce
Open the tutorial and go to section "Initiate the DocumentStore".
Expected behavior
The code sample matches the description.
What environment did you try to run the tutorial on?:
Additional context
Message from Julian: The parameter is optional. It's called content_field and it's default value is content . Here is a link to the code.
elasticsearch.py
:param content_field: Name of field that might contain the answer and will therefore be passed to the Reader Model (e.g. "full_text").
https://github.com/[deepset-ai/haystack](https://github.com/deepset-ai/haystack)|deepset-ai/haystackdeepset-ai/haystack | Added by GitHub
Julian Risch
I'd suggest we leave out the bullet point
specify the name of our text_field in Elasticsearch that we want to return as an answer
from the tutorial and then it's okay.
Describe the tutorial you would like to see here
While we provide tutorials on GPL and DPR-training (-> unsupervised and supervised Retriever training), there is no tutorial on how to utilize the train
method of the EmbeddingRetriever
.
I think for many users, training the EmbeddingRetriever with their annotated data could be the most straightforward idea on how to improve performance (instead of using GPL or switching to DPR).
[x] I've checked the existing tutorials
Create a workflow that generates a link of Haystack website to preview the changes in the tutorial
Describe the issue
When using EvaluationResult.load(), you get a SyntaxError:
Traceback (most recent call last):
File "/opt/homebrew/Caskroom/miniforge/base/envs/unipy/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 3343, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-21-bc256748e1c5>", line 1, in <module>
saved_eval_result = EvaluationResult.load("../")
File "/opt/homebrew/Caskroom/miniforge/base/envs/unipy/lib/python3.9/site-packages/haystack/schema.py", line 1631, in load
node_results = {file.stem: pd.read_csv(file, **read_csv_kwargs) for file in csv_files}
File "/opt/homebrew/Caskroom/miniforge/base/envs/unipy/lib/python3.9/site-packages/haystack/schema.py", line 1631, in <dictcomp>
node_results = {file.stem: pd.read_csv(file, **read_csv_kwargs) for file in csv_files}
File "/opt/homebrew/Caskroom/miniforge/base/envs/unipy/lib/python3.9/site-packages/pandas/io/parsers.py", line 688, in read_csv
return _read(filepath_or_buffer, kwds)
File "/opt/homebrew/Caskroom/miniforge/base/envs/unipy/lib/python3.9/site-packages/pandas/io/parsers.py", line 460, in _read
data = parser.read(nrows)
File "/opt/homebrew/Caskroom/miniforge/base/envs/unipy/lib/python3.9/site-packages/pandas/io/parsers.py", line 1198, in read
ret = self._engine.read(nrows)
File "/opt/homebrew/Caskroom/miniforge/base/envs/unipy/lib/python3.9/site-packages/pandas/io/parsers.py", line 2157, in read
data = self._reader.read(nrows)
File "pandas/_libs/parsers.pyx", line 847, in pandas._libs.parsers.TextReader.read
File "pandas/_libs/parsers.pyx", line 862, in pandas._libs.parsers.TextReader._read_low_memory
File "pandas/_libs/parsers.pyx", line 941, in pandas._libs.parsers.TextReader._read_rows
File "pandas/_libs/parsers.pyx", line 1051, in pandas._libs.parsers.TextReader._convert_column_data
File "pandas/_libs/parsers.pyx", line 2139, in pandas._libs.parsers._apply_converter
File "/opt/homebrew/Caskroom/miniforge/base/envs/unipy/lib/python3.9/ast.py", line 62, in literal_eval
node_or_string = parse(node_or_string, mode='eval')
File "/opt/homebrew/Caskroom/miniforge/base/envs/unipy/lib/python3.9/ast.py", line 50, in parse
return compile(source, filename, mode, flags,
File "<unknown>", line unknown
^
SyntaxError: unexpected EOF while parsing
To Reproduce
Steps to reproduce the behavior: Run Tutorial 5
Expected behavior
Loading the previously saved evaluation result.
What environment did you try to run the tutorial on?:
We won't store markdowns anymore and we need to make this clear on Contributing.md
Describe the tutorial you would like to see here
I will create a second issue for a more advanced tutorial. However, for the PromptNode
basics I would suggest we start with:
PromptTemplates
PromptTemplate
@agnieszka-m please add any input about how this can be 'task' oriented.
Additional context
I think using the PromptNode
in a pipeline is slightly more advanced and could be too distracting to have an intro tutorial that covers this. So I suggest a separate one for that
[x] I've checked the existing tutorials
With the new S3 bucket https://core-engineering.s3.eu-central-1.amazonaws.com/public/ and its public
folder, we should move and possibly also rename all datasets used in the tutorials.
There are individual copies of some datasets for each tutorial to facilitate telemetry. We need to decide on a naming scheme. I would be okay with a number as a suffix just like we did until now but maybe we can come up with an alternative? The downside of the number is that it might stay in sync with the order of the tutorials on our website and the separation into beginner/intermediate/advanced tutorials.
This is how it's currently done: https://github.com/deepset-ai/haystack/blob/ddeaf2c98c157af1e26c637bcb563c6ea52fdcb7/haystack/telemetry.py#L187
"https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/wiki_gameofthrones_txt1.zip": "1",
What do you think? @brandenchan @bilgeyucel
Changes are needed in Haystack to make sure telemetry continues working. There is an issue for that in Haystack here: deepset-ai/haystack#3634
Problem
Currently only tutorials that can run on CPU are executed nightly. However, we noted that the excluded tutorials are very important, as the execute code areas that are not covered by any test (#2885, deepset-ai/haystack#2881, deepset-ai/haystack#2886).
We should setup self-hosted runners with GPU that are capable of running such tutorials, to ensure the same level of confidence than the other tutorials already have.
Related:
The Agent tutorial has no "about section" but the other tutorials do. We should add such a section to the tutorial. Sth like:
About us
This Haystack notebook was made with love by deepset in Berlin, GermanyWe bring NLP to the industry via open source!
Our focus: Industry specific language models & large scale QA systems.Some of our other work:
German BERT
GermanQuAD and GermanDPR
Get in touch: Twitter | LinkedIn | Discord | GitHub Discussions | WebsiteBy the way: we're hiring!
Describe the issue
When running tutorial 11 or 4 or 15 in GPU environment, the notebooks does not finish running the cell where elasticsearch is set up. 16,17 should have the same problem looking at the code but I didn't test them.
%%bash
wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.9.2-linux-x86_64.tar.gz -q
tar -xzf elasticsearch-7.9.2-linux-x86_64.tar.gz
chown -R daemon:daemon elasticsearch-7.9.2
sudo -u daemon -- elasticsearch-7.9.2/bin/elasticsearch -d
Tutorials 1, 2, 3, 6, 7, 8 worked for me though.
To Reproduce
Run tutorial 11 or 4 on colab with GPU environment (environment probably doesn't make a difference).
Expected behavior
Notebook should successfully set up elasticsearch service on colab and continue execution with the next cell.
What environment did you try to run the tutorial on?:
Additional context
No changes made to the tutorial code.
So that we dont miss any bugs coming with the new release
might be solved with #197
Describe the tutorial you would like to see here
A tutorial around WhisperTranscriber. An idea could be to transcribe Youtube videos and summarize them.
Haystack has WhisperTranscriber
starting from v1.15
[x] I've checked the existing tutorials
This should be done once this PR deepset-ai/haystack#5028 is merged and we have base images with every release.
Any other ideas @silvanocerza?
@mayankjobanputra @bilgeyucel jfyi we noticed that the titles formats are off on this tutorial: https://haystack.deepset.ai/tutorials/19_text_to_image_search_pipeline_with_multimodal_retriever
So I'm creating this issue to fix it. I can create a PR tomorrow and add you as reviewer
In the tutorial 09_DPR_training.ipynb, how to add evaluation indicators when fine-tuning DPR, is there any code provided? Like the acc\f1\loss in the tutorial09
Describe the issue
Let's upgrade to torch 1.13.0 (and to 1.13.1 as soon as it is released) to prepare for speed improvements coming with version 2.0.
@mayankjobanputra tracked changes between 1.13.1 and 2.0. It seems that the changes from there aren't big. If we upgrade to 1.13 now it will hopefully allow us to make the jump to 2.0 quickly.
Additional context
Add any other context about the problem here.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. πππ
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google β€οΈ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.