deepset-ai / haystack-tutorials Goto Github PK

View Code? Open in Web Editor NEW

232.0 15.0 79.0 4.21 MB

Here you can find all the Tutorials for Haystack 📓

Home Page: https://haystack.deepset.ai/tutorials

License: Apache License 2.0

Jupyter Notebook 99.76% Python 0.24%

haystack nlp semantic-search tutorials generative-qa llm text-generation

haystack-tutorials's People

Stargazers

Watchers

Forkers

anakin87 luliiiiila bilgeyucel ahmadhammad301 sinchanabhat gsajko anishreddy92 benvii haddadalwiyafie akemi0301 0-0zhuxiaoning housekorea rodolphecalvet crowdliness falvarezoliva rocha-a21906962 babajideowoyele ttb-git davgit zachtil gromag blancadesal simekent prajwollamichhane11 davidgerva samyuen101234 techthiyanes jetqin silvanocerza ottomanz gdf0 bsenst clark-zhang smcady lfunderburk mydevclouds erickfbg vaibhavpatil123 5l1v3r1 muntasir101 soderalohastrom aekmekci72 ce-shane alryns nottrobin havef imsomdev nickprock seanryankeegan rayvinc sahusiddharth ukveteran bmanobel ftgreat doebi sanyaade-projects msblendorio shashipal95 floraldawn nilick gangaaws shademe animesh-ad atifs kencottrell rkd-learn gvc0461082002 gurpreetkaurjethra ashnoorsingh mgosal absnormal cocobeach aogiworld ravi77o vineetp6 ailivesmatter suyambuganesh82 tiagolinnerhall mdwoicke

haystack-tutorials's Issues

Add time to complete and 'created at' date to the index for each tutorial

We will need to include 'time to complete' and the 'date created' somehow to the frontmatter of each tutorial if we want to display these as tags on the tutorial overview page. I suggest having these as fields to add for a tutorial in the index.toml and then add them to the frontmatter in generate_markdowns.py

Tutorial 5 - Simplify the content

Now the steps are complicated, and there are optional paths. Let's simplify it
https://haystack.deepset.ai/tutorials/05_evaluation

FAISSDocumentStore issue when running Tutorial 6 and Tutorial 7 in the same Colab

Describe the bug
Following the tutorial 6 and tutorial 7 in sequence on the same Colab runtime results in a FAISSDocumentStore error due to the presence of the already-created faiss_document_store.db

Related to this issue deepset-ai/haystack#1903

Error message
FAISSDocumentStore: number of documents present in the SQL database does not match the number of embeddings in FAISS

Expected behavior
Either a note in the notebook to mention that this might be expected behaviour and/or additional if...else.. code that automatically handles this.

Additional context
None

To Reproduce
Run Tutorial 6, then copy the relevant bits of code from Tutorial 7 into the same Colab.

FAQ Check

[Y] Have you had a look at our new FAQ page?

**System: Google Colab

OS:
GPU/CPU: VT-100
Haystack version (commit or version number): 1.6.1rc0
DocumentStore: FAISSDocumentStore
Reader:
Retriever:

02_Finetune_a_model_on_your_data.ipynb: A wrong file download location crash the process

Describe the issue
Fine-tuning a Model on Your Own Data

Part 2:
Downloading very small dataset to make tutorial faster (please use a bigger dataset for real use cases)

s3_url = "https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/squad_small.json.zip"
fetch_archive_from_http(url=s3_url, output_dir=doc_dir)

File download error cause the squad_small.json could not be downloaded.

To Reproduce
Just run with jupyter and cause error.

Expected behavior
Should execute without error.

What environment did you try to run the tutorial on?:

OS: Ubuntu 22.04 + miniconda + jupyter-lab
Browser: Chrome
Haystack Version: main stream

Additional context
should be changed to:

s3_url = "https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/squad_small.json.zip"
fetch_archive_from_http(url=s3_url, output_dir="data/temp")
!cp data/temp/*.json .

Create Tutorial on how to generate labels

Create a tutorial that guides users on different ways to generate their own annotations using the annotation tool, and also in evaluation mode in the streamlit.ui. This might take the form of a blog post or video that should be linked to from the repository.

ERROR - haystack.modeling.model.predictions - Invalid end offset >on tutorial notebook

Describe the issue

Hello. I'm testing the first tutorial as it is with around 5000 text files, some are 1 page some are 15 pages long.
When the answer is getting printed I get this error.
ERROR - haystack.modeling.model.predictions - Invalid end offset:



prediction = pipe.run(

    query="how many people attended the last concert?", params={"Retriever": {"top_k": 10}, "Reader": {"top_k": 5}}

)

Inferencing Samples: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 46/46 [00:38<00:00, 1.20 Batches/s]
ERROR - haystack.modeling.model.predictions - Invalid end offset:
(-26524, -26520) with a span answer.
ERROR - haystack.modeling.model.predictions - Invalid end offset:
(-6105, -6102) with a span answer.
ERROR - haystack.modeling.model.predictions - Invalid end offset:
(-24692, -24689) with a span answer.
ERROR - haystack.modeling.model.predictions - Invalid end offset:
(-32411, -32404) with a span answer.
ERROR - haystack.modeling.model.predictions - Invalid end offset:
(-27332, -27325) with a span answer.
ERROR - haystack.modeling.model.predictions - Invalid end offset:
(-32379, -32373) with a span answer.
ERROR - haystack.modeling.model.predictions - Invalid end offset:
(-30646, -30628) with a span answer.
ERROR - haystack.modeling.model.predictions - Invalid end offset:
(-30307, -30297) with a span answer.



**To Reproduce**
Tutorial 1, with 5000 text files, some are 1 page some are 15 pages long.

**Expected behavior**
with the default Game of Thrones dataset I didn't see this issue, can you please help me fix this? Many thanks.

**What environment did you try to run the tutorial on?:**
 - OS: Ubuntu 20
 - Firefox

import haystack

haystack.version
'1.11.0rc0'


**Additional context**
I suspect the problem is not specific to the first notebook. and that is why some unusual content gets printed as the result

Tutorial 09: Update to EmbeddingRetriever Training

Overview

With deepset-ai/haystack#2887, we replaced DPR with EmbeddingRetriever in Tutorial 06.

Now, we might want to do the same for Tutorial 09 which covers training (or fine-tuning) a DPR Retriever model.

Q1. Should we go ahead with this switch? Any reason keeping DPR might be better?

Alternatively, we could create one for each. I guess depends on which we want to demonstrate plus what we think might be valuable for users.

Training EmbeddingRetriever

Only the sentence-encoder variant of EmbeddingRetriever can be trained.

Its train method does some data setup and then calls the fit method on SentenceTransformer (from the sentence_transformers package).

Input data format is:

[
{”question”: …, “pos_doc”: …, “neg_doc”: …, “score”: …}, 
... 
]

It uses MarginMSELoss (as part of the GPL procedure).

Q2. If we were to demonstrate its training, which data could be best to use? GPL et al. seem to use MSMARCO but then we need cross-encoder scores for the score above, right? So there doesn't seem to be a download-and-use form of dataset available?

RFC: @brandenchan @vblagoje @agnieszka-m (please loop in anyone else if necessary)
cc: @mkkuemmel

Switch to InMemoryDocumentStore as much as possible

We can have some tutorials around other DocumentStores, but to show the other features, we should start using InMemoryDocumentStore

Audio Tutorial tests failing due to dependencies

The Audio tutorial (id=17) is failing in the nightlies. But it's failing because of a different issue than the issue we have on Colab.
However, the audio node is moving out to haystack-extras repo and will be installed via a different package so let's fix this tutorial or the test in conjunction with that node.
@ZanSara I let you make the call on whether we should skip this test for now
We will need help from you and possibly @silvanocerza to fix the tutorial once the package for haystack-extras is ready

Remove xpdf in tutorials 8 and 16

Hello 🙂!

Since xpdf dependency was removed in deepset-ai/haystack#4314,
xpdf installation instructions/commands should be removed in tutorials 8 and 16.

Create Tutorial on GermanQuAD and GermanDPR

We recently created a German Question Answering Dataset and also a German Dense Passage Retrieval dataset, along with trained models for each.

It would be great to have a tutorial (something along the lines of Tutorial 1) that allows users to start playing around with these models!

Lab 16: Explain use of Elastic in Index section

Describe the tutorial you would like to see here
At this point in the tutorial we state that we're going to start Elastic but not why we're doing this. What's the objective of this Index section of the lab?

[x] I've checked the existing tutorials

Colab button missing from tutorials on GitHub

Describe the issue
With PR #40 the Colab button is no more visible when looking at the tutorial files on GitHub. In my opinion, we should add them again.

Here is the old view with the button:
https://github.com/deepset-ai/haystack-tutorials/blob/1b592d5791e711489a6d25a4ff8f0a7160b174d1/tutorials/01_Basic_QA_Pipeline.ipynb

And the new one without the button:
https://github.com/deepset-ai/haystack-tutorials/blob/main/tutorials/01_Basic_QA_Pipeline.ipynb

@TuanaCelik if we don't add the button linking to colab I'd assume that fewer users will try running the code.

As an alternative to adding the button, maybe we could at least add a link?

Add Evaluation tutorials for other Nodes

We have Tutorial 5 on Evaluation, working with QA and passage search eval.

The pipeline.eval is structured to work with other nodes as well, but it would be good to have examples that people can base their work on, e.g. for generative QA or tableQA.

Other methods like FAQ matching, query + doc classifiers, summarization and translation need different labels (I think). We should create dedicated tutorials for those. FAQ matching would be a good next candidate.

Tutorial 20: Using Haystack with REST API

Opened issue in error

New tutorial for: Adding an invocation layer to models form PromptNode

Describe the tutorial you would like to see here
The PromptNode works with models that have an invocation layer. There is functinonality to register an invocation layer. We should have a tutorial that shows you how to do this

[x] I've checked the existing tutorials

Tutorial 14 is quite complicated to follow

As discussed with @mkkuemmel - Tutorial 14 is quite complicated to follow and explains a few things in parallel that could possibly be made simpler by splitting the tutorial into a few:

one to show Keyword vs Question/Statement classifier
Another to show Question vs Statement classifier

Or another idea would be to extend the markdown explanations in the current tutorial.

Tutorial 6 - Make MilvusDocumentStore warning prominent

Describe the issue
Tutorial "Better Retrieval with Embedding Retrieval" gives an error when run with MilvusDocumentStore

---------------------------------------------------------------------------
ContextualVersionConflict                 Traceback (most recent call last)
[<ipython-input-4-b2f9ac6965f8>](https://localhost:8080/#) in <module>
      6 
      7 from haystack.utils import launch_milvus
----> 8 from haystack.document_stores import MilvusDocumentStore
      9 
     10 launch_milvus()

9 frames
[/usr/local/lib/python3.9/dist-packages/pkg_resources/__init__.py](https://localhost:8080/#) in resolve(self, requirements, env, installer, replace_conflicting, extras)
    798                 # Oops, the "best" so far conflicts with a dependency
    799                 dependent_req = required_by[req]
--> 800                 raise VersionConflict(dist, req).with_context(dependent_req)
    801 
    802             # push the new requirements onto the stack

ContextualVersionConflict: (grpcio 1.51.3 (/usr/local/lib/python3.9/dist-packages), Requirement.parse('grpcio<=1.48.0,>=1.47.0'), {'pymilvus'})

To Reproduce
Run "Better Retrieval with Embedding Retrieval"

Expected behavior
No error

What environment did you try to run the tutorial on?:

OS: Colab
Browser: Chrome
Haystack Version: 1.15.0rc

Additional context
N/A

Use `CsvTextConverter` in Tut04: FAQ Matching

CsvTextConverter will be included in release 1.13. We should update Tut04 to show how to use it in an indexing pipeline.

Tutorial 16 doesn't work

Describe the issue

I tried running the tutorial directly from Github https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/16_Document_Classifier_at_Index_Time.ipynb

The tutorials fails with:

To Reproduce

Click here

Expected behavior
Tutorial succeeds.

What environment did you try to run the tutorial on?:
Google Colab
Additional context
Add any other context about the problem here.

Run tests on all tutorials

Some tutorials (02_finetune_a_model_on_your_data.ipynb etc) are excluded from the tests. What's the reason for this? Can we add them to nightly tests as well?

Fix formatting on all tutorials

Formatting of some tutorials is off.

Some examples:

This is a header issue, probably it will all be fixed by using the correct level of titles. Only h1 and h2 are possible for ToC.

This issue is better to be fixed after PR #44 is ready and the new format for tutorials is settled.

Query Classifier Tutorial (#14) cannot be done

Describe the issue
Tutorial: https://github.com/deepset-ai/haystack-tutorials/blob/main/tutorials/14_Query_Classifier.ipynb
An exception is thrown at lines with keyword_classifier.run(query=query)

...
File ~/miniconda3/envs/haystack/lib/python3.9/site-packages/haystack/pipelines/base.py:529, in Pipeline.run(self, query, file_paths, labels, documents, meta, params, debug)
    525 except Exception as e:
    526     # The input might be a really large object with thousands of embeddings.
    527     # If you really want to see it, raise the log level.
    528     logger.debug("Exception while running node '%s' with input %s", node_id, node_input)
--> 529     raise Exception(
    530         f"Exception while running node '{node_id}': {e}\nEnable debug logging to see the data that was passed when the pipeline failed."
    531     ) from e
    532 queue.pop(node_id)
    533 #

Exception: Exception while running node 'QueryClassifier': 'GradientBoostingClassifier' object has no attribute '_loss'

What environment did you try to run the tutorial on?:

OS: Ubuntu 22.04.1 LTS
Browser: chrome, brave
Haystack Version: 1.12.0rc0

[Docs] Update TableReader docs about TableCell

Update the TableQa tutorial to reflect the linearized offsets will be deprecated in favor of offsets that specify the row and column indices of the table cell.

TableCell was implemented in this PR deepset-ai/haystack#4616 and will be included in Haystack v 1.16

"Preprocessing" tutorial fails with `PIL` error

Describe the issue
Running the tutorial I get the following error:

AttributeError: module 'PIL.Image' has no attribute 'Resampling'

To Reproduce
Run this tutorial in colab and runs the cells in order until you see the error.

Expected behavior
No error :)

What environment did you try to run the tutorial on?:

OSX
Safari
Haystack main

Additional context
Add any other context about the problem here.

MissingIDFieldWarning in Tutorial 18

Describe the issue
When generate_markdowns.py script is run, Tutorial 18 gives this warning:

/opt/homebrew/Caskroom/miniforge/base/lib/python3.10/site-packages/nbformat/__init__.py:92: MissingIDFieldWarning: Code cell is missing an id field, this will become a hard error in future nbformat versions. You may want to use `normalize()` on your notebooks before validations (available since nbformat 5.1.4). Previous versions of nbformat are fixing this issue transparently, and will stop doing so in the future.
  validate(nb)

I encountered this only with Tutorial 18.

To Reproduce
Run python scripts/generate_markdowns.py --index index.toml --notebooks tutorials/18_GPL.ipynb --output markdowns/

Expected behavior
Although I couldn't figure out why this happens, this warning might be something important. We should check it.

What environment did you try to run the tutorial on?:

OS: macOS, VSCode terminal
pip install --upgrade pip and pip install -r requirements.txt are run

New tutorial for: How to use PromptNode in a pipeline

Describe the tutorial you would like to see here
As a follow up to the tutorial mentioned in #112 I suggest we have a second one where we:

Use the PromptNode as a node in a full pipeline
As an example, build a retrieval augmented qa pipeline
Display how you would use the Shaper

[x] I've checked the existing tutorials

Update`prompt_text` to `prompt` in tutorials

prompt_text becomes prompt in the next release (1.18). The tutorials using PromptTemplate should be updated

Broken Link in Tutorial 11

There is a broken link in tutorial 11 (pipeline tutorial) in the section about the TranslationWrapperPipeline:

translated search (TranslationWrapperPipeline) To find out more about these pipelines, have a look at our documentation

Tutorial 2 - Finetune a model : `fetch_archive_from_http` called twice with same output_dir

Describe the issue
In tutorial 2, Fine-tuning a Model on Your Own Data, tool fetch_archive_from_http is called twice with the same output_dir here and here with the same output_dir this can't work because it will only download if the folder is empty (see implementation and warning here).

More over path are incorrect when calling augment_squad.py.

I will push a PR with a fix.

To Reproduce
Steps to reproduce the behavior:

Simply run the code in Distill your model, it won't work without modifying it

Expected behavior
A clear and concise description of what you expected to happen :

it should download glove vectors AND squad_small.json

What environment did you try to run the tutorial on?:

OS: All
Browser : Chrome
Haystack Version : Latest

Additional context
Add any other context about the problem here.

Failed loading pipeline component 'Retriever'. See the stacktrace above for more information.

Describe the issue
Trying to launch the Rest-api with Milvusdocumentstore and dpr Retrieval but im not able to run the query endpoint.
Im using a seperate script for indexing

While uploading my_pipeline.yml with python and launching pipeline.run(query) it works fine. but when im launching the Rest API the Query endpoint send this :
File "/opt/venv/lib/python3.10/site-packages/rest_api/controller/search.py", line 67, in _process_request
result = pipeline.run(query=request.query, params=params, debug=request.debug)
AttributeError: 'NoneType' object has no attribute 'run'

To Reproduce
pipeline.yml
components:

name: MilvusDocumentStore
params:
embedding_dim: 384
index: custom_index
type: MilvusDocumentStore
name: Retriever
params:
document_store: MilvusDocumentStore
passage_embedding_model: sentence-transformers/all-MiniLM-L6-v2
query_embedding_model: sentence-transformers/all-MiniLM-L6-v2
type: DensePassageRetriever
name: Reader
params:
model_name_or_path: deepset/roberta-base-squad2
type: TransformersReader
pipelines:
name: query
nodes:
- inputs:
  - Query
    name: Retriever
- inputs:
  - Retriever
    name: Reader
    version: 1.14.0

My Docker-compse.yml for The Rest API :

version: '3.5'
services:
haystack-api:
#build:
# context: .
# dockerfile: ./haystack.Dockerfile
image: "deepset/haystack:cpu"
volumes:
- ./rest_api/rest_api/pipeline:/opt/pipelines
ports:
- 8000:8000
restart: on-failure
environment:
- PIPELINE_YAML_PATH=/opt/pipelines/my_pipeline.yml
- TOKENIZERS_PARALLELISM=false
- HAYSTACK_TELEMETRY_ENABLED=false

Haystack Version [1.4]

Additional context
The Initialized send 200 with True, Maybe i have something missing in my docker-comopse or pipeline.yml but as i said the pipeline.yml works fine with a python script.

Framework

Hi! The docs look great! I was wondering if you'd be open enough to let us know what documentation framework you're using?

ElasticsearchDocumentStore is in a optional dependency

Install one extra dependency in the tutorials with ElasticsearchDocumentStore. So, pip install farm-haystack[elasticsearch]. Check the Haystack PR for details.

This is also an opportunity to switch to InMemoryDocumentStore for some of them 👀 -> Issue #153

FAQ tutorial - mismatch between code sample and instruction

Describe the issue
In the FAQ style QA tutorial, in "Init the document store", it says you need to specify the name of "text_field" in elasticsearch, but then in the code sample this field is not listed.

To Reproduce
Open the tutorial and go to section "Initiate the DocumentStore".

Expected behavior
The code sample matches the description.

What environment did you try to run the tutorial on?:

OS: [e.g. iOS]
Browser [e.g. chrome, safari]
Haystack Version [e.g. 22]

Additional context
Message from Julian: The parameter is optional. It's called content_field and it's default value is content . Here is a link to the code.
elasticsearch.py
:param content_field: Name of field that might contain the answer and will therefore be passed to the Reader Model (e.g. "full_text").
https://github.com/[deepset-ai/haystack](https://github.com/deepset-ai/haystack)|deepset-ai/haystackdeepset-ai/haystack | Added by GitHub

Julian Risch
I'd suggest we leave out the bullet point
specify the name of our text_field in Elasticsearch that we want to return as an answer
from the tutorial and then it's okay.

New tutorial for: Retriever training

Describe the tutorial you would like to see here

While we provide tutorials on GPL and DPR-training (-> unsupervised and supervised Retriever training), there is no tutorial on how to utilize the train method of the EmbeddingRetriever.
I think for many users, training the EmbeddingRetriever with their annotated data could be the most straightforward idea on how to improve performance (instead of using GPL or switching to DPR).

[x] I've checked the existing tutorials

Generate Haystack website preview link

Create a workflow that generates a link of Haystack website to preview the changes in the tutorial

EvaluationResult.load() produces SyntaxError (Tutorial 5)

Describe the issue
When using EvaluationResult.load(), you get a SyntaxError:

Traceback (most recent call last):

  File "/opt/homebrew/Caskroom/miniforge/base/envs/unipy/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 3343, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)

  File "<ipython-input-21-bc256748e1c5>", line 1, in <module>
    saved_eval_result = EvaluationResult.load("../")

  File "/opt/homebrew/Caskroom/miniforge/base/envs/unipy/lib/python3.9/site-packages/haystack/schema.py", line 1631, in load
    node_results = {file.stem: pd.read_csv(file, **read_csv_kwargs) for file in csv_files}

  File "/opt/homebrew/Caskroom/miniforge/base/envs/unipy/lib/python3.9/site-packages/haystack/schema.py", line 1631, in <dictcomp>
    node_results = {file.stem: pd.read_csv(file, **read_csv_kwargs) for file in csv_files}

  File "/opt/homebrew/Caskroom/miniforge/base/envs/unipy/lib/python3.9/site-packages/pandas/io/parsers.py", line 688, in read_csv
    return _read(filepath_or_buffer, kwds)

  File "/opt/homebrew/Caskroom/miniforge/base/envs/unipy/lib/python3.9/site-packages/pandas/io/parsers.py", line 460, in _read
    data = parser.read(nrows)

  File "/opt/homebrew/Caskroom/miniforge/base/envs/unipy/lib/python3.9/site-packages/pandas/io/parsers.py", line 1198, in read
    ret = self._engine.read(nrows)

  File "/opt/homebrew/Caskroom/miniforge/base/envs/unipy/lib/python3.9/site-packages/pandas/io/parsers.py", line 2157, in read
    data = self._reader.read(nrows)

  File "pandas/_libs/parsers.pyx", line 847, in pandas._libs.parsers.TextReader.read

  File "pandas/_libs/parsers.pyx", line 862, in pandas._libs.parsers.TextReader._read_low_memory

  File "pandas/_libs/parsers.pyx", line 941, in pandas._libs.parsers.TextReader._read_rows

  File "pandas/_libs/parsers.pyx", line 1051, in pandas._libs.parsers.TextReader._convert_column_data

  File "pandas/_libs/parsers.pyx", line 2139, in pandas._libs.parsers._apply_converter

  File "/opt/homebrew/Caskroom/miniforge/base/envs/unipy/lib/python3.9/ast.py", line 62, in literal_eval
    node_or_string = parse(node_or_string, mode='eval')

  File "/opt/homebrew/Caskroom/miniforge/base/envs/unipy/lib/python3.9/ast.py", line 50, in parse
    return compile(source, filename, mode, flags,

  File "<unknown>", line unknown
    
    ^
SyntaxError: unexpected EOF while parsing

To Reproduce
Steps to reproduce the behavior: Run Tutorial 5

Expected behavior
Loading the previously saved evaluation result.

What environment did you try to run the tutorial on?:

OS: [e.g. iOS]
Browser chrome
Haystack Version farm-haystack @ git+https://github.com/deepset-ai/haystack.git@508d9f6b3228fe2cdb2aba072057bfd0d7531c83

Update Contributing.md after md files are removed

We won't store markdowns anymore and we need to make this clear on Contributing.md

Put an example frontmatter in README.md

New tutorial for: `PromptNode` Basics

Describe the tutorial you would like to see here
I will create a second issue for a more advanced tutorial. However, for the PromptNode basics I would suggest we start with:

Initializing a PromptNode with various models
The default PromptTemplates
Defining your own PromptTemplate

@agnieszka-m please add any input about how this can be 'task' oriented.

Additional context
I think using the PromptNode in a pipeline is slightly more advanced and could be too distracting to have an intro tutorial that covers this. So I suggest a separate one for that

[x] I've checked the existing tutorials

Move tutorial datasets to new S3 bucket

With the new S3 bucket https://core-engineering.s3.eu-central-1.amazonaws.com/public/ and its public folder, we should move and possibly also rename all datasets used in the tutorials.

There are individual copies of some datasets for each tutorial to facilitate telemetry. We need to decide on a naming scheme. I would be okay with a number as a suffix just like we did until now but maybe we can come up with an alternative? The downside of the number is that it might stay in sync with the order of the tutorials on our website and the separation into beginner/intermediate/advanced tutorials.

This is how it's currently done: https://github.com/deepset-ai/haystack/blob/ddeaf2c98c157af1e26c637bcb563c6ea52fdcb7/haystack/telemetry.py#L187
"https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/wiki_gameofthrones_txt1.zip": "1",

What do you think? @brandenchan @bilgeyucel

Changes are needed in Haystack to make sure telemetry continues working. There is an issue for that in Haystack here: deepset-ai/haystack#3634

Run GPU tutorials nightly

Problem

Currently only tutorials that can run on CPU are executed nightly. However, we noted that the excluded tutorials are very important, as the execute code areas that are not covered by any test (#2885, deepset-ai/haystack#2881, deepset-ai/haystack#2886).

We should setup self-hosted runners with GPU that are capable of running such tutorials, to ensure the same level of confidence than the other tutorials already have.

Related:

About us section is missing in agent tutorial

The Agent tutorial has no "about section" but the other tutorials do. We should add such a section to the tutorial. Sth like:

About us
This Haystack notebook was made with love by deepset in Berlin, Germany

We bring NLP to the industry via open source!
Our focus: Industry specific language models & large scale QA systems.

Some of our other work:

German BERT
GermanQuAD and GermanDPR
Get in touch: Twitter | LinkedIn | Discord | GitHub Discussions | Website

By the way: we're hiring!

Tutorial 4,11,15,16,17 stuck at setting up elasticsearch on colab

Describe the issue
When running tutorial 11 or 4 or 15 in GPU environment, the notebooks does not finish running the cell where elasticsearch is set up. 16,17 should have the same problem looking at the code but I didn't test them.

%%bash

wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.9.2-linux-x86_64.tar.gz -q
tar -xzf elasticsearch-7.9.2-linux-x86_64.tar.gz
chown -R daemon:daemon elasticsearch-7.9.2
sudo -u daemon -- elasticsearch-7.9.2/bin/elasticsearch -d

Tutorials 1, 2, 3, 6, 7, 8 worked for me though.

To Reproduce
Run tutorial 11 or 4 on colab with GPU environment (environment probably doesn't make a difference).

Expected behavior
Notebook should successfully set up elasticsearch service on colab and continue execution with the next cell.

What environment did you try to run the tutorial on?:

Colab

Additional context
No changes made to the tutorial code.

Add api key requiring tutorials to nightly tests

So that we dont miss any bugs coming with the new release
might be solved with #197

Skip colab button

Provide an option to not having a colab button on top of the tutorial

New tutorial for: WhisperTranscriber

Describe the tutorial you would like to see here
A tutorial around WhisperTranscriber. An idea could be to transcribe Youtube videos and summarize them.
Haystack has WhisperTranscriber starting from v1.15

[x] I've checked the existing tutorials

Update test on tutorials

This should be done once this PR deepset-ai/haystack#5028 is merged and we have base images with every release.

Add metadata to each tutorial to set on which image it should be tested
Run nightly test on these versions
Run nightly tests on main as well to be cautious about upcoming changes
Release images should be used when a tutorial is created or updated (on PRs)

Any other ideas @silvanocerza?

Additional context
Add any other context about the problem here.

deepset-ai / haystack-tutorials Goto Github PK

haystack-tutorials's People

Stargazers

Watchers

Forkers

haystack-tutorials's Issues

Overview

Training EmbeddingRetriever

Recommend Projects

Recommend Topics

Recommend Org