This repo will provide you with 2 search indexes. One search index includes the entire document, the second index includes a vector type and chunks of the document.
The concept here is to leverage BM25 for searching on the entire document, and then allowing a developer to filter the search for chunks based on the most approroate BM25 serach index.
Chunking is cool, but cosine simularity will only take you so far.
BM25 takes into account the length of the entire document. Use the first search index to determine the correct documents, then leverage hybrid search filtering on the documents specified by your initial search.
The first index will have the name that is in your .env file, "COG_SEARCH_INDEX" The second index will be the name that is in your.nev file "COG_SEARCH_INDEX" with "-vector" added to the end of it.
Azure Cognitive Service Indexers manage updates/inserts and now deletes (in preview to your index).
The following code will handle:
- Inserts/Updates to your index.
- In the event that a file is deleted from your index, an additional function is required to handle deleting from the second index. This is not yet implemented. This will be a function triggered when a delete occurs from blob storage to delete the item from the second index.
This repo holds 2 items.
- An Azure Function to be leveraged as a custom skill for Cog Search
- A notebook for configuring your Azure Cognitive Search index.
After cloning the notebook create a .env file in your directory, this will hold parameters.
To be able to run the notebook, you will need to do a pip install python-dotenv library
pip install python-dotenv
The .env file should contain the following items:
AZURE_OPENAI_ENDPOINT="https://xmm.openai.azure.com/"
AZURE_OPENAI_KEY="XXXXXXXX"
TEXT_EMBEDDING_ENGINE="text-embedding-ada-002"
COG_SEARCH_RESOURCE="mmx-cog-search"
COG_SEARCH_KEY="YYYYYYY"
COG_SEARCH_INDEX="myindex"
#function app configuration
STORAGE_CONNECTION_STRING="DefaultEndpointsProtocol=https;AccountName=mmxcogstorage;AccountKey=4bKKMUnJjxdYN+DUo3WMJ6Sqm+AStDmU2EA==;EndpointSuffix=core.windows.net"
STORAGE_ACCOUNT="mmxcogstorage"
STORAGE_CONTAINER="test"
STORAGE_KEY="XXXXU2EA=="
COG_SERVICE_KEY="786XXXXXf06f"
DEBUG="1"
functionAppUrlAndKey="https://funcationapp.azurewebsites.net/api/HttpTrigger1?code=1xVGCqCG3Txs0r6Q=="```
-
Create a storage account if you don't have one already. This will be the storage account that Cognitive Search will target with the indexer.
-
Azure Cognitive Search - enable Semantic Config
-
Deploy Azure AI services in SAME region as Azure Cog Search
-
Deploy Azure Function App - python linux in SAME region as Azure Cog Search
-
In VS Code editor right click on "HttpTrigger1" and selected deploy.
-
Configure you Azure Function App to connect
- Restart your function app.
- On your overview - you will see the function "HttpTrigger1". Click on it and go "Code + Test"
- Click on "Get function URL" copy and save the URL. You will need it to configure Cognitive Search.
Example:
https://mygithubfuncryh.azurewebsites.net/api/Embeddings?code=6zKqoduc-ezFurl==
{
"IsEncrypted": false,
"Values": {
"API_BASE": "https://mmx-openai.openai.azure.com/",
"API_KEY": "XXXXX",
"API_TYPE": "azure",
"API_VERSION": "2022-12-01",
"APPINSIGHTS_INSTRUMENTATIONKEY": "YYYYY",
"AzureWebJobsFeatureFlags": "EnableWorkerIndexing",
"AzureWebJobsStorage": "DefaultEndpointsProtocol=https;Ac;AccountKey=udK0239QU4j6cwQ==;EndpointSuffix=core.windows.net",
"COG_SEARCH_KEY": "FOo5exG5D3bLg1QEfW8",
"FUNCTIONS_EXTENSION_VERSION": "~4",
"FUNCTIONS_REQUEST_BODY_SIZE_LIMIT": "360000000",
"FUNCTIONS_WORKER_RUNTIME": "python",
"INDEX_NAME": "ithelpdeskv6-vector",
"SERVICE_ENDPOINT": "https://mmx-cog.search.windows.net",
"TEXT_EMBEDDING_MODEL": "text-embedding-ada-002",
"STORAGE_ACCOUNT": "storage"
"STORAGE_ACCOUNT_CONTAINER": "container"
},
"ConnectionStrings": {}
}