Comments (4)
@pandu-k multiple highlights from multiple documents would be awesome. For example:
If you imagine a query like 'what is the contract number, who are the signatories and summarise the scope of work'
The in Document 1 contract number might be on page 1 the signatories might be on page 100. Then then the scope of work might be in Document 2 on page 12.
This would also be really useful for the style of questions which are:
- 'compare these two documents'
- 'what are the similarities and differences between these sections of these documents'
- 'what is the difference between version 1 and version 1.5'
from marqo.
@aryanagarwal9 Maybe I misunderstand, but couldn't a smaller document size get the job done? Can also add overlap on your chunks if worried about missing context. For example:
"index_defaults": {
"treat_urls_and_pointers_as_images": False,
"model": "hf/all_datasets_v4_MiniLM-L6",
"normalize_embeddings": True,
"text_preprocessing": {
"split_method": "sentence",
"split_length": 2,
"split_overlap": 1,
}
}
See: https://docs.marqo.ai/0.0.18/API-Reference/indexes/#text-preprocessing-object
So if your podcast transcript is 100 "pages", this might become 100 marqo "documents" and within each of these documents there will be n "chunks" (aka facets) where, using the above settings, each chunk would be 2 sentences, with a stride/overlap of 1 sentence between them. We would then get 1 highlight per "page", which maybe is insufficient. But couldn't you just split your pages into something even smaller, such as paragraphs, to achieve the desired result?
from marqo.
Hey @jess-lord I don't think the above solution scales. If you have 30 pdfs with 100 pages each. You now have 3000 documents that will each return a highlight. You then need to find the answer you are looking for amongst these 3000 highlights using some other method which defeats the original purpose of finding the actual highlight. If you were using an LLM your token count/cost to process 3000 sentences per query would be high too (if not exceeding the limit).
from marqo.
@edmuthiah I was responding to the podcast use case, which I still think this covers because the facets can be retrieved independently of their "parent document". But for your use case (which I too am now bumping into) I agree. The only alternative I can come up with for the moment is tags and weighted queries.
from marqo.
Related Issues (20)
- ONNX Support - CPU HOT 1
- [BUG] Cannot pull docker image on latest 2.1.0 HOT 1
- Marqo Integration into cacheGPT HOT 1
- Error Status: 400 after using add_documents() to add the images from docker to index
- [ENHANCEMENT] Wildcard support HOT 3
- Marqo encountered an unexpected internal error, status_code: 500[BUG] HOT 1
- Support OpenAPI spec generation [ENHANCEMENT]
- [ENHANCEMENT] Allow Pydantic > 2 HOT 3
- Unable to delete index [BUG] HOT 5
- [ENHANCEMENT] Nicer error message if not enough memory HOT 1
- [ENHANCEMENT] Highlighting Exact Matches HOT 2
- [BUG] This is a test HOT 1
- [BUG] HOT 2
- [BUG] Error ingesting simple wiki using simple_wiki_demo script HOT 1
- [BUG] Performing a lexical search on a structured index always results in a _score of 0.0 HOT 2
- [BUG] Failed to connect to vector store HOT 5
- Can't find any configuration for OpenAI Embeddings HOT 4
- [BUG]I report exec./run_marqo.sh: exec format error at startup HOT 3
- Can I use my own model in.bin format HOT 1
- [BUG] Marqo cannot connect to Zookeeper HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from marqo.