.github's Introduction
.github's People
Forkers
rachitavya.github's Issues
POC for transliteration using bhashini
Link Bhashini translate and S2T API
Create web scraper for extracting data
Train content - Ongoing
Content sheet - https://docs.google.com/spreadsheets/d/12NPm3r4odguaI_3dqEiyJKSzhIsCZXp30BFFrMss1UY/edit?pli=1#gid=1447499902
- Text Context;
- Image Content - Context tagging to the heading is done manually
- @Amruth-Vamshi to add chunks
- @aashutosh-samagra to verify
Create instance for staging (Copy of Samagra Bot)
@Amruth-Vamshi kindlt detail out the ticket for Karun
Services needed:
- BFF - copy Bhashini branch from AKAI
- FusionAuth
- Userservice
- Transport socket
- Minio
Segregate API Keys for OpenAI
- Share plan
Extracting section headings and chunks from PDFs
Sample pdf here
We need to be able to extract text from it and be able to chunk it in the form of headings and related chunks.
We have tired 2 different approaches :
PDF Parser -
Overall ambition :
We should be able able to parse a pdf such that we are able to get this following structure out of it.
It includes the following key capabilities :
- Ability to process pdfs with multiple languages - English/Odia/Hindi
- Ability to create chunks with headings on the basis of the way the pdf is structured. We should be able to recognize that some texts are headings, some are content and then be able to convert that into the structure above.
- Be able to process images and tables and convert them into chunks that can be passed to an LLM to answer questions based on them.
Where are we on this now :
Chunking
Free text chunking :
We are able to chunk free text (unstructured text) here
Structured pdf chunking
We have looked at 2 approaches for chunking text :
- Using Deepdoc detection to extract the text headings and structure of each page and converting it into a json format : here
- Using Pymupdf to get the boundaries of the text from the pdf and then using that to figure out the headings and the content pieces : here
What is a good chunk
- Should be around 100 to 200 words.
- The text/topic in a chunk should be on a similar topic which makes semantic sense.
- The text/topic in a chunk should be different from other chunks
- Ideally it should cover a small topic in its entirety. It could cover multiple topics but these small topics should not be a part of some other chunk.
For example :
Bad Chunk :
Here is a list of links :
Cab booking : http:/sdjnsdkgj/
Hotel form : http:/sfjgkjnfsgn/
This is a bad chunk because :
- The chunk is small is size
- The links cover multiple topics at once. Cab booking form should be a part of the chunk that should be a part that describes how to book a cab. Similarly, for the hotel booking lunch
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. ๐๐๐
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google โค๏ธ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.