Giter VIP home page Giter VIP logo

pdftochat's Introduction

PDFToChat – Chat with your PDFs in seconds.

Chat with your PDFs in seconds. Powered by Together AI and Pinecone.

Tech Stack · Deploy Your Own · Common Errors · Credits · Future Tasks


Tech Stack

Deploy Your Own

You can deploy this template to Vercel or any other host. Note that you'll need to:

See the .example.env for a list of all the required environment variables.

You will also need to prepare your database schema by running npx prisma db push.

MongoDB Atlas

To set up a MongoDB Atlas database as the backing vectorstore, you will need to perform the following steps:

  1. Sign up on their website, then create a database cluster. Find it under the Database sidebar tab.
  2. Create a collection by switching to Collections the tab and creating a blank collection.
  3. Create an index by switching to the Atlas Search tab and clicking Create Search Index.
  4. Make sure you select Atlas Vector Search - JSON Editor, select the appropriate database and collection, and paste the following into the textbox:
{
  "fields": [
    {
      "numDimensions": 768,
      "path": "embedding",
      "similarity": "euclidean",
      "type": "vector"
    },
    {
      "path": "docstore_document_id",
      "type": "filter"
    }
  ]
}

Note that the numDimensions is 768 dimensions to match the embeddings model we're using, and that we have another index on docstore_document_id. This allows us to filter later.

You may call the index whatever you wish, just make a note of it!

  1. Finally, retrieve and set the following environment variables:
NEXT_PUBLIC_VECTORSTORE=mongodb # Set MongoDB Atlas as your vectorstore

MONGODB_ATLAS_URI= # Connection string for your database.
MONGODB_ATLAS_DB_NAME= # The name of your database.
MONGODB_ATLAS_COLLECTION_NAME= # The name of your collection.
MONGODB_ATLAS_INDEX_NAME= # The name of the index you just created.

Common errors

  • Check that you've created an .env file that contains your valid (and working) API keys, environment and index name.
  • Check that you've set the vector dimensions to 768 and that index matched your specified field in the .env variable.
  • Check that you've added a credit card on Together AI if you're hitting rate limiting issues due to the free tier

Credits

  • Youssef for the design of the app
  • Mayo for the original RAG repo and inspiration
  • Jacob for the LangChain help
  • Together AI, Bytescale, Pinecone, and Clerk for sponsoring

Future tasks

These are some future tasks that I have planned. Contributions are welcome!

  • Add a trash icon for folks to delete PDFs from the dashboard and implement delete functionality
  • Try different embedding models like UAE-large-v1 to see if it improves accuracy
  • Explore best practices for auto scrolling based on other chat apps like chatGPT
  • Do some prompt engineering for Mixtral to make replies as good as possible
  • Protect API routes by making sure users are signed in before executing chats
  • Run an initial benchmark on how accurate chunking / retrieval are
  • Research best practices for chunking and retrieval and play around with them – ideally run benchmarks
  • Try out Langsmith for more observability into how the RAG app runs
  • Add demo video to the homepage to demonstrate functionality more easily
  • Upgrade to Next.js 14 and fix any issues with that
  • Implement sources like perplexity to be clickable with more info
  • Add analytics to track the number of chats & errors
  • Make some changes to the default tailwind prose to decrease padding
  • Add an initial message with sample questions or just add them as bubbles on the page
  • Add an option to get answers as markdown or in regular paragraphs
  • Implement something like SWR to automatically revalidate data
  • Save chats for each user to get back to later in the postgres DB
  • Bring up a message to direct folks to compress PDFs if they're beyond 10MB
  • Use a self-designed custom uploader
  • Use a session tracking tool to better understand how folks are using the site
  • Add better error handling overall with appropriate toasts when actions fail
  • Add support for images in PDFs with something like Nougat

pdftochat's People

Contributors

ankri avatar bracesproul avatar jacoblee93 avatar mayooear avatar nutlope avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

pdftochat's Issues

option to choose model or "tier"

great product! congrats!
just an improvement suggestion: ability to choose the model you want to use, or even a "paid" tier with superior models including vision?
awesome launch and oss ftw!

Unable to delete uploaded documents

I am satisfied with the quality of summaries and other Q&As more than any other service.
However, I am concerned that I cannot find a way to delete a document once it's uploaded. Wouldn't that mean documents will just keep accumulating? From the user's perspective, I'm also worried about private information being leaked.

Since I might not know the way to delete a document, I would appreciate your guidance on how to do it.

Pinecone dimensions error

Hello,
I'm running the repo locally but when ingesting a new pdf file I'm getting this error:

PineconeBadRequestError: Vector dimension 768 does not match the dimension of the index 728

I have created a Pinecone index with 728 dimensions though:
Screenshot 2024-02-17 at 17 36 22

and have set the PINECONE_API_KEY and PINECONE_INDEX_NAME environment variables in the .env file.

The error is thrown when running the 'await PineconeStore.fromDocuments' method

I would appreciate any help or pointers towards the right direction!

Many thanks!

Export encountered errors on following paths

when i run next build i get the error below, any suggestion:

Export encountered errors on following paths:
/_error: /404
/_error: /500
/_not-found
/dashboard/page: /dashboard
/page: /

PDF ingestion does not seem to complete successfully

I have attempted to load two different PDFS into PDFtoCHAT and neither appears to complete the ingestion process.

After 45 minutes I close the browser tab, reopen and the PDF is listed in my list.

When I click on them, but when I attempt to chat with them the response from PDFtoCHAT indicates it has no knowledge of the PDF document I am attempting to load.

What am I doing wrong? I was part of your Zoom presentation earlier today and it looks like a terrific capability!

Thanks,

Tim

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.