Giter VIP home page Giter VIP logo

reader's Introduction

Reader

Your LLMs deserve better input.

Reader converts any URL to an LLM-friendly input with a simple prefix https://r.jina.ai/. Get improved output for your agent and RAG systems at no cost.

Feel free to use Reader API in production. It is free, stable and scalable. We are maintaining it actively as one of the core products of Jina AI.

image

Updates

  • 2024-04-24: You now have more fine-grained control over Reader API using headers, e.g. forwarding cookies, using HTTP proxy.
  • 2024-04-15: Reader now supports image reading! It captions all images at the specified URL and adds Image [idx]: [caption] as an alt tag (if they initially lack one). This enables downstream LLMs to interact with the images in reasoning, summarizing etc. See example here.

Usage

Simply prepend https://r.jina.ai/ to any URL. For example, to convert the URL https://en.wikipedia.org/wiki/Artificial_intelligence to an LLM-friendly input, use the following URL:

https://r.jina.ai/https://en.wikipedia.org/wiki/Artificial_intelligence

All images in that page that lack alt tag are auto-captioned by a VLM (vision langauge model) and formatted as !(Image [idx]: [VLM_caption])[img_URL]. This should give your downstream text-only LLM just enough hints to include those images into reasoning, selecting, and summarization.

Streaming mode

Streaming mode is useful when you find that the standard mode provides an incomplete result. This is because the Reader will wait a bit longer until the page is stablely rendered. Use the accept-header to toggle the streaming mode:

curl -H "Accept: text/event-stream" https://r.jina.ai/https://en.m.wikipedia.org/wiki/Main_Page

The data comes in a stream; each subsequent chunk contains more complete information. The last chunk should provide the most complete and final result. If you come from LLMs, please note that it is a different behavior than the LLMs' text-generation streaming.

For example, compare these two curl commands below. You can see streaming one gives you complete information at last, whereas standard mode does not. This is because the content loading on this particular site is triggered by some js after the page is fully loaded, and standard mode returns the page "too soon".

curl -H 'x-no-cache: true' https://access.redhat.com/security/cve/CVE-2023-45853
curl -H "Accept: text/event-stream" -H 'x-no-cache: true' https://r.jina.ai/https://access.redhat.com/security/cve/CVE-2023-45853

Note: -H 'x-no-cache: true' is used only for demonstration purposes to bypass the cache.

Streaming mode is also useful if your downstream LLM/agent system requires immediate content delivery or needs to process data in chunks to interleave I/O and LLM processing times. This allows for quicker access and more efficient data handling:

Reader API:  streamContent1 ----> streamContent2 ----> streamContent3 ---> ... 
                          |                    |                     |
                          v                    |                     |
Your LLM:                 LLM(streamContent1)  |                     |
                                               v                     |
                                               LLM(streamContent2)   |
                                                                     v
                                                                     LLM(streamContent3)

Note that in terms of completeness: ... > streamContent3 > streamContent2 > streamContent1, each subsequent chunk contains more complete information.

Using request headers

As you have already seen above, one can control the behavior of the Reader API using request headers. Here is a complete list of supported headers.

  • You can ask the Reader API to forward cookies settings via the x-set-cookie header.
    • Note that requests with cookies will not be cached.
  • You can bypass readability filtering via the x-respond-with header, specifically:
    • x-respond-with: markdown returns markdown without going through reability
    • x-respond-with: html returns documentElement.outerHTML
    • x-respond-with: text returns document.body.innerText
    • x-respond-with: screenshot returns the URL of the webpage's screenshot
  • You can specify a proxy server via the x-proxy-url header.
  • You can bypass the cached page (lifetime 300s) via the x-no-cache header.

JSON mode (super early beta)

This is still very early and the result is not really a "useful" JSON. It contains three fields url, title and content only. Nonetheless, you can use accept-header to control the output format:

curl -H "Accept: application/json" https://r.jina.ai/https://en.m.wikipedia.org/wiki/Main_Page

Install

You will need the following tools to run the project:

  • Node v18 (The build fails for Node version >18)
  • Firebase CLI (npm install -g firebase-tools)

For backend, go to the backend/functions directory and install the npm dependencies.

git clone [email protected]:jina-ai/reader.git
cd backend/functions
npm install

What is thinapps-shared submodule?

You might notice a reference to thinapps-shared submodule, an internal package we use to share code across our products. While it’s not open-sourced and isn't integral to the Reader's functions, it mainly helps with decorators, logging, secrets management, etc. Feel free to ignore it for now.

That said, this is the single codebase behind https://r.jina.ai, so everytime we commit here, we will deploy the new version to the https://r.jina.ai.

Having trouble on some websites?

Please raise an issue with the URL you are having trouble with. We will look into it and try to fix it.

License

Reader is backed by Jina AI and licensed under Apache-2.0.

reader's People

Contributors

nomagick avatar hanxiao avatar chxru avatar dependabot[bot] avatar

Stargazers

João avatar Gabor Barany avatar Carlos avatar Tobias Augspurger avatar  avatar Hugo VASSELIN avatar 刘恒 avatar Ken Tsui avatar 小子欠扁 avatar lichangming avatar Rohan Paul avatar FS.IO avatar Jacky Chan avatar Thomas Chong avatar Tao Lin avatar Yu Su avatar Samuel Rincé avatar  avatar  avatar Evans avatar Chuck CBW avatar tuhuayuan avatar Mihail Shumilov avatar  avatar bohyeon avatar  avatar Syd avatar Jason Ken Adhinarta avatar  avatar  avatar  avatar Sai Htaung Kham avatar  avatar Zhu avatar Mark Anthony Llego avatar  avatar Arbal avatar FunAI avatar Cedric avatar BENM avatar jralduaveuthey avatar Szymon Baczyński avatar Pranav Goyanka avatar Reco avatar Christopher Colantuono avatar Zhou Tuo avatar  avatar YIFAN LI avatar StevenHHB avatar  avatar Oleh Kuznetsov avatar fufu avatar Kundi Yao avatar andylamgot avatar forestzhang avatar UncleCode avatar  avatar Blaise avatar Brandon avatar Metin Usta avatar ambirdw avatar  avatar Ishvinder Sethi avatar tony kuo avatar ailm32442型机器人 avatar  avatar Humanismusic avatar Besher Alkurdi avatar dinos avatar Alexey Borsky avatar Nicolai Thomsen avatar  avatar Ethan avatar Gilbert Brault avatar CLKLuke avatar Redrain avatar Kunal Verma avatar zhf avatar  avatar  avatar  avatar Kerry Fraser-Robinson avatar willenwu avatar  avatar Dr. Christopher G. Snyder avatar Stoyan Atanasov avatar Dominik Antal avatar Yoshiki Miura avatar Arrizal Amin avatar Sematre avatar Mark Vandergon avatar Diogo de Souza avatar Udeepa Meepegama avatar Peter Nguyen avatar Alex Craviotto avatar Michael Demarais avatar Tobias avatar Alec Larson avatar  avatar Michael Mior avatar

Watchers

Jandy avatar Gerd Krieger avatar Hana Chang avatar  avatar Xieisabug avatar Nan Wang avatar Maximilian Werk avatar 都成 avatar Mohammad Kalim Akram avatar Michael Günther avatar Deepankar Mahapatro avatar Sha Zhou avatar Simon avatar Jie Fu avatar Joan Fontanals avatar  avatar  avatar codingsh avatar  avatar Talal Ahmad avatar  avatar  avatar

reader's Issues

Timeout for both twitter and reddit happening frequently

Around 50% of the time, this tweet and the few others I tried shows a timeout error: https://r.jina.ai/https://twitter.com/EntreEden/status/1780771887624417315.

The response I get is this:

{"name":"TimeoutError",
"domainThrown":true,
"message":"Timed out after 10000 ms while waiting for the WS endpoint URL to appear in stdout!"}

Same with a reddit post I tried: https://www.reddit.com/r/cscareerquestions/comments/ufdyhd/after_4_years_of_working_im_slowly_learning_how/

I am in Canada, all on my home wifi and computer.

If you need help with scaling, I have a $5 Hetzner vps that I can pitch in. If I get guidance, I'm willing to help out with the issue as well.

Published time to Json mode

Please add "Published Time" to JSON mode. We are investigating how to incorporate the published time to check for updated content downstream and replace vectors based on whether the published time has changed.

Add parameters to request full text (i.e. don't parse with @mozilla/readability)

Great tool, would love to make more use of this, however, I scrape a lot of home pages of websites and those pages wind up having far too much info removed by readability.js. I'd love to have a parameter I can pass that would allow me to use html-to-text instead. The code to add that is very simple:

import { convert } from 'html-to-text';

const options = {
	wordwrap: false,
	selectors: [
		{ selector: 'a', options: { hideLinkHrefIfSameAsText: true, noAnchorUrl: true, ignoreHref: true, linkBrackets: false } },
		{ selector: 'img', format: 'skip' },
		{ selector: 'p', options: { leadingLineBreaks: 1, trailingLineBreaks: 1 } },
		{ selector: 'pre', options: { leadingLineBreaks: 1, trailingLineBreaks: 1 } },
	]
};

text = convert(html, options);

That will get you clean text that's well structured from the html though much much longer for home pages like https://www.tanium.com. For that home page, the current reader produces text with: 327 tokens (using the gpt4 tiktoken tokenizer) and html-to-text gives back 2554 tokens. That's far too much information loss for my use case, esp given that much of that information is critical to understanding what the business does.

Finally, while it wouldn't work in js - if you're willing to connect a vision model to interpret images, perhaps you'd consider implementing Trafilatura for articles and similar pages as it slightly outperforms readability.js based on a 2023 analysis from this paper: https://downloads.webis.de/publications/papers/bevendorff_2023b.pdf

Here's one image from the analysis:
image

While adding a parameter like this conflicts a bit with prioritizing the convenience of just dropping the url at the end of https://r.jina.ai/ I think the added info is really critical. It's the difference between me being able to use this for my use case (which I don't think is an extreme edge case) and not being able to use this...

URL loses information in the conversion

Converting the URL at https://www.eventbrite.com/e/brian-mclaren-wisdom-and-courage-for-a-world-falling-apart-tickets-823891721317 results in the loss of several headings (Date and Time, Location, Refund Policy, etc.) in Jina reader. Rendered result is this:

Title: Brian McLaren | “Wisdom and Courage for a World Falling Apart"

URL Source: https://www.eventbrite.com/e/brian-mclaren-wisdom-and-courage-for-a-world-falling-apart-tickets-823891721317

Markdown Content:
Brian McLaren, author, speaker, activist, and public theologian notes that the challenge of living well and maintaining resilience in turbulent times requires new ways of thinking, becoming, and belonging. Facing nations, ecosystems, economies, religions, and other institutions in disarray, we are called to a spiritual transformation in our own lives that will express itself in transformation in our world.

Childcare for ages 2-12 will be provided by reservation through April 15.

Error fetching webpage content via code but successful via browser

When attempting to fetch webpage content via code, I encounter an error consistently. However, I've noticed that the same URL can be successfully accessed via a browser. The error message I'm receiving is as follows:

Error data from response: {
data: null,
path: 'url',
code: 400,
name: 'ParamValidationError',
status: 40001,
message: 'TypeError: Invalid URL',
readableMessage: 'ParamValidationError(url): TypeError: Invalid URL'
}

URL:https://www.trustpilot.com/review/cortexi.io

I've tried troubleshooting this issue, but so far, I haven't been able to pinpoint the exact cause. Any insights or suggestions on how to resolve this would be greatly appreciated. Thank you!

访问的时候产生了报错

访问的时候产生了报错:{"data":null,"cause":{},"code":422,"name":"AssertionFailureError","status":42206,"message":"Failed to goto https://ddjjjc.gov.cn/pages/news.asp?id=4173: Error: net::ERR_CERT_COMMON_NAME_INVALID at https://ddjjjc.gov.cn/pages/news.asp?id=4173","readableMessage":"AssertionFailureError: Failed to goto https://ddjjjc.gov.cn/pages/news.asp?id=4173: Error: net::ERR_CERT_COMMON_NAME_INVALID at https://ddjjjc.gov.cn/pages/news.asp?id=4173"}
我是用requests请求的,这是i请求代码:
import requests
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36',
'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,/;q=0.8,application/signed-exchange;v=b3;q=0.7'
}
urllib = "https://r.jina.ai/https://www.ddjjjc.gov.cn/pages/news.asp?id=4173"
resource = requests.get(urllib,headers=headers)
print(resource.text)

Pile in reader format

Hi! This looks interesting. I wonder if you could convert the Pile dataset taken from respective urls in the jina reader format to experiment with LLM pre-training?

Issue: Jina reader fails to parse URLs containing Chinese characters

Issue: Jina reader fails to parse URLs containing Chinese characters

Description:

We have encountered an issue where the Jina reader fails to parse URLs that contain Chinese characters. This issue is causing our application to throw errors and prevents us from properly extracting content from certain websites.

Steps to Reproduce:

  1. Make a request to the Jina reader API with a URL containing Chinese characters, such as https://zh.wikipedia.org/wiki/%E5%91%A8%E6%9D%B0%E5%80%AB.
  2. Observe that the Jina reader fails to parse the URL and returns an error.

Expected Behavior:

We expect the Jina reader to properly handle URLs containing Chinese characters and successfully parse the corresponding web pages. The reader should be able to decode the URL, retrieve the web page content, and return it as expected.

Actual Behavior:

When a URL containing Chinese characters is passed to the Jina reader, it fails to parse the URL and throws an error. The error message typically indicates that the reader is unable to read properties of undefined, specifically the 'parentNode' property.

Example Error Message:

Failed to fetch https://zh.wikipedia.org/wiki/%E5%91%A8%E6%9D%B0%E5%80%AB: {"code":500,"status":50000,"message":"Cannot read properties of undefined (reading 'parentNode')","name":"TypeError"}

Impact:

This issue prevents our application from properly extracting content from websites that have URLs containing Chinese characters. It limits the functionality of our application and affects the user experience when dealing with such websites.

Potential Causes:

  • The Jina reader may not be properly decoding the URL before making the request, leading to an invalid URL being passed to the underlying parsing logic.
  • The parsing logic within the Jina reader may not be handling URLs with Chinese characters correctly, resulting in the "Cannot read properties of undefined" error.

Workaround:

As a temporary workaround, we have implemented a filtering mechanism in our application to skip URLs that contain Chinese characters. However, this is not an ideal solution as it limits the functionality and coverage of our application.

Request:

We kindly request the Jina team to investigate this issue and provide a fix that allows the Jina reader to properly handle URLs containing Chinese characters. It would be greatly appreciated if you could provide an update on the progress and an estimated timeline for the resolution.

Additional Information:

  • We are currently using the official Jina reader API, not the open-source service.
  • We are in the process of setting up our own service, but we are unsure if the open-source service also has this issue.

Please let us know if you require any further information or if there are any specific details you need to investigate and resolve this issue.

Thank you for your attention to this matter. We look forward to your response and resolution.

Best regards,
Loki.W

Data Retrieval After Initial Page Load for Timely Updates

When using API, there's a limitation where it only captures data loaded during the initial page load. However, certain data is loaded dynamically after a brief delay, making it inaccessible through the current implementation. so is their any option available?

How to start ?

npm WARN EBADENGINE Unsupported engine {
npm WARN EBADENGINE package: undefined,
npm WARN EBADENGINE required: { node: '20' },
npm WARN EBADENGINE current: { node: 'v21.7.3', npm: '10.5.0' }
npm WARN EBADENGINE }

up to date, audited 1005 packages in 2s

146 packages are looking for funding
run npm fund for details

4 critical severity vulnerabilities

To address all issues (including breaking changes), run:
npm audit fix --force

Run npm audit for details.

这不是完整的开源项目把?

没有本地部署文档。什么都不全,readme里边还写着安装命令干什么嘞?到底是想开源还是不想开源?想开源不能放这样的残次品出来把。

Option to toggle the usage of Readability?

As I was experimenting with your API I noticed that it was a bit "too aggressive" on some pages, removing sections that I would want to keep in the final Markdown.

So I looked around both in the project code, as well as setting up an isolated test that only used turndown directly, but finally I found that the "culprit" was @mozilla/readability.

While this seems to do a great job at removing "irrelevant" content before it's converted to Markdown in most cases, I can definitely see how it might be a bit too greedy/aggressive in its cleanup strategy (i.e. not only in my specific case), and since I couldn't really find any combinations of config options for Readability that kept the specific "hero" section on the page I was trying with, I instead wanted to suggest that you might add the ability to simply enable/disable Readability completely?

As I'm not using the project by hosting the actual code locally or on my own server, but rather just using your public API, the ideal scenario would therefore be if this toggle could even exist as e.g. an extra parameter or alternative API endpoint.

Of course turndown should still be configured to remove things like <script> and <style> when not using Readability (if you don't already explicitly do this), but other than that I really think this alternative parsing option could be a very valuable addition!

npm run build failed because shared files are not found

Error:

$ npm run build

> build
> tsc -p .

src/cloud-functions/crawler.ts:3:79 - error TS2307: Cannot find module '../shared' or its corresponding type declarations.

3 import { CloudHTTPv2, Ctx, Logger, OutputServerEventStream, RPCReflect } from '../shared';
                                                                                ~~~~~~~~~~~

src/db/crawled.ts:2:33 - error TS2307: Cannot find module '../shared/lib/firestore' or its corresponding type declarations.

2 import { FirestoreRecord } from '../shared/lib/firestore';
                                  ~~~~~~~~~~~~~~~~~~~~~~~~~

src/db/crawled.ts:9:21 - error TS4112: This member cannot have an 'override' modifier because its containing class 'Crawled' does not extend another class.

9     static override collectionName = 'crawled';
                      ~~~~~~~~~~~~~~

src/db/crawled.ts:11:14 - error TS4112: This member cannot have an 'override' modifier because its containing class 'Crawled' does not extend another class.

11     override _id!: string;
                ~~~

src/db/crawled.ts:36:21 - error TS4112: This member cannot have an 'override' modifier because its containing class 'Crawled' does not extend another class.

36     static override from(input: any) {
                       ~~~~

src/db/crawled.ts:46:14 - error TS4112: This member cannot have an 'override' modifier because its containing class 'Crawled' does not extend another class.

46     override degradeForFireStore() {
                ~~~~~~~~~~~~~~~~~~~

src/index.ts:6:50 - error TS2307: Cannot find module './shared' or its corresponding type declarations.

6 import { loadModulesDynamically, registry } from './shared';
                                                   ~~~~~~~~~~

src/services/puppeteer.ts:4:24 - error TS2307: Cannot find module '../shared/services/logger' or its corresponding type declarations.

4 import { Logger } from '../shared/services/logger';
                         ~~~~~~~~~~~~~~~~~~~~~~~~~~~

src/services/puppeteer.ts:178:43 - error TS2339: Property 'fromFirestoreQuery' does not exist on type 'typeof Crawled'.

178             const cached = (await Crawled.fromFirestoreQuery(Crawled.COLLECTION.where('urlPathDigest', '==', digest).orderBy('createdAt', 'desc').limit(1)))?.[0];
                                              ~~~~~~~~~~~~~~~~~~

src/services/puppeteer.ts:178:70 - error TS2339: Property 'COLLECTION' does not exist on type 'typeof Crawled'.

178             const cached = (await Crawled.fromFirestoreQuery(Crawled.COLLECTION.where('urlPathDigest', '==', digest).orderBy('createdAt', 'desc').limit(1)))?.[0];
                                                                         ~~~~~~~~~~

src/services/puppeteer.ts:232:25 - error TS2339: Property 'save' does not exist on type 'typeof Crawled'.

232                 Crawled.save(
                            ~~~~

src/services/puppeteer.ts:240:26 - error TS7006: Parameter 'err' implicitly has an 'any' type.

240                 ).catch((err) => {
                             ~~~


Found 12 errors in 4 files.

Errors  Files
     1  src/cloud-functions/crawler.ts:3
     5  src/db/crawled.ts:2
     1  src/index.ts:6
     5  src/services/puppeteer.ts:4

To reproduce go to backend/functions and run npm run build from an account which doesn't have access to the thinapps-shared/backend.

It looks like the thinapps-shared backend does mostly logging monitoring and caching, but at this point, the project doesn't build without it.

Not pulling image links correctly

Testing jina on our norwegian websites, we see that the images are not pulled correctly:

example URL: https://mikalsenutvikling.no/
Jina URL: https://r.jina.ai/https://mikalsenutvikling.no

First image on the website by Jina is given as ![Image 1: Daglig Leder - André mikalsen](data:image/svg+xml,%3Csvg%20xmlns='http://www.w3.org/2000/svg'%20viewBox='0%200%201200%201307'%3E%3C/svg%3E)

But in the HTML you can clearly see there is a src image with .png that should be the URL given in the Jina reader version:

src="https://mikalsenutvikling.no/wp-content/uploads/2022/11/Andre-Mikalsen-optimalisert.png

This has been the same on all sites we have tested.

DOMException 500 Error

@hanxiao Could you please give some advice for this issue?

`https://r.jina.ai/https://www.msn.com/en-us/news/technology/the-best-ai-search-engines-and-tools-you-can-use-to-search-the-web/ar-BB1kbzFL`

{"code":500,"status":50000,"message":"Failed to execute 'setAttribute' on 'Element': '}' is not a valid attribute name.","name":"DOMException"}

Respect robots.txt and identify your system

Recently, some AI companies have given website administrators the option of opting out of AI training by using configuration options in robots.txt.

While this project is for prompting and RAG rather than training, I still think you should provide an option for website users to prevent their websites from becoming ad-hoc databases for or components of AI systems. It seems like you have made your software default to evading detection by using puppeteer's stealth plugin; the user-agent configuration that would allow website owners to identify your project's bots is commented out.

I think this default is deceptive and irresponsible. You should make sure users of your project respect these preferences by incorporating them into the software's defaults. Web administrators may not be inclined to support the additional traffic generated by people using their websites as a component of AI systems.

support docker deployment

Could you please add support for docker deployment to streamline setting up and running the project?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.