Giter VIP home page Giter VIP logo

mendableai / firecrawl Goto Github PK

View Code? Open in Web Editor NEW
8.9K 55.0 654.0 9.55 MB

🔥 Turn entire websites into LLM-ready markdown or structured data. Scrape, crawl and extract with a single API.

Home Page: https://firecrawl.dev

License: GNU Affero General Public License v3.0

Dockerfile 0.35% JavaScript 4.74% TypeScript 82.81% Python 7.00% HTML 0.06% CSS 0.29% Go 4.75%
ai crawler data markdown scraper html-to-markdown llm rag scraping web-crawler ai-scraping

firecrawl's Introduction

🔥 Firecrawl

Crawl and convert any website into LLM-ready markdown or structured data. Built by Mendable.ai and the Firecrawl community. Includes powerful scraping, crawling and data extraction capabilities.

This repository is in its early development stages. We are still merging custom modules in the mono repo. It's not completely yet ready for full self-host deployment, but you can already run it locally.

What is Firecrawl?

Firecrawl is an API service that takes a URL, crawls it, and converts it into clean markdown or structured data. We crawl all accessible subpages and give you clean data for each. No sitemap required.

Pst. hey, you, join our stargazers :)

How to use it?

We provide an easy to use API with our hosted version. You can find the playground and documentation here. You can also self host the backend if you'd like.

To run locally, refer to guide here.

API Key

To use the API, you need to sign up on Firecrawl and get an API key.

Crawling

Used to crawl a URL and all accessible subpages. This submits a crawl job and returns a job ID to check the status of the crawl.

curl -X POST https://api.firecrawl.dev/v0/crawl \
    -H 'Content-Type: application/json' \
    -H 'Authorization: Bearer YOUR_API_KEY' \
    -d '{
      "url": "https://mendable.ai"
    }'

Returns a jobId

{ "jobId": "1234-5678-9101" }

Check Crawl Job

Used to check the status of a crawl job and get its result.

curl -X GET https://api.firecrawl.dev/v0/crawl/status/1234-5678-9101 \
  -H 'Content-Type: application/json' \
  -H 'Authorization: Bearer YOUR_API_KEY'
{
  "status": "completed",
  "current": 22,
  "total": 22,
  "data": [
    {
      "content": "Raw Content ",
      "markdown": "# Markdown Content",
      "provider": "web-scraper",
      "metadata": {
        "title": "Mendable | AI for CX and Sales",
        "description": "AI for CX and Sales",
        "language": null,
        "sourceURL": "https://www.mendable.ai/"
      }
    }
  ]
}

Scraping

Used to scrape a URL and get its content.

curl -X POST https://api.firecrawl.dev/v0/scrape \
    -H 'Content-Type: application/json' \
    -H 'Authorization: Bearer YOUR_API_KEY' \
    -d '{
      "url": "https://mendable.ai"
    }'

Response:

{
  "success": true,
  "data": {
    "content": "Raw Content ",
    "markdown": "# Markdown Content",
    "provider": "web-scraper",
    "metadata": {
      "title": "Mendable | AI for CX and Sales",
      "description": "AI for CX and Sales",
      "language": null,
      "sourceURL": "https://www.mendable.ai/"
    }
  }
}

Search (Beta)

Used to search the web, get the most relevant results, scrape each page and return the markdown.

curl -X POST https://api.firecrawl.dev/v0/search \
    -H 'Content-Type: application/json' \
    -H 'Authorization: Bearer YOUR_API_KEY' \
    -d '{
      "query": "firecrawl",
      "pageOptions": {
        "fetchPageContent": true // false for a fast serp api
      }
    }'
{
  "success": true,
  "data": [
    {
      "url": "https://mendable.ai",
      "markdown": "# Markdown Content",
      "provider": "web-scraper",
      "metadata": {
        "title": "Mendable | AI for CX and Sales",
        "description": "AI for CX and Sales",
        "language": null,
        "sourceURL": "https://www.mendable.ai/"
      }
    }
  ]
}

Intelligent Extraction (Beta)

Used to extract structured data from scraped pages.

curl -X POST https://api.firecrawl.dev/v0/scrape \
    -H 'Content-Type: application/json' \
    -H 'Authorization: Bearer YOUR_API_KEY' \
    -d '{
      "url": "https://www.mendable.ai/",
      "extractorOptions": {
        "mode": "llm-extraction",
        "extractionPrompt": "Based on the information on the page, extract the information from the schema. ",
        "extractionSchema": {
          "type": "object",
          "properties": {
            "company_mission": {
                      "type": "string"
            },
            "supports_sso": {
                      "type": "boolean"
            },
            "is_open_source": {
                      "type": "boolean"
            },
            "is_in_yc": {
                      "type": "boolean"
            }
          },
          "required": [
            "company_mission",
            "supports_sso",
            "is_open_source",
            "is_in_yc"
          ]
        }
      }
    }'
{
  "success": true,
  "data": {
    "content": "Raw Content",
    "metadata": {
      "title": "Mendable",
      "description": "Mendable allows you to easily build AI chat applications. Ingest, customize, then deploy with one line of code anywhere you want. Brought to you by SideGuide",
      "robots": "follow, index",
      "ogTitle": "Mendable",
      "ogDescription": "Mendable allows you to easily build AI chat applications. Ingest, customize, then deploy with one line of code anywhere you want. Brought to you by SideGuide",
      "ogUrl": "https://mendable.ai/",
      "ogImage": "https://mendable.ai/mendable_new_og1.png",
      "ogLocaleAlternate": [],
      "ogSiteName": "Mendable",
      "sourceURL": "https://mendable.ai/"
    },
    "llm_extraction": {
      "company_mission": "Train a secure AI on your technical resources that answers customer and employee questions so your team doesn't have to",
      "supports_sso": true,
      "is_open_source": false,
      "is_in_yc": true
    }
  }
}

Using Python SDK

Installing Python SDK

pip install firecrawl-py

Crawl a website

from firecrawl import FirecrawlApp

app = FirecrawlApp(api_key="YOUR_API_KEY")

crawl_result = app.crawl_url('mendable.ai', {'crawlerOptions': {'excludes': ['blog/*']}})

# Get the markdown
for result in crawl_result:
    print(result['markdown'])

Scraping a URL

To scrape a single URL, use the scrape_url method. It takes the URL as a parameter and returns the scraped data as a dictionary.

url = 'https://example.com'
scraped_data = app.scrape_url(url)

Extracting structured data from a URL

With LLM extraction, you can easily extract structured data from any URL. We support pydantic schemas to make it easier for you too. Here is how you to use it:

class ArticleSchema(BaseModel):
    title: str
    points: int
    by: str
    commentsURL: str

class TopArticlesSchema(BaseModel):
    top: List[ArticleSchema] = Field(..., max_items=5, description="Top 5 stories")

data = app.scrape_url('https://news.ycombinator.com', {
    'extractorOptions': {
        'extractionSchema': TopArticlesSchema.model_json_schema(),
        'mode': 'llm-extraction'
    },
    'pageOptions':{
        'onlyMainContent': True
    }
})
print(data["llm_extraction"])

Search for a query

Performs a web search, retrieve the top results, extract data from each page, and returns their markdown.

query = 'What is Mendable?'
search_result = app.search(query)

Using the Node SDK

Installation

To install the Firecrawl Node SDK, you can use npm:

npm install @mendable/firecrawl-js

Usage

  1. Get an API key from firecrawl.dev
  2. Set the API key as an environment variable named FIRECRAWL_API_KEY or pass it as a parameter to the FirecrawlApp class.

Scraping a URL

To scrape a single URL with error handling, use the scrapeUrl method. It takes the URL as a parameter and returns the scraped data as a dictionary.

try {
  const url = "https://example.com";
  const scrapedData = await app.scrapeUrl(url);
  console.log(scrapedData);
} catch (error) {
  console.error("Error occurred while scraping:", error.message);
}

Crawling a Website

To crawl a website with error handling, use the crawlUrl method. It takes the starting URL and optional parameters as arguments. The params argument allows you to specify additional options for the crawl job, such as the maximum number of pages to crawl, allowed domains, and the output format.

const crawlUrl = "https://example.com";
const params = {
  crawlerOptions: {
    excludes: ["blog/"],
    includes: [], // leave empty for all pages
    limit: 1000,
  },
  pageOptions: {
    onlyMainContent: true,
  },
};
const waitUntilDone = true;
const timeout = 5;
const crawlResult = await app.crawlUrl(
  crawlUrl,
  params,
  waitUntilDone,
  timeout
);

Checking Crawl Status

To check the status of a crawl job with error handling, use the checkCrawlStatus method. It takes the job ID as a parameter and returns the current status of the crawl job.

const status = await app.checkCrawlStatus(jobId);
console.log(status);

Extracting structured data from a URL

With LLM extraction, you can easily extract structured data from any URL. We support zod schema to make it easier for you too. Here is how you to use it:

import FirecrawlApp from "@mendable/firecrawl-js";
import { z } from "zod";

const app = new FirecrawlApp({
  apiKey: "fc-YOUR_API_KEY",
});

// Define schema to extract contents into
const schema = z.object({
  top: z
    .array(
      z.object({
        title: z.string(),
        points: z.number(),
        by: z.string(),
        commentsURL: z.string(),
      })
    )
    .length(5)
    .describe("Top 5 stories on Hacker News"),
});

const scrapeResult = await app.scrapeUrl("https://news.ycombinator.com", {
  extractorOptions: { extractionSchema: schema },
});

console.log(scrapeResult.data["llm_extraction"]);

Search for a query

With the search method, you can search for a query in a search engine and get the top results along with the page content for each result. The method takes the query as a parameter and returns the search results.

const query = "what is mendable?";
const searchResults = await app.search(query, {
  pageOptions: {
    fetchPageContent: true, // Fetch the page content for each search result
  },
});

Contributing

We love contributions! Please read our contributing guide before submitting a pull request.

It is the sole responsibility of the end users to respect websites' policies when scraping, searching and crawling with Firecrawl. Users are advised to adhere to the applicable privacy policies and terms of use of the websites prior to initiating any scraping activities. By default, Firecrawl respects the directives specified in the websites' robots.txt files when crawling. By utilizing Firecrawl, you expressly agree to comply with these conditions.

License Disclaimer

This project is primarily licensed under the GNU Affero General Public License v3.0 (AGPL-3.0), as specified in the LICENSE file in the root directory of this repository. However, certain components of this project are licensed under the MIT License. Refer to the LICENSE files in these specific directories for details.

Please note:

  • The AGPL-3.0 license applies to all parts of the project unless otherwise specified.
  • The SDKs and some UI components are licensed under the MIT License. Refer to the LICENSE files in these specific directories for details.
  • When using or contributing to this project, ensure you comply with the appropriate license terms for the specific component you are working with.

For more details on the licensing of specific components, please refer to the LICENSE files in the respective directories or contact the project maintainers.

firecrawl's People

Contributors

100gle avatar calebpeffer avatar chand1012 avatar dependabot[bot] avatar elimisteve avatar eltociear avatar ericciarla avatar george-zakharov avatar jakobstadlhuber avatar jhoseph88 avatar kenthsu avatar keredu avatar kun432 avatar lakr233 avatar mattjoyce avatar mattzcarey avatar mdp avatar mogery avatar nickscamara avatar niublibing avatar rafaelsideguide avatar rogerserper avatar rombru avatar sanix-darker avatar simonha9 avatar snippet avatar szepeviktor avatar tak-s avatar tomkosm avatar tractorjuice avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

firecrawl's Issues

Unable to run python sdk sample code from README

Traceback (most recent call last):
File "/Users/howardgil/Desktop/Projects/nat-sec-hackathon/firecrawl.py", line 1, in
from firecrawl import FirecrawlApp
File "/Users/howardgil/Desktop/Projects/nat-sec-hackathon/firecrawl.py", line 1, in
from firecrawl import FirecrawlApp
ImportError: cannot import name 'FirecrawlApp' from partially initialized module 'firecrawl' (most likely due to a circular import) (/Users/howardgil/Desktop/Projects/nat-sec-hackathon/firecrawl.py)

Feat: Convert images in pdfs to images that can be accessed by the user

Some customers want to access images inside PDFs on the web. I'm not sure if llama-index supports this by default?

If we can get the images, we may need to start hosting ourselves in S3 too. This is probably a better solution for ALL images, since people should be cleaning out links to images on external URLs because of data exfil problems.

limits include filtered out paths

the crawl limit is applied before the paths are filtered out.

base url: test.com
limit: 2
included links: ["/pages/*"]

links on test.com in order:

[
"/home",
"/imprint",
"/about",
"pages/1",
"pages/2",
"pages/3"
]

expected links to be crawled: ["pages/1","pages/2"]

current links that are crawled: []

[Feat] Provide more details for 429 Rate limit reached

Other APIs provide details within the 429 response that enables a calculation or even the details of when to retry.
For example:

Groq:

Error: Error code: 429 - {'error': {'message': 'Rate limit reached for model llama3-70b-8192 in organization org_xxxxxxx on tokens per minute (TPM): Limit 7000, Used 0, Requested ~12903. Please try again in 50.597142857s. Visit https://console.groq.com/docs/rate-limits for more information.', 'type': 'tokens', 'code': 'rate_limit_exceeded'}}

ModuleNotFoundError: No module named 'firecrawl'

I do pip install firecrawl-py

I cannot run the crawler. I get this when installing the SDK

WARNING: Skipping /opt/homebrew/lib/python3.11/site-packages/packaging-24.0.dist-info due to invalid metadata entry 'name'
WARNING: Skipping /opt/homebrew/lib/python3.11/site-packages/packaging-24.0.dist-info due to invalid metadata entry 'name'
Requirement already satisfied: firecrawl-py in /opt/homebrew/lib/python3.11/site-packages (0.0.6)
Requirement already satisfied: requests in /opt/homebrew/lib/python3.11/site-packages (from firecrawl-py) (2.31.0)
Requirement already satisfied: charset-normalizer<4,>=2 in /opt/homebrew/lib/python3.11/site-packages (from requests->firecrawl-py) (3.3.2)
Requirement already satisfied: idna<4,>=2.5 in /opt/homebrew/lib/python3.11/site-packages (from requests->firecrawl-py) (3.6)
Requirement already satisfied: urllib3<3,>=1.21.1 in /opt/homebrew/lib/python3.11/site-packages (from requests->firecrawl-py) (1.26.18)
Requirement already satisfied: certifi>=2017.4.17 in /opt/homebrew/lib/python3.11/site-packages (from requests->firecrawl-py) (2024.2.2)
WARNING: Skipping /opt/homebrew/lib/python3.11/site-packages/packaging-24.0.dist-info due to invalid metadata entry 'name'
WARNING: Skipping /opt/homebrew/lib/python3.11/site-packages/packaging-24.0.dist-info due to invalid metadata entry 'name'
WARNING: Skipping /opt/homebrew/lib/python3.11/site-packages/packaging-24.0.dist-info due to invalid metadata entry 'name'
WARNING: Skipping /opt/homebrew/lib/python3.11/site-packages/packaging-24.0.dist-info due to invalid metadata entry 'name'

After that when I run the file, it gives back:

from firecrawl import FirecrawlApp
ModuleNotFoundError: No module named 'firecrawl'

[Feat] Docker deployment

Could you please add support for docker deployment to streamline setting up and running the project?

Thank you

Get Code for LLM Extract returns bad JSON

I understand LLM Extract is in alpha and Get Code is likely nondeterministic so feel free to ignore. But this is what it gave me:
Screenshot 2024-05-04 at 7 29 40 PM

Which was missing a closing apos and some commas
Screenshot 2024-05-04 at 7 30 00 PM

Fixed JSON looked like this
Screenshot 2024-05-04 at 7 30 51 PM

[Bug] Limit on /search is not deterministic

Right now we limit the search results by applying it to the serp api level by using searchOptions.limit : n.

The problem is that some search results could be social media pages or website that we don't support, failing it on it. This ends up causing the /search endpoint to return less results than expected.

The idea here is that we should search for n + y over the limit, where n is the limit and y is a picked constant. That way if it fails, we can use the y links and try to call get documents on it until we hit the correct limit n.

Use a standard for metadata

Use a standard for the metadata data returned by the API.
Users of the API may add their own metadata and it could overwrite or conflict with the API metadata if there is not a standard.
Use a common prefix or something to identify metadata captured by the API.

401 when checking job status

I'm trying to use the example.js you provided in the repo.

import FirecrawlApp from '@mendable/firecrawl-js';

const app = new FirecrawlApp({apiKey: "my api key"});

const crawlResult = await app.crawlUrl('docs.babylonjs.com', {crawlerOptions: {excludes: ['blog/*'], limit: 10}}, false);
console.log(crawlResult)

const jobId = await crawlResult['jobId'];


let job;
while (true) {
    console.log("checking ",app.apiKey);
  job = await app.checkCrawlStatus(jobId);
  if (job.status == 'completed') {
    break;
  }
  console.log(job);
  await new Promise(resolve => setTimeout(resolve, 1000)); // wait 1 second
}

console.log(job.data[0].content);

I get

{ jobId: '678757b8-0d03-4c56-8017-a6b04136ad07' }
checking fc-4a0f64912306448c975701198d28b85e
file:///D:/Web/webscraper/node_modules/@mendable/firecrawl-js/build/index.js:91
throw new Error(error.message);
^

Error: Request failed with status code 401
at FirecrawlApp. (file:///D:/Web/webscraper/node_modules/@mendable/firecrawl-js/build/index.js:91:23)
at Generator.throw ()
at rejected (file:///D:/Web/webscraper/node_modules/@mendable/firecrawl-js/build/index.js:5:65)
at process.processTicksAndRejections (node:internal/process/task_queues:95:5)

It seems like it starts to crawl but when checking the job, it gets a 401. why might that be?

[Feat] Idempotency key

"Consider adding idempotency feature for our backend POST apis, and allow client to pass an idempotency key to avoid submitting duplicate jobs"

Suggested by @by12380 on Discord.

Add a timeout parameter to the api

One thing that would be useful is the ability to set a timeout on these requests - a customer ended up implementing that on their side.

[Feat] Strip non-content tags, headers, footers

The markdown would be much more useful if you stripped headers/footers and other tags like filters etc that is not core content (i.e. low value for RAG/context). Either using tag or class-based removal from the html or using something like Mozilla's Readability or both! Highly opinionated class-based removal is risky but produces high value content and less noise.

For example a language selector in a header gets produced and should be stripped:

[Skip to main content](#main-content)

Select LanguageEnglishAfrikaansAlbanianArabicArmenianAzerbaijaniBasqueBelarusianBengaliBosnianBulgarianCatalanCebuanoChinese (Simplified)Chinese (Traditional)CroatianCzechDanishDutchEsperantoEstonianFilipinoFinnishFrenchGalicianGeorgianGermanGreekGujaratiHaitian CreoleHausaHebrewHindiHmongHungarianIcelandicIgboIndonesianIrishItalianJapaneseJavaneseKannadaKhmerKoreanLaoLatinLatvianLithuanianMacedonianMalayMalteseMaoriMarathiMongolianNepaliNorwegianPersianPolishPortuguesePunjabiRomanianRussianSerbianSlovakSlovenianSomaliSpanishSwahiliSwedishTamilTeluguThaiTurkishUkrainianUrduVietnameseWelshYiddishYorubaZulu

Here is a starter list.. should probably test against a couple thousand random pages and use an LLM like haiku with vision as judge.

const exclude = [
  'header', '.header', '.top', '.navbar', '#header',
  'footer', '.footer', '.bottom', '#footer',
  '.sidebar', '.side', '.aside', '#sidebar',
  '.modal', '.popup', '#modal', '.overlay',
  '.ad', '.ads', '.advert', '#ad',
  '.lang-selector', '.language', '#language-selector',
  '.social', '.social-media', '.social-links', '#social',
  '.menu', '.navigation', 'nav', '#nav',
  '.breadcrumbs', '#breadcrumbs',
  '.form', 'form', '#search-form',
  'script', 'noscript'
];

[Feat] Be able to pass a timeout param to the endpoints

enable the user to pass a "timeout parameter" to both the scrape and the crawl endpoint. If the timeout is exceeded, please send the user a clear error message. On the crawl endpoint, return any pages that have already been scraped but include messages notifying them that the timeout was exceeded.

If the task is completed within two days, we'll include a $10 dollar tip :)

This is an intro bounty. We are looking for exciting people that will buy in so we can start to ramp up.

[Feat] Add ability/option to transform relative to absolute urls in page

When scraping, and mostly crawling, provide the ability to have all relative urls changed to absolute urls (for further processing or link extraction).

Eg. [The PDF file][/assets/file.pdf] => [The PDF file][https://site.com/assets/file.pdf]

Sample solution a md-post-processor hook:

import re
from urllib.parse import urljoin

def convert_relative_urls(text, base_url):
    # Regex to match Markdown links that don't start with http
    regex = r'\]\((?!http)([^)]+)\)'
    # Function to prepend the base URL to the matched relative URL, handling '../'
    def replace_func(match):
        # Combine the base URL with the relative URL properly handling '../'
        full_url = urljoin(base_url + '/', match.group(1))
        return f"]({full_url})"
    # Replace the matched patterns in the text
    return re.sub(regex, replace_func, text)

# Example usage
markdown_text = "[The PDF file](/assets/file.pdf), [other file](../page/thing.pdf)"
base_url = "https://site.com/subdir/"
converted_text = convert_relative_urls(markdown_text, base_url)

oops ..noticed we are in typescript 😝:

function convertRelativeUrls(text: string, baseUrl: string): string {
  const regex = /\]\((?!http)([^)]+)\)/g;
  
  // Function to prepend the base URL to the matched relative URL, handling '../'
  const replaceFunc = (match: string, group1: string): string => {
    // Create a new URL based on the relative path and the base URL
    const fullUrl = new URL(group1, baseUrl).toString();
    return `](${fullUrl})`;
  };

  // Replace the matched patterns in the text
  return text.replace(regex, replaceFunc);
}

// Example usage
const markdownText = "[The PDF file](/assets/file.pdf), [other file](../page/thing.pdf)";
const baseUrl = "https://site.com/subdir/";
const convertedText = convertRelativeUrls(markdownText, baseUrl);

OpenAPI Spec

I saw you have used mintify but couldn't find the OpeAPI spec itself. Could you share it?

Remove 'cookies' text when removing headers/footers, etc

Remove any cookies text when removing headers and footers.
Many sites in Europe will display a cookie acceptance message
Sometimes, this is the only text returned.

Sometimes it captures something like:

"Skip to main content\n\nCookies \n------------------------------\n\nWe use some essential cookies to make this service work.\n\nWe\u2019d also like to use analytics cookies so we can understand how you use the service and make improvements.\n\nAccept analytics cookies Reject analytics cookies How we use cookies\n\nYou can change your cookie settings\n at any time.\n\nHide cookie message\n\n"

[Feat] Error handling middleware for the API

When errors occur in deeply nested functions, there isn't a way for us to bubble up custom error messages and codes to the API layer.

Proposal: Create a custom Error type and Middleware that intercepts errors.

Middleware example

class AppError extends Error {
  public readonly statusCode: number;
  public readonly isOperational: boolean;

  constructor(message: string, statusCode: number, isOperational: boolean = true) {
    super(message);
    this.statusCode = statusCode;
    this.isOperational = isOperational; // Indicates this is a known type of error
    Object.setPrototypeOf(this, new.target.prototype); // restore prototype chain
    Error.captureStackTrace(this, this.constructor);
  }
}

Deep function:

async function someDeepFunction(): Promise<any> {
  try {
    // Some logic that might fail
    if (someConditionNotMet) {
      throw new AppError('Specific error message', 404);
    }
    // more logic
  } catch (error) {
    throw new AppError('Error accessing resource', 500);
  }
}

Then these errors would be intercepted and cleaned for users by a middleware at the express level.

Wrong hyperlinks in readme

Some links in your readme are pointed at non existent firecrawl.com domain

API key and How to use it sections

[Feat] Cancel job route

"Provide an API to cancel jobs, especially for expensive ones. Setting the default limit at 10,000 could potentially break someone’s bank."

Suggested by @by12380 on Discord

[Test] Add integration tests for complex and larger variety of webpages

In tweaking and growing the html clean up and html-to-md. I highly recommend adding integration tests using either live webpages (to test also the get/network and dynamic websites) OR at least saved html pages with complex layout (and bad html, especially for the html clean up.

  • Find a list of pages to use as test suite with a vareity of layouts
  • Add the integration tests

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.