I made a loader called dynohot

Thoughts on persistent caching about loaders HOT 8 OPEN

laverdet commented on June 11, 2024

Thoughts on persistent caching

from loaders.

Comments (8)

aduh95 commented on June 11, 2024

I wanted to add a file cache to avoid the transformation in the common case where most source files are unchanged since the last invocation.

Here's how I would do it:

import { createReadStream } from 'node:fs';
import { createHash } from 'node:crypto';

export async function resolve(specifier, context, next) {
  const result = await next(specifier, context);
  const url = new URL(result.url);
  if (url.protocol !== 'file:') return result; // for e.g. data: URLs
  const hashChunks = createReadStream(url).pipe(createHash('sha256')).toArray();
  url.searchParams.set(
    import.meta.url, // An almost certainly unique key
    Buffer.concat(hashChunks).toString('base64url')
  );
  return { ...result, url: url.href };
}

By adding the hash to the resolved URL, you are guaranteed per spec that it won't be load more than once, you don't need to implement your own cache.

What is the purpose of your custom loader: scheme? I'm not sure I see why you would need a special scheme.

This led to my final question about whose job it is to cache?

According to the current ES spec, there can be only one module per URL. IIRC it's also a limitation of V8, trying to load more than one module on the same URL would lead to undefined behavior. For this reason, Node.js has an internal module cache which loaders cannot access but can rely upon.
With the addition of import attributes, this is slightly more complicated (the module cache is now Map<SerializedKey, ModuleNamespace> with SerializedKey is a serialization of the URL string with the import attributes), but the principle still holds.

So loaders are of course free to add an additional cache layer if they see fit, but I'd expect that wouldn't be necessary for most use cases.

from loaders.

laverdet commented on June 11, 2024

I think you misunderstood. I am talking about a persistent file cache for caching the results of transformations between different runs of nodejs, not an in-memory cache for caching instances of modules within a single process. The motivation is explained clearly in the first few lines of the comment.

According to the current ES spec, there can be only one module per URL. IIRC it's also a limitation of V8, trying to load more than one module on the same URL would lead to undefined behavior. For this reason, Node.js has an internal module cache which loaders cannot access but can rely upon.

I'm sorry but none of this is true.

current ES spec, there can be only one module per URL

The resolution process is not specified by es262 at all. HostLoadImportedModule is host-defined and can be anything. They punted this to other specifications, and rightfully so.

IIRC it's also a limitation of V8

v8 doesn't care at all about the module URL, it's just metadata on a module record. When you invoke Module::InstantiateModule you pass a callback which implements the aforementioned host-defined HostLoadImportedModule operation:
https://github.com/v8/v8/blob/33651e6252eb96ab12cb1a584385c4f7a60493c2/include/v8-script.h#L205-L217

We can verify with my other project isolated-vm which is as close to raw v8 bindings as you can get in nodejs:

const ivm = require('isolated-vm');
void async function() {
    const isolate = new ivm.Isolate();
    for (let ii = 0; ii < 10; ++ii) {
        console.log(await isolate.compileModule('import foo from "foo"; export {};', { filename: 'file:///wow' }));
    }
}();

Module {}
Module {}
Module {}
Module {}
Module {}
Module {}
Module {}
Module {}
Module {}
Module {}

You can also verify with vm which behaves the same way.

from loaders.

aduh95 commented on June 11, 2024

I'm sorry but none of this is true.

current ES spec, there can be only one module per URL

The resolution process is not specified by es262 at all.

ecma262 defines a [[LoadedModules]] structure that maps a [[Specifier]] to a [[Module]]. It turns out in Node.js we use the absolute URL returned by the resolve hook as [[Specifier]], not sure if that's required in this spec or if it's taken from another spec.

HostLoadImportedModule is host-defined and can be anything.

Sure but it must be stable: "If this operation is called multiple times with the same (referrer, specifier) pair […] then it must perform FinishLoadingImportedModule(referrer, specifier, payload, result) with the same result each time." But we're getting off topic, a loader doesn't have to comply with the ES spec anyway.

I am talking about a persistent file cache for caching the results of transformations between different runs of nodejs, not an in-memory cache for caching instances of modules within a single process

I completely missed that, sorry for the confusion.

from loaders.

ljharb commented on June 11, 2024

ecma262 doesn't have any requirements on the specifier except that it's a string; HTML is what requires they be URLs. node is free to make whatever choice it wants here, since it's not a web browser.

from loaders.

GeoffreyBooth commented on June 11, 2024

I proposed 2 ad-hoc solutions here: loader:cache-key resolution specifier, and sourceURLs array on the result of resolve and load.

Could we create a cache based on the resolved URL and a hash (like a shasum) of the source returned by nextLoad? Then maybe the cache wouldn't need to know anything about the other hooks in the chain? It would be the same problem as designing a cache for loading files from disk, where the resolved URL is like the filename and nextLoad is like readFile.

from loaders.

laverdet commented on June 11, 2024

Could we create a cache based on the resolved URL and a hash (like a shasum) of the source returned by nextLoad?

Ideally you wouldn't need to call nextLoad at all if you don't want to.

Imagine a generalized Babel loader that transforms your source based on the contents of babelrc. You invoke nodejs with something like: node --loader babel --loader transform-cache exotic-script.xyz.

Invocation 1 (fresh):

transform-cache looks for a cache entry for exotic-script.xyz, finds nothing
transform-cache invokes nextLoad
- babel invokes nextLoad
  - default load invokes fs.readFile
- babel runs transform, returns result
transform-cache saves a cache entry to .cache or wherever. The cache entry includes the source file's mtime, size, and transformed text
Script executes, nodejs exits

Invocation 2 (afterward):

transform-cache looks for a cache entry for exotic-script.xyz, finds previous entry
transform-cache stats the underlying moduleURL and compares mtime, and size. Finds that they are the same, so it returns the previously transformed source text
Script executes, nodejs exits

What I'm suggesting is a cache scheme which allows us to elide the invocation to nextLoad entirely. With the scheme you suggested you will need to read the original source text in addition to the cached source text, each time.

from loaders.

GeoffreyBooth commented on June 11, 2024

Ideally you wouldn't need to call nextLoad at all if you don't want to.

That's fine. I guess what this illuminates though is that there can be varying goals for creating a cache: avoiding file reads (the goal you cited) or avoiding processing. Like for example if your loader is the one that does the transpilation, you could use the approach I suggested to load transpiled output from cache rather than doing the transpilation work again.

from loaders.

laverdet commented on June 11, 2024

Yeah I didn't mean to say that either case is more valid than the other. Studying both is great.

Thinking about good "best practices" for caching would really benefit the ecosystem. Right now my intuition is that caching should live in dedicated loader. If each loader implements their own caching mechanism then you might actually run into very poor performance on first load, because cache misses aren't free.

from loaders.

Thoughts on persistent caching about loaders HOT 8 OPEN

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent