Giter VIP home page Giter VIP logo

Comments (8)

zh217 avatar zh217 commented on June 27, 2024 1

No. Here is a minimal example that you can test:

# run with usearch-2.9.2 installed

import usearch.index

idx = usearch.index.Index(ndim=1024, metric='ip')
idx.save('index')
# run with usearch-2.12.0 installed

import usearch.index

# will throw an error in usearch-2.12.0
idx = usearch.index.Index.restore('index', view=True)

There's no need to insert anything into the database in order to trigger the error. Should be that the metadata in the old version is messed up.

from usearch.

ashvardanian avatar ashvardanian commented on June 27, 2024

There were no changes in the file format, but the number of checks and assertions grew. Apparently, one of those checks is hurting us here.

Does it also fail if you create an arbitrary index, and then call .load - reinitializing it with a different file?

from usearch.

zh217 avatar zh217 commented on June 27, 2024

It fails with a different error:

RuntimeError: Key type doesn't match, consider rebuilding

triggered by the following code:

idx = usearch.index.Index(ndim=1024, metric='ip')
idx.load(idx_path)

which runs fine if downgraded to 2.9.2.

from usearch.

ashvardanian avatar ashvardanian commented on June 27, 2024

Interesting. Any chance the file was corrupted somewhere in between?

from usearch.

zh217 avatar zh217 commented on June 27, 2024

Update: this works both ways --- old version cannot open databases created by the new version either.

from usearch.

zh217 avatar zh217 commented on June 27, 2024

There were no changes in the file format, but the number of checks and assertions grew. Apparently, one of those checks is hurting us here.

In fact the file format changed due to a subtle change in code.

Compare:

enum class scalar_kind_t : std::uint8_t {
unknown_k = 0,
// Custom:
b1x8_k = 1,
u40_k = 2,
uuid_k = 3,
// Common:
f64_k = 10,
f32_k = 11,
f16_k = 12,
f8_k = 13,
// Common Integral:
u64_k = 14,
u32_k = 15,
u16_k = 16,
u8_k = 17,
i64_k = 20,
i32_k = 21,
i16_k = 22,
i8_k = 23,
};

with:

enum class scalar_kind_t : std::uint8_t {
unknown_k = 0,
// Custom:
b1x8_k,
u40_k,
uuid_k,
// Common:
f64_k,
f32_k,
f16_k,
f8_k,
// Common Integral:
u64_k,
u32_k,
u16_k,
u8_k,
i64_k,
i32_k,
i16_k,
i8_k,
};

so different versions interpret enums in the metadata differently.

As the metadata stored on disk also has version information, we can make new version of the library open old databases by mapping the old values to the new values. There seems to be no easy fix for the reverse direction, however.

As this definitely breaks compatibility between versions (affecting all f16, f32, f64 indices and all languages), this should be marked as a breaking change.

from usearch.

zh217 avatar zh217 commented on June 27, 2024

We can localize the damage by changing what is returned by this function:

inline index_dense_metadata_result_t index_dense_metadata_from_path(char const* file_path) noexcept {
index_dense_metadata_result_t result;
std::unique_ptr<std::FILE, int (*)(std::FILE*)> file(std::fopen(file_path, "rb"), &std::fclose);
if (!file)
return result.failed(std::strerror(errno));
// Read the header
std::size_t read = std::fread(result.head_buffer, sizeof(index_dense_head_buffer_t), 1, file.get());
if (!read)
return result.failed(std::feof(file.get()) ? "End of file reached!" : std::strerror(errno));
// Check if the file immediately starts with the index, instead of vectors
result.config.exclude_vectors = true;
if (std::memcmp(result.head_buffer, default_magic(), std::strlen(default_magic())) == 0)
return result;
if (std::fseek(file.get(), 0L, SEEK_END) != 0)
return result.failed("Can't infer file size");
// Check if it starts with 32-bit
std::size_t const file_size = std::ftell(file.get());
std::uint32_t dimensions_u32[2]{0};
std::memcpy(dimensions_u32, result.head_buffer, sizeof(dimensions_u32));
std::size_t offset_if_u32 = std::size_t(dimensions_u32[0]) * dimensions_u32[1] + sizeof(dimensions_u32);
std::uint64_t dimensions_u64[2]{0};
std::memcpy(dimensions_u64, result.head_buffer, sizeof(dimensions_u64));
std::size_t offset_if_u64 = std::size_t(dimensions_u64[0]) * dimensions_u64[1] + sizeof(dimensions_u64);
// Check if it starts with 32-bit
if (offset_if_u32 + sizeof(index_dense_head_buffer_t) < file_size) {
if (std::fseek(file.get(), static_cast<long>(offset_if_u32), SEEK_SET) != 0)
return result.failed(std::strerror(errno));
read = std::fread(result.head_buffer, sizeof(index_dense_head_buffer_t), 1, file.get());
if (!read)
return result.failed(std::feof(file.get()) ? "End of file reached!" : std::strerror(errno));
result.config.exclude_vectors = false;
result.config.use_64_bit_dimensions = false;
if (std::memcmp(result.head_buffer, default_magic(), std::strlen(default_magic())) == 0)
return result;
}
// Check if it starts with 64-bit
if (offset_if_u64 + sizeof(index_dense_head_buffer_t) < file_size) {
if (std::fseek(file.get(), static_cast<long>(offset_if_u64), SEEK_SET) != 0)
return result.failed(std::strerror(errno));
read = std::fread(result.head_buffer, sizeof(index_dense_head_buffer_t), 1, file.get());
if (!read)
return result.failed(std::feof(file.get()) ? "End of file reached!" : std::strerror(errno));
// Check if it starts with 64-bit
result.config.exclude_vectors = false;
result.config.use_64_bit_dimensions = true;
if (std::memcmp(result.head_buffer, default_magic(), std::strlen(default_magic())) == 0)
return result;
}
return result.failed("Not a dense USearch index!");
}

Since the result is returned in various places inside the function, maybe it is best to add a method on index_dense_metadata_result_t to "upgrade" its version to the new enum by mutating its headers appropriately.

I can make a pull request for it if that's OK.

from usearch.

ashvardanian avatar ashvardanian commented on June 27, 2024

Good catch @zh217! I think a good solution would be a custom function to convert enum to integer and vice-versa, with respect to the file version. Can you add it in index_plugins?

from usearch.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.