amazon-ion / ion-rust Goto Github PK
View Code? Open in Web Editor NEWRust implementation of Amazon Ion
License: Apache License 2.0
Rust implementation of Amazon Ion
License: Apache License 2.0
#102 illustrates that we should have some kind of CI for AArch64 if we want to ensure that things build/run on this platform. A quick survey on Github Actions doesn't make it clear that it is an option here, but regardless, we should do something for ARM.
Towards #36. We should have some reasonable tests that execute the text write paths for the Ion C bindings.
In other implementations we have either some kind of catalog or resolver API that provides a way for the implementation to call user code to get shared symbol tables. We should have an appropriately similar trait API in ion-rust and expose it to the parser APIs appropriately.
The Cursor
trait is generic over some type D
that implements IonDataSource
.
pub trait Cursor<D: IonDataSource> {
// ...
}
However, defining the trait as Cursor<D>
means that all types that are generic over its implementations must also be generic over D
.
impl<D: IonDataSource, C: Cursor<D>> Reader<D, C> {
// ^-- D is now in Reader's signature ---^
// ...
}
If we make D
an associated type instead...
pub trait Cursor {
type DataSource: IonDataSource;
// ...
}
the type signatures of "downstream" types can be simplified quite a bit:
pub struct Reader<C: Cursor> {
// ...
}
Currently IonCAnnotationsFieldWriter
has a method to support generic composition of annotations and field writing that looks something like:
/// Writes a value within a context of annotations and/or a field name
///
/// Note that it is undefined behavior if a value writing method is **not** called
/// or if a value writing method is called **more than once**. For this reason,
/// most users should prefer the `field()` and/or `annotations()` method on the
/// [`IonCWriter`](./trait.IonCWriter.html) trait as it doesn't have this edge case.
fn write_annotations_and_field<'a, A, F, FN>(
&mut self,
annotations: A,
field: F,
applier: FN,
) -> IonCResult<()>
where
A: Into<Option<&'a [&'a str]>>,
F: Into<Option<&'a str>>,
FN: Fn(&mut Self::AFValueWriter) -> IonCResult<()>;
@zslayton mentioned in #95 that he thought this might be unsafe
, but it is not a memory safety concern. It should be possible to make this API less problematic, but it requires more abstraction/implementation with respect to the host implementation (e.g. only write the fields when actually using the value writer). I didn't think it was worth it to do at this time, because if you use the context object, it handles the usage safety for you.
Currently all of ion-c-sys
higher-level APIs expect fully resolved symbol tokens as &str
, we should also export some higher-level APIs for ION_SYMBOL_TOKEN
.
We are currently limited to i64
integers in the bindings, we should have some idiomatic integration for ION_INT
which is Ion C's arbitrary integer type.
Towards #36. We should have some reasonable tests that execute the binary write paths for the Ion C bindings.
Towards #36. Replace the existing unit tests with some deeper tests using the current higher-level API
Provide bridge APIs for ION_TIMESTAMP
into the idiomatic Rust APIs.
Related to #37.
Based on #106, we really should have CI for this platform. Github Actions doesn't support this directly (only linux target is Ubuntu).
Here is a docker file that demonstrates largely what would work here.
https://gist.github.com/almann/f4ccce9747361e71b2b2e7693b83e792
VarInt
and VarUInt
currently store the value being read in as a 64-bit value. This is likely to be sufficient for almost all use cases, but the Ion spec allows for values of any size. If/when we have demand for it, we should investigate using an implementation of big-integer.
Int
and VarInt
represent the value of the signed integer being read in as an i64
, which cannot represent -0
on its own. Int
and VarInt
are already structs, so we can add a sign
field to these and check for it when necessary.
The first_commit_candidate
branch should be reconciled with master
. It seems to be further along than master
, I suspect we should use it as a first cut for what we want to do in master
, but we should remove this branch (or merge it) once we're done with it.
@zslayton is probably best to comment here, but I think it is good to have an issue to track this.
Annotation APIs and helpers should be added to the Ion C bindings.
Related to #37.
As per https://twitter.com/webbonet/status/1306114634043273223
The C didn't compile with musl for lambda and we didn't really wanted to force the project to compile C.
We should investigate this and make sure there isn't something in Ion C that is broken here (could be an amzn/ion-c issue).
We should expose some of the underlying ION_SYMBOL_TABLE
APIs as higher-level ones in ion-c-sys
.
The BinaryCursor
has some deeply nested data structures. Instead of writing:
let code = self.cursor.value.header.length_code;
and the like, we should add some internal shortcut methods to allow something like this:
let code = self.header().length_code;
On RHEL/CentOS/Amazon Linux the install target on x86-64 installs to lib64
path, on other platforms (including Windows and macOS) this is lib
, we need to add lib64
to the build script.
This can be reproduced with the following docker image with a checkout of the repository.
# Build:
# docker build -t ion-rust-al2 --build-arg USER_ID=$(id -u) --build-arg GROUP_ID=$(id -g) $(pwd)
#
# Run:
# docker run --interactive --tty --rm --mount type=bind,source="$(pwd)",target=/workspace ion-rust-al2
FROM amazonlinux:2
ARG USER_ID
ARG GROUP_ID
ARG WORKSPACE_DIR=/workspace
RUN yum -y install clang llvm-devel git gcc gcc-c++ wget curl tar make
RUN wget --quiet https://github.com/Kitware/CMake/releases/download/v3.18.3/cmake-3.18.3-Linux-x86_64.sh
RUN sh ./cmake-3.18.3-Linux-x86_64.sh --skip-license --prefix=/usr/local
RUN groupadd --gid $GROUP_ID user
RUN useradd -l --uid $USER_ID --gid $GROUP_ID user
RUN mkdir $WORKSPACE_DIR && chown $USER_ID.$GROUP_ID $WORKSPACE_DIR
USER user
WORKDIR $WORKSPACE_DIR
RUN curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | /bin/sh -s -- -y
ENV PATH=/home/user/.cargo/bin:$PATH
CMD cargo clean && cargo build
There appears to be an issue with ion_decimal_from_int
that requires a specifically allocated ION_DECIMAL
. Since this is abstracted away from the public API.
Specifically, the problem appears to be ion_decimal.c
L287-L293:
iERR ion_decimal_from_ion_int(ION_DECIMAL *value, decContext *context, ION_INT *p_int) {
ION_DECIMAL_SWITCH (
value,
IONCHECK(ion_int_to_decimal(p_int, quad_value, context)),
IONCHECK(_ion_int_to_decimal_number(p_int, num_value, context))
);
}
As a result, our bindings do manual initialization and use internal Ion C APIs. Once this is fixed upstream, we should eliminate this hack.
Develop a pure Rust implementation for Ion.
The create_header_byte_jump_table
function in binary/header.rs
will parse each possible byte value from 0 to 255 as though it were the one-octet type descriptor found at the beginning of all binary Ion values. The output from parsing each byte is stored in a newly allocated Vec
that can be used as a jump table to avoid re-calculating the meaning of same byte values repeatedly.
This is sub-optimal:
Vec
to be allocated when a fixed-size array would doVec
despite the fact that its contents will always be identical.Ideally, we would store a fixed-length array as a static constant in the header
module. Unlike C, however, the Rust compiler tries very hard to prevent programs from mutating global state. This means that initializing a static array is currently somewhat difficult to do without sacrificing either speed or safety.
Here are some options I've explored:
The lazy_static
crate provides a macro that allows you to declare and initialize a global constant. It works by wrapping your constant in a custom smartpointer type--the first time the pointer is dereferenced, lazy_static
will call a user-defined function to calculate the value of the constant. Every dereference that occurs from then on will receive the already-calculated value.
This is easy to implement, but profiling with valgrind
has shown that the indirection introduced by the smartpointer indirection adds an unacceptable amount of overhead in terms of CPU time. Derefencing the jump table each time you begin parsing a value in a stream with millions of values adds up quickly.
thread_local
copies of the jump table suffer from even worse indirection overhead, rendering them a non-starter.
If/when we add an IonSystem
concept, we may be able to have IonReader
s share a reference to a shared jump table living in the IonSystem
, but this may not be worth the complexity.
Initializing a fixed-size array (i.e. [T]
, not a Vec) is currently surprisingly challenging for non-trivial types. The best methods available only work for types that implement the Copy
trait, and types that implement Default
and where the array is <32 entries long.
We could choose to skirt this problem with unsafe
code, but this has its own problems:
unsafe
code, so we would need to scrutinize it very carefully.mem::uninitialized
(currently deprecated in nightly) and towards mem::MaybeUninit
(currently unstable).Rust is incrementally adding support for const fn
s, functions whose outputs can be calculated entirely at compile time. This is perfect for our use case -- all of the inputs are statically known (u8
values from 0 to 255) and processing them requires no information from the outside world. With this feature, we'll be able to simply write:
const HEADER_JUMP_TABLE: [IonResult<Option<IonValueHeader>>] = create_jump_table();
The compiler will calculate the definition of HEADER_JUMP_TABLE
at build time and then all interested entities can refer to it freely -- no wasted memory, no additional overhead from indirection.
At the time of this writing, const fn
s have severe restrictions on which language features may be used. Functionality like match
statements, the ?
operator, calling non-const
functions, looping, etc. are planned but not yet supported. These restrictions are so limiting that I have not been able to write an array initialization const fn
.
Hi, guys, one of my friends who work for Amazon recommend me ION recently. Then I find this repo.
I am not sure if my understanding is correct or not. Looks like this repo only has a binary ION reader. Do you guys wanna start text format parser after finish binary reader first, or don't have plans to do it?
After reading the master branch, looks like result.rs
, data_source.rs
, and cursor.rs
define traits, I guess you guys wanna make text
in src
and let it use types from types
mod and implemented all traits. If so, I can help to make some code.
Sorry for bothering you guys in issues, I have searched 'help wanted' but found nothing.
Thank you.
Cases around ~35 digits or so, convert to ION_DECIMAL
with a loss of precision, which is likely due to decQuad
limitations, we should probably check the status code and do decNumber
conversion or be more aggressive with usage of decNumber
.
Provide APIs to bridge decQuad
, decNumber
, and ION_DECIMAL
to the more idiomatic types in Rust.
Related to #37.
This is generally applicable to ion-rs
as well, but as we design the traits we want to have for our APIs, we should be consider how Generic Associated Traits (GAT) can be used to make our APIs reasonably flexible while maintaining static polymorphism.
Currently, delegation via our current APIs are difficult because we cannot have an associated type that can be constructed with a per invocation lifetime. This example illustrates this, and shows how we can work around this by the judicious use of unsafe
code, by essentially emulating the reference with pointers we know to be scoped appropriately. We also work around the lack of GAT by simplifying the value writer used in the implementation of IonCAnnotationsFieldWriter
for IonCWriterHandle
by refering to Self
which works, but makes the API potentially easier to misuse.
would be cool ;-)
The BinaryCursor
should be able to save its state and rewind to that position as long as the data source implements io::Seek
.
With #34, using the Ion C bindings is pretty low-level. In particular, one needs to deal with iERR
and also allocate and bind raw pointers. It would be nice if we could expose safer versions of the low-level APIs that integrate with Result
, make working with ION_TYPE
/ION_READER
/ION_WRITER
/etc. easier.
This would not be a replacement for the traits we have in the native Rust implementation, but at least make interacting with IonC in its full capacity easier.
The binary reading types each have some basic unit tests, but we should read the .10n
files from the ion-tests
repo to ensure spec compliance.
Should we be using ~
versions or ^
dependencies (the default)?
Originally posted by @almann in #91 (comment)
We should consider exposing higher-level APIs for ION_CATALOG
.
Much like #48, we should add the writer equivalents.
bigdecimal::BigDecimal
relies on num_bigint::BigInt
to provide the signed coefficient. BigInt
does not support maintaining sign when constructing it (bigint.rs
L2774-2786). This information is lost when converting from ION_DECIMAL
to BigDecimal
, so we should have some kind of API to preserve this if so desired.
One can always use ION_DECIMAL
and IonDecimalPtr
to get at the low-level Ion C APIs, but they shouldn't have to.
For cases that are larger than decQuad
there are issues using ion_decimal_from_ion_int
to convert the coefficient creating a status error of invalid context leading to generating a nan
decNumber
. Talking with @tgregg, I believe this is a bug in Ion C, and needs to be reproduced there and fixed upstream.
From PR #91:
As you know, it would be nice if we didn't have to do the copy here ๐
Originally posted by @almann in #91 (comment)
On a lark, I tried building ion-rs
and ion-c-sys
on Termux. However, I ran into the following build problems:
The problem is that certain Ion C APIs use c_char
pointers, and we hard coded them to i8
pointers which works on x86-64, but is u8
pointers on AArch64. We should use the c_char
type alias as it is defined appropriately on the platform being built.
We should extract the Reader/Writer traits from IonCReaderHandle
and IonCWriterHandle
. This will make generic/dynamic dispatched code against this API possible, to potentially layer things like Ion Hash (assuming we don't just layer it on ion-rs
).
The field writing API has a requirement that the ION_STRING
passed to them must live until a value is written. The trivial API surface makes this contract implicit:
let mut buf = vec![0; 128];
let len = {
let mut writer = IonCWriterHandle::new_buf_mode(buf.as_mut_slice(), WriterMode::Binary)?;
writer.start_container(ION_TYPE_STRUCT)?;
{
// no problem! 'static life for "name"
writer.set_field_name("name")?;
writer.write_string("kumo")?;
}
writer.finish_container()?;
writer.finish()?
}
An example where this breaks is here:
let mut buf = vec![0; 128];
let len = {
let mut writer = IonCWriterHandle::new_buf_mode(buf.as_mut_slice(), WriterMode::Binary)?;
writer.start_container(ION_TYPE_STRUCT)?;
{
{
let s = "name".to_string();
writer.set_field_name(s.as_str())?;
}
writer.write_string("kumo")?;
}
writer.finish_container()?;
writer.finish()?
}
In practice, users are probably not going to shoot themselves in the foot by separating the value writing calls from the field setting, but it would be nice to take away the foot gun.
We need an API that exposes the concept of symbol tokens and tables to users. It is probably worth looking at amazon-ion/ion-go#11 as an example of a modern take on this in another implementation, though of course, we should make this idiomatic to Rust as appropriate.
Here are some example Actions.
Currently the IonCReaderHandle
and IonCWriterHandle
have constructors from buffers in memory (&mut [u8]
). For full generality, we should support the ION_STREAM
interface as well as the file target.
While it is preferable to have pure rust implementation of Ion, it would be nice to be able to leverage (and compare) the amzn/ion-c implementation in Rust. We could add a sub-crate for ion-c-sys
, have the crate build ion-c
through the cmake
crate and leverage the bindgen
crate to expose low-level bindings.
It would be left as an exercise for future work to:
iERR
as a Result
).The IonError
enum's variants currently only offer a text description of the problem that arose, meaning that there's no way to programmatically respond to failures. IonError::IoError
and IonError::DecodingError
should have additional fields containing helpful contextual data. e.g. The std::io::Error
that was encountered or the byte offset where the reader found an illegal value.
I am not sure where the problem is exactly.
Note that
Running tarpaulin with
cargo tarpaulin -v --debug --release --run-types Doctests --out Html --workspace
or
cargo tarpaulin -v --debug --release --run-types Doctests --out Html --workspace -- --test-threads=1
fails on
Doc-tests ion-c-sys
Running `rustdoc --edition=2018 --crate-type lib --test /home/ANT.AMAZON.COM/therapon/Amazon/Work/GitHub/Ion/ts-ion-rust-1/ion-c-sys/src/lib.rs --crate-name ion_c_sys -L dependency=/home/ANT.AMAZON.COM/therapon/Amazon/Work/GitHub/Ion/ts-ion-rust-1/target/release/deps -L dependency=/home/ANT.AMAZON.COM/therapon/Amazon/Work/GitHub/Ion/ts-ion-rust-1/target/release/deps -L native=/home/ANT.AMAZON.COM/therapon/Amazon/Work/GitHub/Ion/ts-ion-rust-1/target/release/build/ion-c-sys-30c937fa6fac9e04/out/lib -L native=/home/ANT.AMAZON.COM/therapon/Amazon/Work/GitHub/Ion/ts-ion-rust-1/target/release/build/libloading-35f0411caee9e2f5/out --test-args --test-threads=1 --extern ion_c_sys=/home/ANT.AMAZON.COM/therapon/Amazon/Work/GitHub/Ion/ts-ion-rust-1/target/release/deps/libion_c_sys-c0e9a51780e178af.rlib --extern paste=/home/ANT.AMAZON.COM/therapon/Amazon/Work/GitHub/Ion/ts-ion-rust-1/target/release/deps/libpaste-07a2659eec7ed2cb.so --extern rstest=/home/ANT.AMAZON.COM/therapon/Amazon/Work/GitHub/Ion/ts-ion-rust-1/target/release/deps/librstest-f059a51b34ab8d79.so -C relocation-model=dynamic-no-pic -C link-dead-code -C debuginfo=2 --cfg=tarpaulin --persist-doctests /home/ANT.AMAZON.COM/therapon/Amazon/Work/GitHub/Ion/ts-ion-rust-1/target/doctests -Z unstable-options`
error: test failed, to rerun pass '--doc'
[ERROR tarpaulin] Building doctests failed
[ERROR tarpaulin] Cargo failed to run! Error: Building doctest failed
[INFO tarpaulin] Serializing tarpaulin debug log to tarpaulin-run-2020-08-10T22:01:17.635435492-07:00.json
Error: "Cargo failed to run! Error: Building doctest failed"
The tarpaulin debug log is not that informative
{"events":[{"ConfigLaunch":"<anonymous>"}]}
The BinaryCursor
currently skips the fractional seconds support when reading a Timestamp
into a DateTime<FixedOffset>
.
As per #34, we should have better testing for Ion C bindings. Running some tests against the test vectors at a minimum, and really refactoring that against what we have in good.rs
. In addition, we should exercise the bindings in a reasonable way in addition to the high-level test vectors.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.