Storing events seems like it'll be simpler than querying, so let's start brainstorming

I'm using <a href="https://docs.google.com/spreadsheets/d/1EdfeqLjK8lW-rLxN-uyvLXqWm8z

From the meeting today: Start with something even simpler: lit

Sketch out an API for recording events,about mozilla/activity-stream-storage-prototype

Comments (14)

commented on May 22, 2024

I'm using this list for other examples of events we might record.

from activity-stream-storage-prototype.

commented on May 22, 2024

A more structured design might add a store.define() API for every event we'd like to record. define would return an Event struct, which we could pass to set instead of the string name.

For example:

#[derive(Clone, Copy)]
enum Category {
  Page,
  Visit,
  Action,
}

struct Event {
  // ...
}

// First, define the events we expect to store. We might also compose these
// when querying...maybe?
// `fn define(&self, name: &str, category: Category) -> Event`.
let image_count = store.define("images", Category::Page);
let video_count = store.define("videos", Category::Page);

let dwell_time = store.define("dwell-time", Category::Visit);
let background_time = store.define("background-time", Category::Visit);
let referring_query = store.define("referring-search-query", Category::Visit);

let share_action = store.define("share", Category::Action);
let play_action = store.define("play", Category::Action);
let pause_action = store.define("pause", Category::Action);

let recorder = store.record("http://example.com");

// `fn set(&self, event: &Event, value: rusqlite::types::ToSql) -> Result<(), Error>`.
// This assumes our event values are dynamically typed, like columns in SQLite,
// though we could also extend `define` to take a type tag, and use our own
// trait instead of `ToSql`.
recorder.set(image_count, 4).expect("Failed to update image count for page");
recorder.set(video_count, 14).expect("Failed to update video count for page");

recorder.set(dwell_time, Duration::minutes(2));
recorder.set(background_time, Duration::seconds(3));
recorder.set(referring_query, "lolcats");

recorder.set(share_action, "send-to-device");
recorder.set(play_action, "#cat-video-1");
recorder.set(pause_action, "#cat-video-1");

from activity-stream-storage-prototype.

commented on May 22, 2024

From the meeting today:

Start with something even simpler: literally recordPlaceMetadata, recordPlaceAction, queryPlaceMetadata, and queryPlaceAction. These should be trivial; they can even return fake data, and definitely don't need to attach to Places or BrowserDB at this point. Don't worry about a more rigorous schema for events, because it might turn out that's not what Activity Stream wants at all.
Focus on exposing that simple API to the different platforms, and being able to call the methods from an xpcshell test or a system add-on.

from activity-stream-storage-prototype.

rnewman commented on May 22, 2024

My concern with this is that you seem to be driving towards making an interface so simple and generic that it doesn't require anyone to think about what it does — what actually is an action or place metadata? Can metadata refer to, or link, more than one place? Does it do that by repeating strings? Does the API implicitly record which device is making the observation, and when? Can I 'negate' or update an observation if the page changes? How do I 'connect the dots' between recorded events? Etc. etc. ad infinitum.

There's value in prototyping the simplest possible thing, simpler than one would ever use, from the standpoint of an end-to-end test: can you build a library that you use in more than one place, exchanging complex structures and enums? And @fluffyemily is already doing that.

But the two hard parts with an "activity stream storage prototype" are modeling the data and exposing the storage to the application in the right way, and doing the simplest possible thing sidesteps those hard problems.

I would have expected that what you're prototyping here would address the second of those two problems, and touch on a very small part of the first: can we define a very very specific API to do a very very specific thing? E.g., a Swift interface that looks like

/// Record the relationship between the device and a page URL at a time.
func recordVisitedURL(url: URL, atTime: Timestamp, byDevice: Device) -> Visit

/// Given a visit, record that the fetched page had a certain title.
func recordTitleForVisit(visit: Visit, title: String)

/// Record that the fetched page embedded a certain video.
func recordEmbeddedVideo(video: Video, forVisit: Visit)

func fetchVideosSince(since: Timestamp) -> [Video]

The example backing storage for that can be arbitrary, but the point is that the API looks a lot like a tight, constrained final API might look like, and the implied data model is realistically rich, even if you don't have to solve the storing and querying yet.

Am I misunderstanding what you're trying to achieve here, @kitcambridge?

from activity-stream-storage-prototype.

commented on May 22, 2024

My concern with this is that you seem to be driving towards making an interface so simple and generic that it doesn't require anyone to think about what it does — what actually is an action or place metadata?

I think that's a well-founded concern, and I don't have a good answer because I'm not entirely sure what an action or event is. 😄 We've been using "continue watching videos," and the list of signals, for inspiration. Those are only examples, though; we've been discouraged from focusing on anything more specific at this stage, because it's not clear that they'll be useful for Activity Stream.

At this point, the only product requirement we have is "store interesting things about pages, and also sync some or all of them to other platforms, maybe." Without something more concrete than that, I don't think there's any way we can avoid making storage generic.

But the two hard parts with an "activity stream storage prototype" are modeling the data and exposing the storage to the application in the right way, and doing the simplest possible thing sidesteps those hard problems.

True. In order to model the data, we first need to know what kind of data we're going to store...and we haven't reached that point yet. It's not clear what a Visit or a Video would contain. Timestamp? Element ID? Play count? Referrer? Related actions on other pages? Do we query by visit: given this URL, show me all the videos you started watching on this page? By event type: give me the URLs for all pages where you started watching videos? By timestamp, or a combination of properties: give me all pages that are likely to be "interesting" in some way, with "video played" as a strong indicator? How do we represent devices?

Is recordEmbeddedVideo going to be valuable at all: what if Activity Stream finds that's not interesting data to record? How would they evolve the API to record other bits of data? Will they need to change the Rust code and wrappers on multiple platforms, or rely on us to do that?

We hope to answer those kinds of questions by giving Activity Stream something generic as a first cut, and seeing how they use it. What kinds of events will they store? How do they filter and aggregate these events in their queries? Once we have an idea of the specific things they'd like to do, we can pave those cowpaths and develop a specific API. As yet, we don't know what the specific things are.

Does that help clarify, @rnewman?

from activity-stream-storage-prototype.

rnewman commented on May 22, 2024

I think we might have very different perspectives on desirable outcomes here.

I think it's relatively straightforward to model and re-model data as product desires become clearer, and using a generic stringly-typed store only makes that thinking muddier. That the AS team doesn't yet know what they want to save is not an obstacle, it's an opportunity to exercise the hard problems without chasing changing product priorities. Come up with your own specific things, or use the video ontology I designed; it doesn't really matter.

If our desire is to deliver a set of cross-platform, syncable, evolvable, performant stores, we will not be advancing towards our goal if we spend time building a string-based key value store for AS.

We already know how to do key-value data stores, so there's little value in reimplementing LevelDB. We also know how to persist strings in and out of Rust from languages like Swift; Emily has a working iOS application that does all of this.

So the valuable stuff, IMO, is the set of questions you touched on; trying out evolving rich APIs, defining and changing data models, figuring out how results map through to the data model, and imposing concrete examples of data to see how they feel.

I would much rather get early experience with the hard stuff: how to define rich APIs; how to handle identifiers that are managed by the library; how to interact with data transactionally; how to bridge logic between an application and into shared modules; how to manage migrations that involve running application code; how to make these things feel natural to developers.

from activity-stream-storage-prototype.

rfk commented on May 22, 2024

There's value in prototyping the simplest possible thing, simpler than one would ever use,
from the standpoint of an end-to-end test: can you build a library that you use in more than
one place, exchanging complex structures and enums? And @fluffyemily is already doing that.
...
We also know how to persist strings in and out of Rust from languages like Swift;
Emily has a working iOS application that does all of this.

It sounds like part of your concern here is that we're pointlessly duplicating work that you and Emily have already done. That's not the intention. But we do need to collectively get up to speed with that work, and to see what it looks like to use that pattern to expose an API to Firefox on each platform, if we want to learn anything about managing and evolving a shared API using this technique. Don't be surprised to see us starting out small on our first iteration.

from activity-stream-storage-prototype.

rnewman commented on May 22, 2024

I see why you'd get that impression, but less that than you might think! I think it's perfectly fine to do some amount of duplicate work as part of learning.

But I have two doubts with the way this is going.

Firstly, I don't want us to miss an opportunity to explore open questions that we need to explore. This prototype seems like a great place to test and evolve specific opinionated APIs, rather than deferring that decision making to consumers, both to explore how that boundary might shift, and also to touch on crucial concepts of identity etc.

Secondly, there's some risk in whatever you do being seen as representative, or even as a v0.1. After all, this first stab is opinionated in its genericness, just like Weave! I would rather keep the API surface narrow and specific (recordVisitByDeviceAtTime, not recordEvent, or even simpler: recordFirefoxVersion!), abstracting away the concrete data representation. Not only does that let us learn from how those APIs feel, and how they change, but it also gives us the ability to shift storage more easily, and avoids our customers conflating the API with the capabilities of the representation.

You could sum up that fear as: if we deliver a prototype JSON blob soup, we will again be stuck with a JSON blob soup, because everyone will drop back to thinking in blob soups.

from activity-stream-storage-prototype.

grigoryk commented on May 22, 2024

Apologies for an uninvited comment, but I think this is a point worth making.

I think the task of building/evolving a concrete, opinionated API as a test-bed is being made much more difficult than it should be because that work is, seemingly, to be done in isolation from any real consumers. In absence of a concrete use case, interface, a user experience, it's natural to fall back on the generic thing that's limited in its learning potential.

In this "product vacuum", my suggestion would be to pretend it's not there, and build a simplistic UX layer in addition to the storage/querying APIs. Do the whole thing end-to-end. This is a throw away prototype after all, and it doesn't seem out of hand to build a little throw away user experience in order to actually drive the prototype. Questions around "what should this look like" will become obvious, and more interesting questions about "how can we evolve this, at what point, how to reduce friction, what's painful", etc - will all come to the forefront.

Building a good API layer becomes nearly impossible if you don't feel the pain of using/evolving it yourself. And so ISTM that if you want to learn how to do this well, you need to approach it more holistically.

from activity-stream-storage-prototype.

thomcc commented on May 22, 2024

Enormous rambling bug comment ahead. Much of it is expansion on what I said in IRC earlier, although I think my thoughts on that are substantially clearer now.

I did actually try to make this short, but I suspect I didn't get quite enough sleep to be truly ruthless when editing, or really ruthless at all. (The joy of discovering that you have a slightly leaky window during a very heavy rainstorm)

Anyway, for background, since I don't know if everybody has access to the meeting notes for the meeting on Oct. 24 (assuming meeting notes were even taken, I actually couldn't find them if they exist, either way, this will provide some background for what we decided / why we decided it).

Coming out of that meeting, our biggest goal for the initial milestone of the prototype [0] was to validate that we can write a portable Rust library, and have it callable from JavaScript, Swift, and Java code in Firefox desktop, Firefox for iOS, and Fennec respectively -- demonstrated by doing so in the test suites of those codebases, and ideally land said test code in those test suites.

Given that goal, along with the fact that (at least in the present and near future) the Rust FFI seems limited to nothing more complex than passing #[repr(C)] + Copy structs, numbers, and raw pointers ~~to those types~~[1], I made the implementation in #10 a fairly stringly typed API, with the thought that in the worst case, we can just do string copies on the FFI boundary to avoid issues with runtimes that demand control over the allocation of strings (which is almost certainly all three we care about).

Unfortunatelly, even if we live in a wonderful universe where we have not only a dreamy Rust FFI, but we also know what Activity Stream needs, I think there would still be an argument to be made for doing it generically: The AS team is (AIUI) unlikely to be writing the bulk of their code in Rust, and might not even know the language.

And so if our opinionated API ends up lacking the power to express some feature they need, there is a very real risk that we'd end up being a blocker for them implementing that feature, which is, well, bad. Moreover, ISTM that the further away we are from that wonderful universe, the more likely it is to happen.

That is, the worse the Rust FFI is, the less they'll want to add new features to the API, and also the more guesses we need to make about what AS needs, the more likely we are to guess wrong. (... I think this is basically Conway's Law, or at least the law of some relative of Conway; we'd be making an API generic to avoid the risk that we won't have time or they won't want to add features to that API. Unfortunately, I'm not convinced it's not I'm having trouble convincing myself it's not a real concern)

The last bit of context is that the (only?) explicitly called out non-goal was to implement a complex API which would be made obsolete by the Mentat-like system that shall not be named. The specific example was that we'd like to avoid introducing a new query language/syntax that would be made useless just in time for it to be too painful to be worth removing.

I think the subtext there is that we were/are hoping that Mentat-lite, should it ever come into being, solves many of the issues we're punting on here. Which, well, I honestly don't know how realistic of a hope that is. Certainly the original Mentat seemed interested in expressing the same sorts of things as AS, but it's unclear (to me) how similar whatever we end up with will be to that.

A lot of that is kind of moot though, since I think the real reason the API is so generic and noncomittal is that I don't feel like I understand the Activity Stream problem space well enough to be able to write an opinionated API to help that space out that isn't, erm, bad. Any opinions I have on API's in that space are very likely either uninformed, dubious, or both.

And, the solution here could be "don't let Thom design the API", but that only works if I'm relatively alone in feeling this way. Unfortunately, given that the sentiment of nearly every Sync meeting where Activity Stream comes up (including the one on the 24th) is something like "AS is too broad/nebulous to pin down", I suspect that that means we don't have enough insight into that problem space to have an idea of what would fit their needs[2].

Which is more or less why doing something trivial and punting the API to them is attractive, but @rnewman's comment about it being "opinionated in its genericness" is pretty spot on. If this is the API it is likely to be the API forever -- it seems very unlikely that they'd come back to us and suggest something more than slightly different than what we have here.

[0]: To be clear, this is not the biggest goal for the overall project, just for it's first step. And that's largely because it's a very hard requirement for the overall project to be viable, and so it's where we started.... That and the fact that it is concrete, clearly defined, we know we need to do it eventually, and it would help us to have done even if the rest of the project is, well, a bit of a flop.

[1]: My thought here was that there's no way to know how controlling various runtimes will be about storing externally allocated data, if they allow it at all -- but it's very unlikely we somehow couldn't copy data into the runtime... But on further thought, we're completely able to pass raw pointers to arbitrarially complex Rust types this way. Either make it an opaque value, or in the case of a particularly unfriendly runtime, use something like a handle table.

[2]: That, or it actually is a big broad nebulous thing that doesn't know what it wants to be. In which case we're hosed anyway, so we might as well assume it's not.

from activity-stream-storage-prototype.

davismtl commented on May 22, 2024

My thoughts here are:

The big picture with this project is to quickly enable new teams (like AS) to store new data types with little to no help from us.
As per what @grigoryk mentioned, while tackling the AS storage problem is difficult because it's vague, it's actually the most concrete vague problem we have. Yes, that is as confusing as it reads. Ultimately, the end goal is to have a storage solution for NMX (using AS as guinea pig) so that new apps can pop-up and teams can prototype with little to no help from us to get up and running. At least, that's what we've been advised to think about.
I'm OK with the team making a basic prototype, even if it is too generic. I think we have a lot of catching up to do. By no means do I want us to ship this POC. We need to learn a lot!
As per making a key value pair storage, I have no opinion (not my expertise) other than if we want flexibility in the future for NMX, it might be important to note that the keys won't always be URLs. God knows what might come out of NMX.
That being said, while I believe that too generic is possible, I want to caution about also being specific.
IIUC from all the documentation I've read around this project, it's probably preferable to be somewhat generic with the event storage API because most of the magic will be around creating the appropriate views populated from those events in the future.

So I'm confused when I read that you want the POC to be this specific. Functions like this make think we'd need to regularly update the library and will make us a blocker.

I would have expected that what you're prototyping here would address the second of those two problems, and touch on a very small part of the first: can we define a very very specific API to do a very very specific thing? E.g., a Swift interface that looks like

/// Record the relationship between the device and a page URL at a time.
func recordVisitedURL(url: URL, atTime: Timestamp, byDevice: Device) -> Visit

/// Given a visit, record that the fetched page had a certain title.
func recordTitleForVisit(visit: Visit, title: String)

/// Record that the fetched page embedded a certain video.
func recordEmbeddedVideo(video: Video, forVisit: Visit)

func fetchVideosSince(since: Timestamp) -> [Video]

I've said it maybe in every other meeting but perhaps the best place to start is with API documentation. How do we want our customers (developers) to interface with us? We can show teams documentation without committing to anything. This guarantees that we don't ship a POC.
(although it doesn't help us learn RUST and if we can expose API across every platform)

Fake documentation would help us answer questions like these:

Can metadata refer to, or link, more than one place? Does it do that by repeating strings? Does the API implicitly record which device is making the observation, and when? Can I 'negate' or update an observation if the page changes? How do I 'connect the dots' between recorded events? Etc. etc. ad infinitum.

from activity-stream-storage-prototype.

rfk commented on May 22, 2024

By no means do I want us to ship this POC

You'll notice that this repo has deliberately been given a long unwieldy name with the word "prototype" in it, specifically to ward off any suggestion that we would actually ship its contents ;-)

I think the task of building/evolving a concrete, opinionated API as a test-bed is being made
much more difficult than it should be because that work is, seemingly, to be done
in isolation from any real consumers.

This is a very good point, and IIUC, part of the goal here is to get to something that we can work on in cooperation with the Activity Stream team in order to make it more concrete. So we need to be clear what we would take to them for comment, and what we'd ask from them for next steps.

Suppose this prototype is wildly successful, what do we want to the resulting conversation with Activity Stream to look like? Some options include:

Look, we built this generic Rust storage library that you can use on all three platforms! How about you take it and implement your storage needs on top without any deeper involvement from this team?
Look, we built this activity-stream-shaped API in Rust! Want to hack on it together and see if we can evolve it into something that can ship in Firefox on all three platforms?
Look, we built this activity-stream-shaped API in Rust! How about you folks take over maintenance of it from here, while we hack on the lower-level storage routines that power it?

Maybe we don't know yet, because we don't have enough of an understanding of the landscape.

if we want flexibility in the future for NMX, it might be important to note that the key
won't always be URLs. God knows what might come out of NMX.

Likewise for this point about NMX. Suppose our efforts here are wildly successful, what does that mean for how new NMX projects approach data storage? Some options include:

They can pick up this Rust library and plug it into their app and use the same API that ActivityStream is using to store their own app-specific data?
They can follow the patterns we've established here to build an app-specific data handling API, that re-uses some lower-level components used by ActivityStream (e.g. libNotMentat)?
Maybe both, to access some shared data in ActivityStream and also store some app-specific data alongside it?

I think part of our job here in the first few iterations, is to try things out for ourselves and see what the right pitch would be to other teams.

from activity-stream-storage-prototype.

rnewman commented on May 22, 2024

So I'm confused when I read that you want the POC to be this specific. Functions like this make think we'd need to regularly update the library and will make us a blocker.

I understand your confusion!

Thanks, all, for leaving such thorough explanations.

My view of the road we're on is that we'll be storing structured data — your point about things not always being URLs is talking about that. We are already in that place: history visits, for example.

My opposition to shipping simple string-ey key-value systems is that they make structured data hard to get right, and so I'd rather the first exposed API surface either be able to handle syncable structured data, or be very specific and hide the storage implementation as best it can. A key-value store falls between those two stools, and so I'm glad there are no plans to ship it!

I presumed that you don't want to bite off more than you can chew and prototype a structured data system, and so I suggested the alternative.

I think there's also a set of open questions around where API boundaries live — is it better to routinely write new data-handling functionality once (in Rust) or three times? That is, how much code is shared: just the generic storage and sync layer, or that plus the schema and some helpers, or some stable domain code, or large chunks of the domain code?

I'm interested in exploring those questions.

Having the API boundary be lower means exposing richer primitives — entities and attributes and types and queries — and establishing patterns to help app code to get things right.

Having it be higher means exposing more simple entry points, like recordSomeSpecificEvent.

These are tradeoffs, and I expect both to happen at some point.

from activity-stream-storage-prototype.

rnewman commented on May 22, 2024

Oh, and a brief update: @fluffyemily and I chatted a little yesterday about how to generate interface code for Rust functionality. How easy that is, and how ergonomic the result (e.g., object references and methods) dictates how feasible it is to quickly build functionality in Rust for cross-platform use, and thus where the boundary is cheapest.

from activity-stream-storage-prototype.

Sketch out an API for recording events about activity-stream-storage-prototype HOT 14 OPEN

Comments (14)

Related Issues (11)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent