Data stored in Riak is opaque to Riak. This means that Riak doesn't
know anything about the structure of the data being stored in it.
Whether it be JSON or JPEG, it's all the same to Riak.
On the contrary, the application storing the data often understands
the structure of the data. The application may want to tag the data
with attributes that give additional context. For example, tagging a
picture with information such as who uploaded it and when it was
taken.
The Riak Object contains metadata which is a set of key-value pairs.
This lends itself nicely to indexing because Solr expects a document
which is a set of field-value pairs. It's a matter of mapping the
metadata key to a field name.
In Riak the object metadata is an Erlang dict which is created
directly in the case of protobuffs or translated from HTTP headers in
the case of HTTP. This feature should work the same no matter the
method of creation.
API and Implementation Notes
TODO: UTF-8 in field names?
- The object's metadata may contain a
yz-tags
key. The value should
be a CSV specifying the fields to tag. A comma must be used as
delimiter and whitespace is ignored.
- For HTTP the header
x-yz-tags
may be used. Yokozuna will check
for both fields.
- If a field listed in
yz-tags
doesn't exist it will not be
considered an error but will be silently ignored.
- Tag names should match Solr field names declared in the schema.
- If a tag field name doesn't match any fields defined in the Solr
schema then the result will depend on the schema config. In most
cases I imagine this will cause an error.
- The value of a tag is always passed verbatim to Solr.
- Any metadata key may be tagged.
Example
The following is an example of HTTP headers conforming with the rules
above. The data is a fictitious picture uploaded by myapp
.
yz-tags: myapp-where_loc, myapp-user_s, myapp-description_t
myapp-where_loc: Baltimore
myapp-user_s: rzezeski
myapp-description_t: Federal Hill at dusk.
The schema for this index might contain the following entries.
<dynamicField name="*_loc" type="location" indexed="true" stored="true"/>
<dynamicField name="*_s" type="string" indexed="true" stored="true" />
<dynamicField name="*_t" type="text_general" indexed="true" stored="false"/>
This example uses dynamic fields for each tag but a direct mapping
could be used as well.
yz-tags: myapp-description
myapp-description: Federal Hill at dusk.
<dynamicField name="myapp-description" type="text_general" indexed="true" stored="false"/>
An example of including metadata not specific to myapp.
Thought Process
Following is a summary of the thought process for this feature.
Mapping Keys to Fields
The first thing to decide is how to map a metadata key to a Solr
field. A naive approach might look like the 2i interface which relies
on a prefix and a suffix. The prefix signals the the key should be
indexed, e.g. x-riak-index
. The suffix indicates the type of the
value, e.g. _bin
. It's then a matter of deciding what the prefix
should be and how to map suffix to Solr fields.
In fact, why not use the 2i interface since it is already defined and
supported by Riak clients? There are several problems with this
approach.
- There is only support for binary (
_bin
) and int (_int
) types.
Solr supports many more types.
- What Solr type should
_bin
map to? It could reasonably map to
different types like text
, text_general
, string
, etc.
- What if
_bin
should be mapped to a different type depending on the
data? What is the convention for specifying the Solr type?
- What happens when leveldb is being used? Should both 2i and Solr
indexing occur? What if the user only wants one of the systems to
index the data? How does the user tell the system what to do?
- The 2i interface is now performing double duty. Any change to its
API needs to take both systems into account.
Yokozuna could have a dedicated prefix/suffix convention but then
there is an additional layer of translation between Yokozuna and Solr.
This adds complexity in the system and causes additional burden to the
users.
Solr already has a mapping from field name to type. This should be
exploited. If a user adds a summary
tag then there should be a
summary
field in the Solr schema. If the user adds a summary_t
tag then it should match the *_t
dynamic field. This is obvious and
requires no translation layer. That handles the suffix issue, but
what about the prefix? How does Yokozuna know when a metadata key
is a tag?
A prefix like tag-
could be used. That's not so bad but it causes
metadata to be polluted with prefixes. Also, what if the user wants
to tag keys like content-type
, which already exist? Should a
duplicate value be made with a tagged key? Can the prefix be avoided
altogether?
Why not be explicit about which metadata keys are tags? Add a special
metadata key named tz-tags
containing a CSV of the metadata keys
to tag. On the downside it requires an additional metadata entry but
on the upside it's explicit about what is being tagged, leaving the
user in full control. It doesn't require remembering a set of rules
about when metadata is tagged. This seems like the best possible
solution.
Tags With CSV
In most cases it is obvious that the tag value should be sent to Solr
verbatim. What about a comma separated value, CSV? Solr has the
notion of multi-valued fields and it might make sense to treat a CSV
as such a field. Once again the problem is discovering what semantic
the user wants. A prefix of suffix could be used. Perhaps a special
syntax for the yz-tags
value, e.g. yz-tags: keywords(csv)
would
indicate a CSV. This could work but once again requires translation
and user education. I think this should be avoided at all costs.
Instead, the tag value should be passed to Solr verbatim in all cases.
If a CSV should be converted into a multi-valued field then the user
should update the Solr config or schema to interpret it as such.
I think this could be achieved by either "poly fields" or a custom
Field Mutating Update Processor.
A Note on X-
Headers
Notice the examples above are not using the X-
convention for custom
HTTP headers. The IETF deprecated this style in RFC6648 because
it causes problems for existing systems when a custom header becomes
part of the standard. Also, using a prefix specific to your
application is sufficient.
A nice feature of the proposed plan above is that the user is in
complete control of the header names. Either style may be used.