coralproject / pillar Goto Github PK

View Code? Open in Web Editor NEW

4.0 17.0 1.0 9.77 MB

Deprecated: Service layer for the Coral ecosystem

License: Other

Go 100.00%

pillar's Introduction

Pillar

Pillar is a REST based web service written in Go. It provides the following services:

Imports external data into Coral data model
Allows CRUD operation on Coral data model
Provides simple queries on Coral data model

All of the Pillar documentation (including installation instructions) can be found in the Coral Project Documentation.

The Pillar documentation lives in Github in the coralproject/docs/docs_dir/pillar repository.

pillar's People

Contributors

Stargazers

Watchers

Forkers

isabella232

pillar's Issues

Stats: Create Asset aggregations

Stats: Create Author aggregations

More unit and integration tests for Pillar

Pillar is a REST based web-service module. This calls for more unit tests and integration tests. Not only should we be able to test code changes within the server, we must also be able to test various end-points provided by the server.

We do have both in place, but not enough - this issue is to make sure we expand tests and have a common framework for anyone to be able to test them properly.

a) Expand tests within the server module
b) Expand tests within the client module

Code reorg (cleanup) and few model changes

The handler methods can all be consolidated into a single go file
Few duplicates code in service and model can be consolidated into model for reuse
Few model changes to make it simpler
- user.user_name should be user.name
- note.target_type should be note.target
- action.target_type should be action.target

Stats: Create Section aggregations

Build Tag APIs

Challenge: The Trust product requires the ability to apply tags to users.

Concept: Taggable ("Users are Taggable")

Spec

Create tags collection to hold all possible tags:

{
  _id: type ObjectId()
  name: type string, // required, not empty, unique
  group: type string // default 'users'
}

Expose CRUD endpoints for Tags
- [POST, PUT, GET, DELETE] /tags/
Design reference scheme (similar to Issue #8) to associate tags with users
- We should have the ability to assign tags to any document in any collection

Separate import from create and also expose other end-points (RUD)

Need to expand existing 'import' end-points into (a) import and (b) create
Build new end-points to fetch, update and delete

New endpoint needed to create 'Action'

There are two new requirements:
a) Make Action as a first-class citizen. IOW, have a separate collection for all actions
b) Provide an endpoint to create Action

Create data randomizer

Challenge: Our demo needs a data set. The best dataset we have, currently, is the WaPo data, which is proprietary. Generating random data will lead to nothing but noise. We need to be able to obscure the data such that:

it is not possible to trace users in the obscured set back to users in the live site
no real numbers can be found
patterns that make the data interesting are presrved

Solution: Write a script that crawls all nytimes collections and:
Users:

obscures all user names with "user"+randomNumber
insert record into target database

Comments:

Throws away comments based on a "double random" method. (generate a random number between .2 and .3, then another random number between 0 and 1. If the first is lower than the second, throw the data away.
Keep the full comment text
Insert into target database

Assets:

copy in all the assets as they exist in the wapo database

Actions:

copy in all actions using the randomize throwaway method for comments that haven't been thrown away

Make sure all counts are up to date.

Create an endpoint to capture user activity on the front end

We need to be able to capture data about how our users are using our demo. Create an endpoint that adds documents to cay_user_actions.

{
  _id: ObjectId(),
  time: ISOTime that the packet was received,
  data: contents of the POST payload,
  release: "0.1.0"  // we can eventually make this dynamic
}

Code instead of error when it is a duplicated row

Right now each time sponge send a row that is already in the coral system, we get an error. Before #29 is implemented it would be great to get a code when duplicated row instead of an error.

Add unstructured metadata property to all models

Challenge: We need to account for metadata for all of our entities that varies from client to client.

Solution: Add a metadata property to every struct along with the endpoint code to capture it.

metadata map[interface{}]interface{}

This should allow us to capture any kind of data sent to pillar under the metadata attribute.

Make all import endpoints "upsert" (insert or update)

Situation: During a large import process, it's often the case that we need to go over already inserted data in multiple passes in order to add new fields, etc... This creates a problem when keys are already established. For example, if we import users, assets and comments, but then want to reimport users, we need to drop the users collection. At this point the keys in comments will break and we will need to reinsert comments as well.

Also, in some cases such as importing users from the comment records, we will intentionally (dumbly) be sending duplicate records. We shouldn't see errors for these if they are successfully in the db.

Solution: Instead of throwing an error when an existing record is posted to an import endpoint (asset/user/comment and action), the data in the database should be updated keeping all the _id fields in unchanged. The endpoint should return a 200 and a message saying whether or not the record was updated or inserted.

pseudocode:

entityHandler:
check source id against database
entity already exists:
run update command for all non mongo id values
respond with 200 && update message
entity doesn't exist:
run code that exists now (insert and key translation)
respond with 200 && insert message
reply with 500 only if the another error occurs

Technical discovery: Configurable Model

Our 2nd product, Ask, will revolve around the ability to create a custom form that will allow an arbitrary datatype to be handled as an "ask". The Ask will be defined by a front end tool that will store the 'schema' for that ask as a json object. The model that powers the ask api, therefore, must be as configurable as possible based on a json object.

Configurable model elements include:

Schema
- Any number of fields
- Names, descriptions, other labels for fields
- Field types
- Required?
- Default values.
Methods for dealing with files
- Uploads with validation
- Triggering workflows (aka, resizing/resampling)

Tech Challenge:
Create a configurable model package and a basic set of apis that allow crud on that model.

Adopt web package

Let's start using a web package to handle http requests. Wrapping each of our http handlers in importer.go will allow us to:

Requirements:

standardize behavior between all Coral services (features, config, etc.. need to be identical)
centralize headers
apply middleware (such as auth)
cut down on repetitive code
CORS support
JSONP support

Using https://github.com/ardanlabs/kit/tree/master/web will allow us to standardize auth, config and logging with Xenia, which will be essential for consistency across the project.

Stats: Create timeseries breakdowns

Timeseries should be keyed off:

start time
duration
dimension
dimensionKey

Implement Count and other statistics

Challenge: A lot of the analytics that we want to provide depend on counts. How many times has a user recommended and article. Calculating these on the fly is not practical.

Solution: Implement counters and lists in documents that are updated upon creates/updates.

Note: Specific counts to be cached to be defined in the Data Model Wiki page.

Update and Delete Tag behaviors

We need the ability to update and delete tags. The current upsert and delete functionality will update the master tag list, but does not update or remove the tags that have already been applied to entities.

Functionality to be added:

On update scan for all entities that have the tag and update the tag in the subarray
On delete scan for all entities that have the tag and remove it from the subarray

Implement tracking and metrics

Challenge: we need to know how people are using our products and how well our products are performing (on an opt in basis.)

Evaluate the two leading candidates for monitoring and metrics:

prometheus.io
elk stack

typo in model definition

it should be bson:parent_id on this line.

https://github.com/coralproject/pillar/blob/master/server/model/model.go#L148

New endpoint needed to update tags in Asset/User/Comment

Front-end needs apis to update asset, user and comment with tags. At this time, the update will be limited to tags only, possibly needs expansion later in future.

Create endpoint to add indexes to mongo

Challenge: With variations in metadata, we cannot predict at the api level which fields will need to be queried.

Solution: Publish an endpoint with Pillar that creates an index on one or more fields. The endpoint should accept 3 params:

collection: required, string, the name of the collection to index
keys: required, object, a json object to be dropped in the first argument of createIndex()*
options: optional, object, a json object to be dropped in the second argument of createIndex()*

Note: passing keys and options directly into the function call will allow us to take advantage of all of mongodb's indexing features, which are substantial: https://docs.mongodb.org/manual/reference/method/db.collection.createIndex/#db.collection.createIndex

Consumer: coralproject/sponge#21

Define messages and queuing model for various events

Please refer to 'event-driven aggregation' here https://drive.google.com/open?id=1NtnsdeLiVHJ69nr25AcGUGhUHGnZi0BqZQDxPVbPDxE

This issue primarily focusses on:

Define message models
Define a common queue package for reuse

Standardize application structure across Coral Go programs

from @ardan-bkennedy
I also suggest reworking the source tree of the three Go projects to be the same as well.

app
cmd
pkg
vendor
config
tstdata

A repo is a project and a project has a single vendor folder.

Moved from #21

Dockerize Pillar Server

Various items to be taken care of:

Create a Dockerfile
Make sure it has everything needed to make a Pillar Server container

Write individual aggregation runners to be used as microservices

Rename coral collections to plural names

asset to assets
user to users

and so on....

Update UserGroup model and endpoints to Search

Define a new entity UserGroup and end-points for crud operations on it.

Move config code into pillar app

The code in config should move into pillar/app/pillar
make crud independent of config code

Implement application logging.

Implement Coral standard application logs as described here:

https://github.com/coralproject/reef/wiki/Application-Logging

Data migration with new ID (bson.ObjectId) and its impacts

So now that we’re going to be using bson.ObjectId as primary key for most of our first-class citizens, there are some side-effects I want to bring to your attention. And yes, discuss the remedy as well.

Order

Really the discussion boils down to inserting data in order. In other words, start with the least dependent and go all the way to the most dependent one.

User
Asset
Comment (start with the root of the tree, since each child needs to have a ref to its parent)
Notes
Actions

References

Another challenge is to fix the fidelity using original reference. For example a Comment is associated with a User and that means Fiddler must also pass the ObjectId for the User. Similarly the ObjectId for parent Comment if any.

We have two choices:

Fiddler finds the ObjectId using the original id from User or parent Comment
Fiddler passes all the original id as a sub-item (field) and let Pillar take care of it.

I’m proposing that we go with option (b). Introduce a sub-json say ref (or whatever) and pass all original ids.

ref {
“parent_id”: “sndlfkslfjlsd”,
“user_id” : “aljdlkfafjsjf”,
***
}

This ref field will not be serialized in the DB, but will only be only used as a way to IMPORT data.

refactor model

separate model package with different file names
separate service package

Coral Data Model

comments

id
userId
assetId
parentId
children (array)
body
status
dateCreated
dateUpdated
dateApproved
Actions
Notes
Source: original IDs from external source (publisher)

User

ID
UserName
SourceID: original IDs from external source (publisher)

Asset

Content the comments are on

ID
URL
SourceID: original ID from external source (publisher)

Data - Compute "meta stats subdocuments"

Challenge: Provide a 'meta' level of stats on each stats packet calculated.

Each dimensional breakdown offers a consistent set of values. In order to intelligently work with them (and build front ends to do so), we need to know information about what values we can expect, and how they are formed.

Sample "meta stats" packet:

  min: // the minimum value in that dimension
  max: // the maximum
  mean: // the mean
  median:  // the median
  stdev: // the standard deviation value
  distribution: [ // a breakdown of the distribution of values in the range between man and max
    ##, // number of elements falling between 0% and 5% of range
    ##, // number of elements falling between 5% and 10% of range
    ...
    ## // number of elements falling between 95% and 100% of range
  ]

A meta stats packet must be provided for each field in each dimensional breakdown. For example, we might want to architect it this way:

  user_statistics.meta.comments.all.all.count: {
    // an entire meta stats packet for user_statistics.statistics.comments.all.all.count 
  }

This would allow a client who knows they want to work in a certain dimension to request a meta packet to render the interface for that dimension.

errors resulting from http.ListenAndServe should be fatal

As it stands right now, any error (i.e. trying to listen on a taken port) will log but not produce a non-zero exit status.

https://github.com/coralproject/pillar/blob/master/app/pillar/main.go#L14

A simple fix would be to use log.Fatalf instead of log.Printf.

Docker build should use vendored code

The go get commands here are non-deterministic.

https://github.com/coralproject/pillar/blob/master/app/pillar/Dockerfile#L9-L12

The GO15VENDOREXPERIMENT would be useful here, but the structure of the repository does not lend itself to using it.

Should we adopt w3c conventions?

It recently came to my attention that there's a working draft out to standardize "activity streams":

https://www.w3.org/TR/activitystreams-core/

From my first read, it seems that the majority of the things we are trying to model fit nicely as a subset of the activity stream schemas. In addition, these schemas have a lot of depth that we have not yet addressed but will likely encounter.

@samshub @gabelula Thoughts?

How should we go microservice?

The signs and omens are clear, the time has come to adopt a microservice architecture and start developing our messaging protocol.

This is a discussion thread to track the conversation. Go!

Refactoring - Introduce a new MW, AppContext to work with handler and service layer

Encapsulate everything that is needed to pass information from handler to service layer. The first one may look like this, but can be expanded to fit our needs.

type AppContext struct {
    DB      *db.MongoDB
    Input   interface{}
}

Also add a new handler that can take care of the plumbing/common code in all handlers.

Support CORS pre-flight requests

We are using fetch() on the front-end, which uses a "cors" mode and sends a pre-flight OPTIONS request to ask for available methods on the API side.

Using fetch's "no-cors" mode allows for some POST requests, but when using "no-cors" you can't consume the response body, which is quite crucial. So CORS it is.

I think the gorilla handlers do have support for OPTIONS requests, the docs aren't very clear on how to set it up, but you can see some of the options on the code.

See related PR: #38

Create indexes on mongo collections

Pillar needs to create indexes that prevent table scans for all operations it handles.

db.collection.createindex() will not override existing indexes, so we can call createindex each time the server starts without incurring cost:

https://docs.mongodb.org/v3.0/reference/method/db.collection.createIndex/#db.collection.createIndex

Note, indexes should not be created in the background, as we want to ensure that they are in place before the server starts accepting requests.

Use ardanlabs/kits/log

Change from log to "github.com/ardanlabs/kit/log" for logging. Look at shelf or sponge on how to initiate and use log.

Create search_history collection

Challenge: As searches change over time, people will want to be able to see what the effects of those changes are. In anticipation of this, we need to create a history of creates and updates that store each search state along the way.

Solution: Create a search_history collection and update it whenever a search is created or updated. Documents should look like this:

{
  action: "[create|update]",
  when: date,
  search: { full user group record }
}

Merge backend and service mongo code

Reuse and idea of a backend and create a generic package to merge duplicate code out there. Currently we have mongo code in service as well as mongodb package.

Establish variable naming conventions for Coral Schemas

What?

Since Xenia is a pass-through from the mongo storage of our data, the field names are carried through. JSON is traditionally camelCase, while our field names are currently PascalCase. Fixing this will lead to less typos, and more importantly, expected behavior for users of our software.

MongoDB naming conventions say that field names should be lowercase (camelCase or snake_case).
http://stackoverflow.com/questions/9868323/is-there-a-convention-to-name-collection-in-mongodb
This makes sense if you think about how mongodb speaks JavaScript on the cli and is basically storing JSON blobs.

The Google JSON Style Guide says that JSON should be camelCased, in the same naming conventions as JavaScript https://google.github.io/styleguide/jsoncstyleguide.xml#Property_Name_Guidelines

JavaScript naming conventions

How to fix?

Change all field names to lowercase (optionally snake_case if you prefer). If we fix this now, it will be less painful than going back later and updating every instance.

Make Source field consistent in all Collections - Simplify Referential Integrity in Import

Since we're creating bson.ObjectId for all ID fields in our collections, we established a standard approach to identify/lookup using original source fields as strings. This was done to find the references and keep integrity in the system.

For example the Source field in a Comment looks as follows:

// CommentSource encapsulates all original id from the source system
type CommentSource struct {
    ID       string `json:"id" bson:"id" validate:"required"`
    AssetID  string `json:"asset_id" bson:"asset_id" validate:"required"`
    UserID   string `json:"user_id" bson:"user_id" validate:"required"`
    ParentID string `json:"parent_id" bson:"parent_id"`
}

These fields are used to lookup respective items in their own collection and fix the references in a comment.

However, this is not done consistently in other collections. We should make a conscious effort to keep this consistent for all collections such as Asset, User and Action as well.

Dynamicize aggregators to account for custom fields

Clients have the options of adding custom fields to the metadata interface{}. Dynamicize the aggregations steps so that fields in metadata are included in the aggregation breakdowns.

Allow notes on Comments and Users

Challenge: The Trust product will allow users to leave notes on comments or users. Most of our source data structures only allow comments on notes. Our schema will need to allow notes to be placed on documents in any collection.

Concept: Notable (aka, "Comments and Users are Notable")

Spec

Build CRUD apis for notes
- [POST, GET, PUT && Delete] /notes/
Implement note counts on comments/users
Implement strategy to return notes along with comments/users

3 possible solutions depending on how we end up dealing with relations in mongo:

Append notes to a subdocument on the document,
Create a separate notes collection. Each document has a field indicating which document the note is on, or
Create separate noes collections for notes on each document type. aka, make user_notes and comment_notes collections for notes on users and comments respectively.

Add author(s), section, sub-section to an Asset

Proposal below:

type Author struct {
    ID       string        `json:"id" bson:"_id" validate:"required"`
    Name     string        `json:"name" bson:"name" validate:"required"`
    URL      string        `json:"url,omitempty" bson:"url,omitempty"`
    Twitter  string        `json:"twitter,omitempty" bson:"twitter,omitempty"`
    Facebook string        `json:"facebook,omitempty" bson:"facebook,omitempty"`
}

type Asset struct {
    ID         bson.ObjectId `json:"id" bson:"_id"`
    URL        string        `json:"url" bson:"url" validate:"required"`
    Tags       []string      `json:"tags,omitempty" bson:"tags,omitempty"`
    Authors    []Author      `json:"authors,omitempty" bson:"authors,omitempty"`
    Section    string        `json:"section,omitempty" bson:"section,omitempty"`
    Subsection string        `json:"subsection,omitempty" bson:"subsection,omitempty"`
    Source     ImportSource  `json:"source" bson:"source"`
    Metadata   bson.M        `json:"metadata,omitempty" bson:"metadata,omitempty"`
}

Prevent duplicate actions

Currently, pillar allows users to perform the same action on a single target more than once.

When an action is posted, pillar should check to see if there's already an action matching:

the user
the target
the action type

If this already exists, we should not insert another copy and, instead, respond with an appropriate message.