Giter VIP home page Giter VIP logo

blockbuilder-search-index's People

Contributors

enjalot avatar micahstubbs avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

blockbuilder-search-index's Issues

store gist data by user

better support for manual exploration

perhaps by storing files in directories by user then by gistID

user/gistID/

load gists data into modern Elasticsearch on cloud VM server

load gists data into modern Elasticsearch. today, modern Elasticsearch === version6.3.2 https://www.elastic.co/downloads/elasticsearch-oss

do this locally first an as exercise.

  • load gists data into modern Elasticsearch on my local machine

when all goes well locally

  • deploy blockbuilder-graph-search-index + Elasticsearch to a cloud VM server

stretch goal

  • package blockbuilder-graph-search-index + Elasticsearch up as a docker container

collaboration checklist

  • get @enjalot's local blockbuilder search dev environment up and running
  • review PRs, test code locally
  • rinse and repeat
  • stand up some servers
  • deploy new backends: Elasticsearch 6.4.0 and index d3 gist data on servers
  • test with local search frontend and new search backend #55 (comment)
  • deploy indexing server with systemd
  • setup cronjobs for continuous indexing
  • setup elasticsearch with systemd
  • deploy free commercial Elasticsearch 6.4.0 to enable security monitoring features #55 (comment)
  • whitelist IP addresses for servers - firewall rules
  • fix module parsing bug #56
  • test it all
  • update blockbuilder config to point to new servers
  • deploy new blockbuilder & blockbuilder-search code to existing blockbuilder server

Remarks on first install

Micah asked me to mention difficulties on my first install of this script. I noted this:

  • gist-cloner wanted to log its progress into ES, but failed because I was not at the point of using ES yet. There was no clear indication about that error message.

  • it wasn't obvious how to start without loading thousands of blocks — maybe an example showing how to download two or three specific users would help. I've managed to find coffee gist-meta.coffee data/user1.json '' "user1" but I have no idea how to extend that to a few users.

index blocks found with clever google searches

h/t @redblobgames for this set of related ideas

we can index the bl.ocks directly, but we probably want to also add the github users to our users csv file and use the existing scripts (gist-cloner.coffee?) to get all of the gists for each new github user

add update pipeline script

add update pipeline script

add a shell script for the pipeline to update the blocks gist data to update a local blockbuilder search instance.

add the script with the commands that @micahstubbs normally uses, rather than the set of all possible options that we list in README.md

add a script that can be run as a script, rather than pasted from manually, command by command (like we do with the commands listed in the README.md today)

retrieve latest users data from server

it looks like the deployed blockbuilder search knows about more users than our data directory does.

if we browse over to http://blockbuilder.org/search, we see 25298 blocks (hooray d3 community!)
screen shot 2017-08-14 at 11 23 45 am

yet running coffee gist-meta.coffee returns a total of 24095 blocks

screen shot 2017-08-14 at 11 23 12 am

the difference between these two is 1203 new blocks that the deployed blockbuilder search knows about but that our local script does not. I'm guessing that these are blocks created by new users of the blockbuilder editor.

@enjalot when you have a moment, could you retrieve that latest users csv from the blockbuilder search server and commit it to this repo? (github tells me this user data was updated 3 months ago, so should be straightforward to update again 😄 )

https://github.com/enjalot/blockbuilder-search-index/tree/master/data

the goal is to contribute back the most complete user list that we have so that other d3 example research (like graph search) can benefit from it 🌱

user datastore schema

  • username [string] github username
  • source [string] the place we found the user
  • created [timestamp] inserted into user datastore
    - [ ] updated [timestamp] updated in user datastore

clone gists instead of downloading text files

I have an experimental file where I clone entire gists from our list of blocks. This ends up taking up about 2x the space of just the text files (current way we are indexing). Thats still only about 4gb which is rather trivial.

I'd like to do this at the same time we refactor to store the downloaded gist content by user ( #3 )

setup gcp hosted search

setup gcp (google cloud platform) hosted search

💭 reasoning

  1. gcp with dev credits is likely lower-cost than our existing elastic cloud search index instance.
  2. support for the old version of Elasticsearch we currently run (2.4.x something) is going away on 8/28, so we have to re-write our search queries in our blockbuilder search frontend code soon anyway
  3. since we have to re-write search queries anyway, might as well migrate to a lower cost provider while we are at it

📑 tasks

  • write a nodejs client for App Engine Searchable Document Indexes
  • figure out a mapping from gists contents to one or more AppEngine Documents
  • write a script to import gists into AppEngine Documents
  • write some command line search queries to test the mapping / schema
  • write new queries for each search action possible today from the blockbuilder.org/search UI
  • make a branch of the blockbuilder.org/search UI with new queries
  • a/b test GCP AppEngine Search with Elasticsearch

skip already indexed files by default

in elasticsearch.coffee

# we may want to check if a document is in ES before trying to write it
# this can help us avoid overloading the server with writes when reindexing
skip = true
offset = 0

cloning errors

On my local install, the cloning process sometimes will try to clone the block into {user}/gist.github.com/ instead of {user}/{gist.id}.

I have traced the error to:

cmd = "cd #{userfolder};git clone [email protected]:#{gist.id}"

which is solved with
cmd = "cd #{userfolder};git clone [email protected]:#{gist.id} #{gist.id}/"

(I don't know why it does this only for some blocks, not all.)

search by script tag dependencies

it would be cool to be able to search for blocks by what external libraries they import with script tags.

specifically, I would like to be able to search for blocks that only load d3, so that I can find an example of a technique implemented in pure d3 and javascript, without the overhead of some other charting library.

js heap OOM with 8gb ram specified

so I'm updating my local elasticsearch, and I use this command from our docs

coffee --nodejs --max-old-space-size=8000 elasticsearch.coffee

oh no, at gist 30179 I see this error message

indexed 30182 7067e1cc1b623959eacda6e34a2f63da
indexed 30181 7acb36eccb6280d95634f3d6f4d8f0f7
indexed 30179 c2acadc0809fcad97e403212333234d8

<--- Last few GCs --->

[12055:0x102801e00]   210399 ms: Mark-sweep 7845.3 (8060.4) -> 7844.9 (8060.9) MB, 247.6 / 0.0 ms  allocation failure GC in old space requested
[12055:0x102801e00]   210742 ms: Mark-sweep 7844.9 (8060.9) -> 7844.9 (8048.9) MB, 343.3 / 0.0 ms  last resort GC in old space requested
[12055:0x102801e00]   210957 ms: Mark-sweep 7844.9 (8048.9) -> 7844.9 (8043.9) MB, 215.4 / 0.0 ms  last resort GC in old space requested


<--- JS stacktrace --->

==== JS stack trace =========================================

Security context: 0x184b31ca55e9 <JSObject>
    1: toString [buffer.js:~634] [pc=0x30f302b1d0b8](this=0x184b9b53c4c1 <Uint8Array map = 0x184b022da259>,encoding=0x184bef8022d1 <undefined>,start=0x184bef8022d1 <undefined>,end=0x184bef8022d1 <undefined>)
    2: arguments adaptor frame: 0->3
    3: /* anonymous */ [/Users/m/workspace/blockbuilder-search-index/elasticsearch.coffee:98] [bytecode=0x184b2ebd2fa1 offset=19](this=0x184bead866f1 <JS...

FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - JavaScript heap out of memory
 1: node::Abort() [/Users/m/.nvm/versions/node/v9.11.1/bin/node]
 2: node::FatalTryCatch::~FatalTryCatch() [/Users/m/.nvm/versions/node/v9.11.1/bin/node]
 3: v8::internal::V8::FatalProcessOutOfMemory(char const*, bool) [/Users/m/.nvm/versions/node/v9.11.1/bin/node]
 4: v8::internal::Factory::NewRawTwoByteString(int, v8::internal::PretenureFlag) [/Users/m/.nvm/versions/node/v9.11.1/bin/node]
 5: v8::internal::Factory::NewStringFromUtf8(v8::internal::Vector<char const>, v8::internal::PretenureFlag) [/Users/m/.nvm/versions/node/v9.11.1/bin/node]
 6: v8::String::NewFromUtf8(v8::Isolate*, char const*, v8::NewStringType, int) [/Users/m/.nvm/versions/node/v9.11.1/bin/node]
 7: node::StringBytes::Encode(v8::Isolate*, char const*, unsigned long, node::encoding, v8::Local<v8::Value>*) [/Users/m/.nvm/versions/node/v9.11.1/bin/node]
 8: void node::Buffer::(anonymous namespace)::StringSlice<(node::encoding)1>(v8::FunctionCallbackInfo<v8::Value> const&) [/Users/m/.nvm/versions/node/v9.11.1/bin/node]
 9: 0x30f302248327
10: 0x30f302b1d0b8
➜  blockbuilder-search-index git:(micah/55/exp/parse-modules) ✗

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.