enjalot / blockbuilder-search-index Goto Github PK

View Code? Open in Web Editor NEW

24.0 24.0 5.0 21.04 MB

download and process d3.js blocks for further indexing and visualization

License: BSD 3-Clause "New" or "Revised" License

CoffeeScript 82.74% JavaScript 4.00% Shell 13.27%

blockbuilder-search-index's People

Contributors

Stargazers

Watchers

Forkers

micahstubbs npmcdn-to-unpkg-bot curran vitaly-z

blockbuilder-search-index's Issues

create tutorial for setting up a blockbuilder search mirror

tell others in the community how they can host a blockbuilder search instance

store gist data by user

better support for manual exploration

perhaps by storing files in directories by user then by gistID

user/gistID/

document solution to JavaScript heap out of memory when indexing blocks to elasticsearch

ninja edit: the solution is to increase the max memory available to nodejs with this command:

coffee --nodejs --max-old-space-size=8000 elasticsearch.coffee

when running the command coffee elasticsearch.coffee

load gists data into modern Elasticsearch on cloud VM server

load gists data into modern Elasticsearch. today, modern Elasticsearch === version6.3.2 https://www.elastic.co/downloads/elasticsearch-oss

do this locally first an as exercise.

load gists data into modern Elasticsearch on my local machine

when all goes well locally

deploy blockbuilder-graph-search-index + Elasticsearch to a cloud VM server

stretch goal

package blockbuilder-graph-search-index + Elasticsearch up as a docker container

collaboration checklist

enable security for our elasticsearch index

https://www.elastic.co/guide/en/elastic-stack-overview/current/get-started-enable-security.html

Remarks on first install

Micah asked me to mention difficulties on my first install of this script. I noted this:

gist-cloner wanted to log its progress into ES, but failed because I was not at the point of using ES yet. There was no clear indication about that error message.
it wasn't obvious how to start without loading thousands of blocks — maybe an example showing how to download two or three specific users would help. I've managed to find coffee gist-meta.coffee data/user1.json '' "user1" but I have no idea how to extend that to a few users.

index blocks found with clever google searches

h/t @redblobgames for this set of related ideas

we can index the bl.ocks directly, but we probably want to also add the github users to our users csv file and use the existing scripts (gist-cloner.coffee?) to get all of the gists for each new github user

document how to start elasticsearch

document Kibana Dev Tools as a replacement for Sense

If Kibana is installed locally open http://localhost:5601 and choose “DevTools” from the left menu.

https://stackoverflow.com/a/40832536

hide deprecated instructions in docs

use HTML5 details https://gist.github.com/ericclemmons/b146fe5da72ca1f706b2ef72a20ac39d

index filenames for keyword search

would like a block that contains a file with a filename gapminder2015.csv to appear in a search for gapminder

http://blockbuilder.org/search#text=gapminder

add update pipeline script

add a shell script for the pipeline to update the blocks gist data to update a local blockbuilder search instance.

add the script with the commands that @micahstubbs normally uses, rather than the set of all possible options that we list in README.md

add a script that can be run as a script, rather than pasted from manually, command by command (like we do with the commands listed in the README.md today)

retrieve latest users data from server

it looks like the deployed blockbuilder search knows about more users than our data directory does.

if we browse over to http://blockbuilder.org/search, we see 25298 blocks (hooray d3 community!)

yet running coffee gist-meta.coffee returns a total of 24095 blocks

the difference between these two is 1203 new blocks that the deployed blockbuilder search knows about but that our local script does not. I'm guessing that these are blocks created by new users of the blockbuilder editor.

@enjalot when you have a moment, could you retrieve that latest users csv from the blockbuilder search server and commit it to this repo? (github tells me this user data was updated 3 months ago, so should be straightforward to update again 😄 )

https://github.com/enjalot/blockbuilder-search-index/tree/master/data

the goal is to contribute back the most complete user list that we have so that other d3 example research (like graph search) can benefit from it 🌱

make it easy to run the pipeline on a per-user basis

user datastore schema

username [string] github username
source [string] the place we found the user
created [timestamp] inserted into user datastore
~~- [ ] updated [timestamp] updated in user datastore~~

investigate finding d3 blocks in commoncrawl dataset

http://commoncrawl.org/ h/t @redblobgames for this idea

could also possibly use links in this data as an search ranking score component

index tags mentioned in .block file

https://twitter.com/patrickm145/status/896738789397340161

tags can be expressed as a yaml sequence in the .block config file
https://stackoverflow.com/a/33136212/1732222

improving the regex that we use to detect d3 modules

/(d3-[\w-]*)(?=\.)/

https://regexr.com/

index all github users who use d3 in a github repo

a way to improve http://blockbuilder.org/search

we can query all of the repos in Google BigQuery for d3, and then collect the github users from the results

dataset description
https://cloud.google.com/bigquery/public-data/github

direct link
https://bigquery.cloud.google.com/dataset/bigquery-public-data:github_repos

properly index modules with versions

example:
https://bl.ocks.org/syntagmatic/77c7f7e8802e8824eed473dd065c450b

I think we are missing this because of the version, or lack of .js
we should be able to fix with a new regex or change to existing

keep the gist-meta.json up-to-date automatically, in a distributed manner

from a conversation with @enjalot

optionally log output from `gist-cloner.coffee` to file instead of to the console

extract combine function from gist-meta.coffee

then rm concat.coffee

clone gists instead of downloading text files

I have an experimental file where I clone entire gists from our list of blocks. This ends up taking up about 2x the space of just the text files (current way we are indexing). Thats still only about 4gb which is rather trivial.

I'd like to do this at the same time we refactor to store the downloaded gist content by user ( #3 )

index all gists that match the d3 index.html github gist search

https://gist.github.com/search?utf8=%E2%9C%93&q=d3+index.html

setup gcp hosted search

setup gcp (google cloud platform) hosted search

💭 reasoning

gcp with dev credits is likely lower-cost than our existing elastic cloud search index instance.
support for the old version of Elasticsearch we currently run (2.4.x something) is going away on 8/28, so we have to re-write our search queries in our blockbuilder search frontend code soon anyway
since we have to re-write search queries anyway, might as well migrate to a lower cost provider while we are at it

📑 tasks

write a nodejs client for App Engine Searchable Document Indexes
figure out a mapping from gists contents to one or more AppEngine Documents
write a script to import gists into AppEngine Documents
write some command line search queries to test the mapping / schema
write new queries for each search action possible today from the blockbuilder.org/search UI
make a branch of the blockbuilder.org/search UI with new queries
a/b test GCP AppEngine Search with Elasticsearch

update project codebase to es2017+

strategy:

file by file convert Coffeescript 1 code to es2017+ JavaScript with http://decaffeinate-project.org/repl/
manually convert commented out code
format resulting es2017+ js with https://github.com/prettier/prettier

create thumbnails service to snapshot thumbnails

and store them

calculate pairwise block similarity

using doc2vec on code or some similar approach

search blocks by twitter handle of block creator

map github login to twitter handles

public google spreadsheet

gSheets

setup cron job to run combine-users.coffee every time index is refreshed

every 15 minutes

this will enable new users to see their blocks in blockbuilder search 15 minutes or less after they take action that adds them to the user list built by combine.coffee

skip already indexed files by default

in elasticsearch.coffee

# we may want to check if a document is in ES before trying to write it
# this can help us avoid overloading the server with writes when reindexing
skip = true
offset = 0

cloning errors

On my local install, the cloning process sometimes will try to clone the block into {user}/gist.github.com/ instead of {user}/{gist.id}.

I have traced the error to:

cmd = "cd #{userfolder};git clone [email protected]:#{gist.id}"

which is solved with
cmd = "cd #{userfolder};git clone [email protected]:#{gist.id} #{gist.id}/"

(I don't know why it does this only for some blocks, not all.)

breaking changes in Elasticsearch v5.0.0 https://www.elastic.co/guide/en/elasticsearch/reference/5.0/release-notes-5.0.0.html

update users with manually discovered users

PMeinshausen
migurski
alper
cartoda

if `data/parsed/` directory doesn't exist, `parse.coffee` should create it

js heap OOM with 8gb ram specified

so I'm updating my local elasticsearch, and I use this command from our docs

coffee --nodejs --max-old-space-size=8000 elasticsearch.coffee

oh no, at gist 30179 I see this error message

indexed 30182 7067e1cc1b623959eacda6e34a2f63da
indexed 30181 7acb36eccb6280d95634f3d6f4d8f0f7
indexed 30179 c2acadc0809fcad97e403212333234d8

<--- Last few GCs --->

[12055:0x102801e00]   210399 ms: Mark-sweep 7845.3 (8060.4) -> 7844.9 (8060.9) MB, 247.6 / 0.0 ms  allocation failure GC in old space requested
[12055:0x102801e00]   210742 ms: Mark-sweep 7844.9 (8060.9) -> 7844.9 (8048.9) MB, 343.3 / 0.0 ms  last resort GC in old space requested
[12055:0x102801e00]   210957 ms: Mark-sweep 7844.9 (8048.9) -> 7844.9 (8043.9) MB, 215.4 / 0.0 ms  last resort GC in old space requested


<--- JS stacktrace --->

==== JS stack trace =========================================

Security context: 0x184b31ca55e9 <JSObject>
    1: toString [buffer.js:~634] [pc=0x30f302b1d0b8](this=0x184b9b53c4c1 <Uint8Array map = 0x184b022da259>,encoding=0x184bef8022d1 <undefined>,start=0x184bef8022d1 <undefined>,end=0x184bef8022d1 <undefined>)
    2: arguments adaptor frame: 0->3
    3: /* anonymous */ [/Users/m/workspace/blockbuilder-search-index/elasticsearch.coffee:98] [bytecode=0x184b2ebd2fa1 offset=19](this=0x184bead866f1 <JS...

FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - JavaScript heap out of memory
 1: node::Abort() [/Users/m/.nvm/versions/node/v9.11.1/bin/node]
 2: node::FatalTryCatch::~FatalTryCatch() [/Users/m/.nvm/versions/node/v9.11.1/bin/node]
 3: v8::internal::V8::FatalProcessOutOfMemory(char const*, bool) [/Users/m/.nvm/versions/node/v9.11.1/bin/node]
 4: v8::internal::Factory::NewRawTwoByteString(int, v8::internal::PretenureFlag) [/Users/m/.nvm/versions/node/v9.11.1/bin/node]
 5: v8::internal::Factory::NewStringFromUtf8(v8::internal::Vector<char const>, v8::internal::PretenureFlag) [/Users/m/.nvm/versions/node/v9.11.1/bin/node]
 6: v8::String::NewFromUtf8(v8::Isolate*, char const*, v8::NewStringType, int) [/Users/m/.nvm/versions/node/v9.11.1/bin/node]
 7: node::StringBytes::Encode(v8::Isolate*, char const*, unsigned long, node::encoding, v8::Local<v8::Value>*) [/Users/m/.nvm/versions/node/v9.11.1/bin/node]
 8: void node::Buffer::(anonymous namespace)::StringSlice<(node::encoding)1>(v8::FunctionCallbackInfo<v8::Value> const&) [/Users/m/.nvm/versions/node/v9.11.1/bin/node]
 9: 0x30f302248327
10: 0x30f302b1d0b8
➜  blockbuilder-search-index git:(micah/55/exp/parse-modules) ✗

create a gcp cloud function to clone gists

questions

can we run a shell script from inside a cloud function?
how does github API rate limiting interact with cloud functions?

don't show [UNLISTED]
seeing all your private blocks
creating lists of blocks