Giter VIP home page Giter VIP logo

apollo's People

Contributors

bzz avatar carlosms avatar fulaphex avatar marnovo avatar r0maink avatar smacker avatar vmarkovtsev avatar zurk avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

apollo's Issues

Bags not saved in DB

The class BagsSaver in bags.py is not used since the refactoring of ml was leveraged to update apollo, hence the bags are not saved to DB.

  • If we wish to keep it that way, it would make sense to remove the. class, and refactor the other files to remove all references to the bags table.
  • If not, we can either modify ml's repos2bow function, in order to add this transformer at the end of the pipe, or we need to put the logic here like was the case before.

Run apollo on minikube: Answer from Java side is empty

Issue description

I want to run apollo on k8 staging cluster, so I wanted to test it out locally on minikube first. I used helm charts to bring up a local spark cluster, scylla DB and babelfshd. I then created an image for apollo, available here as well as a k8 service so it would connect to port 7077, 9042 and 9432. After creating the pod i ran the resetdbcommand, it worked. I cloned the engine repo in order to get example siva files, that I put in io/siva. Then I tried to run the bagscommand , Spark launches and registers the job (I checked logs on the master and worker pod, as well as UI) and then I got this error:

INFO:engine:Initializing on io/siva
INFO:MetadataSaver:Ignition -> DzhigurdaFiles -> UastExtractor -> Moder -> Cacher -> MetadataSaver
ERROR:root:Exception while sending command.
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/py4j/java_gateway.py", line 1062, in send_command
    raise Py4JNetworkError("Answer from Java side is empty")
py4j.protocol.Py4JNetworkError: Answer from Java side is empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/py4j/java_gateway.py", line 908, in send_command
    response = connection.send_command(command)
  File "/usr/local/lib/python3.5/dist-packages/py4j/java_gateway.py", line 1067, in send_command
    "Error while receiving", e, proto.ERROR_ON_RECEIVE)
py4j.protocol.Py4JNetworkError: Error while receiving
Traceback (most recent call last):
  File "/usr/local/bin/apollo", line 11, in <module>
    load_entry_point('apollo', 'console_scripts', 'apollo')()
  File "/packages/apollo/apollo/__main__.py", line 230, in main
    return handler(args)
  File "/packages/apollo/apollo/bags.py", line 94, in source2bags
    cache_hook=lambda: MetadataSaver(args.keyspace, args.tables["meta"]))
  File "/packages/sourced/ml/utils/engine.py", line 147, in wrapped_pause
    return func(cmdline_args, *args, **kwargs)
  File "/packages/sourced/ml/cmd_entries/repos2bow.py", line 35, in repos2bow_entry_template
    uast_extractor.link(cache_hook()).execute()
  File "/packages/sourced/ml/transformers/transformer.py", line 95, in execute
    head = node(head)
  File "/packages/apollo/apollo/bags.py", line 46, in __call__
    rows.toDF() \
  File "/spark/python/pyspark/sql/session.py", line 58, in toDF
    return sparkSession.createDataFrame(self, schema, sampleRatio)
  File "/spark/python/pyspark/sql/session.py", line 582, in createDataFrame
    rdd, schema = self._createFromRDD(data.map(prepare), schema, samplingRatio)
  File "/spark/python/pyspark/sql/session.py", line 380, in _createFromRDD
    struct = self._inferSchema(rdd, samplingRatio)
  File "/spark/python/pyspark/sql/session.py", line 351, in _inferSchema
    first = rdd.first()
  File "/spark/python/pyspark/rdd.py", line 1361, in first
    rs = self.take(1)
  File "/spark/python/pyspark/rdd.py", line 1343, in take
    res = self.context.runJob(self, takeUpToNumLeft, p)
  File "/spark/python/pyspark/context.py", line 992, in runJob
    port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, partitions)
  File "/usr/local/lib/python3.5/dist-packages/py4j/java_gateway.py", line 1160, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/spark/python/pyspark/sql/utils.py", line 63, in deco
    return f(*a, **kw)
  File "/usr/local/lib/python3.5/dist-packages/py4j/protocol.py", line 328, in get_return_value
    format(target_id, ".", name))
py4j.protocol.Py4JError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob
ERROR:root:Exception while sending command.
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/py4j/java_gateway.py", line 1062, in send_command
    raise Py4JNetworkError("Answer from Java side is empty")
py4j.protocol.Py4JNetworkError: Answer from Java side is empty

Steps to Reproduce (for bugs)

  • Setup minikube and helm
  • Clone charts repo
  • Create pods, servieces, etc: helm install scylla --name=scylla, helm install spark --name=anything, helm install bblfshd --name=babel, kubectl create -f service.yaml, kubectl run -ti --image=r0maink/apollo apollo-test
  • Open new tab and log in the spark master with kubectl exec -it anything-master /bin/bash then do: export PYSPARK_PYTHON=python3 and export PYSPARK_PYTHON_DRIVER=python3
  • Go to the previous tab, it should be logged on the apollo pod and run apollo resetdb --cassandra scylla:9042
  • Get the siva files: apt update, apt install git, git clone https://github.com/src-d/engine, mkdir io, mkdir io/bags, cp engine/examples/siva_files io/siva

And finally: apollo bags -r io/siva --bow io/bags/bow.asdf --docfreq io/bags/docfreq.asdf -f id -f lit -f uast2seq --uast2seq-seq-len 4 -l Java --min-docfreq 5 --bblfsh babel-bblfshd --cassandra scylla:9042 --persist MEMORY_ONLY -s spark://anything-master:7077

Any ideas ?

Problem when using apollo with spark cluster

When trying to run hash or cmd commands with spark in cluster mode, we get the same problem we used to have with ml, because the workers do not have the apollo lib and it is not added to the spark session using addPyfile.

I think we should either modify the way the --depzip flag functions in order to add it, or change logic:

  • when ml -s flag is used to specify a master that is not local, we should add ml, engine and all other dependencies if the call is not made by a command of the ml library, e.g. apollo and it's dependencies.
  • the --dep-zip flag should be used to add ml dependencies. It will be of no use for us since our workers use ml-core image and already have them, but it will be useful for other users.
  • as was pointed out in this issue, I think we should add to the spark conf by default the flags that will clean up after us, because it ends up taking a lot of memory

Support hashing on CPU

Not everyone has GPUs and in some cases hashing time may not be a bottleneck, so it would be nice to have a mode that does not require CUDA and firends.

AFAIK to have such option, alongside with high-performance GPU one would require small changes in hasher.py to have an option of using something like https://github.com/ekzhu/datasketch instead of libMHCUDA.

[`hash` with a big number of files] Failed to execute: com.datastax.spark.connector.writer.RichBatchStatement

Hi,
spark cassandra connector / scylla fails when you attempt to make hash step with a big number of files.
I tried 1M files - always fails, 300k files - unstable, some experiments can be completed, but after a several trials it fails with error. Before each new run of hash step I used reset_db but memory is not released from scylla (I'm not ure if it's correct behaviour of DB).

Error log ``` 18/04/10 23:04:51 ERROR QueryExecutor: Failed to execute: com.datastax.spark.connector.writer.RichBatchStatement@3ec8da3b com.datastax.driver.core.exceptions.WriteTimeoutException: Cassandra timeout during write query at consistency LOCAL_QUORUM (1 replica were required but only 0 acknowled ged the write) at com.datastax.driver.core.exceptions.WriteTimeoutException.copy(WriteTimeoutException.java:100) at com.datastax.driver.core.Responses$Error.asException(Responses.java:122) at com.datastax.driver.core.RequestHandler$SpeculativeExecution.onSet(RequestHandler.java:506) at com.datastax.driver.core.Connection$Dispatcher.channelRead0(Connection.java:1070) at com.datastax.driver.core.Connection$Dispatcher.channelRead0(Connection.java:993) at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336) at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:287) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336) at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336) at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:293) at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:267) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336) at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1294) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343) at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:911) at io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:934) at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:397) at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:302) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131) at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144) at java.lang.Thread.run(Thread.java:748) Caused by: com.datastax.driver.core.exceptions.WriteTimeoutException: Cassandra timeout during write query at consistency LOCAL_QUORUM (1 replica were required but only 0 acknowledged the write) at com.datastax.driver.core.Responses$Error$1.decode(Responses.java:59) at com.datastax.driver.core.Responses$Error$1.decode(Responses.java:37) at com.datastax.driver.core.Message$ProtocolDecoder.decode(Message.java:289) at com.datastax.driver.core.Message$ProtocolDecoder.decode(Message.java:269) at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:88) ... 18 more ```

Cassandra timeouts with resetdb

When resetting the cassandra DB, I often get this error:

root@rom-gpu-dbc68df59-kf6qw:/# apollo resetdb --cassandra cassandra
INFO:cassandra:Connecting to cassandra
DROP KEYSPACE IF EXISTS apollo;
Traceback (most recent call last):
  File "/usr/local/bin/apollo", line 11, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.4/dist-packages/apollo/__main__.py", line 228, in main
    return handler(args)
  File "/usr/local/lib/python3.4/dist-packages/apollo/cassandra_utils.py", line 67, in reset_db
    cql("DROP KEYSPACE IF EXISTS %s" % args.keyspace)
  File "/usr/local/lib/python3.4/dist-packages/apollo/cassandra_utils.py", line 64, in cql
    db.execute(cmd)
  File "cassandra/cluster.py", line 2141, in cassandra.cluster.Session.execute
  File "cassandra/cluster.py", line 4033, in cassandra.cluster.ResponseFuture.result
cassandra.OperationTimedOut: errors={'10.2.13.74': 'Client request timeout. See Session.execute[_async](timeout)'}, last_host=10.2.13.74

Gonna look into it when I get time, it's a relatively minor problem, as simply retrying the command will often do the trick, but would like to correct this if possible.

Problem saving bags when running on small dataset on science 3

I ran apollo's bag command with test Siva files I found on science 3. In order for it to work I had to change a couple lines in ml's batch_transform.py file and apollo's bags.py file. These changes involved the changes in this open PR as well as a couple more all involving refactoring of the old model variable to the new docfreq_model that were omitted and commenting out a line containing reference to quant_model.

The exact command was: docker run -it --rm -v /home/romain/io:/io --link bblfshd --link scylla src-d/apollo-rom bags -r /io/siva --batches /io/bags --docfreq /io/bags/docfreq.asdf -f id -f lit -f uast2seq --uast2seq-seq-len 4 -l Java -s 'local[*]' --min-docfreq 5 --bblfsh bblfshd --cassandra scylla --persist MEMORY_ONLY --config spark.executor.memory=4G --config spark.driver.memory=10G --config spark.driver.maxResultSize=4G

It seemed to work properly (log says apollo detected 128 documents, with average bag length of 398.7 and 5395 of vocab size) until starting writing docfreq to /io/bags/docfreq.asdf, when I got this error:

Traceback (most recent call last):
File "/usr/local/bin/apollo", line 11, in <module>
load_entry_point('apollo', 'console_scripts', 'apollo')()
File "/packages/apollo/apollo/__main__.py", line 258, in main
return handler(args)
File "/packages/apollo/apollo/bags.py", line 126, in source2bags
batcher.docfreq_model.save(args.docfreq)
File "/usr/local/lib/python3.5/dist-packages/modelforge/model.py", line 270, in save
write_model(self._meta, tree, output)
File "/usr/local/lib/python3.5/dist-packages/modelforge/model.py", line 409, in write_model
asdf.AsdfFile(final_tree).write_to(output, all_array_compression=ARRAY_COMPRESSION)
File "/usr/local/lib/python3.5/dist-packages/asdf/asdf.py", line 890, in write_to
with generic_io.get_file(fd, mode='w') as fd:
File "/usr/local/lib/python3.5/dist-packages/asdf/generic_io.py", line 1186, in get_file
fd = atomicfile.atomic_open(realpath, realmode)
File "/usr/local/lib/python3.5/dist-packages/asdf/extern/atomicfile.py", line 139, in atomic_open delete=False)
File "/usr/lib/python3.5/tempfile.py", line 688, in NamedTemporaryFile
(fd, name) = _mkstemp_inner(dir, prefix, suffix, flags, output_type)
File "/usr/lib/python3.5/tempfile.py", line 399, in _mkstemp_inner
fd = _os.open(file, flags, 0o600)
FileNotFoundError: [Errno 2] No such file or directory: '/io/bags/.___atomic_writedyrniw3x'

The problem seems to be linked with modelforge's saving process, from the error log it seems that science 3 hasn't got the latest version of the repo but I couldn't see how that affected it when looking this diff.

Apollo could not be run with sourced.ml@develp branch

Right now, if one

git clone https://github.com/src-d/apollo.git; cd apollo
virtualenv -p python3 .venv-py3
source .venv-py3/bin/activate

pip install git+https://github.com/src-d/ml.git@develop
pip3 install -e .

and then apollo --help it will result in

$ apollo --help
Traceback (most recent call last):
  File "./src-d/apollo/.venv-py3/bin/apollo", line 11, in <module>
    load_entry_point('apollo', 'console_scripts', 'apollo')()
  File "./src-d/apollo/.venv-py3/lib/python3.6/site-packages/pkg_resources/__init__.py", line 572, in load_entry_point
    return get_distribution(dist).load_entry_point(group, name)
  File "./src-d/apollo/.venv-py3/lib/python3.6/site-packages/pkg_resources/__init__.py", line 2755, in load_entry_point
    return ep.load()
  File "./src-d/apollo/.venv-py3/lib/python3.6/site-packages/pkg_resources/__init__.py", line 2408, in load
    return self.resolve()
  File "./src-d/apollo/.venv-py3/lib/python3.6/site-packages/pkg_resources/__init__.py", line 2414, in resolve
    module = __import__(self.module_name, fromlist=['__name__'], level=0)
  File "./src-d/apollo/apollo/__main__.py", line 12, in <module>
    from apollo.bags import preprocess_source, source2bags
  File "./src-d/apollo/apollo/bags.py", line 8, in <module>
    from sourced.ml.transformers import UastExtractor, Transformer, Cacher, UastDeserializer, Engine, \
ImportError: cannot import name 'Documents2BOW'

From https://github.com/src-d/ml/pull/160/files#diff-9bdc53996c12b1f5ff9117bb4bf0ae23R7 it seems that Repo2WeightedSet was renamed in sourced.ml, but

FieldsSelector, ParquetSaver, Repo2WeightedSet, Repo2DocFreq, Repo2Quant, BagsBatchSaver, BagsBatcher
still uses it.

Problem mounting during install

On mac os the command mount -o bind does not work, because it does not exist, however you can achieve the same result with the method describe here.

It might be a good idea to add the link to this git as well as the alternate commands (e.g.: mount localhost:/path/to/sourced-engine bundle/engine) to the installation part of the doc.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.