tony-framework / tony Goto Github PK
View Code? Open in Web Editor NEWTonY is a framework to natively run deep learning frameworks on Apache Hadoop.
Home Page: https://tony-project.ai
License: Other
TonY is a framework to natively run deep learning frameworks on Apache Hadoop.
Home Page: https://tony-project.ai
License: Other
Currently, we have different handling logic for hdfs_classpath, which we adds to container localizable resources for am and then pass this again to the workers, for src_dir & python_venv, we add them to a tony.zip. A difference between these two is for src_dir, we care about the folder structure, however for python_venv, we don't care since it is a single zip file.
All these resources handling could be unified via the new -tony.container.resources
flag. We localize all resources in that field (delimited by ,)
We pass python_venv all the way from client -> am -> taskExecutor as command line arguments, which is not necessary. We can always use relative path and assume the top level folder is venv
.
The plan is to get rid of the logic to create a tony.zip
and use the logic to handle resources to handle all these scenarios. The -execute
and -python_binary
will always take relative path inside the uploaded artifact.
ls /user/alice/tonyJob
- venv.zip
- src/
- mnist.py
Inside venv.zip
venv.zip
- bin/
- lib/
Example:
java -cp `hadoop classpath`:/path/to/TonY/tony-cli/build/libs/tony-cli-x.x.x-all.jar com.linkedin.tony.cli.ClusterSubmitter \
--python_venv=/user/alice/tonyJob/venv.zip \
--src_dir=/user/alice/tonyJob/src \
--executes=mnist_distributed.py \
--python_binary_path=bin/python
Under the hood, we pack src_dir into a SRC.zip, upload to hdfs, set tonyConf's tony.container.resources
to include that and then all containers will localize the zip and if the SRC.zip exists, we'll unzip the file.
tony-final.xml's tony.container.resources will be like:
<property>
<name>
tony.container.resources
</name>
<value>
hdfs://tony_tmp/SRC.zip, hdfs://tony_tmp/venv.zip, hdfs://hdfs_classpath/tony.jar
</value>
</property>
Same applies to python_venv, we upload the venv zip file to hdfs, rename to venv.zip and set tonyConf and then a common logic will localize it to all containers. We'll unzip the file at the containers.
WIP branch https://github.com/linkedin/TonY/tree/refactor
Ref #74
Would be nice to set up continuous integration for this repo. Travis CI is the most popular tool, per https://blog.github.com/2017-11-07-github-welcomes-all-ci-tools/. It's free.
Would be good to run build on every Pull Request.
An admin of TonY project (@oliverhu or @chriseppstein) can set it up here: https://travis-ci.org/linkedin/TonY
2018-12-12 19:43:26 ERROR TonyApplicationMaster:935 - [2018-12-12 19:43:25.607]Container [pid=12069,containerID=container_1544604976318_0003_01_000003] is running 22081205248B beyond the 'VIRTUAL' memory limit. Current usage: 979.3 MB of 2 GB physical memory used; 24.8 GB of 4.2 GB virtual memory used. Killing container.
How to configure the 'VIRTUAL' memory in Tony or on YARN?It seems that the vmem in my process is so large.
Hi,
I came across this great work just recently. I had lot of issues using the TensorflowOnSpark and TensorflowOnYARN earlier this year and had given up. I'm wondering how can I make use of this repo on top of my Cloudera Distribution of Hadoop. Any help is appreciated.
Thanks !
Mohammed Ayub
Hello,
I am having the following issue when trying to launch a job:
java.lang.RuntimeException: Failed to get FS delegation token for default FS.
at com.linkedin.tony.TonyClient.getTokens(TonyClient.java:555)
at com.linkedin.tony.TonyClient.run(TonyClient.java:177)
at com.linkedin.tony.TonyClient.start(TonyClient.java:716)
at com.linkedin.tony.TonyClient.start(TonyClient.java:703)
at com.linkedin.tony.cli.ClusterSubmitter.main(ClusterSubmitter.java:54)
I went to the code and I see a call to fs.getDelegationToken(tokenRenewer). However, I don't see such method in the API for FileSystem, so I am not sure what I should do next.
Thanks in advance for the help provided!
Currently, the startTHS.sh script uses exec
to start, and so if the SSH session in which the THS is started dies, the THS process itself dies. We should use nohup
so the THS will continue running even when the SSH session is closed. (Alternatively, we could run THS as a background process.)
Currently, the README.md still mentions the old tony.application.insecure-mode
property, which was changed to tony.application.security.enabled
in #14.
Currently, MiniTony is only used in our unit tests. We should provide doc that how other folks can leverage to iterate their model code faster without submitting the model code to a remote cluster.
Now we can only localize one hdfs classpath, we should be able to pull resources from different hdfs_classpath. This should also be renamed to hdfs_resources
Will be adding support in Google Cloud DataProc for TonY as part of the initialization actions.
https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/init-actions
This issue is to track this effort.
Part of making #93 easier. User should be able to pass in --resources SCHEMA://PATH/TO/RESOURCES
instead of having to add resources through local file system.
Example:
@Test
public void testTonyResourcesFlag() throws ParseException {
conf.setBoolean(TonyConfigurationKeys.IS_SINGLE_NODE, false);
client = new TonyClient(conf);
client.init(new String[]{
"--executes", "'/bin/cat log4j.properties'",
"--hdfs_classpath", "/yarn/libs",
"--container_env", Constants.SKIP_HADOOP_PATH + "=true",
"--conf", "tony.worker.resources=/yarn/libs",
"--conf", "tony.ps.instances=0",
});
int exitCode = client.start();
Assert.assertEquals(exitCode, 0);
}
Support PyTorch.
Tried to build TonY and run mnist-tensorflow example, but get error message "ERROR tony.TonyClient: Application failed to complete successfully". There is no clear error in hadoop logs. Furthermore, I successfully built TonY but I didn't find the tony folder and configuration tony.xml. Thanks in advance for any help.
My configurations are:
A Hadoop cluster (4 nodes: 1 master and 3 slaves) on Virtualbox.
TonY: 0.1.3
Hadoop: 2.9.1
Tensorflow: 1.9.0
The printed out info:
18/11/05 14:49:59 INFO tony.TonyClient: TonY heartbeat interval [1000]
18/11/05 14:49:59 INFO tony.TonyClient: TonY max heartbeat misses allowed [25]
18/11/05 14:49:59 INFO tony.TonyClient: Starting client..
18/11/05 14:49:59 INFO client.RMProxy: Connecting to ResourceManager at /192.168.56.100:8032
18/11/05 14:50:05 WARN hdfs.DataStreamer: Caught exception
java.lang.InterruptedException
at java.lang.Object.wait(Native Method)
at java.lang.Thread.join(Thread.java:1252)
at java.lang.Thread.join(Thread.java:1326)
at org.apache.hadoop.hdfs.DataStreamer.closeResponder(DataStreamer.java:980)
at org.apache.hadoop.hdfs.DataStreamer.endBlock(DataStreamer.java:630)
at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:807)
18/11/05 14:50:05 INFO tony.TonyClient: Completed setting up app master command {{JAVA_HOME}}/bin/java -Xmx1638m -Dyarn.app.container.log.dir=<LOG_DIR> com.linkedin.tony.TonyApplicationMaster --python_binary_path /home/rui/venv/bin/python --python_venv /home/rui/venv.zip --executes /home/rui/TonY/tony-examples/mnist/mnist_distributed.py --hdfs_classpath hdfs://192.168.56.100:9000/user/rui/.tony/1adf67c5-3be7-4245-8a31-3c9204ae84a8 --container_env TONY_CONF_PATH=hdfs://192.168.56.100:9000/user/rui/.tony/application_1541449424539_0001/tony-final.xml --container_env TONY_CONF_TIMESTAMP=
1541451005299 --container_env TF_ZIP_LENGTH=102664099 --container_env TF_ZIP_TIMESTAMP=1541451005184 --container_env TF_ZIP_PATH=hdfs://192.168.56.100:9000/user/rui/.tony/application_1541449424539_0001/tf.zip --container_env TONY_CONF_L
ENGTH=3200 --container_env CLASSPATH={{CLASSPATH}}./{{HADOOP_CONF_DIR}}{{HADOOP_COMMON_HOME}}/share/hadoop/common/{{HADOOP_COMMON_HOME}}/share/hadoop/common/lib/{{HADOOP_HDFS_HOME}}/share/hadoop/hdfs/{{
HADOOP_HDFS_HOME}}/share/hadoop/hdfs/lib/{{HADOOP_YARN_HOME}}/share/hadoop/yarn/{{HADOOP_YARN_HOME}}/share/hadoop/yarn/lib/* 1><LOG_DIR>/amstdout.log 2><LOG_DIR>/amstderr.log
18/11/05 14:50:05 INFO tony.TonyClient: Submitting YARN application
18/11/05 14:50:05 INFO impl.YarnClientImpl: Submitted application application_1541449424539_0001
18/11/05 14:50:05 INFO tony.TonyClient: URL to track running application (will proxy to TensorBoard once it has started): http://tf-yarn-master:8088/proxy/application_1541449424539_0001/
18/11/05 14:50:05 INFO tony.TonyClient: ResourceManager web address for application: http://192.168.56.100:8088/cluste
r/app/application_1541449424539_0001
18/11/05 14:50:11 INFO tony.TonyClient: AM host: tf-yarn-slave3
18/11/05 14:50:11 INFO tony.TonyClient: AM RPC port: 14925
18/11/05 14:50:11 INFO client.RMProxy: Connecting to ResourceManager at /192.168.56.100:8032
18/11/05 14:50:13 INFO tony.TonyClient: Logs for ps 0 at: http://tf-yarn-slave2:8042/node/containerlogs/container_1541
449424539_0001_01_000002/rui
18/11/05 14:50:13 INFO tony.TonyClient: Logs for worker 0 at: http://tf-yarn-slave3:8042/node/containerlogs/container_
1541449424539_0001_01_000003/rui
18/11/05 14:50:21 INFO tony.TonyClient: Application finished unsuccessfully. YarnState=FINISHED, DSFinalStatus=FAILED. Breaking monitoring loop : ApplicationId:1
18/11/05 14:50:22 ERROR tony.TonyClient: Application failed to complete successfully
TaskExecutor
uses the -python_venv passed by user as the path to locate the python_venv path, this is wrong.. it will always be at root folder and we should make a constant.
Is there some feature that is strictly not supported on Hadoop of lower version ? I try to build TonY on Hadoop 2.6.0 and get the following:
$ ./gradlew build -x test
> Task :tony-core:compileJava
Note: Some input files use or override a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
Note: /home/xxx/TonY/tony-core/src/main/java/com/linkedin/tony/rpc/impl/ApplicationRpcClient.java uses unchecked or unsafe operations.
Note: Recompile with -Xlint:unchecked for details.
Unknown file extension: tony-core/src/main/resources/META-INF/services/org.apache.hadoop.security.SecurityInfo
Unknown file extension: tony-core/src/test/resources/test.tar
Unknown file extension: tony-core/src/test/resources/test.tar.gz
Unknown file extension: tony-core/src/test/resources/test.zip
> Task :tony-history-server:compilePlayBinaryScala
Pruning sources from previous analysis, due to incompatible CompileSetup.
> Task :tony-history-server:compilePlayBinaryTests
Pruning sources from previous analysis, due to incompatible CompileSetup.
Note: /home/xxx/TonY/tony-history-server/test/utils/TestHdfsUtils.java uses unchecked or unsafe operations.
Note: Recompile with -Xlint:unchecked for details.
> Task :tony-history-server:testPlayBinary
controllers.BrowserTest > test FAILED
java.lang.RuntimeException
Caused by: akka.stream.impl.io.ConnectionSourceStage$$anon$2$$anon$1
Caused by: java.net.BindException
12 tests completed, 1 failed
> Task :tony-history-server:testPlayBinary FAILED
FAILURE: Build failed with an exception.
* What went wrong:
Execution failed for task ':tony-history-server:testPlayBinary'.
> There were failing tests. See the report at: file:///home/xxx/TonY/tony-history-server/build/playBinary/reports/test/index.html
* Try:
Run with --stacktrace option to get the stack trace. Run with --info or --debug option to get more log output. Run with --scan to get full insights.
* Get more help at https://help.gradle.org
Deprecated Gradle features were used in this build, making it incompatible with Gradle 5.0.
Use '--warning-mode all' to show the individual deprecation warnings.
See https://docs.gradle.org/4.10.2/userguide/command_line_interface.html#sec:command_line_warnings
BUILD FAILED in 27s
45 actionable tasks: 40 executed, 5 up-to-date
Currently, there are a couple timeouts involved in worker/parameter server registration:
tony.task.registration-timeout-sec
(default 300 sec) return Utils.pollTillNonNull(() ->
proxy.registerWorkerSpec(jobName + ":" + taskIndex,
InetAddress.getLocalHost().getHostName() + ":" + rpcPort), 3, 120);
If there are large container scheduling/start-up delays, jobs can fail due to this. We should remove these timeouts entirely. We also then don't need the tony.task.registration-retry-count
property either.
Currently, EventHandler.run() spins inside this loop:
while (!isStopped) {
writeEvent(eventQueue, dataFileWriter);
}
because writeEvent uses poll()
which will immediately return null if the queue is empty.
We should update writeEvent()
to use take()
instead and have the stop()
method call Thread.currentThread().interrupt()
(the writeEvent()
method should catch the InterruptedException
).
TonY clients and the TonY History Server need to use the same value for the location of the history files. The client needs to tell the TonY AM to write to that location and the history server needs to read from that location.
We should define this location in a tony-site.xml file and expect it to be in the directory pointed to by the environment variable TONY_CONF_DIR
, which can be set before running TonY clients or the history server.
See #81 (comment) for more context.
TonY should inject build metadata (version number, commit hash, user, date, etc.) into the Configuration created by the TonyClient so that this info is available in the config.xml for TonY applications.
Unable to run mnist example in Dataproc.
sudo java -cp `hadoop classpath`:/usr/local/src/MyJob/tony-cli-0.1.5-all.jar com.linkedin.tony.cli.ClusterSubmitter --python_venv=/usr/lo
cal/src/MyJob/venv.zip --src_dir=/usr/local/src/TonY/mnist/ --executes=/usr/local/src/TonY/mnist/src/mnist_distributed.py --conf_file=/usr/local/src/tony.xml --python_binary_path
=venv/bin/python3.5
18/11/11 08:23:02 INFO cli.ClusterSubmitter: Starting ClusterSubmitter..
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/lib/hadoop/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/local/src/MyJob/tony-cli-0.1.5-all.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
18/11/11 08:23:02 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
18/11/11 08:23:02 INFO cli.ClusterSubmitter: Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml, null/core-site.
xml, null/hdfs-site.xml
Nov 11, 2018 8:23:02 AM com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase <clinit>
INFO: GHFS version: hadoop2-1.9.8
18/11/11 08:23:03 WARN shortcircuit.DomainSocketFactory: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded.
18/11/11 08:23:03 INFO cli.ClusterSubmitter: Copying /usr/local/src/MyJob/tony-cli-0.1.5-all.jar to: hdfs://tony-dev-m/user/root/.tony/6665ca2a-fd31-4f61-a947-33f895517302
Exception in thread "main" java.lang.ClassCastException: [Ljava.lang.Object; cannot be cast to [Ljava.lang.String;
at com.linkedin.tony.cli.ClusterSubmitter.main(ClusterSubmitter.java:60)
hadoop version
Hadoop 2.9.0
Subversion Unknown -r Unknown
Compiled by bigtop on 2018-08-17T12:00Z
Compiled with protoc 2.5.0
From source with checksum f510b6e8bafb2ddfd660aeb7454e7c30
This command was run using /usr/lib/hadoop/hadoop-common-2.9.0.jar
Java version
java -version
openjdk version "1.8.0_181"
OpenJDK Runtime Environment (build 1.8.0_181-8u181-b13-1~deb9u1-b13)
OpenJDK 64-Bit Server VM (build 25.181-b13, mixed mode)
Command run:
java -cp `hadoop classpath`:/usr/local/src/MyJob/tony-cli-0.1.5-all.jar com.linkedin.tony.cli.ClusterSubmitter \
--python_venv=/usr/local/src/MyJob/venv.zip \
--src_dir=/usr/local/src/TonY/mnist/ \
--executes=/usr/local/src/TonY/mnist/src/mnist_distributed.py \
--conf_file=/usr/local/src/tony.xml \
--python_binary_path=venv/bin/python3.5
Directory structure:
.
├── src
│ └── mnist_distributed.py
├── tony-cli-0.1.5-all.jar
├── tony.xml
└── venv.zip
tony.xml contents:
<configuration>
<property>
<name>tony.application.security.enabled</name>
<value>false</value>
</property>
<property>
<name>tony.worker.instances</name>
<value>2</value>
</property>
<property>
<name>tony.worker.memory</name>
<value>15g</value>
</property>
<property>
<name>tony.worker.gpus</name>
<value>0</value>
</property>
<property>
<name>tony.ps.memory</name>
<value>3g</value>
</property>
</configuration>
Currently, history files are retained forever. The retention period should be configurable and TonY Portal should take care of enforcing retention.
As part of retention, we can also clean up in-progress files that are older than the retention period. (These are probably jobs that crashed or encountered other abnormal conditions.)
Internal Jira: LIHADOOP-43855
TonY hangs if a worker is killed due to OOM
Java version:1.8.0_18
tensorflow-gpu: 1.9
运行如下命令:
./gradlew build
会出现如下错误:
com.linkedin.tony.TestTonyE2E.setup FAILED
java.lang.IllegalArgumentException: The value of property bind.address must not be null
at com.google.common.base.Preconditions.checkArgument(Preconditions.java:148)
at org.apache.hadoop.conf.Configuration.set(Configuration.java:1357)
at org.apache.hadoop.conf.Configuration.set(Configuration.java:1338)
at org.apache.hadoop.http.HttpServer2.initializeWebServer(HttpServer2.java:585)
at org.apache.hadoop.http.HttpServer2.(HttpServer2.java:537)
at org.apache.hadoop.http.HttpServer2.(HttpServer2.java:117)
at org.apache.hadoop.http.HttpServer2$Builder.build(HttpServer2.java:421)
at org.apache.hadoop.hdfs.server.namenode.NameNodeHttpServer.start(NameNodeHttpServer.java:160)
at org.apache.hadoop.hdfs.server.namenode.NameNode.startHttpServer(NameNode.java:869)
at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:691)
at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:937)
at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:910)
at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1643)
at org.apache.hadoop.hdfs.MiniDFSCluster.createNameNode(MiniDFSCluster.java:1308)
at org.apache.hadoop.hdfs.MiniDFSCluster.configureNameService(MiniDFSCluster.java:1077)
at org.apache.hadoop.hdfs.MiniDFSCluster.createNameNodesAndSetConf(MiniDFSCluster.java:952)
at org.apache.hadoop.hdfs.MiniDFSCluster.initMiniDFSCluster(MiniDFSCluster.java:884)
at org.apache.hadoop.hdfs.MiniDFSCluster.(MiniDFSCluster.java:517)
at org.apache.hadoop.hdfs.MiniDFSCluster$Builder.build(MiniDFSCluster.java:476)
at com.linkedin.minitony.cluster.MiniCluster.start(MiniCluster.java:50)
at com.linkedin.tony.TestTonyE2E.setup(TestTonyE2E.java:34)
是什么原因?hadoop的各项配置检查了好几遍,都是正确的。。
还有在用gpu时,运行如下命令:
java -cp hadoop classpath
:/TonY/tony-cli/build/libs/tony-cli-0.1.3-all.jar com.linkedin.tony.cli.LocalSubmitter
--python_venv=/venv.zip
--src_dir=/TonY/tony-examples/mnist
--executes=/TonY/tony-examples/mnist/mnist_distributed.py
--conf_file=/path/tony-test.xml
--python_binary_path=venv/bin/python
会出现找不到libcublas.so.9.0的错误,之前配置过cuda和cudnn,没问题,可以跑tf,
在这个虚拟环境中运行tf也可以,但是运行如上命令,则报错。谢谢
See #62 (comment) for context.
Tried to install TonY on Google Cloud via Dataproc, but tests failed during build.
Setup:
Operating System:
Debian GNU/Linux 8
Attached logs with -debug information.
#yarn node -list
18/11/03 21:56:40 INFO client.RMProxy: Connecting to ResourceManager at tony-m/10.138.0.4:8032
Total Nodes:2
Node-Id Node-State Node-Http-Address Number-of-Running-Containers
tony-w-0.c.dpe.internal:33607 RUNNING tony-w-0.c.dpe.internal:8042 11
tony-w-1.c.dpe.internal:44563 RUNNING tony-w-1.c.dpe.internal:8042 9
#hadoop version
Hadoop 2.8.4
Subversion Unknown -r Unknown
Compiled by bigtop on 2018-08-09T10:27Z
Compiled with protoc 2.5.0
From source with checksum 373fbec5524db42be27f1396ffbd2fc6This command was run using
[build.log](https://github.com/linkedin/TonY/files/2545362/build.log)
/usr/lib/hadoop/hadoop-common-2.8.4.jar
#java -version
openjdk version "1.8.0_171"OpenJDK Runtime Environment (build 1.8.0_171-8u171-b11-1~bpo8+1-b11)OpenJDK 64-Bit Server VM (build 25.171-b11, mixed mode))
#echo $JAVA_HOME
/usr/lib/jvm/java-8-openjdk-amd64
When running ./gradlew build
[sudo ./gradlew build --stacktrace
> Task :tony-core:test
Gradle suite > Gradle test > com.linkedin.tony.TestTonyE2E.testNullAMRpcClient FAILED
java.lang.AssertionError at TestTonyE2E.java:268
Gradle suite > Gradle test > com.linkedin.tony.TestTonyE2E.testPSSkewedWorkerTrainingShouldPass FAILED
java.lang.AssertionError at TestTonyE2E.java:110
Gradle suite > Gradle test > com.linkedin.tony.TestTonyE2E.testPSWorkerTrainingShouldPass FAILED
java.lang.AssertionError at TestTonyE2E.java:127
Gradle suite > Gradle test > com.linkedin.tony.TestTonyE2E.testSingleNodeTrainingShouldPass FAILED
java.lang.AssertionError at TestTonyE2E.java:73
27 tests completed, 4 failed
> Task :tony-core:test FAILED
FAILURE: Build failed with an exception.
* What went wrong:
Execution failed for task ':tony-core:test'.
> There were failing tests. See the report at: file:///usr/local/src/TonY/tony-core/build/reports/tests/test/index.html
* Try:
Run with --info or --debug option to get more log output. Run with --scan to get full insights.](url)
TestEventHandler.parseEvents()
duplicates much of the code in ParserUtils.parseEvents()
. Instead of duplicating the code, we should move ParserUtils
from tony-history-server
into the tony-core
module (to avoid a cyclic dependency). For more context, see https://github.com/linkedin/TonY/pull/117/files#r241235693
The MNIST distributed example uses asynchronous training, so the workers should not need to communicate with each other (only the ps), and one worker finishing should not cause the other workers to hang.
Currently, tony-default.xml contains tony.history.location
whereas other places use tony.historyFolder
. We should standardize on one. Looking at Hadoop configs, seems like all.lowercase.period.separated
is the more standard naming convention, rather than using camelCase in config names.
We should also avoid hardcoding tony.historyFolder
in multiple places and instead define a String constant once and use that elsewhere.
When running DataProc with Google Cloud, would be ideal to keep files in Company GCS bucket, private or public
Support for:
Since some GCS buckets are not public may be required to pass credentials (json file) in a different parameter.
Code sample here.
gcloud dataproc jobs submit hadoop --cluster tony-staging \
--class com.linkedin.tony.cli.ClusterSubmitter \
--jars gs://tony-staging/tony-cli-0.1.5-all.jar -- \
--python_venv=gs://tony-staging/env/tf19.zip \
--src_dir=gs://tony-staging/tony/mnist/src/ \
--executes=gs://tony-staging/tony/mnist/src/mnist_distributed.py \
--conf_file=gs://tony-staging/tony/conf/tony.xml \
--python_binary_path=tf19/bin/python3.5
Related to #74
This is a very common error, right now the AM just prints the diagnostics with an error code -104, which doesn't make sense if you haven't memorized what the error codes mean.
Massage CLUSTER_SPEC, TASK_INDEX, JOB_NAME into a TF_CONFIG env variable to support Estimator API
We generate appid-tony.zip + appid-tony-final.xml and leave them in the client.. should clean them up after job has finished.
2018-11-12 22:03:02 INFO TonyClient:198 - Submitting YARN application
2018-11-12 22:03:03 FATAL TonyClient:776 - Failed to run TonyClient
org.apache.hadoop.yarn.exceptions.YarnException: Failed to submit application_1541916949981_266668 to YARN : Unauthorized connection for super-user: rm/[email protected] from IP xx.xx.xx.xx
at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.submitApplication(YarnClientImpl.java:272)
at com.linkedin.tony.TonyClient.run(TonyClient.java:199)
at com.linkedin.tony.TonyClient.start(TonyClient.java:774)
at com.linkedin.tony.TonyClient.start(TonyClient.java:762)
at com.linkedin.tony.cli.ClusterSubmitter.main(ClusterSubmitter.java:76)
2018-11-12 22:03:03 ERROR TonyClient:786 - Application failed to complete successfully
If chief worker fails, we should immediately fail the application. (i.e. the underlying TF distributed training may hang, so TonY should just fail the application.)
Other workers failing is a separate issue. In theory if they fail the training can continue. But it can be configurable.
When running locally with a python script that exits 1 (mock a fail worker), TonyAM only fails when tony.application.single-node
is set to true. We expect that when a non-chief worker fails, TonyAM should continue on training, but it should return fail, regardless of training mode.
WIth #13, Notebook is another way to submit a TonY job, and we should include it in the README.
Unable to run PyTorch sample code. Task is stuck in "RUNNING"
Setup
GCP DataProc
Version
Hadoop 2.9.0
Subversion https://bigdataoss-internal.googlesource.com/third_party/apache/hadoop -r e8ce80c37eebb173fc688e7f5686d7df74d182aa
Compiled by bigtop on 2018-10-25T12:56Z
Compiled with protoc 2.5.0
From source with checksum 1eb388d554db8e1cadcab4c1326ee72
This command was run using /usr/lib/hadoop/hadoop-common-2.9.0.jar
ML framework versions
PyTorch 0.4.0
Python 3.5
tony.xml
<configuration>
<property>
<name>tony.application.name</name>
<value>PyTorch</value>
</property>
<property>
<name>tony.application.security.enabled</name>
<value>false</value>
</property>
<property>
<name>tony.worker.instances</name>
<value>2</value>
</property>
<property>
<name>tony.worker.memory</name>
<value>4g</value>
</property>
<property>
<name>tony.worker.gpus</name>
<value>0</value>
</property>
<property>
<name>tony.ps.memory</name>
<value>4g</value>
</property>
<property>
<name>tony.application.framework</name>
<value>pytorch</value>
</property>
</configuration>
#yarn application -list -appStates ALL
18/11/20 07:06:41 INFO client.RMProxy: Connecting to ResourceManager at tony-staging-m/10.138.0.2:8032
18/11/20 07:06:42 INFO client.AHSProxy: Connecting to Application History server at tony-staging-m/10.138.0.2:10200
The application state AL is invalid.
The valid application state can be one of the following: ALL,NEW,NEW_SAVING,SUBMITTED,ACCEPTED,RUNNING,FINISHED,FAILED,KILLED
(torch04) root@tony-staging-m:/usr/local/src/jobs/PTJob# yarn application -list -appStates ALL
18/11/20 07:06:46 INFO client.RMProxy: Connecting to ResourceManager at tony-staging-m/10.138.0.2:8032
18/11/20 07:06:47 INFO client.AHSProxy: Connecting to Application History server at tony-staging-m/10.138.0.2:10200
Total number of applications (application-types: [], states: [NEW, NEW_SAVING, SUBMITTED, ACCEPTED, RUNNING, FINISHED, FAILED, KILLED] and tags: []):9
Application-Id Application-Name Application-Type User Queue State Final-State Progress
Tracking-URL
application_1542587994073_0009 TensorFlowApplication TENSORFLOW root default KILLED KILLED 100% http://t
ony-staging-m:8188/applicationhistory/app/application_1542587994073_0009
application_1542587994073_0010 TensorFlowApplication TENSORFLOW root default FINISHED FAILED 100%
N/A
application_1542587994073_0015 PyTorch TENSORFLOW root default RUNNING UNDEFINED 0%
N/A
Logs from:
node/containerlogs/container_1542587994073_0015_01_000002/root
Code fails in:
executor.taskIndex = Integer.parseInt(System.getenv(Constants.TASK_INDEX));
stderr
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/hadoop/yarn/nm-local-dir/usercache/root/filecache/36/tony-cli-0.1.5-all.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/hadoop/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
Exception in thread "main" java.lang.NumberFormatException: null
at java.lang.Integer.parseInt(Integer.java:542)
at java.lang.Integer.parseInt(Integer.java:615)
at com.linkedin.tony.TaskExecutor.main(TaskExecutor.java:109)
stdout
2018-11-20 06:46:37 INFO TaskExecutor:89 - TaskExecutor is running..
2018-11-20 06:46:37 INFO TaskExecutor:83 - Reserved rpcPort: 43073
2018-11-20 06:46:37 INFO TaskExecutor:84 - Reserved tbPort: 37633
2018-11-20 06:46:37 INFO TaskExecutor:85 - Reserved py4j gatewayServerPort: 35571
2018-11-20 06:46:37 INFO TaskExecutor:175 - Task command: venv/torch04/bin/python3.5 /usr/local/src/jobs/PTJob/src/mnist_distributed.py --root /tmp/data/
2018-11-20 06:46:37 INFO Utils:132 - Unzipping tony.zip to destination ./
2018-11-20 06:46:39 INFO TaskExecutor:184 - Setting up Rpc client, connecting to: tony-staging-w-0.c.dpe-cloud-mle.internal:11616
2018-11-20 06:46:39 INFO TaskExecutor:102 - Unpacking Python virtual environment: /usr/local/src/jobs/PTJob/env/torch04.zip
2018-11-20 06:46:39 INFO Utils:132 - Unzipping /usr/local/src/jobs/PTJob/env/torch04.zip to destination venv
with #67 , we'll support running TonY with docker images. The prerequisite is you need a properly configured cluster to be capable of running YARN applications with docker. Personal experience with that: https://medium.com/@oliver_hu/enable-hadoop-yarn-2-9-1-3-0-3-1-to-launch-application-using-docker-containers-1442a639bb64.
With this change, you should be able to launch your training jobs without zipping the python virtual env anymore.
To enable Tony launch your jobs on docker, set:
<property>
<description>Whether we use docker container to launch the tasks</description>
<name>tony.docker.enabled</name>
<value>false</value>
</property>
<property>
<description>Whether we use docker container to launch the tasks</description>
<name>tony.docker.image</name>
<value>oliverhu/hadoop-base</value> // your image
</property>
I am trying to follow the mnist-tensorflow in tony-example, but when I run the following command, I found my containers exited with code 132 and I can't find why, which really confused me. Any Ideas?
java version: 1.8.0_181
Hadoop version: 3.1.1
java -cp "`hadoop classpath --glob`:tony/*:tony" \
com.linkedin.tony.cli.ClusterSubmitter \
-executes src/models/mnist_distributed.py \
-python_venv env.zip \
-python_binary_path env/bin/python \
-src_dir src \
-shell_env LD_LIBRARY_PATH=/usr/lib/jvm/java-8-openjdk-amd64/jre/lib/amd64/server
tony.xml
<configuration>
<property>
<name>tony.application.hdfs-conf-path</name>
<value>/home/hadoop/hadoop/etc/hadoop/hdfs-site.xml</value>
</property>
<property>
<name>tony.application.yarn-conf-path</name>
<value>/home/hadoop/hadoop/etc/hadoop/yarn-site.xml</value>
</property>
<property>
<name>tony.application.security.enabled</name>
<value>false</value>
</property>
</configuration>
the console:
2018-10-17 08:27:49,932 INFO cli.ClusterSubmitter: Starting ClusterSubmitter..
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/hadoop/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/hadoop/distribute/tony/tony-cli-0.1.3-all.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
2018-10-17 08:27:50,132 INFO cli.ClusterSubmitter: Configuration: core-default.xml, core-site.xml, null/core-site.xml, null/hdfs-site.xml
2018-10-17 08:27:50,465 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2018-10-17 08:27:51,887 INFO cli.ClusterSubmitter: Copying /home/hadoop/distribute/tony/tony-cli-0.1.3-all.jar to: hdfs://localhost:9000/user/hadoop/.tony/ffae84e0-edd3-444a-9148-a25124a3e7bc
2018-10-17 08:27:53,753 INFO tony.TonyClient: TonY heartbeat interval [1000]
2018-10-17 08:27:53,753 INFO tony.TonyClient: TonY max heartbeat misses allowed [25]
2018-10-17 08:27:53,790 INFO tony.TonyClient: Starting client..
2018-10-17 08:27:53,796 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
2018-10-17 08:27:54,163 INFO conf.Configuration: resource-types.xml not found
2018-10-17 08:27:54,164 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
2018-10-17 08:28:08,606 INFO tony.TonyClient: Completed setting up app master command {{JAVA_HOME}}/bin/java -Xmx1638m -Dyarn.app.container.log.dir=<LOG_DIR> com.linkedin.tony.TonyApplicationMaster --python_binary_path env/bin/python --python_venv env.zip --executes src/models/mnist_distributed.py --hdfs_classpath hdfs://localhost:9000/user/hadoop/.tony/ffae84e0-edd3-444a-9148-a25124a3e7bc --shell_env LD_LIBRARY_PATH=/usr/lib/jvm/java-8-openjdk-amd64/jre/lib/amd64/server --container_env TONY_CONF_PATH=hdfs://localhost:9000/user/hadoop/.tony/application_1539761891085_0002/tony-final.xml --container_env YARN_CONF_PATH=home/hadoop/hadoop/etc/hadoop/yarn-site.xml --container_env TONY_CONF_TIMESTAMP=1539764888557 --container_env TONY_CONF_LENGTH=3659 --container_env TONY_ZIP_PATH=hdfs://localhost:9000/user/hadoop/.tony/application_1539761891085_0002/tony.zip --container_env TONY_ZIP_LENGTH=154330934 --container_env TONY_ZIP_TIMESTAMP=1539764888067 --container_env CLASSPATH={{CLASSPATH}}<CPS>./*<CPS>{{HADOOP_CONF_DIR}}<CPS>{{HADOOP_COMMON_HOME}}/share/hadoop/common/*<CPS>{{HADOOP_COMMON_HOME}}/share/hadoop/common/lib/*<CPS>{{HADOOP_HDFS_HOME}}/share/hadoop/hdfs/*<CPS>{{HADOOP_HDFS_HOME}}/share/hadoop/hdfs/lib/*<CPS>{{HADOOP_YARN_HOME}}/share/hadoop/yarn/*<CPS>{{HADOOP_YARN_HOME}}/share/hadoop/yarn/lib/* --container_env HDFS_CONF_PATH=home/hadoop/hadoop/etc/hadoop/hdfs-site.xml 1><LOG_DIR>/amstdout.log 2><LOG_DIR>/amstderr.log
2018-10-17 08:28:08,607 INFO tony.TonyClient: Submitting YARN application
2018-10-17 08:28:08,712 INFO impl.YarnClientImpl: Submitted application application_1539761891085_0002
2018-10-17 08:28:08,718 INFO tony.TonyClient: URL to track running application (will proxy to TensorBoard once it has started): http://HP-DL580-G7:8088/proxy/application_1539761891085_0002/
2018-10-17 08:28:08,719 INFO tony.TonyClient: ResourceManager web address for application: http://0.0.0.0:8088/cluster/app/application_1539761891085_0002
2018-10-17 08:28:14,764 INFO tony.TonyClient: AM host: HP-DL580-G7
2018-10-17 08:28:14,764 INFO tony.TonyClient: AM RPC port: 13923
2018-10-17 08:28:14,770 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
2018-10-17 08:28:19,025 INFO tony.TonyClient: Logs for ps 0 at: http://HP-DL580-G7:8042/node/containerlogs/container_1539761891085_0002_01_000002/hadoop
2018-10-17 08:28:19,026 INFO tony.TonyClient: Logs for worker 0 at: http://HP-DL580-G7:8042/node/containerlogs/container_1539761891085_0002_01_000003/hadoop
2018-10-17 08:28:36,135 INFO tony.TonyClient: Application finished unsuccessfully. YarnState=FINISHED, DSFinalStatus=FAILED. Breaking monitoring loop : ApplicationId:2
2018-10-17 08:28:36,200 ERROR tony.TonyClient: Application failed to complete successfully
the amstdout.log:
2018-10-17 08:40:07 INFO TonyApplicationMaster:145 - Logs for worker 0 at: http://HP-DL580-G7:8042/node/containerlogs/container_1539765357563_0002_01_000003/hadoop
2018-10-17 08:40:07 INFO TonyApplicationMaster:909 - Successfully started container container_1539765357563_0002_01_000003
2018-10-17 08:40:08 INFO TonyApplicationMaster:728 - Client requesting TaskUrls!
2018-10-17 08:40:09 INFO TonyApplicationMaster:528 - Completed worker tasks: 0, total worker tasks: 1
2018-10-17 08:40:14 INFO TonyApplicationMaster:528 - Completed worker tasks: 0, total worker tasks: 1
2018-10-17 08:40:18 INFO TonyApplicationMaster:770 - Received cluster spec registration request from task ps:0 with spec: HP-DL580-G7:33439
2018-10-17 08:40:18 INFO TonyApplicationMaster:783 - [ps:0] Received Registration for HB !!
2018-10-17 08:40:18 INFO TonyApplicationMaster:795 - Received registrations from 1 tasks, awaiting registration from 1 tasks.
2018-10-17 08:40:18 INFO TonyApplicationMaster:797 - Awaiting registration from task worker 0 in container_1539765357563_0002_01_000003 on host HP-DL580-G7
2018-10-17 08:40:19 INFO TonyApplicationMaster:770 - Received cluster spec registration request from task worker:0 with spec: HP-DL580-G7:35215
2018-10-17 08:40:19 INFO TonyApplicationMaster:783 - [worker:0] Received Registration for HB !!
2018-10-17 08:40:19 INFO TonyApplicationMaster:789 - All 2 tasks registered.
2018-10-17 08:40:19 INFO TonyApplicationMaster:831 - Got request to update TensorBoard URL: HP-DL580-G7:45163
2018-10-17 08:40:19 WARN TonyApplicationMaster:850 - This Hadoop version doesn't have the YARN-7974 patch, TonY won't register TensorBoard URL withapplication's tracking URL
2018-10-17 08:40:19 INFO TonyApplicationMaster:528 - Completed worker tasks: 0, total worker tasks: 1
2018-10-17 08:40:20 INFO TonyApplicationMaster:811 - Received result registration request with exit code 132 from worker 0
2018-10-17 08:40:21 INFO TonyApplicationMaster:789 - All 2 tasks registered.
2018-10-17 08:40:21 INFO TonyApplicationMaster:944 - Completed containers: 1
2018-10-17 08:40:21 INFO TonyApplicationMaster:947 - ContainerID = container_1539765357563_0002_01_000003, state = COMPLETE, exitStatus = 132
2018-10-17 08:40:21 ERROR TonyApplicationMaster:952 - [2018-10-17 08:40:20.925]Exception from container-launch.
Container id: container_1539765357563_0002_01_000003
Exit code: 132
[2018-10-17 08:40:20.934]Container exited with a non-zero exit code 132. Error file: prelaunch.err.
Last 4096 bytes of prelaunch.err :
Last 4096 bytes of stderr :
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/hadoop/hadoop/tmp/nm-local-dir/usercache/hadoop/filecache/13/tony-cli-0.1.3-all.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/hadoop/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
[2018-10-17 08:40:20.935]Container exited with a non-zero exit code 132. Error file: prelaunch.err.
Last 4096 bytes of prelaunch.err :
Last 4096 bytes of stderr :
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/hadoop/hadoop/tmp/nm-local-dir/usercache/hadoop/filecache/13/tony-cli-0.1.3-all.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/hadoop/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
2018-10-17 08:40:21 INFO TonyApplicationMaster:961 - Unregister task [worker:0] from Heartbeat monitor..
2018-10-17 08:40:21 INFO TonyApplicationMaster:966 - Container failed, id = container_1539765357563_0002_01_000003
2018-10-17 08:40:22 INFO TonyApplicationMaster:811 - Received result registration request with exit code 132 from ps 0
2018-10-17 08:40:23 INFO TonyApplicationMaster:944 - Completed containers: 1
2018-10-17 08:40:23 INFO TonyApplicationMaster:947 - ContainerID = container_1539765357563_0002_01_000002, state = COMPLETE, exitStatus = 132
2018-10-17 08:40:23 ERROR TonyApplicationMaster:952 - [2018-10-17 08:40:23.168]Exception from container-launch.
Container id: container_1539765357563_0002_01_000002
Exit code: 132
[2018-10-17 08:40:23.175]Container exited with a non-zero exit code 132. Error file: prelaunch.err.
Last 4096 bytes of prelaunch.err :
Last 4096 bytes of stderr :
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/hadoop/hadoop/tmp/nm-local-dir/usercache/hadoop/filecache/13/tony-cli-0.1.3-all.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/hadoop/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
[2018-10-17 08:40:23.177]Container exited with a non-zero exit code 132. Error file: prelaunch.err.
Last 4096 bytes of prelaunch.err :
Last 4096 bytes of stderr :
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/hadoop/hadoop/tmp/nm-local-dir/usercache/hadoop/filecache/13/tony-cli-0.1.3-all.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/hadoop/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
2018-10-17 08:40:23 INFO TonyApplicationMaster:961 - Unregister task [ps:0] from Heartbeat monitor..
2018-10-17 08:40:23 INFO TonyApplicationMaster:966 - Container failed, id = container_1539765357563_0002_01_000002
2018-10-17 08:40:24 INFO TonyApplicationMaster:512 - Completed jobs: 1 total jobs: 1
2018-10-17 08:40:24 INFO TonyApplicationMaster:564 - Total completed worker tasks: 1, total worker tasks: 1
2018-10-17 08:40:24 INFO TonyApplicationMaster:570 - TensorFlow session failed: At least one job task exited with non-zero status, failedCnt=1
2018-10-17 08:40:24 INFO TonyApplicationMaster:335 - Result: false, job failed: true, retry count: 0
2018-10-17 08:40:25 INFO TonyApplicationMaster:837 - Client signals AM to finish application.
2018-10-17 08:40:29 INFO Utils:61 - Poll function finished within 30 seconds
2018-10-17 08:40:29 INFO TonyApplicationMaster:145 - Logs for ps 0 at: http://HP-DL580-G7:8042/node/containerlogs/container_1539765357563_0002_01_000002/hadoop
2018-10-17 08:40:29 INFO TonyApplicationMaster:145 - Logs for worker 0 at: http://HP-DL580-G7:8042/node/containerlogs/container_1539765357563_0002_01_000003/hadoop
2018-10-17 08:40:29 INFO TonyApplicationMaster:355 - Application Master failed. exiting
and the worker container:
2018-10-17 08:40:08 INFO TaskExecutor:86 - TaskExecutor is running..
2018-10-17 08:40:08 INFO TaskExecutor:80 - Reserved rpcPort: 35215
2018-10-17 08:40:08 INFO TaskExecutor:81 - Reserved tbPort: 45163
2018-10-17 08:40:08 INFO TaskExecutor:82 - Reserved py4j gatewayServerPort: 35421
2018-10-17 08:40:08 INFO TaskExecutor:178 - Task command: venv/env/bin/python src/models/mnist_distributed.py
2018-10-17 08:40:08 INFO Utils:109 - Unzipping tony.zip to destination ./
2018-10-17 08:40:10 INFO TaskExecutor:184 - Setting up Rpc client, connecting to: HP-DL580-G7:11789
2018-10-17 08:40:10 INFO TaskExecutor:96 - Unpacking Python virtual environment: env.zip
2018-10-17 08:40:10 INFO Utils:109 - Unzipping env.zip to destination venv
2018-10-17 08:40:19 INFO TaskExecutor:107 - Executor is running task worker 0
2018-10-17 08:40:19 INFO TaskExecutor:190 - Application Master address : HP-DL580-G7:11789
2018-10-17 08:40:19 INFO TaskExecutor:193 - ContainerId is: container_1539765357563_0002_01_000003 HostName is: HP-DL580-G7
2018-10-17 08:40:19 INFO TaskExecutor:201 - Connecting to HP-DL580-G7:11789 to register worker spec: worker 0 HP-DL580-G7:35215
2018-10-17 08:40:19 INFO Utils:82 - Poll function finished within 120 seconds
2018-10-17 08:40:19 INFO TaskExecutor:114 - Successfully registered and got cluster spec: {"ps":["HP-DL580-G7:33439"],"worker":["HP-DL580-G7:35215"]}
2018-10-17 08:40:19 INFO TaskExecutor:211 - TensorBoard address : HP-DL580-G7:45163
2018-10-17 08:40:19 INFO Utils:82 - Poll function finished within 60 seconds
2018-10-17 08:40:19 INFO TaskExecutor:214 - Register TensorBoard response: SUCCEEDED
2018-10-17 08:40:19 INFO Utils:210 - Executing command: venv/env/bin/python src/models/mnist_distributed.py
2018-10-17 08:40:20 INFO Utils:82 - Poll function finished within 60 seconds
2018-10-17 08:40:20 INFO TaskExecutor:223 - AM response for result execution run: RECEIVED
2018-10-17 08:40:20 INFO TaskExecutor:148 - Child process exited with exit code 132
Hadoop transitively brings in jersey classes, which are CDDL-1.0-licensed. Since we do not use the CDDL 1.0 license, we should not be including the jersey classes in the tony-cli-all fat jar.
mnist_distributed.py example doesn't work with TF 1.11. It complains FLAGS doesn't have ports
or logdir
To make the THS page for a TonY application more easily accessible, the TonyClient could print a link to it when the application finishes.
Currently, TonY allocates the TensorBoard port on the chief worker (worker 0) and expects the chief worker task to read in the port from the environment and launch the TensorBoard process. However, often, this can cause the chief worker to fail with OutOfMemoryExceptions. To help mitigate this, TonY should allocate a separate container just for running TensorBoard. This would be more similar to what's generally done when running TensorFlow on Kubernetes -- TensorBoard is started in a separate pod form the workers and parameters servers (see the training.yaml
example here).
The TensorFlowJob type used to run TonY jobs should inject Azkaban metadata into the TonY configuration so that the Azkaban metadata (exec link, job link, project name, etc.) show up in the config.xml file written to HDFS by the TonY ApplicationMaster.
A job history server for TonY jobs.
Investigate how to spawning containers for JupyterHub's notebooks.
Currently, the Play tests BrowserTest and HomeControllerTest do not run as part of the build. For example, in https://api.travis-ci.org/v3/job/448123064/log.txt, we see
> Task :tony-history-server:testClasses UP-TO-DATE
Skipping task ':tony-history-server:testClasses' as it has no actions.
:tony-history-server:testClasses (Thread[Task worker for ':',5,main]) completed. Took 0.0 secs.
:tony-history-server:test (Thread[Task worker for ':',5,main]) started.
> Task :tony-history-server:test NO-SOURCE
Skipping task ':tony-history-server:test' as it has no source files and no previous output files.
:tony-history-server:test (Thread[Task worker for ':',5,main]) completed. Took 0.002 secs.
The Play tests get run as part of the testPlayBinary
which test
does NOT depend on.
Suppose each node in a cluster only has 4 GPUs. If a TonY application requests 5 GPUs per worker (tony.worker.gpus = 5
), YARN will give TonY containers with 4 GPUs, but TonY will not start anything in those containers due to an NPE:
TonyApplicationMaster:1013 - Error java.lang.NullPointerException: Task was null! Nothing to schedule.
This comes from ContainerLauncher.run()
:
TFTask task = session.getMatchingTask(container.getAllocationRequestId());
Preconditions.checkNotNull(task, "Task was null! Nothing to schedule.");
Instead of hanging, TonY should probably either:
Internal JIRA: LIHADOOP-40976
When I run the mnist example, the ps process is still running after the tony application finishes.
Comment in tensorflow/python/training/server_lib.py in Tensorflow writes :
This method currently blocks forever.
should Tony do the cleanup?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.