Giter VIP home page Giter VIP logo

Comments (16)

lixusign avatar lixusign commented on May 17, 2024

hihi 在吗?

from euler.

lixusign avatar lixusign commented on May 17, 2024

pip安装的euler无法支持分布式训练?

from euler.

yangsiran avatar yangsiran commented on May 17, 2024

@lixusign

  1. 命令行参数中有--model_dir这个选项,默认是(worker 0)本地当前目录下的ckpt文件夹,模型会被保存在这里,实际上就是TensorFlow中Checkpoint的路径;
  2. 目前是支持的。

from euler.

lixusign avatar lixusign commented on May 17, 2024

hihi, 请问pip安装的euler无法支持HDFS的话,如何支持分布式训练呢?

from euler.

yangsiran avatar yangsiran commented on May 17, 2024

@lixusign PyPI上0.1.0这个版本的包是支持HDFS的。更新的版本的话需要编译安装并打开HDFS的选项。

from euler.

chengenbao avatar chengenbao commented on May 17, 2024

from euler.

lixusign avatar lixusign commented on May 17, 2024

非常感谢各位大大,我先用下0.1.0 版本的Pypi安装试试,编译安装很麻烦而且系统上很多依赖都要各种版本哎。

from euler.

lixusign avatar lixusign commented on May 17, 2024

你好,下面是进行分布式训练ppi-graphSage的日志,请问卡到这块意味着什么?我用的pip安装的0.1.0版本euler + hdfs2.9.2 + zk + tensorflow1.12 + 当前2worker + 1ps 。

I0218 11:22:33.431810 19041 graph_builder.cc:84] Thread 98, job size: 0

I0218 11:22:33.432725 19041 graph_builder.cc:84] Thread 99, job size: 0

19/02/18 11:22:33 WARN hdfs.DFSClient: zero

19/02/18 11:22:33 WARN hdfs.DFSClient: zero

19/02/18 11:22:33 WARN hdfs.DFSClient: zero

19/02/18 11:22:33 WARN hdfs.DFSClient: zero

19/02/18 11:22:33 WARN hdfs.DFSClient: zero

19/02/18 11:22:33 WARN hdfs.DFSClient: zero

19/02/18 11:22:33 WARN hdfs.DFSClient: zero

19/02/18 11:22:33 WARN hdfs.DFSClient: zero

19/02/18 11:22:33 WARN hdfs.DFSClient: zero

19/02/18 11:22:33 WARN hdfs.DFSClient: zero

19/02/18 11:22:33 WARN hdfs.DFSClient: zero

I0218 11:22:33.970232 19075 graph_builder.cc:59] Load Done: hdfs://xxx:9000/user/euler/ppi/ppi_train.id

I0218 11:22:34.135093 19070 graph_builder.cc:59] Load Done: hdfs://xxx :9000/user/euler/ppi/ppi-walks.txt

I0218 11:22:34.300680 19069 graph_builder.cc:59] Load Done: hdfs://xxx:9000/user/euler/ppi/ppi-id_map.json

I0218 11:22:34.424460 19067 graph_builder.cc:59] Load Done: hdfs://xxx:9000/user/euler/ppi/ppi-class_map.json

I0218 11:22:34.482018 19074 graph_builder.cc:59] Load Done: hdfs://xxx:9000/user/euler/ppi/ppi_test.id

I0218 11:22:34.498394 19076 graph_builder.cc:59] Load Done: hdfs://xxx:9000/user/euler/ppi/ppi_val.id

I0218 11:22:36.113641 19073 graph_builder.cc:59] Load Done: hdfs://xxx:9000/user/euler/ppi/ppi_meta.json

I0218 11:22:37.140648 19068 graph_builder.cc:59] Load Done: hdfs://xxx:9000/user/euler/ppi/ppi-feats.npy

I0218 11:22:37.287214 19066 graph_builder.cc:59] Load Done: hdfs://xxx:9000/user/euler/ppi/ppi-G.json

I0218 11:22:37.952857 19072 graph_builder.cc:59] Load Done: hdfs://xxx:9000/user/euler/ppi/ppi_data.json

I0218 11:22:38.380856 19071 graph_builder.cc:59] Load Done: hdfs://xxx:9000/user/euler/ppi/ppi_data.dat

I0218 11:22:38.395088 19041 graph_builder.cc:102] Done: build node sampler

I0218 11:22:38.395136 19041 graph_builder.cc:112] Graph build finish

I0218 11:22:38.395155 19041 graph_service.cc:179] service init finish

I0218 11:22:38.396541 19041 graph_service.cc:131] bound port: xxx:32804

W0218 11:22:38.448446 19041 graph.h:148] global sampler is not ok

I0218 11:22:38.451355 19041 graph_service.cc:146] service start

I0218 11:22:38.463107 19045 zk_server_monitor.cc:238] Online node: 0#ip:32804.

I0218 11:22:38.463485 18999 remote_graph.cc:106] Retrieve meta info success, shard number: 2

I0218 11:22:38.463508 18999 remote_graph.cc:119] Retrieve meta info success, partition number: 1

I0218 11:22:38.463533 18999 remote_graph.cc:190] Retrieve Shard Meta Info successfully, shard: 0, Key: node_sum_weight, Meta Info: 44906.000000,6514.000000,5524.000000

I0218 11:22:38.463547 18999 remote_graph.cc:190] Retrieve Shard Meta Info successfully, shard: 0, Key: edge_sum_weight, Meta Info:

from euler.

chengenbao avatar chengenbao commented on May 17, 2024

from euler.

lixusign avatar lixusign commented on May 17, 2024

你好 ,是这样的,我只有一个shard,有2个worker 那么只有一个worker会加载一个shard,这样不行吗?
另外上面提到的训练模型保存 配置是:--model_dir:"/model" ,然后就会save到worker0的这个目录下 ?

from euler.

chengenbao avatar chengenbao commented on May 17, 2024

from euler.

lixusign avatar lixusign commented on May 17, 2024

非常感谢,还有一个小问题,即训练完成后ps服务无法退出。

from euler.

chengenbao avatar chengenbao commented on May 17, 2024

from euler.

lixusign avatar lixusign commented on May 17, 2024

好的 非常感谢 我先看下 这个issue可以关闭了

from euler.

chengenbao avatar chengenbao commented on May 17, 2024

from euler.

lixusign avatar lixusign commented on May 17, 2024

ok

from euler.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.