vesoft-inc / nebula-spark-utils Goto Github PK

View Code? Open in Web Editor NEW

23.0 23.0 31.0 1.25 MB

Spark related libraries and tools

Scala 100.00%

nebula-spark-utils's People

Contributors

Stargazers

Watchers

nebula-spark-utils's Issues

[feature] [exchange] add summary report in the end

add summary report in the end, including:

source data count, succeed count and failed count.
total time.

nebula exchange 生成sst文件报key order的错误

具体问题描述可以查看：https://discuss.nebula-graph.com.cn/t/topic/3667
报错信息为：org.rocksdb.RocksDBException: Keys must be added in strict ascending order.
具体报错如下：

org.apache.spark.SparkException: Job aborted due to stage failure: Task 4 in stage 8.0 failed 4 times, most recent failure: Lost task 4.3 in stage 8.0 (TID 96, executor 10): org.rocksdb.RocksDBException: Keys must be added in strict ascending order.
	at org.rocksdb.SstFileWriter.put(Native Method)
	at org.rocksdb.SstFileWriter.put(SstFileWriter.java:132)
	at com.vesoft.nebula.exchange.writer.NebulaSSTWriter.write(FileBaseWriter.scala:42)
	at com.vesoft.nebula.exchange.processor.EdgeProcessor$$anonfun$process$3$$anonfun$apply$3.apply(EdgeProcessor.scala:245)
	at com.vesoft.nebula.exchange.processor.EdgeProcessor$$anonfun$process$3$$anonfun$apply$3.apply(EdgeProcessor.scala:219)
	at scala.collection.Iterator$class.foreach(Iterator.scala:891)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
	at com.vesoft.nebula.exchange.processor.EdgeProcessor$$anonfun$process$3.apply(EdgeProcessor.scala:219)
	at com.vesoft.nebula.exchange.processor.EdgeProcessor$$anonfun$process$3.apply(EdgeProcessor.scala:214)
	at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:980)
	at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:980)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:123)
	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

update guava version?

v14 is too old!

can we update guava version like exchange1.0? just need replace 'getHostText' to 'getHost'.

exchange 1.0 update guava pr

Nebula Algorithm

Nebula Graph 2.0.0
Deployment type (Single-Host)
Hardware info
Disk in use (HDD)
CPU: Intel(R) Xeon(R) CPU E5-2630 v2 @ 2.60GHz * 10
Ram: 64 G

I wanna use pagerank algorithm. I use below configuration:

{
  # Spark relation config
  spark: {
    app: {
        name: LPA
        # spark.app.partitionNum
        partitionNum:100
    }
    master:local
  }

  data: {
    # data source. optional of nebula,csv,json
    source: nebula
    # data sink, means the algorithm result will be write into this sink. optional of nebula,csv,text
    sink: nebula
    # if your algorithm needs weight
    hasWeight: false
  }

  # Nebula Graph relation config
  nebula: {
    # algo's data source from Nebula. If data.source is nebula, then this nebula.read config can be valid.
    read: {
        # Nebula metad server address, multiple addresses are split by English comma
        metaAddress: "127.0.0.1:9559"
        # Nebula space
        space: CallConnection
        # Nebula edge types, multiple labels means that data from multiple edges will union together
        labels: ["callTO"]
        # Nebula edge property name for each edge type, this property will be as weight col for algorithm.
        # Make sure the weightCols are corresponding to labels.
        # weightCols: ["start_year"]
    }

    # algo result sink into Nebula. If data.sink is nebula, then this nebula.write config can be valid.
    write:{
        # Nebula graphd server address， multiple addresses are split by English comma
        graphAddress: "127.0.0.1:9669"
        # Nebula metad server address, multiple addresses are split by English comma
        metaAddress: "127.0.0.1:9559,127.0.0.1:9560"
        user:user
        pswd:password
        # Nebula space name
        space:CallConnection
        # Nebula tag name, the algorithm result will be write into this tag
        tag:pagerank
    }
  }

#   local: {
#     # algo's data source from Nebula. If data.source is csv or json, then this local.read can be valid.
#     read:{
#         filePath: "hdfs://127.0.0.1:9000/edge/work_for.csv"
#         # srcId column
#         srcId:"_c0"
#         # dstId column
#         dstId:"_c1"
#         # weight column
#         #weight: "col3"
#         # if csv file has header
#         header: false
#         # csv file's delimiter
#         delimiter:","
#     }

#     # algo result sink into local file. If data.sink is csv or text, then this local.write can be valid.
#     write:{
#         resultPath:/tmp/
#     }
#   }

  algorithm: {
    # the algorithm that you are going to execute，pick one from [pagerank, louvain, connectedcomponent,
    # labelpropagation, shortestpaths, degreestatic, kcore, stronglyconnectedcomponent, trianglecount,
    # betweenness]
    executeAlgo: pagerank

    # PageRank parameter
    pagerank: {
        maxIter: 2
        resetProb: 0.15  # default 0.15
    }

 }
}

after run spark-submit as this

spark-submit --master local --class com.vesoft.nebula.algorithm.Main nebula-algorithm-2.0.0.jar -p classes/Algorithm.conf

after about several minutes and use all CPU but not RAM,

here is the spark UI

spark-submit --master local --class com.vesoft.nebula.algorithm.Main nebula-algorithm-2.0.0.jar -p classes/Algorithm.conf 
21/05/19 06:19:53 WARN Utils: Your hostname, orientserver resolves to a loopback address: 127.0.1.1; using 10.10.1.121 instead (on interface ens33)
21/05/19 06:19:53 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
21/05/19 06:20:00 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
log4j:WARN No appenders could be found for logger (com.vesoft.nebula.algorithm.Main$).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
21/05/19 06:20:00 INFO SparkContext: Running Spark version 2.4.7
21/05/19 06:20:00 INFO SparkContext: Submitted application: LPA
21/05/19 06:20:00 INFO SecurityManager: Changing view acls to: orient
21/05/19 06:20:00 INFO SecurityManager: Changing modify acls to: orient
21/05/19 06:20:00 INFO SecurityManager: Changing view acls groups to: 
21/05/19 06:20:00 INFO SecurityManager: Changing modify acls groups to: 
21/05/19 06:20:00 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(orient); groups with view permissions: Set(); users  with modify permissions: Set(orient); groups with modify permissions: Set()
21/05/19 06:20:00 INFO Utils: Successfully started service 'sparkDriver' on port 36273.
21/05/19 06:20:00 INFO SparkEnv: Registering MapOutputTracker
21/05/19 06:20:00 INFO SparkEnv: Registering BlockManagerMaster
21/05/19 06:20:00 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
21/05/19 06:20:00 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
21/05/19 06:20:00 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-e15b4c05-c314-463e-9f06-a1fca904f7e7
21/05/19 06:20:00 INFO MemoryStore: MemoryStore started with capacity 366.3 MB
21/05/19 06:20:01 INFO SparkEnv: Registering OutputCommitCoordinator
21/05/19 06:20:01 INFO Utils: Successfully started service 'SparkUI' on port 4040.
21/05/19 06:20:01 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://10.10.1.121:4040
21/05/19 06:20:01 INFO SparkContext: Added JAR file:/home/orient/NebulaJavaTools/nebula-spark-utils/nebula-algorithm/target/nebula-algorithm-2.0.0.jar at spark://10.10.1.121:36273/jars/nebula-algorithm-2.0.0.jar with timestamp 1621405201288
21/05/19 06:20:01 INFO Executor: Starting executor ID driver on host localhost
21/05/19 06:20:01 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 34449.
21/05/19 06:20:01 INFO NettyBlockTransferService: Server created on 10.10.1.121:34449
21/05/19 06:20:01 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
21/05/19 06:20:01 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 10.10.1.121, 34449, None)
21/05/19 06:20:01 INFO BlockManagerMasterEndpoint: Registering block manager 10.10.1.121:34449 with 366.3 MB RAM, BlockManagerId(driver, 10.10.1.121, 34449, None)
21/05/19 06:20:01 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 10.10.1.121, 34449, None)
21/05/19 06:20:01 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, 10.10.1.121, 34449, None)
21/05/19 06:20:01 INFO ReadNebulaConfig$: NebulaReadConfig={space=CallConnection,label=callTO,returnCols=List(),noColumn=true,partitionNum=10}
21/05/19 06:20:02 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir ('file:/home/orient/NebulaJavaTools/nebula-spark-utils/nebula-algorithm/target/spark-warehouse').
21/05/19 06:20:02 INFO SharedState: Warehouse path is 'file:/home/orient/NebulaJavaTools/nebula-spark-utils/nebula-algorithm/target/spark-warehouse'.
21/05/19 06:20:02 INFO StateStoreCoordinatorRef: Registered StateStoreCoordinator endpoint
21/05/19 06:20:02 INFO NebulaDataSource: create reader
21/05/19 06:20:02 INFO NebulaDataSource: options {spacename=CallConnection, nocolumn=true, metaaddress=127.0.0.1:9559, label=callTO, type=edge, connectionretry=2, timeout=6000, executionretry=1, paths=[], limit=1000, returncols=, partitionnumber=10}
21/05/19 06:20:02 INFO NebulaDataSourceEdgeReader: dataset's schema: StructType(StructField(_srcId,StringType,false), StructField(_dstId,StringType,false), StructField(_rank,LongType,false))
21/05/19 06:20:04 INFO NebulaDataSource: create reader
21/05/19 06:20:04 INFO NebulaDataSource: options {spacename=CallConnection, nocolumn=true, metaaddress=127.0.0.1:9559, label=callTO, type=edge, connectionretry=2, timeout=6000, executionretry=1, paths=[], limit=1000, returncols=, partitionnumber=10}
21/05/19 06:20:04 INFO DataSourceV2Strategy: 
Pushing operators to class com.vesoft.nebula.connector.NebulaDataSource
Pushed Filters: 
Post-Scan Filters: 
Output: _srcId#0, _dstId#1, _rank#2L
         
21/05/19 06:20:05 INFO CodeGenerator: Code generated in 205.90032 ms
21/05/19 06:20:05 INFO SparkContext: Starting job: fold at VertexRDDImpl.scala:90
21/05/19 06:20:05 INFO DAGScheduler: Registering RDD 9 (mapPartitions at VertexRDD.scala:356) as input to shuffle 3
21/05/19 06:20:05 INFO DAGScheduler: Registering RDD 27 (mapPartitions at VertexRDDImpl.scala:247) as input to shuffle 1
21/05/19 06:20:05 INFO DAGScheduler: Registering RDD 23 (mapPartitions at VertexRDDImpl.scala:247) as input to shuffle 0
21/05/19 06:20:05 INFO DAGScheduler: Registering RDD 31 (mapPartitions at GraphImpl.scala:208) as input to shuffle 2
21/05/19 06:20:05 INFO DAGScheduler: Got job 0 (fold at VertexRDDImpl.scala:90) with 10 output partitions
21/05/19 06:20:05 INFO DAGScheduler: Final stage: ResultStage 4 (fold at VertexRDDImpl.scala:90)
21/05/19 06:20:05 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 0, ShuffleMapStage 3)
21/05/19 06:20:05 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 0, ShuffleMapStage 3)
21/05/19 06:20:05 INFO DAGScheduler: Submitting ShuffleMapStage 0 (VertexRDD.createRoutingTables - vid2pid (aggregation) MapPartitionsRDD[9] at mapPartitions at VertexRDD.scala:356), which has no missing parents
21/05/19 06:20:05 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 16.4 KB, free 366.3 MB)
21/05/19 06:20:05 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 7.6 KB, free 366.3 MB)
21/05/19 06:20:05 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 10.10.1.121:34449 (size: 7.6 KB, free: 366.3 MB)
21/05/19 06:20:05 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:1184
21/05/19 06:20:05 INFO DAGScheduler: Submitting 10 missing tasks from ShuffleMapStage 0 (VertexRDD.createRoutingTables - vid2pid (aggregation) MapPartitionsRDD[9] at mapPartitions at VertexRDD.scala:356) (first 15 tasks are for partitions Vector(0, 1, 2, 3, 4, 5, 6, 7, 8, 9))
21/05/19 06:20:05 INFO TaskSchedulerImpl: Adding task set 0.0 with 10 tasks
21/05/19 06:20:06 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, executor driver, partition 0, PROCESS_LOCAL, 9761 bytes)
21/05/19 06:20:06 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
21/05/19 06:20:06 INFO Executor: Fetching spark://10.10.1.121:36273/jars/nebula-algorithm-2.0.0.jar with timestamp 1621405201288
21/05/19 06:20:06 INFO TransportClientFactory: Successfully created connection to /10.10.1.121:36273 after 44 ms (0 ms spent in bootstraps)
21/05/19 06:20:06 INFO Utils: Fetching spark://10.10.1.121:36273/jars/nebula-algorithm-2.0.0.jar to /tmp/spark-17e5e989-b080-4a41-aac3-f9fd56f03752/userFiles-693448ab-9c10-468b-a2a9-c926b9e62800/fetchFileTemp3358155586060883895.tmp
21/05/19 06:20:06 INFO Executor: Adding file:/tmp/spark-17e5e989-b080-4a41-aac3-f9fd56f03752/userFiles-693448ab-9c10-468b-a2a9-c926b9e62800/nebula-algorithm-2.0.0.jar to class loader
21/05/19 06:20:07 INFO NebulaEdgePartitionReader: partition index: 1, scanParts: List(1)
21/05/19 06:20:07 INFO CodeGenerator: Code generated in 23.996116 ms

after some minutes,

just do nothing and all task gone!

Louvain算法只返回了_id和_louvain，社区内部的节点ID没返回

Migrate data from sst files only adds forward edge

Migrate data from sst files only adds forward edge but doesn't add reverse edge.

sst import data error "Keys must be added in strict ascending order"

this is spark job log
driver.log
space sst_test

tag idno

data example
230421197906123305,0,客户0,0,2021/1/1,女,博士,离异,1,black 230421197906123305,1,客户1,1,2021/1/1,女,博士,离异,1,black 230421197906123305,2,客户2,2,2021/1/1,女,博士,离异,1,black 533101196807196696,3,客户3,3,2021/1/1,女,博士,离异,1,black 533101196807196696,4,客户4,4,2021/1/1,女,博士,离异,1,black 220722198306264943,5,客户5,5,2021/1/1,女,博士,离异,1,black 220722198306264943,6,客户6,6,2021/1/1,女,博士,离异,1,black 220722198306264943,7,客户7,7,2021/1/1,女,博士,离异,1,black 310200197802274261,8,客户8,8,2021/1/1,女,博士,离异,1,black 310200197802274261,9,客户9,9,2021/1/1,女,博士,离异,1,black
edge address_id

data example
`贵州省黔东南苗族侗族自治州榕江县镜湖花园22栋177号,230421197906123305,2019-02-20 08:02:20

贵州省黔东南苗族侗族自治州榕江县镜湖花园22栋177号,230421197906123305,2019-01-30 13:26:01
贵州省黔东南苗族侗族自治州榕江县镜湖花园22栋177号,230421197906123305,2019-03-31 03:19:30
四川省泸州市龙马潭区锋尚名居8栋734号,533101196807196696,2019-05-14 22:31:58
四川省泸州市龙马潭区锋尚名居8栋734号,533101196807196696,2020-02-20 06:09:22
河北省邢台市巨鹿县市聚福园16栋228号,220722198306264943,2020-12-11 12:23:53
河北省邢台市巨鹿县市聚福园16栋228号,220722198306264943,2020-01-08 05:19:52
河北省邢台市巨鹿县市聚福园16栋228号,220722198306264943,2020-04-29 09:23:55
浙江省丽水地区青田县尚书苑8栋271号,310200197802274261,2019-09-03 01:40:32
浙江省丽水地区青田县尚书苑8栋271号,310200197802274261,2019-04-23 18:37:49`

application-sst.conf

  # Spark relation config
  spark: {
    app: {
      name: Nebula Exchange 2.5
    }

    master:local

    driver: {
      cores: 1
      maxResultSize: 1G
      memory: 1G
    }

    executor: {
      memory: 4G
      instances: 1
      cores: 10
    }

    cores:{
      max: 48
    }
  }


  # Nebula Graph relation config
  nebula: {
    address:{
      graph:["xxxx:9669"]
      meta:["xxxx:9559"]
    }
    user: root
    pswd: nebula
    space: sst_test

    # parameters for SST import, not required
    path:{
      local: "/tmp"
      remote: "/user/frms/sst"
      hdfs.namenode: "hdfs://xxxx:8020"
    }

    # nebula client connection parameters
    connection {
      timeout: 3000
      retry: 3
    }

    # nebula client execution parameters
    execution {
      retry: 3
    }

    error: {
      # max number of failures, if the number of failures is bigger than max, then exit the application.
      max: 32
      # failed import job will be recorded in output path
      output: /user/frms/error
    }

    # use google's RateLimiter to limit the requests send to NebulaGraph
    rate: {
      # the stable throughput of RateLimiter
      limit: 1024
      # Acquires a permit from RateLimiter, unit: MILLISECONDS
      # if it can't be obtained within the specified timeout, then give up the request.
      timeout: 1000
    }
  }

  # Processing tags
  # There are tag config examples for different dataSources.
  tags: [

    # csv
    {
      name: address
      type: {
        source: csv
        sink: sst
      }
      path: "hdfs://xxxx:8020/user/frms/nebula-data/address.csv"
      # if your csv file has no header, then use _c0,_c1,_c2,.. to indicate fields
      fields: []
      nebula.fields: []
      vertex: {
        field: _c0
        # policy: "hash"
      }
      separator: ","
      header: false
      batch: 2560
      partition: 32
    }



    {
      name: idno
      type: {
        source: csv
        sink: sst
      }
      path: "hdfs://xxxx:8020/user/frms/nebula-data/id.csv"
      fields: [_c1,_c2,_c3,_c4,_c5,_c6,_c7,_c8,_c9]
      nebula.fields: [a1,a2,a3,a4,a5,a6,a7,a8,a9]
      vertex: {
        field: _c0
        # policy: "hash"
      }
      separator: ","
      header: false
      batch: 2560
      partition: 32
    }


  ]

  # Processing edges
  # There are edge config examples for different dataSources.
  edges: [

    # csv
    {
      name: address_id
      type: {
        source: csv
        sink: sst
      }
      path: "hdfs://xxxx:8020/user/frms/nebula-data/addressid.csv"
      fields: [_c2]
      nebula.fields: [create_time]
      source: _c0
      target: _c1
      # ranking: rank
      separator: ","
      header: false
      batch: 2560
      partition: 32
    }

  ]
}

run command

./spark-2.4.8-bin-hadoop2.6/bin/spark-submit --master yarn-client --class com.vesoft.nebula.exchange.Exchange nebula-exchange-2.5-SNAPSHOT.jar -c application-sst.conf

vertex is null exception

when data source vertex is null, exchange have not a good warning but just a exception like 'java.lang.NullPointException'.when data source is not strict,someone do not know why and how to fix it.

go mod 或许需要打版本?

go get 2.0 时, 报错.

命令以及报错详情:

$ go get -v github.com/vesoft-inc/[email protected]
go get github.com/vesoft-inc/[email protected]: github.com/vesoft-inc/[email protected]: invalid version: module contains a go.mod file, so major version must be compatible: should be v0 or v1, not v2

查阅文档可知, go 官方推荐在 v2 及更高版本中, 使用 {path}/v2 的方式, 从而管理多个版本. 参考文档: Go Modules: v2 and Beyond

如何解决

在v2.x.x分支中, 更新 go.mod 为 module github.com/vesoft-inc/nebula-go/v2.
在 master/relase2.x 分支中, 也进行如上更新

如何复现

去掉goproxy: unset GOPROXY
执行 go get go get -v github.com/vesoft-inc/[email protected]

new feature: can exchange support datasource ClickHouse?

希望exchange支持clickhouse数据源
I hope that exchange can support datasource ClickHouse.

add exchange faq

https://discuss.nebula-graph.com.cn/tags

neo4j 导入Nebula Graph 驱动问题

版本：Nebula Exchange 2.0, Nebula Graph 2.0.1 spark2.4
os：Ubuntu 18.04
按照文档说明编写了neo4j_application.conf ，执行命令${SPARK_HOME}/bin/spark-submit --master "local" --class com.vesoft.nebula.exchange.Exchange /root/nebula-spark-utils/nebula-exchange/target/nebula-exchange-2.0.0.jar -c /root/nebula-spark-utils/nebula-exchange/target/classes/neo4j_application.conf 报错误：
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): org.neo4j.driver.exceptions.ClientException: Database name parameter for selecting database is not supported in Bolt Protocol Version 3. Database name: 'graph.db'
再次参考文档，文档上说：Exchange使用Neo4j Driver 4.0.1实现对Neo4j数据的读取。查找Neo4j官方未找到Neo4j Driver ，但找到neo4j-spark-connector 不知道是不是这个，请确认下，若不是，希望给出Neo4j Driver 4.0.1的地址，

[Question] how could I split the graph into subgraphs according to cluster id?

Hi,

Thanks for this great work!!

I tried some cluster algorithm and generate cluster id for each vertex. Now I need to compute betweenness centrality of on each cluster(because the whole graph is too large). But I do not know how to generate subgraphs according to the cluster ids, and how could I compute betweenness centrality in parallel on each subgraph.

Would you please tell me how could I make this work?

can not find com.vesoft.client.2.0.0-SNAPSHOT jar.

Unable to package Exchange 2.0

According to the readme instructions, once I've cloned nebula-spark-utils repo I should just enter
mvn clean package -Dmaven.test.skip=true -Dgpg.skip -Dmaven.javadoc.skip=true
Nevertheless there's no way to run it and I always get a BUILD FAILURE ERROR:
Here is the output i get

[INFO] Scanning for projects...
[WARNING]
[WARNING] Some problems were encountered while building the effective model for com.vesoft:nebula-exchange:jar:2.0.0
[WARNING] 'build.plugins.plugin.version' for org.apache.maven.plugins:maven-surefire-plugin is missing. @ line 98, column 21
[WARNING] 'build.plugins.plugin.version' for net.alchim31.maven:scala-maven-plugin is missing. @ line 173, column 21
[WARNING]
[WARNING] It is highly recommended to fix these problems because they threaten the stability of your build.
[WARNING]
[WARNING] For this reason, future Maven versions might no longer support building such malformed projects.
[WARNING]
[INFO]
[INFO] ---------------------< com.vesoft:nebula-exchange >---------------------
[INFO] Building nebula-exchange 2.0.0
[INFO] --------------------------------[ jar ]---------------------------------
Downloading from SparkPackagesRepo: http://dl.bintray.com/spark-packages/maven/neo4j-contrib/neo4j-spark-connector/2.4.5-M1/neo4j-spark-connector-2.4.5-M1.pom
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 7.472 s
[INFO] Finished at: 2021-05-10T23:16:41+02:00
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal on project nebula-exchange: Could not resolve dependencies for project com.vesoft:nebula-exchange:jar:2.0.0: Failed to collect dependencies at neo4j-contrib:neo4j-spark-connector:jar:2.4.5-M1: Failed to read artifact descriptor for neo4j-contrib:neo4j-spark-connector:jar:2.4.5-M1: Could not transfer artifact neo4j-contrib:neo4j-spark-connector:pom:2.4.5-M1 from/to SparkPackagesRepo (http://dl.bintray.com/spark-packages/maven): Authorization failed for http://dl.bintray.com/spark-packages/maven/neo4j-contrib/neo4j-spark-connector/2.4.5-M1/neo4j-spark-connector-2.4.5-M1.pom 403 Forbidden -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/DependencyResolutionException

Am I missing something?

Louvain-algo result should be more meaningful

nebula-spark-utils/nebula-algorithm/src/main/scala/com/vesoft/nebula/algorithm/lib/LouvainAlgo.scala

Line 369 in fe314b7

Row(x._1, x._2.cId)

Version: v2.0.0

Debugging with breakpoint, it seems that the louvain result only contains the core vertecies in community, which means we cannot figure out all vertecies in one community.

Here is my test data:

src,dst,weight
1,2,1.0
1,3,5.0
2,1,5.0
2,3,5.0
3,4,1.0
4,5,1.0
4,6,1.0
4,7,1.0
4,8,1.0
5,7,1.0
5,8,1.0
6,8,1.0

params:

    val louvainConfig = new LouvainConfig(5, 3, 0.005)
    val louvainResult = LouvainAlgo.apply(spark, data, louvainConfig, hasWeight = false)

And the result by calling louvainResult.show():

+---+--------+
|_id|_louvain|
+---+--------+
|  4|       4|
|  1|       1|
|  5|       5|
+---+--------+

Apparently, there are more than 3 vertecies.

And I think it's more meaningful to users by changing the code to:

def getCommunities(G: Graph[VertexData, Double]): RDD[Row] = {
    val communities = G.vertices
      .flatMap(x => {
        x._2.innerVertices
          .map(y => {
            Row(y, x._2.cId)
          })
      })
    communities
  }

Nebula Chinese forum related: https://discuss.nebula-graph.com.cn/t/topic/4701/5

    options = OrderedDict()
    options["es.nodes"] = ['127.0.0.1:9200']
    options["es.index.auto.create"] = "true"
    options["es.resource"] = "nebula/docs"

    df.write.format("org.elasticsearch.spark.sql") \
            .options(**options) \
            .save(mode='append')

vesoft-inc / nebula-spark-utils Goto Github PK

nebula-spark-utils's People

Contributors

Stargazers

Watchers

Forkers

nebula-spark-utils's Issues

Recommend Projects

Recommend Topics

Recommend Org