Giter VIP home page Giter VIP logo

rscala's People

Contributors

carterj4 avatar dbdahl avatar floidgilbert avatar philwalk avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

rscala's Issues

What's the correct usage of RClient

Hi,
We are having a web application based on Play, when concurrent user requests comes, do the computation by calling the R code. The request should be done as fast as possible.
I noticed that creating RClient itself will cost 1.5 seconds
val R = org.ddahl.rscala.RClient(serializeOutput = true)

I would ask whether I used RClient correctly:

  1. When a request comes, then create a RClient, and do the job with this RClient, so if there are many requests concurrently ,there would be more than one RClient instances running, I am not familiar with internals of the RClient, so I would ask whether it is ok to use RClient in this way.

  2. If above usage is OK, i would ask whether I can pool the RClient instances beforehand, so that, when a request comes, it can use a RClient from the pool. I am afraid that if the RClient in the pool has been used for a long time, then the RClient will be outdated, eg, due to the connection timeout

  3. I notice that RClient doesn't have close method,I would ask when the resources that RClient uses will be released

Rscala RClient() hangs (macOS)

Thanks for this very interesting package.

I am trying to get RClient() working, i.e. calling R from Scala.

The function RClient always hangs and never returns. In the code below, "Hello World!" is printed, but not "Goodbye!".

object HelloWorldRscala {

  import org.ddahl.rscala._
  def main(argv: Array[String]): Unit = {
    println("Hello World!")
    val R = RClient()
    println("Goodbye!")
  }
}

I've tried it with sbt run, and also by creating a fat jar, same result.

I am using Java 8 on macOS 10.14.5 (Mojave).

I'd be grateful for any tips to get this working :)

eval without type

in version 2 there was an unsafe-eval that return the result and string from type.
can it be added back?

Parameter template

This is not so much an issue, rather my inability to understand how to use the package properly.

I thought I could do something like the following:

    val rStringLength = "function(s) return(nchar(s))"
    val rStrLenFun = R.evalObject(rStringLength)
    val helloLen = R.evalI0("%-", rStrLenFun, "hello")
    println("Length of hello is "+helloLen)

but that doesn't work.

If I replace the "helloLen" line above with the following:

    val helloLen = R.evalI0("%-(\"hello\")", rStrLenFun)

then I get the desired effect. But that doesn't seem to be how it should work.

Am I doing something stupid?

working with R lists

Hello!
I need to pass R list from Scala code to R and to read list back to Scala.
As I understand, I can pass list to R via chained calls of RClient.eval, RClient.set(identifier, index, singleBrackets), as mentioned in comments to set method, although maybe you can suggest more efficient method.
But is there way to get list content back to Scala? It looks like R lists as full-blown type are not supported. Call to RClient.get("myList$names") fails with "Undefined identifier".

P.S. currently I'm on 2.5.3 version

Problem getting predicted values from GAM model

I'm trying to build a GAM model in R and then use it to predict a list of input points from Scala. Here is the Scala code:

val xs = List(1.0, 2.0, 3.0, 4.0, 5.0)
val ys = List(1.0, 2.0, 3.0, 4.0, 5.0)
val predictXs = List(2.0, 2.5, 3.0)

val R: RClient = RClient()
R.xs = xs.toArray
R.ys = ys.toArray
R.eval("require(mgcv)")
R.eval("model = gam(ys ~ xs)")
val formattedPredictXs = predictXs.mkString("c(", ",", ")")
val res = R.evalD1(s"""predict(model, data.frame(xs = ${formattedPredictXs}))""")
println("Resulting predictions:")
res.foreach(println)

When run, I get this error (which is happening on the val res = R.evalD1(s"""predict(model, data.frame(xs = ${formattedPredictXs}))""") line):

Loading required package: mgcv
Loading required package: nlme
This is mgcv 1.8-24. For overview type 'help("mgcv-package")'.
Exception in thread "main" java.lang.RuntimeException: Unsupported data type.
	at org.ddahl.rscala.RClient.getInternal(RClient.scala:581)
	at org.ddahl.rscala.RClient.eval(RClient.scala:140)
	at org.ddahl.rscala.RClient.evalD1(RClient.scala:169)

If I change the model from gam to lm though, everything works perfectly (note the only difference is the R.eval("model = lm(ys ~ xs)") line):

val xs = List(1.0, 2.0, 3.0, 4.0, 5.0)
val ys = List(1.0, 2.0, 3.0, 4.0, 5.0)
val predictXs = List(2.0, 2.5, 3.0)

val R: RClient = RClient()
R.xs = xs.toArray
R.ys = ys.toArray
R.eval("require(mgcv)")
R.eval("model = lm(ys ~ xs)")
val formattedPredictXs = predictXs.mkString("c(", ",", ")")
val res = R.evalD1(s"""predict(model, data.frame(xs = ${formattedPredictXs}))""")
println("Resulting predictions:")
res.foreach(println)

That gives the output:

Loading required package: mgcv
Loading required package: nlme
This is mgcv 1.8-24. For overview type 'help("mgcv-package")'.
Resulting predictions:
2.0000000000000004
2.5000000000000004
3.0000000000000004

So for some reason the output of the lm prediction can be converted to a Scala object, while the output of the gam prediction cannot. If I print the predicted list in R using this code:

R.eval(s"""print(predict(model, data.frame(xs = ${formattedPredictXs})))""")

I get the exact same output for both the lm and the gam prediction:

  1   2   3 
2.0 2.5 3.0

I can't figure out why the lm results can be converted, but the gam results cannot -- especially given that, when printed in R, they appear to be the exact same.

Problems with using Rscala with Scala 2.10

In Scala 2.11 ()using rscala 2.2.2) I can write somthing like this:

org.ddahl.rscala.RClient()
R.x = 1.0

With Scala 2.10 this gives me a compilation error:

Error:(8, 5) type mismatch;
 found   : R.type (with underlying type org.ddahl.rscala.RClient)
 required: ?{def x: ?}
Note that implicit conversions are not applicable because they are ambiguous:
 both method any2Ensuring in object Predef of type [A](x: A)Ensuring[A]
 and method any2ArrowAssoc in object Predef of type [A](x: A)ArrowAssoc[A]
 are possible conversion functions from R.type to ?{def x: ?}
    R.x = 1.0

Is there a way to make this work, or should I use R.set("x",1.0) ?

When the message: "Scala seems to have died" is issued?

Hello David @dbdahl,

We observe the title message. Is it normal?
It is issued from rbyte R function which is called from pop function.

Or in other words - what is the correct scenario of closing the bridge between Scala and R? In our case we create RClient object in UDF, do the job and call quit() on RClient object, while an instant later the RClient object might get invalidated as not being in the scope of function body which ended and the GC might get it garbage collected.

So is my understanding correct that the R session still exists and only when it discovers that the RClient disappeared then this pop function ends the processing with message "Scala seems to have died"?

ps exceptions

Hi there.

When running tests I get an exception from ps complaining that the -p option does not have an argument.

Environment:
MacOS 10.14.6 Mojave
Rscala: "org.ddahl" %% "rscala" % "3.2.16",
sbt.version=1.3.3
Scala 2.13

I'm calling an R script from Scala. The call is working, but under normal circumstances I'm seeing the error ps: option requires an argument -- p

scriptName = singlet_gate_fc_wrapper

[1] 0.10000000 0.06559273 0.10000000 0.17368241 4.00000000 4.04467395 4.00000000
[8] 3.92850483

ps: option requires an argument -- p

usage: ps [-AaCcEefhjlMmrSTvwXx] [-O fmt | -o fmt] [-G gid[,gid...]]

      [-g grp[,grp...]] [-u [uid,uid...]]

      [-p pid[,pid...]] [-t tty[,tty...]] [-U user[,user...]]

   ps [-L]

Error in rbyte(socketIn) : Scala seems to have died.

If I run the test from the IntelliJ debugger I don't get this error.

Any idea what I am doing wrong?

RObject is not Serializable and has no public constructor

Hello David @dbdahl,

We are trying to use RScala library in a Scala project and use it to call R functions. Our application is executed on Azure Databricks (so something like Apache Spark). We face cases where some data were prepared on a driver machine and we have them in some R variables in R session on a driver. But we also want to process same other data the Spark way, so compute something on workers.

Now, the computation we are trying to perform on the workers needs to use some values already computed on the driver in its R session (called from Scala via RScala). So we thought about calling a Spark User Defined Function, which would be supplied with necessary data captured from the driver (Scala closure). To achieve this we need to somehow pass the value from driver to the User Defined Function.

So basically we try to do something like this:

val s = "text"
val testFun: () => String =
  () => {
    val r = RClient()
    r.evalObject("s <- %-", s) // captue s value from outside of testFun
    r.eval("x <- modify(s)") // call some R function to work on s
    val result = r.getS0("x") // finally return computed value x
    r.quit()
    result
  }
val testUDF = udf(testFun)

So far so good and this works on Spark. For now let's ignore the cost of creating the RClient object and its closing (quit()).

The problem is that in majority of cases the captured value is not that simple as a string, but also might be much more complex like R list of several elements which in turn contains another lists/elements and so on.

What we tried to do was to capture not only the scalar values but the whole complex list. So we want to achieve something like this:

val s = r.evalObject("some_list") // take a complex R list as RObject
val testFun: () => String =
  () => {
    val r = RClient()
    r.evalObject("s <- %-", s) // capture s value from outside of testFun, this time as an RObject
    r.eval("x <- modify(s)") // call some R function to work on s
    val result = r.getS0("x") // finally return computed value x
    r.quit()
    result
  }
val testUDF = udf(testFun)

The problem with above code snippet is that Spark complains that "RObject is not Serializable" and therefore cannot be distributed to the workers. So Spark issues runtime error.

So as a remedy we thought about reading the RObject as Array[Byte] like here:

val s = r.evalObject("some_list").x // get the real bytes, Yes, the x value is public
val testFun: () => String =
  () => {
    val r = RClient()
    r.evalObject("s <- %-", new RObject(s)) // capture s value from outside of testFun, this time as Array[Byte]
    r.eval("x <- modify(s)") // call some R function to work on s
    val result = r.getS0("x") // finally return computed value x
    r.quit()
    result
  }
val testUDF = udf(testFun)

Unfortunatelly this does not work nor even compile, because RObject does not have public constructor.

Do you see any solution of this puzzle?
Is there any reason why RObject is not Serializable and does not have a public constructor?
I guess making RObject Serializable or making its constructor public should solve this problem.

If this cannot be amended, then we have to pass the comlpex list after having decomposed it into smaller and simpler data structures which is a time consuming process. But at the moment this seems to be the only solution, as RScala supports only scalars, Arrays and Arrays of Arrays, so rather simple data structures, but nothing more and using RObject would be a nice work around.

RScala doesn't handle encoding correctly?

Hi @dbdahl ,

I encounter an interesting problem, in the following code, I am using Chinese words 你来自哪里 to create a vector(By the way, the words mean where are you from in English)

  val R = org.ddahl.rscala.RClient(serializeOutput = true)
    val code =
      """
          x <- c("你来自哪里")
         "done"
      """.stripMargin(' ')
    try {
      val ret = R evalS0 code
      println(ret)
    } catch {
      case e: Exception => e.printStackTrace()
    } finally {
      R.exit()
    }
  }

An exception occurs when running the above code, the error message is:

java.lang.RuntimeException: Error in R evaluation.
	at org.ddahl.rscala.RClient.eval(RClient.scala:112)

but if I am using x <- c("你来自哪"),then it works, the error one is using 5 characters, and the correct one is using 4 characters.

I am using UTF-8, I have no idea where the problem lies,@dbdahl ,could you please take a look?

NOTE: both cases are working with RStudio.

Question on the ping and exit methods of RClient

Hi, @dbdahl ,
I am reading the code of RClient at https://github.com/dbdahl/rscala/blob/master/src/main/scala/org/ddahl/rscala/RClient.scala,

There is an instance of Process, rProcessInstance , I would ask what this process stands for?

It looks to me that the ping and exit methods should be reconsidered,
I think a common pattern to use ping and exit may be:

if (!R.ping()) {
  R.exit()
}

If ping returns false, and the process has been destroyed, then calling on exit would do nothing but throw error.
I think the code for ping and exit can be improves as following, that is ,check whether rProcessInstance has been destroyed?

  def ping(): Boolean = synchronized {
      try {
        if (rProcessInstance != null) {
          out.writeInt(PING)
          out.flush()
          val status = in.readInt()
          status == OK
        } else {
          false
        }

      } catch {
        case _ : Throwable =>
          if ( rProcessInstance != null ) {
            rProcessInstance.destroy()
            rProcessInstance = null
          }
          false
      }
    }



    def exit() = synchronized {
      try {
        if (rProcessInstance != null) {
          check4GC()
          out.writeInt(SHUTDOWN)
          out.flush()
        }
      } catch {
        case _ : Throwable => {
          rProcessInstance.destroy()
          rProcessInstance = null
        }
      }


    }

Not able to add extra command arguments to scala call

I don't find a way to add extra command line arguments to scala call. It can be useful to add an agentlib for debugging, or any other scala or java argument.
The only command line argument available right now is the maximum heap memory.

Is it possible to connect to already instantiated R session?

Hello David @dbdahl,

We are experimenting with RScala library on Azure Databricks. The processing is organised in that way that firstly we perform some initialization on the driver where we produce some preprocessed data, which then are passed/broadcasted to workers, where we would like to perform another part of the computation using Spark User Defined Function (UDF) which uses RClient and performs some additional processing using R. Basically it works well except that each call to UDF needs to instantiate a new RClient session, which has to be created and initialized and then we need to push there those preprocessed data we prepared on the driver.

So, as you may imagine what is being processed within the UDF consists of several operations which are the same before UDF starts to do something really important for the given record of Dataframe. It turns out that the cost of instantiation of RClient and then cost of creation of those preprocessed data within R session is significant.

That is where we come to my question. I mean - is it possible to somehow connect to already running R session? So, for example imagine that we have let say 16 tasks on eache worker, so we have always up to 16 UDFs being processed at the same time. But instead of creating a new RClient for each record, maybe it is possible to somehow create and initialize those 16 sessions with R using some preprocessing (for example somehow initialize 16 RClient sessions and identify them like some 16-element collection of RClient objects). Such initialization would be performed only once per processing and would also contain instantiation of all the preprocessed data structures.

Then during real processing we would be able to refer to those already created RClients/R sessions. For example in UDF insted of

val r = RClient()
// here the initialization of R variables in R session
// and only here we start the real R processing
r.quit() // finally close the object

we would use something like this:

val r = RClient(some_identifier_of_one_of_those_16_sessions)
// SKIPPED - since we connect to existing session, we skip the initialization
// the real processing
// r.quit() SKIPPED - we could also skip the quit method as we leave the object for further processing

Of course I am aware that probably we should pay attention to not use the same RClient object by more then one task at the same time because it does not have to be thread safe, but this is why I thought of let say 16 of such objects.

The key question is whether this is possible? I mean to connect to already created RClient session.

Two questions about rscala

With the following code, I got two questions:

object RScala2Test {
  def main(args: Array[String]): Unit = {
    val R = org.ddahl.rscala.RClient()
    R eval """
    print("Hello, I am in Scala")
    """.stripMargin(' ')

   R eval """
    1 / "a"
    """ .stripMargin(' ')
  }
}
  1. The first question is that I can' see the output of print("Hello, I am in Scala"), If the R script is very long,it would very helpful to see what's going on in R with print,

  2. The second question is that i deliberately write the code 1 / "a" that will cause error, the output is as follows:

��������������з���ֵ����
Exception in thread "main" java.lang.RuntimeException: Error in R evaluation.
at org.ddahl.rscala.RClient.eval(RClient.scala:112)
at org.ddahl.rscala.RClient.eval(RClient.scala:117)
at rscala.RScala2Test$.main(RScala2Test.scala:12)
at rscala.RScala2Test.main(RScala2Test.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:144)

There is messy code in the first line, I am not sure what it is saying, and what cause it.

  1. Is it the error message that exactly shows what's wrong in the R code?
  2. I am in Intellij Idea, which is using UTF-8,and I am not sure what's causing this

Empty array

When passing results between R and java the output type is sometimes difficult to predict. In particular, the following returns an rscala reference when I would expect an empty matrix.

s * 'Array.fill[Array[String]](0)(Array.fill[String](0)(""))'

Calling R code from within Spark(Scala)

I am trying to use R code/libraries in a Spark app written in scala using rscala. I am getting the following error

Exception in thread "main" java.lang.NoClassDefFoundError: org/ddahl/rscala/RClient$

I am building the app using SBT and runnig on the local machine using spark-submit. Also I have aready updated the build.sbt file with Spark and rscala dependencies:

libraryDependencies += "org.apache.spark" %% "spark-core" % "2.4.2"
libraryDependencies += "org.ddahl" %% "rscala" % "3.2.6"
I tried older version of rscala 2.5.0 but got the same error.

Please let me know if I am missing something. Is it possible to use rscala with Spark?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.