Giter VIP home page Giter VIP logo

Comments (8)

alg-jmx avatar alg-jmx commented on May 13, 2024 1

why mmlspark hasn't more example using scala? I'm more interesting in scala with spark in mmlspark~

from synapseml.

drdarshan avatar drdarshan commented on May 13, 2024

Dear Myasuka,

  1. For reading CIFAR-10 in Scala, we have a version of the dataset in a zip file hosted on our CDN. I will send you an example as soon as I have verified that it works.
  2. We would love to have your Scala examples! Could you please let us know how you have them implemented? Is it as a standalone application or does it use the Jupyter Scala kernel? If you could please send us a pointer, we will be happy to work with you to integrate them.

Thank you so much.

from synapseml.

drdarshan avatar drdarshan commented on May 13, 2024

Dear Myasuka,
You can get a zip version of the CIFAR test dataset here: https://mmlspark.azureedge.net/datasets/CIFAR10/test.zip.

Once you copy the zip file to HDFS (or local file system), I just confirmed, you can do something like this in Scala:

import com.microsoft.ml.spark.Readers.implicits._
val images = spark.readImages("file:///home/mmlspark/test.zip", true, 1.0, true)

images.printSchema()

/* This produces
root
 |-- image: struct (nullable = true)
 |    |-- path: string (nullable = true)
 |    |-- height: integer (nullable = true)
 |    |-- width: integer (nullable = true)
 |    |-- type: integer (nullable = true)
 |    |-- bytes: binary (nullable = true)
*/

images.selectExpr("image.width as w", "image.height as h", "image.bytes as b").show()

/* This produces:
+---+---+--------------------+
|  w|  h|                   b|
+---+---+--------------------+
| 32| 32|[5B 65 6D 62 68 6...|
| 32| 32|[01 01 01 01 01 0...|
| 32| 32|[F8 F8 F8 F6 F6 F...|
| 32| 32|[99 98 94 5E 5C 5...|
| 32| 32|[DC DF D7 C5 CE B...|
| 32| 32|[B0 D5 F3 AE D3 F...|
| 32| 32|[6B 38 22 67 36 2...|
| 32| 32|[4C 69 64 53 65 6...|
| 32| 32|[DC A1 70 DC A1 7...|
| 32| 32|[BB CE DF BA CD D...|
| 32| 32|[22 27 29 1D 21 2...|
| 32| 32|[F5 FE FD AE BC B...|
| 32| 32|[C5 BA AA C9 BC A...|
| 32| 32|[95 77 84 95 78 8...|
| 32| 32|[91 B5 C1 93 AC B...|
| 32| 32|[7C 8C A7 72 8B A...|
| 32| 32|[31 39 4E 44 4A 6...|
| 32| 32|[34 3A 3C 28 2E 2...|
| 32| 32|[DA CF CB D9 CD C...|
| 32| 32|[27 AB 9C 25 B1 9...|
+---+---+--------------------+
only showing top 20 rows
*/

Please let me know if this helps. Thank you!

from synapseml.

Myasuka avatar Myasuka commented on May 13, 2024

@drdarshan , thanks for your help.
From your reply, MMLSpark seems can only support to read images instead of original CIFAR binary format from official site with scala, did I misunderstand? That's to say, if we download original binary format file, we need to first transform them into images one by one?

from synapseml.

drdarshan avatar drdarshan commented on May 13, 2024

Hello @Myasuka, yes, you would need to transform the images. I might be able to write a UDF to do this from Scala since looks like they have a binary format in addition to pickle and MAT. Please let me know if that would help and I can write one for you. Thanks!

from synapseml.

Myasuka avatar Myasuka commented on May 13, 2024

Really thanks for your kindness help, I already use my modified cookie-datasets to read original CIFAR10 binary format data.

BTW, I think you can also share the transform script since there maybe someone else want to try MMLSpark with scala but found hard to read original CIFAR10 binary format data without your provided https://mmlspark.azureedge.net/datasets/CIFAR10/test.zip

from synapseml.

drdarshan avatar drdarshan commented on May 13, 2024

Hello @Myasuka, here is a simple Python3 script that extracts the images from the original CIFAR Python dataset and writes them out as PNG images. You can then zip the directory and use spark.readImages to read it. Please let me know if this is sufficient.

You might need to adapt it in order to also get the labels - please let me know if you need more help with this.

Thank you!
Sudarshan

import os
import tarfile, pickle
import PIL
with tarfile.open("cifar-10-python.tar.gz", "r:gz") as f:
    for batch in [p for p in f.getnames() if "_batch" in p]:
        print("Extracting: "+batch)
        os.makedirs(batch, exist_ok=True)
        images = pickle.load(f.extractfile(batch), encoding="latin1")
        data, filenames = images["data"], images["filenames"]
        for img_data, filename in zip(data, filenames):
            img = PIL.Image.fromarray(img_data.reshape(3,32,32).transpose(1,2,0))
            img.save(os.path.join(batch, filename))

from synapseml.

drdarshan avatar drdarshan commented on May 13, 2024

Hi @Myasuka, I'm closing this issue for now.. please reopen if you are still blocked. Thank you!

from synapseml.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.