I have a sample codebase in Scala where I use OpenCV and ScalaPy for doing some image classification. Here is the codebit:
def loadImage(imagePath: String): Image = {
// 0. Load the image and extract class label where a path to the image is assumed to be
// /path/to/dataset/{class}/{image}.jpg
val matrix: Mat = imread(imagePath)
val label = imagePath.split("")
// 1. Run the loaded image through the preprocessors, resulting in a feature vector
//val preProcessedImagesWithLabels = Seq(new ImageResizePreProcessor(appCfg)).map(preProcessor => (preProcessor.preProcess(matrix), label))
val preProcessedImagesWithLabels = Seq(new ImageResizePreProcessor(appCfg)).map(preProcessor => preProcessor.preProcess(matrix))
np.asarray(preProcessedImagesWithLabels)
}
It fails however for the reason that it cannot find an implicit converter for NumPy:
[error] /home/joesan/Projects/Private/ml-projects/object-classifier/src/main/scala/com/bigelectrons/animalclassifier/ImageLoader.scala:10:34: not found: type NumPy
[error] val np = py.module("numpy").as[NumPy]
What is to be expected in addition to the imports?
"me.shadaj" %% "scalapy-numpy" % "0.1.0",
"me.shadaj" %% "scalapy-core" % "0.5.0",
Try with latest "dev" version of scalapy-numpy: 0.1.0+6-14ca0424
So change the sbt build in:
libraryDependencies += "me.shadaj" %% "scalapy-numpy" % "0.1.0+6-14ca0424"
I try in ammonite this script:
import me.shadaj.scalapy.numpy.NumPy
import me.shadaj.scalapy.py
val np = py.module("numpy").as[NumPy]
An it seems to find the NumPy as expected
I find the following simple example hangs indefinitely for me in the Scala REPL (sbt console):
import org.apache.spark.sql._
val spark = SparkSession.builder().master("local[*]").getOrCreate()
val sc = spark.sparkContext
val rdd = sc.parallelize(1 to 100000000)
val n = rdd.map(_ + 1).sum
However, the following works just fine:
import org.apache.spark.sql._
val spark = SparkSession.builder().master("local[*]").getOrCreate()
val sc = spark.sparkContext
val rdd1 = sc.parallelize(1 to 100000000)
val rdd2 = rdd1.map(_ + 1)
val n = rdd2.sum
I'm very confused by this, and was hoping somebody had an explanation... assuming they can reproduce the 'issue'.
This is basically just the example provided on the Almond kernel's Spark documentation page, and it does work just fine in Jupyter using the Almond kernel. Also, sbt "runMain Main" works just fine for the following:
import org.apache.spark.sql._
object Main extends App {
val spark = SparkSession.builder().master("local[*]").getOrCreate()
val sc = spark.sparkContext
val rdd = sc.parallelize(1 to 100000000)
val n = rdd.map(_ + 1).sum
println(s"\n\nn: $n\n\n")
spark.stop
}
For completeness, I'm using a very simple build.sbt file, which looks as follows:
name := """sparktest"""
scalaVersion := "2.12.10"
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.4.6"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.4.6"
I noticed a bunch of errors like the following when I killed the console:
08:53:36 ERROR Executor:70 - Exception in task 2.0 in stage 0.0 (TID 2): Could not initialize class $line3.$read$$iw$$iw$$iw$$iw$
This led me to:
Lambda in REPL (using object-wrappers) + concurrency = deadlock #9076
It seems that my problem is this same thing and is specific to Scala 2.12. Adding the following line to build.sbt seems to be the accepted workaround:
scalacOptions += "-Ydelambdafy:inline"
I want to scale data with StandardScaler (from pyspark.mllib.feature import StandardScaler), by now I can do it by passing the values of RDD to transform function, but the problem is that I want to preserve the key. is there anyway that I scale my data by preserving its key?
Sample dataset
0,tcp,http,SF,181,5450,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,8,8,0.00,0.00,0.00,0.00,1.00,0.00,0.00,9,9,1.00,0.00,0.11,0.00,0.00,0.00,0.00,0.00,normal.
0,tcp,http,SF,239,486,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,8,8,0.00,0.00,0.00,0.00,1.00,0.00,0.00,19,19,1.00,0.00,0.05,0.00,0.00,0.00,0.00,0.00,normal.
0,tcp,http,SF,235,1337,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,8,8,0.00,0.00,0.00,0.00,1.00,0.00,0.00,29,29,1.00,0.00,0.03,0.00,0.00,0.00,0.00,0.00,smurf.
Imports
import sys
import os
from collections import OrderedDict
from numpy import array
from math import sqrt
try:
from pyspark import SparkContext, SparkConf
from pyspark.mllib.clustering import KMeans
from pyspark.mllib.feature import StandardScaler
from pyspark.statcounter import StatCounter
print ("Successfully imported Spark Modules")
except ImportError as e:
print ("Can not import Spark Modules", e)
sys.exit(1)
Portion of code
sc = SparkContext(conf=conf)
raw_data = sc.textFile(data_file)
parsed_data = raw_data.map(Parseline)
Parseline function:
def Parseline(line):
line_split = line.split(",")
clean_line_split = [line_split[0]]+line_split[4:-1]
return (line_split[-1], array([float(x) for x in clean_line_split]))
Not exactly a pretty solution but you can adjust my answer to the similar Scala question. Lets start with an example data:
import numpy as np
np.random.seed(323)
keys = ["foo"] * 50 + ["bar"] * 50
values = (
np.vstack([np.repeat(-10, 500), np.repeat(10, 500)]).reshape(100, -1) +
np.random.rand(100, 10)
)
rdd = sc.parallelize(zip(keys, values))
Unfortunately MultivariateStatisticalSummary is just a wrapper around a JVM model and it is not really Python friendly. Luckily with NumPy array we can use standard StatCounter to compute statistics by key:
from pyspark.statcounter import StatCounter
def compute_stats(rdd):
return rdd.aggregateByKey(
StatCounter(), StatCounter.merge, StatCounter.mergeStats
).collectAsMap()
Finally we can map to normalize:
def scale(rdd, stats):
def scale_(kv):
k, v = kv
return (v - stats[k].mean()) / stats[k].stdev()
return rdd.map(scale_)
scaled = scale(rdd, compute_stats(rdd))
scaled.first()
## array([ 1.59879188, -1.66816084, 1.38546532, 1.76122047, 1.48132643,
## 0.01512487, 1.49336769, 0.47765982, -1.04271866, 1.55288814])
I am using Spark 1.6.1 with Scala 2.10.5 built in. I am building a set of permutation models using binary numbers but I've now hit a wall for a few days now. Here is the code (partially taken from zero323 at RDD to LabeledPoint conversion):
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
import org.apache.spark.sql.Row
import org.apache.spark.rdd.RDD
import org.apache.spark.sql._
import org.apache.spark.sql.SQLContext
import org.apache.spark.mllib.util.MLUtils
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.regression.LinearRegressionModel
import org.apache.spark.mllib.regression.LinearRegressionWithSGD
import org.apache.spark.mllib.linalg.Vectors
import scala.util.Random.{setSeed, nextDouble}
setSeed(123)
case class Record(
target: Double, x1: Double, x2: Double, x3: Double, x4:Double, x5:Double, x6:Double)
val rows = sc.parallelize(
(1 to 50).map(_ => Record(
nextDouble, nextDouble, nextDouble, nextDouble, nextDouble, nextDouble, nextDouble)))
val df = sqlContext.createDataFrame(rows)
df.registerTempTable("df")
sqlContext.sql("""
SELECT ROUND(target, 2) target,
ROUND(x1, 2) x1,
ROUND(x2, 2) x2,
ROUND(x3, 2) x3,
ROUND(x4, 2) x4,
ROUND(x5, 2) x5,
ROUND(x6, 2) x6
FROM df""").show
Now, I want to build binary numbers with 6 digits. This equates to the number 63 in base 10:
val max_base10=63
val max_binary="%06d".format(max_base10.toBinaryString.toLong)
After that, I create a variable which will allow for column indexes to be permutated in ascending order:
def binary_permutationsSeg1 = for {a <- 1 to max_base10
} yield List(Array(0), %06d".format(a.toBinaryString.toLong).split("").map(_.toInt)) flatten // Array(0) is added as target is not an explanatory variable
This is just to make sure I get what I'm after:
binary_permutationsSeg1(62)
Now I wish to build my set of permutation models for which I MUST have an intercept.. I've learned the hard way that train() does not build an intercept in its algorithm...:
var regression = new LinearRegressionWithSGD().setIntercept(true)
regression.optimizer.setStepSize(0.1)
regression.optimizer.setNumIterations(100)
This bit is not very elegant as I create a double def... Also, I'd like to automatically create multiModel01, multiModel02 etc in memory but I'm not able to do it....
def multiModel = for{
a <- 1 to binary_permutationsSeg1.size-1
} yield df.rdd.map(r => LabeledPoint(
r.getDouble(1),
Vectors.dense(binary_permutationsSeg1(a).zipWithIndex.filter(_._1 == 1).unzip ._2 map(r.getDouble(_)) toArray))).cache()
def run_multiModel=for{
a <- 1 to binary_permutationsSeg1.size-1
} yield regression.run(multiModel(a))
run_multiModel
When I run the line above, I don't get the 62 models I am after which leaves me perplex.
These last two paragraphs are to do with model evaluation (taken from zero323 again) but I cannot make it so that it'd automatically contrust valuesAndPreds01, valuesAndPreds02 and so forth... whilst bearig in mind that some permutations will give NaN.
val valuesAndPreds = df.rdd.map { point =>
val prediction = model.predict(point.features)
(point.label, prediction)
}
val MSE = valuesAndPreds.map{case(v, p) => math.pow((v - p), 2)}.mean()
println("training Mean Squared Error = " + MSE)
Thanks for taking the time to look at this problem and thanks for any suggestions or tips in the right direction.
Regards,
Christian
I'm get into spark and I have problems with Vectors
import org.apache.spark.mllib.linalg.{Vectors, Vector}
The input of my program is a text file with contains the output of a RDD(Vector):
dataset.txt:
[-0.5069793074881704,-2.368342680619545,-3.401324690974588]
[-0.7346396928543871,-2.3407983487917448,-2.793949129209909]
[-0.9174226561793709,-0.8027635530022152,-1.701699021443242]
[0.510736518683609,-2.7304268743276174,-2.418865539558031]
So, what a try to do is:
val rdd = sc.textFile("/workingdirectory/dataset")
val data = rdd.map(s => Vectors.dense(s.split(',').map(_.toDouble)))
I have the error because it read [0.510736518683609 as a number.
Exist any form to load directly the vector stored in the text-file without doing the second line? How I can delete "[" in the map stage ?
I'm really new in spark, sorry if it's a very obvious question.
Given the input the simplest thing you can do is to use Vectors.parse:
scala> import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.linalg.Vectors
scala> Vectors.parse("[-0.50,-2.36,-3.40]")
res14: org.apache.spark.mllib.linalg.Vector = [-0.5,-2.36,-3.4]
It also works with sparse representation:
scala> Vectors.parse("(10,[1,5],[0.5,-1.0])")
res15: org.apache.spark.mllib.linalg.Vector = (10,[1,5],[0.5,-1.0])
Combining it with your data all you need is:
rdd.map(Vectors.parse)
If you expect malformed / empty lines you can wrap it using Try:
import scala.util.Try
rdd.map(line => Try(Vectors.parse(line))).filter(_.isSuccess).map(_.get)
Here is one way to do it :
val rdd = sc.textFile("/workingdirectory/dataset")
val data = rdd.map {
s =>
val vect = s.replaceAll("\\[", "").replaceAll("\\]","").split(',').map(_.toDouble)
Vectors.dense(vect)
}
I've just broke the map into line for readability purpose.
Note: Remember, it's simple a string processing on each line.