how to convert my pyspark code into scala? - scala

I am a scala beginner. Now I have to convert some codes I wrote in Pyspark to scala. The codes are just to extract fields for modeling.
Could someone point out to me how to write the following code into scala? At least where and how I could get the quick answer. Thanks so much!!!
Here are my previous codes
{val records = rawdata.map(x=> x.split(","))
val data = records.map(r=> LabeledPoint(extract_label(r), extract_features(r)))
...
def extract_features(record):
return np.array(map(float, record[2:16]))
def extract_label(record):
return float(record[16])
}

It goes like this:
scala> def extract_label(record: Array[String]): Float = { record(16).toFloat }
extract_label: (record: Array[String])Float
scala> def extract_features(record: Array[String]): Array[Float] = { val newArray = new Array[Float](14); for(i <- 2 until 16) newArray(i-2)=record(i).toFloat; newArray;}
extract_features: (record: Array[String])Array[Float]
There may be a direct method for above logic.
Test:
scala> records.map(x => extract_label(x)).take(5).foreach(println)
4.9
scala> records.map(x => extract_features(x).mkString(",")).take(5).foreach(println)
6.4,2.5,4.5,2.8,4.7,2.5,6.4,8.5,3.5,6.4,2.9,10.5,6.4,2.2

Related

Rewriting Apache Spark Scala into PySpark

Community, I'm not familiar with Scala and not so great with PySpark. However, I'm much less familiar with Scala and therefore was hoping if someone could let me know if someone could help me re-write the following Apache Spark Scala to PySpark.
If you're going to ask what I have done so far to help myself, I'm going to honestly say very little, as I'm still in the early days of coding.
So, if you can help re-code the following into PySpark, or put me on the right path so that I can re-code it myself, that would be very helpful
import org.apache.spark.sql.DataFrame
def readParquet(basePath: String): DataFrame = {
val parquetDf = spark
.read
.parquet(basePath)
return parquetDf
}
def num(df: DataFrame): Int = {
val numPartitions = df.rdd.getNumPartitions
return numPartitions
}
def ram(size: Int): Int = {
val ramMb = size
return ramMb
}
def target(size: Int): Int = {
val targetMb = size
return targetMb
}
def dp(): Int = {
val defaultParallelism = spark.sparkContext.defaultParallelism
return defaultParallelism
}
def files(dp: Int, multiplier: Int, ram: Int, target: Int): Int = {
val maxPartitions = Math.max(dp * multiplier, Math.ceil(ram / target).toInt)
return maxPartitions
}
def split(df: DataFrame, max: Int): DataFrame = {
val repartitionDf = df.repartition(max)
return repartitionDf
}
def writeParquet(df: DataFrame, targetPath: String) {
return df.write.format("parquet").mode("overwrite").save(targetPath)
}
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder().appName("spark-repartition-optimizer-app").getOrCreate()
spark.conf.set("spark.sql.shuffle.partitions", 2001) // example
val parquetDf = readParquet("/blogs/source/airlines.parquet/")
val numPartitions = num(parquetDf)
val ramMb = ram(6510) // approx. df cache size
val targetMb = target(128) // approx. partition size (between 50 and 200 mb)
val defaultParallelism = dp()
val maxPartitions = files(defaultParallelism, 2, ramMb, targetMb)
val repartitionDf = split(parquetDf, maxPartitions)
writeParquet(repartitionDf, "/blogs/optimized/airlines.parquet/")
I simply need to re-code the Scala code to PySpark myself.
This was fixed by including the following module in pyspark.
import module

Group by of list

I have a list with 5 elements
data = List((List(c1),Y), (List(c1),N), (List(c1),N), (List(c1),Y), (List(c1),Y))
And I want to create a list following:
List((List(c1),Y,0.666), (List(c1),N),0.333)
Any tips on the best way to do this?
I am using scala if that's any help
object Grouping {
def main(args: Array[String]): Unit = {
val data = List((List("c1"),"Y"), (List("c1"),"N"), (List("c1"),"N"), (List("c1"),"Y"), (List("c1"),"Y"))
val result = data.groupBy(grp => (grp._1,grp._2))
.mapValues(count => BigDecimal(count.size.toDouble).setScale(3)./(BigDecimal(data.size.toDouble).setScale(3))
.setScale(3, BigDecimal.RoundingMode.HALF_UP))
.map( k => (k._1._1,k._1._2,k._2)).toList
println("result=="+result)
}
}
def calculatePercentages(data : List[(List[String], String)]): List[((List[String], String),BigDecimal)] = {
val (yesRows, noRows) = data.partition(_._2 == "Y")
List((yesRows(0), (BigDecimal(yesRows.length) / BigDecimal(data.length)).setScale(3, BigDecimal.RoundingMode.HALF_UP)),
(noRows(0), (BigDecimal(noRows.length) / BigDecimal(data.length)).setScale(3, BigDecimal.RoundingMode.HALF_UP)))
}
scala> calculatePercentages(data)
res30: List[((List[String], String), BigDecimal)] = List(((List(c1),Y),0.600), ((List(c1),N),0.400))
Thank you very much for your support. Your code ran properly on my first request. However, with more complex data as with a list below it is not as I expected.
List(
(List(c1, a1),Y),
(List(a1),Y),
(List(c1, a1),N),
(List(a1),N),
(List(a1),Y))
and i want the result is
List(
(List(c1, a1),Y, 0.5),
(List(c1, a1),N, 0.5),
(List(a1),Y, 0.66),
(List(a1),N, 0.33),
)
I look forward to your support

Problems with spark datasets

When I execute a function in a mapPartition of dataset (executeStrategy()) it returns a result which I could check by debug but when I use dataset.show () it shows me an empty table and I do not know why this happens
This is for a data mining job at my school. I'm using windows 10, scala 2.11.12 and spark-2.2.0, which work without problems.
case class MyState(code: util.ArrayList[Object], evaluation: util.ArrayList[java.lang.Double])
private def executeStrategy(iter: Iterator[Row]): Iterator[(String,MyState)] = {
val listBest = new util.ArrayList[State]
Predicate.fuzzyValues = iter.toList
for (i <- 0 until conf.runNumber) {
Strategy.executeStrategy(conf.iterByRun, 1, conf.algorithm("algorithm").asInstanceOf[GeneratorType])
listBest.addAll(Strategy.getStrategy.listBest)
}
val result = postMining(listBest)
result.map(x => (x.getCode.toString, MyState(x.getCode,x.getEvaluation))).iterator
}
def run(sparkSession: SparkSession, n: Int): Unit = {
import sparkSession.implicits._
var data0 = conf.dataBase.repartition(n).persist(StorageLevel.MEMORY_AND_DISK_SER)
var listBest = new util.ArrayList[State]
implicit def enc1 = Encoders.bean(classOf[(String,MyState)])
val data1 = data0.mapPartitions(executeStrategy)
data1.show(3)
}
I expect that the dataset has the results of the processing of each partition, which I can see when I debug, but I get an empty dataset.
I have tried rdd with the same function executeStrategy() and this one returns an rdd with the results. What is the problem with the dataset?

Not able to see RDD contents

i am using scala to create an RDD but when i am trying to see the contents of RDD i am getting below results
MapPartitionsRDD[25] at map at <console>:96
I want to see the contents of RDD how can i see that ?
below is my scala code:
object WordCount {
def main(args: Array[String]): Unit = {
val textfile = sc.textFile("/user/cloudera/xxx/File")
val word = textfile.filter(x => x.length > 0).map(_.split('|'))
println(word)
}
}
You need to provide an output transformation (action). e.g. use RDD.collect:
object WordCount {
def main(args: Array[String]): Unit = {
val textfile = sc.textFile("/user/cloudera/xxx/File")
val word = textfile.filter(x => x.length > 0).map(_.split('|'))
word.collect().foreach(println)
}
}
If you have an Array[Array[T]], you'll need to flatten before using foreach:
word.collect().flatten.foreach(println)

sortByKey in Spark

New to Spark and Scala. Trying to sort a word counting example. My code is based on this simple example.
I want to sort the results alphabetically by key. If I add the key sort to an RDD:
val wordCounts = names.map((_, 1)).reduceByKey(_ + _).sortByKey()
then I get a compile error:
error: No implicit view available from java.io.Serializable => Ordered[java.io.Serializable].
[INFO] val wordCounts = names.map((_, 1)).reduceByKey(_ + _).sortByKey()
I don't know what the lack of an implicit view means. Can someone tell me how to fix it? I am running the Cloudera 5 Quickstart VM. I think it bundles Spark version 0.9.
Source of the Scala job
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
object SparkWordCount {
def main(args: Array[String]) {
val sc = new SparkContext(new SparkConf().setAppName("Spark Count"))
val files = sc.textFile(args(0)).map(_.split(","))
def f(x:Array[String]) = {
if (x.length > 3)
x(3)
else
Array("NO NAME")
}
val names = files.map(f)
val wordCounts = names.map((_, 1)).reduceByKey(_ + _).sortByKey()
System.out.println(wordCounts.collect().mkString("\n"))
}
}
Some (unsorted) output
("INTERNATIONAL EYELETS INC",879)
("SHAQUITA SALLEY",865)
("PAZ DURIGA",791)
("TERESSA ALCARAZ",824)
("MING CHAIX",878)
("JACKSON SHIELDS YEISER",837)
("AUDRY HULLINGER",875)
("GABRIELLE MOLANDS",802)
("TAM TACKER",775)
("HYACINTH VITELA",837)
No implicit view means there is no scala function like this defined
implicit def SerializableToOrdered(x :java.io.Serializable) = new Ordered[java.io.Serializable](x) //note this function doesn't work
The reason this error is coming out is because in your function you are returning two different types with a super type of java.io.Serializable (ones a String the other an Array[String]). Also reduceByKey for obvious reasons requires the key to be an Orderable. Fix it like this
object SparkWordCount {
def main(args: Array[String]) {
val sc = new SparkContext(new SparkConf().setAppName("Spark Count"))
val files = sc.textFile(args(0)).map(_.split(","))
def f(x:Array[String]) = {
if (x.length > 3)
x(3)
else
"NO NAME"
}
val names = files.map(f)
val wordCounts = names.map((_, 1)).reduceByKey(_ + _).sortByKey()
System.out.println(wordCounts.collect().mkString("\n"))
}
}
Now the function just returns Strings instead of two different types