Rewriting Apache Spark Scala into PySpark - scala

Community, I'm not familiar with Scala and not so great with PySpark. However, I'm much less familiar with Scala and therefore was hoping if someone could let me know if someone could help me re-write the following Apache Spark Scala to PySpark.
If you're going to ask what I have done so far to help myself, I'm going to honestly say very little, as I'm still in the early days of coding.
So, if you can help re-code the following into PySpark, or put me on the right path so that I can re-code it myself, that would be very helpful
import org.apache.spark.sql.DataFrame
def readParquet(basePath: String): DataFrame = {
val parquetDf = spark
.read
.parquet(basePath)
return parquetDf
}
def num(df: DataFrame): Int = {
val numPartitions = df.rdd.getNumPartitions
return numPartitions
}
def ram(size: Int): Int = {
val ramMb = size
return ramMb
}
def target(size: Int): Int = {
val targetMb = size
return targetMb
}
def dp(): Int = {
val defaultParallelism = spark.sparkContext.defaultParallelism
return defaultParallelism
}
def files(dp: Int, multiplier: Int, ram: Int, target: Int): Int = {
val maxPartitions = Math.max(dp * multiplier, Math.ceil(ram / target).toInt)
return maxPartitions
}
def split(df: DataFrame, max: Int): DataFrame = {
val repartitionDf = df.repartition(max)
return repartitionDf
}
def writeParquet(df: DataFrame, targetPath: String) {
return df.write.format("parquet").mode("overwrite").save(targetPath)
}
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder().appName("spark-repartition-optimizer-app").getOrCreate()
spark.conf.set("spark.sql.shuffle.partitions", 2001) // example
val parquetDf = readParquet("/blogs/source/airlines.parquet/")
val numPartitions = num(parquetDf)
val ramMb = ram(6510) // approx. df cache size
val targetMb = target(128) // approx. partition size (between 50 and 200 mb)
val defaultParallelism = dp()
val maxPartitions = files(defaultParallelism, 2, ramMb, targetMb)
val repartitionDf = split(parquetDf, maxPartitions)
writeParquet(repartitionDf, "/blogs/optimized/airlines.parquet/")

I simply need to re-code the Scala code to PySpark myself.

This was fixed by including the following module in pyspark.
import module

Related

How to implement the loop function already implemented in python with scala

enter image description here
I write some code in scala but i am stuck on the loop function.
import math._
object Haversine {
val R = 6372.8 //radius in km
def haversine(lat1:Double, lon1:Double, lat2:Double, lon2:Double)={
val dLat=(lat2 - lat1).toRadians
val dLon=(lon2 - lon1).toRadians
val a = pow(sin(dLat/2),2) + pow(sin(dLon/2),2) * cos(lat1.toRadians) * cos(lat2.toRadians)
val c = 2 * asin(sqrt(a))
R * c
}
def main(args: Array[String]): Unit = {
println(haversine(36.12, -86.67, 33.94, -118.40))
}
}
import org.apache.spark.sql.SparkSession
import Haversine.haversine
object Position {
def main(args: Array[String]): Unit = {
// create Spark DataFrame with Spark configuration
val spark= SparkSession.builder().getOrCreate()
// Read csv with DataFrame
val file1 = spark.read.csv("file:///home/aaron/Downloads/taxi_gps.txt")
val file2 = spark.read.csv("file:///home/aaron/Downloads/district.txt")
//change the name
val new_file1= file1.withColumnRenamed("_c0","id")
.withColumnRenamed("_c4","lat")
.withColumnRenamed("_c5","lon")
val new_file2= file2.withColumnRenamed("_c0","dis")
.withColumnRenamed("_1","lat")
.withColumnRenamed("_2","lon")
.withColumnRenamed("_c3","r")
//count
}
}
I am not familiar with scala,it is quite a tough question for me.
I hope you guys can help me,thx!
Before implementing you need to define and implement method cal_distance out side main method but in same class
def cal_distance(lon: Float, lat: Float, taxiLon: Float, taxiLat: Float) : Float = {
val distance = 0.0f
// write your geopy scala code here
distance
}
You code should in Scala should be similar to something as below
new_file2.foreach(row => {
val district = row.getAs[Float]("dis")
val lon = row.getAs[Float]("lon")
val lat = row.getAs[Float]("lat")
val distance = row.getAs[Float]("r")
var temp = 0
new_file1.foreach(taxi => {
val taxiLon = taxi.getAs[Float]("lon")
val taxiLat = taxi.getAs[Float]("lat")
if(cal_distance(lon,lat,taxiLon,taxiLat) <= distance) {
temp+=1
}
})
println(s"district:${district} temp=${temp}")
})

How to persist the list which we made dynamically from dataFrame in scala spark

def getAnimalName(dataFrame: DataFrame): List[String] = {
dataFrame.select("animal").
filter(col("animal").isNotNull && col("animal").notEqual("")).
rdd.map(r => r.getString(0)).distinct().collect.toList
}
I am basicaly Calling this function 2 times For getting the list for different purposes . I just want to know is there a way to retain the list in memory and we dont have to call the same function again and again to generate the list and only have to generate the list only one time in scala spark.
Try something as below and you can also check the performance using time func.
Also find the code explanation inline
import org.apache.spark.rdd
import org.apache.spark.sql.functions._
import org.apache.spark.sql.{DataFrame, functions}
object HandleCachedDF {
var cachedAnimalDF : rdd.RDD[String] = _
def main(args: Array[String]): Unit = {
val spark = Constant.getSparkSess
val df = spark.read.json("src/main/resources/hugeTest.json") // Load your Dataframe
val df1 = time[rdd.RDD[String]] {
getAnimalName(df)
}
val resultList = df1.collect().toList
val df2 = time{
getAnimalName(df)
}
val resultList1 = df2.collect().toList
println(resultList.equals(resultList1))
}
def getAnimalName(dataFrame: DataFrame): rdd.RDD[String] = {
if (cachedAnimalDF == null) { // Check if this the first initialization of your dataframe
cachedAnimalDF = dataFrame.select("animal").
filter(functions.col("animal").isNotNull && col("animal").notEqual("")).
rdd.map(r => r.getString(0)).distinct().cache() // Cache your dataframe
}
cachedAnimalDF // Return your cached dataframe
}
def time[R](block: => R): R = { // COmpute the time taken by function to execute
val t0 = System.nanoTime()
val result = block // call-by-name
val t1 = System.nanoTime()
println("Elapsed time: " + (t1 - t0) + "ns")
result
}
}
You would have to persist or cache at this point
dataFrame.select("animal").
filter(col("animal").isNotNull && col("animal").notEqual("")).
rdd.map(r => r.getString(0)).distinct().persist
and then call the function as follow
def getAnimalName(dataFrame: DataFrame): List[String] = {
dataFrame.collect.toList
}
as many times as you need it without repeat the process.
I hope it helps.

How to implement the Seq.grouped(size:Int): Seq[Seq[A]] for Dataset in Spark

I want to try to implement the
def grouped(size: Int): Iterator[Repr] that Seq has but for Dataset in Spark.
So the input should be ds: Dataset[A], size: Int and output Seq[Dataset[A]] where each of the Dataset[A] in the output can't be bigger than size.
How should I proceed ? I tried with repartition and mapPartitions but I am not sure where to go from there.
Thank you.
Edit: I found the glom method in RDD but it produce a RDD[Array[A]] how do I go from this to the other way around Array[RDD[A]] ?
here you go, something that you want
/*
{"countries":"pp1"}
{"countries":"pp2"}
{"countries":"pp3"}
{"countries":"pp4"}
{"countries":"pp5"}
{"countries":"pp6"}
{"countries":"pp7"}
*/
import org.apache.spark.rdd.RDD
import org.apache.spark.sql._
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
import org.apache.spark.{SparkConf, SparkContext};
object SparkApp extends App {
override def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("Simple Application").setMaster("local").set("spark.ui.enabled", "false")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
val dataFrame: DataFrame = sqlContext.read.json("/data.json")
val k = 3
val windowSpec = Window.partitionBy("grouped").orderBy("countries")
val newDF = dataFrame.withColumn("grouped", lit("grouping"))
var latestDF = newDF.withColumn("row", row_number() over windowSpec)
val totalCount = latestDF.count()
var lowLimit = 0
var highLimit = lowLimit + k
while(lowLimit < totalCount){
latestDF.where(s"row <= $highLimit and row > $lowLimit").show(false)
lowLimit = lowLimit + k
highLimit = highLimit + k
}
}
}
Here is the solution I found but I am not sure if that can works reliably:
override protected def batch[A](
input: Dataset[A],
batchSize: Int
): Seq[Dataset[A]] = {
val count = input.count()
val partitionQuantity = Math.ceil(count / batchSize).toInt
input.randomSplit(Array.fill(partitionQuantity)(1.0 / partitionQuantity), seed = 0)
}

Problems with spark datasets

When I execute a function in a mapPartition of dataset (executeStrategy()) it returns a result which I could check by debug but when I use dataset.show () it shows me an empty table and I do not know why this happens
This is for a data mining job at my school. I'm using windows 10, scala 2.11.12 and spark-2.2.0, which work without problems.
case class MyState(code: util.ArrayList[Object], evaluation: util.ArrayList[java.lang.Double])
private def executeStrategy(iter: Iterator[Row]): Iterator[(String,MyState)] = {
val listBest = new util.ArrayList[State]
Predicate.fuzzyValues = iter.toList
for (i <- 0 until conf.runNumber) {
Strategy.executeStrategy(conf.iterByRun, 1, conf.algorithm("algorithm").asInstanceOf[GeneratorType])
listBest.addAll(Strategy.getStrategy.listBest)
}
val result = postMining(listBest)
result.map(x => (x.getCode.toString, MyState(x.getCode,x.getEvaluation))).iterator
}
def run(sparkSession: SparkSession, n: Int): Unit = {
import sparkSession.implicits._
var data0 = conf.dataBase.repartition(n).persist(StorageLevel.MEMORY_AND_DISK_SER)
var listBest = new util.ArrayList[State]
implicit def enc1 = Encoders.bean(classOf[(String,MyState)])
val data1 = data0.mapPartitions(executeStrategy)
data1.show(3)
}
I expect that the dataset has the results of the processing of each partition, which I can see when I debug, but I get an empty dataset.
I have tried rdd with the same function executeStrategy() and this one returns an rdd with the results. What is the problem with the dataset?

Reduce two Scala methods, that only differ in one Object Type

I have the following two methods, using objects from Apache Spark.
def SVMModelScoring(sc: SparkContext, scoringDataset: String, modelFileName: String): RDD[(Double, Double)] = {
val model = SVMModel.load(sc, modelFileName)
val scoreAndLabels =
MLUtils.loadLibSVMFile(sc, scoringDataset).randomSplit(Array(0.1), seed = 11L)(0).map { point =>
val score = model.predict(point.features)
(score, point.label)
}
return scoreAndLabels
}
def DecisionTreeScoring(sc: SparkContext, scoringDataset: String, modelFileName: String): RDD[(Double, Double)] = {
val model = DecisionTreeModel.load(sc, modelFileName)
val scoreAndLabels =
MLUtils.loadLibSVMFile(sc, scoringDataset).randomSplit(Array(0.1), seed = 11L)(0).map { point =>
val score = model.predict(point.features)
(score, point.label)
}
return scoreAndLabels
}
My previous attempts to merge these functions have resulted in errors surround model.predict.
Is there a way I can use model as a parameter that is weakly typed in Scala?
Disclaimer - I've never used Apache Spark.
It looks to me like the only difference between the two methods is the way the model is instantiated. It's unfortunate that the two model instances don't actually share a common trait that provides predict(...) but we can still make this work by pulling out the part that changes - the scorer:
def scoreWith(sc: SparkContext, scoringDataset: String)(scorer: (Vector)=>Double): RDD[(Double, Double)] = {
MLUtils.loadLibSVMFile(sc, scoringDataset).randomSplit(Array(0.1), seed = 11L)(0).map { point =>
val score = scorer(point.features)
(score, point.label)
}
}
Now we can get the previous functionality with:
def svmScorer(sc: SparkContext, scoringDataset:String, modelFileName:String) =
scoreWith(sc: SparkContext, scoringDataset:String)(SVMModel.load(sc, modelFileName).predict)
def dtScorer(sc: SparkContext, scoringDataset:String, modelFileName:String) =
scoreWith(sc: SparkContext, scoringDataset:String)(DecisionTreeModel.load(sc, modelFileName).predict)