How to implement the loop function already implemented in python with scala - scala

enter image description here
I write some code in scala but i am stuck on the loop function.
import math._
object Haversine {
val R = 6372.8 //radius in km
def haversine(lat1:Double, lon1:Double, lat2:Double, lon2:Double)={
val dLat=(lat2 - lat1).toRadians
val dLon=(lon2 - lon1).toRadians
val a = pow(sin(dLat/2),2) + pow(sin(dLon/2),2) * cos(lat1.toRadians) * cos(lat2.toRadians)
val c = 2 * asin(sqrt(a))
R * c
}
def main(args: Array[String]): Unit = {
println(haversine(36.12, -86.67, 33.94, -118.40))
}
}
import org.apache.spark.sql.SparkSession
import Haversine.haversine
object Position {
def main(args: Array[String]): Unit = {
// create Spark DataFrame with Spark configuration
val spark= SparkSession.builder().getOrCreate()
// Read csv with DataFrame
val file1 = spark.read.csv("file:///home/aaron/Downloads/taxi_gps.txt")
val file2 = spark.read.csv("file:///home/aaron/Downloads/district.txt")
//change the name
val new_file1= file1.withColumnRenamed("_c0","id")
.withColumnRenamed("_c4","lat")
.withColumnRenamed("_c5","lon")
val new_file2= file2.withColumnRenamed("_c0","dis")
.withColumnRenamed("_1","lat")
.withColumnRenamed("_2","lon")
.withColumnRenamed("_c3","r")
//count
}
}
I am not familiar with scala,it is quite a tough question for me.
I hope you guys can help me,thx!

Before implementing you need to define and implement method cal_distance out side main method but in same class
def cal_distance(lon: Float, lat: Float, taxiLon: Float, taxiLat: Float) : Float = {
val distance = 0.0f
// write your geopy scala code here
distance
}
You code should in Scala should be similar to something as below
new_file2.foreach(row => {
val district = row.getAs[Float]("dis")
val lon = row.getAs[Float]("lon")
val lat = row.getAs[Float]("lat")
val distance = row.getAs[Float]("r")
var temp = 0
new_file1.foreach(taxi => {
val taxiLon = taxi.getAs[Float]("lon")
val taxiLat = taxi.getAs[Float]("lat")
if(cal_distance(lon,lat,taxiLon,taxiLat) <= distance) {
temp+=1
}
})
println(s"district:${district} temp=${temp}")
})

Related

Rewriting Apache Spark Scala into PySpark

Community, I'm not familiar with Scala and not so great with PySpark. However, I'm much less familiar with Scala and therefore was hoping if someone could let me know if someone could help me re-write the following Apache Spark Scala to PySpark.
If you're going to ask what I have done so far to help myself, I'm going to honestly say very little, as I'm still in the early days of coding.
So, if you can help re-code the following into PySpark, or put me on the right path so that I can re-code it myself, that would be very helpful
import org.apache.spark.sql.DataFrame
def readParquet(basePath: String): DataFrame = {
val parquetDf = spark
.read
.parquet(basePath)
return parquetDf
}
def num(df: DataFrame): Int = {
val numPartitions = df.rdd.getNumPartitions
return numPartitions
}
def ram(size: Int): Int = {
val ramMb = size
return ramMb
}
def target(size: Int): Int = {
val targetMb = size
return targetMb
}
def dp(): Int = {
val defaultParallelism = spark.sparkContext.defaultParallelism
return defaultParallelism
}
def files(dp: Int, multiplier: Int, ram: Int, target: Int): Int = {
val maxPartitions = Math.max(dp * multiplier, Math.ceil(ram / target).toInt)
return maxPartitions
}
def split(df: DataFrame, max: Int): DataFrame = {
val repartitionDf = df.repartition(max)
return repartitionDf
}
def writeParquet(df: DataFrame, targetPath: String) {
return df.write.format("parquet").mode("overwrite").save(targetPath)
}
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder().appName("spark-repartition-optimizer-app").getOrCreate()
spark.conf.set("spark.sql.shuffle.partitions", 2001) // example
val parquetDf = readParquet("/blogs/source/airlines.parquet/")
val numPartitions = num(parquetDf)
val ramMb = ram(6510) // approx. df cache size
val targetMb = target(128) // approx. partition size (between 50 and 200 mb)
val defaultParallelism = dp()
val maxPartitions = files(defaultParallelism, 2, ramMb, targetMb)
val repartitionDf = split(parquetDf, maxPartitions)
writeParquet(repartitionDf, "/blogs/optimized/airlines.parquet/")
I simply need to re-code the Scala code to PySpark myself.
This was fixed by including the following module in pyspark.
import module

How to persist the list which we made dynamically from dataFrame in scala spark

def getAnimalName(dataFrame: DataFrame): List[String] = {
dataFrame.select("animal").
filter(col("animal").isNotNull && col("animal").notEqual("")).
rdd.map(r => r.getString(0)).distinct().collect.toList
}
I am basicaly Calling this function 2 times For getting the list for different purposes . I just want to know is there a way to retain the list in memory and we dont have to call the same function again and again to generate the list and only have to generate the list only one time in scala spark.
Try something as below and you can also check the performance using time func.
Also find the code explanation inline
import org.apache.spark.rdd
import org.apache.spark.sql.functions._
import org.apache.spark.sql.{DataFrame, functions}
object HandleCachedDF {
var cachedAnimalDF : rdd.RDD[String] = _
def main(args: Array[String]): Unit = {
val spark = Constant.getSparkSess
val df = spark.read.json("src/main/resources/hugeTest.json") // Load your Dataframe
val df1 = time[rdd.RDD[String]] {
getAnimalName(df)
}
val resultList = df1.collect().toList
val df2 = time{
getAnimalName(df)
}
val resultList1 = df2.collect().toList
println(resultList.equals(resultList1))
}
def getAnimalName(dataFrame: DataFrame): rdd.RDD[String] = {
if (cachedAnimalDF == null) { // Check if this the first initialization of your dataframe
cachedAnimalDF = dataFrame.select("animal").
filter(functions.col("animal").isNotNull && col("animal").notEqual("")).
rdd.map(r => r.getString(0)).distinct().cache() // Cache your dataframe
}
cachedAnimalDF // Return your cached dataframe
}
def time[R](block: => R): R = { // COmpute the time taken by function to execute
val t0 = System.nanoTime()
val result = block // call-by-name
val t1 = System.nanoTime()
println("Elapsed time: " + (t1 - t0) + "ns")
result
}
}
You would have to persist or cache at this point
dataFrame.select("animal").
filter(col("animal").isNotNull && col("animal").notEqual("")).
rdd.map(r => r.getString(0)).distinct().persist
and then call the function as follow
def getAnimalName(dataFrame: DataFrame): List[String] = {
dataFrame.collect.toList
}
as many times as you need it without repeat the process.
I hope it helps.

How to implement the Seq.grouped(size:Int): Seq[Seq[A]] for Dataset in Spark

I want to try to implement the
def grouped(size: Int): Iterator[Repr] that Seq has but for Dataset in Spark.
So the input should be ds: Dataset[A], size: Int and output Seq[Dataset[A]] where each of the Dataset[A] in the output can't be bigger than size.
How should I proceed ? I tried with repartition and mapPartitions but I am not sure where to go from there.
Thank you.
Edit: I found the glom method in RDD but it produce a RDD[Array[A]] how do I go from this to the other way around Array[RDD[A]] ?
here you go, something that you want
/*
{"countries":"pp1"}
{"countries":"pp2"}
{"countries":"pp3"}
{"countries":"pp4"}
{"countries":"pp5"}
{"countries":"pp6"}
{"countries":"pp7"}
*/
import org.apache.spark.rdd.RDD
import org.apache.spark.sql._
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
import org.apache.spark.{SparkConf, SparkContext};
object SparkApp extends App {
override def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("Simple Application").setMaster("local").set("spark.ui.enabled", "false")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
val dataFrame: DataFrame = sqlContext.read.json("/data.json")
val k = 3
val windowSpec = Window.partitionBy("grouped").orderBy("countries")
val newDF = dataFrame.withColumn("grouped", lit("grouping"))
var latestDF = newDF.withColumn("row", row_number() over windowSpec)
val totalCount = latestDF.count()
var lowLimit = 0
var highLimit = lowLimit + k
while(lowLimit < totalCount){
latestDF.where(s"row <= $highLimit and row > $lowLimit").show(false)
lowLimit = lowLimit + k
highLimit = highLimit + k
}
}
}
Here is the solution I found but I am not sure if that can works reliably:
override protected def batch[A](
input: Dataset[A],
batchSize: Int
): Seq[Dataset[A]] = {
val count = input.count()
val partitionQuantity = Math.ceil(count / batchSize).toInt
input.randomSplit(Array.fill(partitionQuantity)(1.0 / partitionQuantity), seed = 0)
}

Spark job returns a different result on each run

I am working on a scala code which performs Linear Regression on certain datasets. Right now I am using 20 cores and 25 executors and everytime I run a Spark job I get a different result.
The input size of the files are 2GB and 400 MB.However, when I run the job with 20 cores and 1 executor, I get consistent results.
Has anyone experienced such a thing so far?
Please find the code below:
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.sql.SQLContext
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.SchemaRDD
import org.apache.spark.Partitioner
import org.apache.spark.storage.StorageLevel
object TextProcess{
def main(args: Array[String]){
val conf = new SparkConf().set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
val sc = new SparkContext(conf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val numExecutors=(conf.get("spark.executor.instances").toInt)
// Read the 2 input files
// First file is either cases / controls
val input1 = sc.textFile(args(0))
// Second file is Gene Expression
val input2 = sc.textFile(args(1))
//collecting header information
val header1=sc.parallelize(input1.take(1))
val header2=sc.parallelize(input2.take(1))
//mapping data without the header information
val map1 = input1.subtract(header1).map(x => (x.split(" ")(0)+x.split(" ")(1), x))
val map2 = input2.subtract(header2).map(x => (x.split(" ")(0)+x.split(" ")(1), x))
//joining data. here is where the order was getting affected.
val joinedMap = map1.join(map2)
//adding the header back to the top of RDD
val x = header1.union(joinedMap.map{case(x,(y,z))=>y})
val y = header2.union(joinedMap.map{case(x,(y,z))=>z})
//removing irrelevant columns
val rddX = x.map(x=>x.split(" ").drop(3)).zipWithIndex.map{case(a,b)=> a.map(x=>b.toString+" "+x.toString)}
val rddY = y.map(x=>x.split(" ").drop(2)).zipWithIndex.map{case(a,b)=> a.map(x=>b.toString+" "+x.toString)}
//transposing and cross joining data. This keeps the identifier at the start
val transposedX = rddX.flatMap(x => x.zipWithIndex.map(x=>x.swap)).reduceByKey((a,b)=> a+":"+b).map{case(a,b)=>b.split(":").sorted}
val transposedY = rddY.flatMap(x => x.zipWithIndex.map(x=>x.swap)).reduceByKey((a,b)=> a+":"+b).map{case(a,b)=>b.split(":").sorted}.persist(StorageLevel.apply(false, true, false, false, numExecutors))
val cleanedX = transposedX.map(x=>x.map(x=>x.slice(x.indexOfSlice(" ")+1,x.length)))
val cleanedY = transposedY.map(x=>x.map(x=>x.slice(x.indexOfSlice(" ")+1,x.length))).persist(StorageLevel.apply(false, true, false, false, numExecutors))
val cartXY = cleanedX.cartesian(cleanedY)
val finalDataSet= cartXY.map{case(a,b)=>a zip b}
//convert to key value pair
val regressiondataset = finalDataSet.map(x=>(x(0),x.drop(1).filter{case(a,b)=> a!="NA" && b!="NA" && a!="null" && b!="null"}.map{case(a,b)=> (a.toDouble, b.toDouble)}))
val linearOutput = regressiondataset.map(s => new LinearRegression(s._1 ,s._2).outputVal)
linearOutput.saveAsTextFile(args(2))
cleanedY.unpersist()
transposedY.unpersist()
}
}
class LinearRegression (val keys: (String, String),val pairs: Array[(Double,Double)]) {
val size = pairs.size
// first pass: read in data, compute xbar and ybar
val sums = pairs.aggregate(new X_X2_Y(0D,0D,0D))(_ + new X_X2_Y(_),_+_)
val bars = (sums.x / size, sums.y / size)
// second pass: compute summary statistics
val sumstats = pairs.foldLeft(new X2_Y2_XY(0D,0D,0D))(_ + new X2_Y2_XY(_, bars))
val beta1 = sumstats.xy / sumstats.x2
val beta0 = bars._2 - (beta1 * bars._1)
val betas = (beta0, beta1)
//println("y = " + ("%4.3f" format beta1) + " * x + " + ("%4.3f" format beta0))
// analyze results
val correlation = pairs.aggregate(new RSS_SSR(0D,0D))(_ + RSS_SSR.build(_, bars, betas),_+_)
val R2 = correlation.ssr / sumstats.y2
val svar = correlation.rss / (size - 2)
val svar1 = svar / sumstats.x2
val svar0 = ( svar / size ) + ( bars._1 * bars._1 * svar1)
val svar0bis = svar * sums.x2 / (size * sumstats.x2)
/* println("R^2 = " + R2)
println("std error of beta_1 = " + Math.sqrt(svar1))
println("std error of beta_0 = " + Math.sqrt(svar0))
println("std error of beta_0 = " + Math.sqrt(svar0bis))
println("SSTO = " + sumstats.y2)
println("SSE = " + correlation.rss)
println("SSR = " + correlation.ssr)*/
def outputVal() = keys._1
+"\t"+keys._2
+"\t"+beta1
+"\t"+beta0
+"\t"+R2
+"\t"+Math.sqrt(svar1)
+"\t"+Math.sqrt(svar0)
+"\t"+sumstats.y2
+"\t"+correlation.rss
+"\t"+correlation.ssr+"\t;
}
object RSS_SSR {
def build(p: (Double,Double), bars: (Double,Double), betas: (Double,Double)): RSS_SSR = {
val fit = (betas._2 * p._1) + betas._1
val rss = (fit-p._2) * (fit-p._2)
val ssr = (fit-bars._2) * (fit-bars._2)
new RSS_SSR(rss, ssr)
}
}
class RSS_SSR(val rss: Double, val ssr: Double) {
def +(p: RSS_SSR): RSS_SSR = new RSS_SSR(rss+p.rss, ssr+p.ssr)
}
class X_X2_Y(val x: Double, val x2: Double, val y: Double) {
def this(p: (Double,Double)) = this(p._1, p._1*p._1, p._2)
def +(p: X_X2_Y): X_X2_Y = new X_X2_Y(x+p.x,x2+p.x2,y+p.y)
}
class X2_Y2_XY(val x2: Double, val y2: Double, val xy: Double) {
def this(p: (Double,Double), bars: (Double,Double)) = this((p._1-bars._1)*(p._1-bars._1), (p._2-bars._2)*(p._2-bars._2),(p._1-bars._1)*(p._2-bars._2))
def +(p: X2_Y2_XY): X2_Y2_XY = new X2_Y2_XY(x2+p.x2,y2+p.y2,xy+p.xy)
}

Chisel: Access to Module Parameters from Tester

How does one access the parameters used to construct a Module from inside the Tester that is testing it?
In the test below I am passing the parameters explicitly both to the Module and to the Tester. I would prefer not to have to pass them to the Tester but instead extract them from the module that was also passed in.
Also I am new to scala/chisel so any tips on bad techniques I'm using would be appreciated :).
import Chisel._
import math.pow
class TestA(dataWidth: Int, arrayLength: Int) extends Module {
val dataType = Bits(INPUT, width = dataWidth)
val arrayType = Vec(gen = dataType, n = arrayLength)
val io = new Bundle {
val i_valid = Bool(INPUT)
val i_data = dataType
val i_array = arrayType
val o_valid = Bool(OUTPUT)
val o_data = dataType.flip
val o_array = arrayType.flip
}
io.o_valid := io.i_valid
io.o_data := io.i_data
io.o_array := io.i_array
}
class TestATests(c: TestA, dataWidth: Int, arrayLength: Int) extends Tester(c) {
val maxData = pow(2, dataWidth).toInt
for (t <- 0 until 16) {
val i_valid = rnd.nextInt(2)
val i_data = rnd.nextInt(maxData)
val i_array = List.fill(arrayLength)(rnd.nextInt(maxData))
poke(c.io.i_valid, i_valid)
poke(c.io.i_data, i_data)
(c.io.i_array, i_array).zipped foreach {
(element,value) => poke(element, value)
}
expect(c.io.o_valid, i_valid)
expect(c.io.o_data, i_data)
(c.io.o_array, i_array).zipped foreach {
(element,value) => poke(element, value)
}
step(1)
}
}
object TestAObject {
def main(args: Array[String]): Unit = {
val tutArgs = args.slice(0, args.length)
val dataWidth = 5
val arrayLength = 6
chiselMainTest(tutArgs, () => Module(
new TestA(dataWidth=dataWidth, arrayLength=arrayLength))){
c => new TestATests(c, dataWidth=dataWidth, arrayLength=arrayLength)
}
}
}
If you make the arguments dataWidth and arrayLength members of TestA you can just reference them. In Scala this can be accomplished by inserting val into the argument list:
class TestA(val dataWidth: Int, val arrayLength: Int) extends Module ...
Then you can reference them from the test as members with c.dataWidth or c.arrayLength