Spark Union inside a loop gives void - scala

I try to make a RDD from iterative union from another RDD inside a loop but the result works exclusively if i perform an action on the result RDD inside the loop.
var rdd : RDD[Int] = sc.emptyRDD
for ( i <- 1 to 5 ) {
val rdd1 = sc.parallelize(Array(1))
rdd = rdd ++ rdd1
}
// rdd.foreach(println) => void
for ( i <- 1 to 5 ) {
val rdd1 = sc.parallelize(Array(1))
rdd = rdd ++ rdd1
rdd.foreach(x=>x)
}
// rdd.foreach(println) => ( 1,1,1,1,1)
If I create rdd1 outside the loop everything works fine but not inside.
Does it exist a specific lightweight action to solve this problem ?

One thing to keep in mind is that when you apply the foreach action to your RDD, the action while apply on each individual worker. Therefore in the first case, if you check the stdout's of each executor, you will find the printed values from rdd. If you want these values to be printed to your console, you can aggregate the elements of an RDD (or a subset of them) at the driver, and then apply your function (e.g. rdd.collect.foreach(println), rdd.take(3).foreach(println), etc).

Related

Non Deterministic Behaviour of UNION of RDD in Spark

I'm performing Union operation on 3 RDD's, I'm aware Union doesn't preserve ordering but my in my case it is quite weird. Can someone explain me what's wrong in my code??
I've a (myDF)dataframe of rows and converted to RDD :-
myRdd = myDF.rdd.map(row => row.toSeq.toList.mkString(":")).map(rec => (2, rec))
myRdd.collect
/*
Deepak:7321c:Stack Overflow:AIR:INDIA:AIR999:N:2020-04-22T10:28:33.087
Veeru:596621c:Medium:POWER:USA:LN49:Y:2020-14-22T10:38:43.287
Rajeev:1612801:Udemy:LEARN:ITALY:P4399:N:2020-04-22T13:08:43.887
*/
val rowCount = myRdd.count() // Count of Records in myRdd
val header = "name:country:date:nextdate:1" // random header
// Generating Header Rdd
headerRdd = sparkContext.parallelize(Array(header), 1).map(rec => (1, rec))
//Generating Trailer Rdd
val trailerRdd = sparkContext.parallelize(Array("T" + ":" + rowCount),1).map(rec => (3, rec))
//Performing Union
val unionRdd = headerRdd.union(myRdd).union(trailerdd).map(rec => rec._2)
unionRdd.saveAsTextFile("pathLocation")
As Union doesn't preserve ordering it should not give below result
Output
name:country:date:nextdate:1
Deepak:7321c:Stack Overflow:AIR:INDIA:AIR999:N:2020-04-22T10:28:33.087
Veeru:596621c:Medium:POWER:USA:LN49:Y:2020-14-22T10:38:43.287
Rajeev:1612801:Udemy:LEARN:ITALY:P4399:N:2020-04-22T13:08:43.887
T:3
Without using any sorting, How's that possible to get above output??
sortByKey("true", 1)
But When I Remove map from headerRdd, myRdd & TrailerRdd the oder is like
Deepak:7321c:Stack Overflow:AIR:INDIA:AIR999:N:2020-04-22T10:28:33.087
name:country:date:nextdate:1
Veeru:596621c:Medium:POWER:USA:LN49:Y:2020-14-22T10:38:43.287
Rajeev:1612801:Udemy:LEARN:ITALY:P4399:N:2020-04-22T13:08:43.887
T:3
What is the possible reason for above behaviour??
In Spark, the elements within a particular partition are unordered, however the partitions themselves are ordered check this

Getting the number of rows in a Spark dataframe without counting

I am applying many transformations on a Spark DataFrame (filter, groupBy, join). I want to have the number of rows in the DataFrame after each transformation.
I am currently counting the number of rows using the function count() after each transformation, but this triggers an action each time which is not really optimized.
I was wondering if there is any way of knowing the number of rows without having to trigger another action than the original job.
You could use an accumulator for each stage and increment the accumulator in a map after each stage. Then at the end after you do your action you would have a count for all the stages.
val filterCounter = spark.sparkContext.longAccumulator("filter-counter")
val groupByCounter = spark.sparkContext.longAccumulator("group-counter")
val joinCounter = spark.sparkContext.longAccumulator("join-counter")
myDataFrame
.filter(col("x") === lit(3))
.map(x => {
filterCounter.add(1)
x
}) .groupBy(col("x"))
.agg(max("y"))
.map(x => {
groupByCounter.add(1)
x
})
.join(myOtherDataframe, col("x") === col("y"))
.map(x => {
joinCounter.add(1)
x
})
.count()
print(s"count for filter = ${filterCounter.value}")
print(s"count for group by = ${groupByCounter.value}")
print(s"count for join = ${joinCounter.value}")
Each operator in itself has couple of metrics. These metrics are visible in the spark UI,'s SQL tab.
If SQL is not used, we could introspect the query execution object of the data frame after execution, to access the metrics (internally accumulators).
Example: df.queryExecution.executedPlan.metrics will give the metrics of the top most node in DAG.
Coming back to this question after a bit more experience on Apache Spark to complement randal's answer.
You can also use an UDF to increment a counter.
val filterCounter = spark.sparkContext.longAccumulator("filter-counter")
val groupByCounter = spark.sparkContext.longAccumulator("group-counter")
val joinCounter = spark.sparkContext.longAccumulator("join-counter")
def countUdf(acc: LongAccumulator): UserDefinedFunction = udf { (x: Int) =>
acc.add(1)
x
}
myDataFrame
.filter(col("x") === lit(3))
.withColumn("x", countUdf(filterCounter)(col("x")))
.groupBy(col("x"))
.agg(max("y"))
.withColumn("x", countUdf(groupByCounter)(col("x")))
.join(myOtherDataframe, col("x") === col("y"))
.withColumn("x", countUdf(joinCounter)(col("x")))
.count()
print(s"count for filter = ${filterCounter.value}")
print(s"count for group by = ${groupByCounter.value}")
print(s"count for join = ${joinCounter.value}")
This should be more efficient because spark will only have to deserialize the column used in the UDF, but has to be carefully used as catalyst can more easily reorder the operations (like pushing a filter before the call to the udf)

Can I recursively apply transformations to a Spark dataframe in scala?

Noodling around with Spark, using union to build up a suitably large test dataset. This works OK:
val df = spark.read.json("/opt/spark/examples/src/main/resources/people.json")
df.union(df).union(df).count()
But I'd like to do something like this:
val df = spark.read.json("/opt/spark/examples/src/main/resources/people.json")
for (a <- 1 until 10){
df = df.union(df)
}
that barfs with error
<console>:27: error: reassignment to val
df = df.union(df)
^
I know this technique would work using python, but this is my first time using scala so I'm unsure of the syntax.
How can I recursively union a dataframe with itself n times?
If you use val on the dataset it becomes an immutable variable. That means you can't do any reassignments. If you change your definition to var df your code should work.
A functional approach without mutable data is:
val df = List(1,2,3,4,5).toDF
val bigDf = ( for (a <- 1 until 10) yield df ) reduce (_ union _)
The for loop will create a IndexedSeq of the specified length containing your DataFrame and the reduce function will take the first DataFrame union it with the second and will start again using the result.
Even shorter without the for loop:
val df = List(1,2,3,4,5).toDF
val bigDf = 1 until 10 map (_ => df) reduce (_ union _)
You could also do this with tail recursion using an arbitrary range:
#tailrec
def bigUnion(rng: Range, df: DataFrame): DataFrame = {
if (rng.isEmpty) df
else bigUnion(rng.tail, df.union(df))
}
val resultingBigDF = bigUnion(1.to(10), myDataFrame)
Please note this is untested code based on a similar things I had done.

Finding the union of RDDs which may not exist

I'm trying to get the union of a few RDDs. The RDDs are being read in via SparkContext.textFile, but some may not exist on the file system.
val rdd1 = Try(Repository.fetch(data1Path))
val rdd2 = Try(Repository.fetch(data2Path))
val rdd3 = Try(Repository.fetch(data3Path))
val rdd4 = Try(Repository.fetch(data4Path))
val all = Seq(rdd1, rdd2, rdd3, rdd4)
val union = sc.union(all.map {case Success(r) => r})
val results = union.filter(some-filter-logic).collect
However due to lazy evaluation, all those Try statements evaluate to Success regardless of whether the files are present or not, and I end up with a FileNotFoundException upon evaluation when collect is called.
Is there a way around this?
You can try to run a loop to check whether the file exists and in the same loop you can create the RDDs and get a union.
OR
you can use wholeTextFiles API to read all the files present in one directory as key,value pair.
val rdd=sc.wholeTextFiles(path, minPartitions)
If any file will be empty also, it will not not create any issue.

Access joined RDD fields in a readable way

I joined 2 RDDs and now when I'm trying to access the new RDD fields I need to treat them as Tuples. It leads to code that is not so readable. I tried to use 'type' in order to create some aliases however it doesn't work and I still need to access the fields as Tuples. Any idea how to make the code more readable?
for example - when trying to filter rows in the joined RDD:
val joinedRDD = RDD1.join(RDD2).filter(x=>x._2._2._5!='temp')
I would like to use names instead of 2,5 etc.
Thanks
Use pattern matching wisely.
val rdd1 = sc.parallelize(List(("John", (28, true)), ("Mary", (22, true)))
val rdd2 = sc.parallelize(List(("John", List(100, 200, -20))))
rdd1
.join(rdd2)
.map {
case (name, ((age, isProlonged), payments)) => (name, payments.sum)
}
.filter {
case (name, sum) => sum > 0
}
.collect()
res0: Array[(String, Int)] = Array((John,280))
Another option is using dataframes abstraction over RDD and writing sql queries.