How to make to make each partition executed sequentially in DataFrame with scala/spark? - scala

I hava a DataFrame,I want to make the first partition is first executed ,the second partition is second executed,this is my code,but it does not work ,what should I do to make each partition executed sequentially?
val arr = Array(1, 7, 3, 3, 5,21, 7, 3, 9, 10)
var df=sc.parallelize(arr,4).toDF("aa")
var arrbrocast=new HashMap[Int,Double]()
val bro=m_sparkCtx.broadcast(arrbrocast)
val rdd=df.rdd.mapPartitionsWithIndex((partIdx,iter)=>{
var flag=true
println("----"+bro.value.size)
while (flag){
if(bro.value.contains(partIdx-1)) {
flag = false
}
}
bro.value+=(partIdx->1.0)
println(bro.value.get(partIdx-1).get)
iter
})
rdd.count()

If you want data to be processed sequentially don't use Spark. Open the file and read input stream line by line. Theoretically you can use SparkContext.runJob to process specific partitions but it useless when processing full dataset.
Also this is not how broadcast variables work. You should never attempt to modify them when executing tasks.

Related

print a specific partition of RDD / Dataframe

I have been experimenting with partitions and repartitioning of PySpark RDDs.
I noticed, when repartitioning a small sample RDD from 2 to 6 partitions, that simply a few empty parts are added.
rdd = sc.parallelize([1,2,3,43,54,678], 2)
rdd.glom().collect()
>>> [[1, 2, 3], [43, 54, 678]]
rdd6 = rdd.repartition(6)
rdd6.glom().collect()
>>> [[], [1, 2, 3], [], [], [], [43, 54, 678]]
Now, I wonder if that also happens in my real data.
It seems I can't use glom() on larger data (df with 192497 rows).
df.rdd.glom().collect()
Because when I try, nothing happens. It makes sense though, the resulting print would be enormous...
SO
I'd like to print each partition, to check if they are empty. or at least the top 20 elements of each partition.
any ideas?
PS: I found solutions for Spark, but I couldn't get them to work in PySpark...
How to print elements of particular RDD partition in Spark?
btw: if someone can explain to me why I get those empty partitions in the first place, I'd be all ears...
Or how I know when to expect this to happen and how to avoid this.
Or does it simply not influence performance, if there are empty partitions in a dataset?
Apparently (and surprisingly), rdd.repartition only doing coalesce, so, no shuffling, no wonder why the distribution is unequal. One way to go is using dataframe.repartition
rdd = sc.parallelize([1,2,3,43,54,678], 2)
rdd.glom().collect()
>>> [[1, 2, 3], [43, 54, 678]]
rdd6 = rdd.repartition(6)
rdd6.glom().collect()
>>> [[], [1, 2, 3], [], [], [], [43, 54, 678]]
rdd6_df = spark.createDataFrame(rdd, T.IntegerType()).repartition(6).rdd
rdd6_df.glom().collect()
[[Row(value=678)],
[Row(value=3)],
[Row(value=2)],
[Row(value=1)],
[Row(value=43)],
[Row(value=54)]]
concerning the possibility to check if partitions are empty, I came across a few solutions myself:
(if there aren't that many partitions)
rdd.glom().collect()
>>>nothing happens
rdd.glom().collect()[1]
>>>[1, 2, 3]
Careful though, it will truly print the whole partition. For my data it resulted in a few thousand lines of print. but it worked!
source: How to print elements of particular RDD partition in Spark?
count lines in each partition and show smallest/largest number.
l = df.rdd.mapPartitionsWithIndex(lambda x,it: [(x,sum(1 for _ in it))]).collect()
min(l,key=lambda item:item[1])
>>>(2, 61705)
max(l,key=lambda item:item[1])
>>>(0, 65875)
source: Spark Dataframes: Skewed Partition after Join

Why the local variable value is not visible after iterating RDD? [duplicate]

This question already has an answer here:
Spark : Difference between accumulator and local variable
(1 answer)
Closed 3 years ago.
Hi I am writing code in scala for apache-spark.
my local variable "country" value is not reflecting after rdd iteration done.
I am assigning value in country variable after checking condition inside rdd iteration.until rdd is iterating value is available in country variable after control come out from loop value lost.
import org.apache.spark.sql.SparkSession
import java.lang.Long
object KPI1 {
def main(args:Array[String]){
System.setProperty("hadoop.home.dir","C:\\shivam docs\\hadoop-2.6.5.tar\\hadoop-2.6.5");
val spark=SparkSession.builder().appName("KPI1").master("local").getOrCreate();
val textFile=spark.read.textFile("C:\\shivam docs\\HADOOP\\sample data\\wbi.txt").rdd;
val splitData=textFile.map{
line=>{
val token=line.split(",(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)");
(token(0),token(10).replace("\"","").replace(",", ""));
}
};
// splitData.max()._2;
var maxele=0l;
var index=0;
var country="";
splitData.foreach(println);
for(ele<-splitData){
val data=Long.parseLong(ele._2);
if(maxele<data){
maxele=data;
println(maxele);
country=ele._1;
println(country);
}
};
println("***************************** "+country+maxele);
spark.close()
}
}
country variable value should not have default value.
Both for and foreach is wide operation. That means the execution will happen on more than one executors and that's why you are getting default value for some threads. I'm running my sample code in single node cluster with 4 executors and you can see the execution has happened in two different executors( Thread id is evident)
Sample
val baseRdd = spark.sparkContext.parallelize(Seq((1, 2), (3, 4)))
for (h <- baseRdd) {
println( "Thread id " + Thread.currentThread().getId)
println("Value "+ h)
}
Output
Thread id 48
Value (1,2)
Thread id 50
Value (3,4)
If you still want to have your expected result, follow either of the below option
1.Make changes to your spark context configuration as
master("local[1]"). This will run your job with single executor.
collect() your splitData before you perform for(ele<-splitData){...}
Note Both the options are strictly for testing or experimental purpose only and it will not work against large datasets.
When you're using variables within Executors, Spark (YARN/Mesos etc.) creates a new instance of it per each Executor. This is why you don't see any update to your variable (the updates occur only on the Executors, none is retrieved to the Driver). If you want to accomplish this, you should use Accumulators:
Both 'maxele' & 'country' should be Accumulators.
You can read about it here and here

Spark flushing Dataframe on show / count

I am trying to print the count of a dataframe, and then first few rows of it, before finally sending it out for further processing.
Strangely, after a call to count() the dataframe becomes empty.
val modifiedDF = funcA(sparkDF)
val deltaDF = modifiedDF.except(sparkDF)
println(deltaDF.count()) // prints 10
println(deltaDF.count()) //prints 0, similar behavior with show
funcB(deltaDF) //gets null dataframe
I was able to verify the same using deltaDF.collect.foreach(println) and subsequent calls to count.
However, if I do not call count or show, and just send it as is, funcB gets the whole DF with 10 rows.
Is it expected?
Definition of funcA() and its dependencies:
def funcA(inputDataframe: DataFrame): DataFrame = {
val col_name = "colA"
val modified_df = inputDataframe.withColumn(col_name, customUDF(col(col_name)))
val modifiedDFRaw = modified_df.limit(10)
modifiedDFRaw.withColumn("colA", modifiedDFRaw.col("colA").cast("decimal(38,10)"))
}
val customUDF = udf[Option[java.math.BigDecimal], java.math.BigDecimal](myUDF)
def myUDF(sval: java.math.BigDecimal): Option[java.math.BigDecimal] = {
val strg_name = Option(sval).getOrElse(return None)
if (change_cnt < 20) {
change_cnt = change_cnt + 1
Some(strg_name.multiply(new java.math.BigDecimal("1000")))
} else {
Some(strg_name)
}
}
First of all function used as UserDefinedFunction has to be at least idempotent, but optimally pure. Otherwise the results are simply non-deterministic. While some escape hatch is provided in the latest versions (it is possible to hint Spark that function shouldn't be re-executed) these won't help you here.
Moreover having mutable stable (it is not exactly clear what is the source of change_cnt, but it is both written and read in the udf) as simply no go - Spark doesn't provide global mutable state.
Overall your code:
Modifies some local copy of some object.
Makes decision based on such object.
Unfortunately both components are simply not salvageable. You'll have to go back to planning phase and rethink your design.
Your Dataframe is a distributed dataset and trying to do a count() returns unpredictable results since the count() can be different in each node. Read the documentation about RDDs below. It is applicable to DataFrames as well.
https://spark.apache.org/docs/2.3.0/rdd-programming-guide.html#understanding-closures-
https://spark.apache.org/docs/2.3.0/rdd-programming-guide.html#printing-elements-of-an-rdd

spark streaming - use previous calculated dataframe in next iteration

I have a streaming app that take a dstream and run an sql manipulation over the Dstream and dump it to file
dstream.foreachRDD { rdd =>
{spark.read.json(rdd)
.select("col")
.filter("value = 1")
.write.csv("s3://..")
now I need to be able to take into account the previous calculation (from eaelier batch) in my calculation (something like the following):
dstream.foreachRDD { rdd =>
{val df = spark.read.json(rdd)
val prev_df = read_prev_calc()
df.join(prev_df,"id")
.select("col")
.filter(prev_df("value)
.equalTo(1)
.write.csv("s3://..")
is there a way to write the calc result in memory somehow and use it as an input to to the calculation
Have you tried using the persist() method on a DStream? It will automatically persist every RDD of that DStream in memory.
by default, all input data and persisted RDDs generated by DStream transformations are automatically cleared.
Also, DStreams generated by window-based operations are automatically persisted in memory.
For more details, you can check https://spark.apache.org/docs/latest/streaming-programming-guide.html#caching--persistence
https://spark.apache.org/docs/0.7.2/api/streaming/spark/streaming/DStream.html
If you are looking only for one or two previously calculated dataframes, you should look into Spark Streaming Window.
Below snippet is from spark documentation.
val windowedStream1 = stream1.window(Seconds(20))
val windowedStream2 = stream2.window(Minutes(1))
val joinedStream = windowedStream1.join(windowedStream2)
or even simpler, if we want to do a word count over the last 20 seconds of data, every 10 seconds, we have to apply the reduceByKey operation on the pairs DStream of (word, 1) pairs over the last 30 seconds of data. This is done using the operation reduceByKeyAndWindow.
// Reduce last 20 seconds of data, every 10 seconds
val windowedWordCounts = pairs.reduceByKeyAndWindow((a:Int,b:Int) => (a + b), Seconds(20), Seconds(10))
more details and examples at-
https://spark.apache.org/docs/latest/streaming-programming-guide.html#window-operations

Iterate through mixed-Type scala Lists

Using Spark 2.1.1., I have an N-row csv as 'fileInput'
colname datatype elems start end
colA float 10 0 1
colB int 10 0 9
I have successfully made an array of sql.rows ...
val df = spark.read.format("com.databricks.spark.csv").option("header", "true").load(fileInput)
val rowCnt:Int = df.count.toInt
val aryToUse = df.take(rowCnt)
Array[org.apache.spark.sql.Row] = Array([colA,float,10,0,1], [colB,int,10,0,9])
Against those Rows and using my random-value-generator scripts, I have successfully populated an empty ListBuffer[Any] ...
res170: scala.collection.mutable.ListBuffer[Any] = ListBuffer(List(0.24455154, 0.108798146, 0.111522496, 0.44311434, 0.13506883, 0.0655781, 0.8273762, 0.49718297, 0.5322746, 0.8416396), List(1, 9, 3, 4, 2, 3, 8, 7, 4, 6))
Now, I have a mixed-type ListBuffer[Any] with different typed lists.
.
How do iterate through and zip these? [Any] seems to defy mapping/zipping. I need to take N lists generated by the inputFile's definitions, then save them to a csv file. Final output should be:
ColA, ColB
0.24455154, 1
0.108798146, 9
0.111522496, 3
... etc
The inputFile can then be used to create any number of 'colnames's, of any 'datatype' (I have scripts for that), of each type appearing 1::n times, of any number of rows (defined as 'elems'). My random-generating scripts customize the values per 'start' & 'end', but these columns are not relevant for this question).
Given a List[List[Any]], you can "zip" all these lists together using transpose, if you don't mind the result being a list-of-lists instead of a list of Tuples:
val result: Seq[List[Any]] = list.transpose
If you then want to write this into a CSV, you can start by mapping each "row" into a comma-separated String:
val rows: Seq[String] = result.map(_.mkString(","))
(note: I'm ignoring the Apache Spark part, which seems completely irrelevant to this question... the "metadata" is loaded via Spark, but then it's collected into an Array so it becomes irrelevant)
I think the RDD.zipWithUniqueId() or RDD.zipWithIndex() methods can perform what you wanna do.
Please refer to official documentation for more information. hope this help you