Is Spark zipWithIndex safe with parallel implementation? - scala

If I have a file, and I did an RDD zipWithIndex per row,
([row1, id1001, name, address], 0)
([row2, id1001, name, address], 1)
...
([row100000, id1001, name, address], 100000)
Will I be able to get the same index order if I reload the file? Since it runs in parallel, other rows may be partitioned differently?

RDDs can be sorted, and so do have an order. This order is used to create the index with .zipWithIndex().
To get the same order each time depends upon what previous calls are doing in your program. The docs mention that .groupBy() can destroy order or generate different orderings. There may be other calls that do this as well.
I suppose you could always call .sortBy() before calling .zipWithIndex() if you needed to guarantee a specific ordering.
This is explained in the .zipWithIndex() scala API docs
public RDD<scala.Tuple2<T,Object>> zipWithIndex() Zips this RDD with
its element indices. The ordering is first based on the partition
index and then the ordering of items within each partition. So the
first item in the first partition gets index 0, and the last item in
the last partition receives the largest index. This is similar to
Scala's zipWithIndex but it uses Long instead of Int as the index
type. This method needs to trigger a spark job when this RDD contains
more than one partitions.
Note that some RDDs, such as those returned by groupBy(), do not
guarantee order of elements in a partition. The index assigned to each
element is therefore not guaranteed, and may even change if the RDD is
reevaluated. If a fixed ordering is required to guarantee the same
index assignments, you should sort the RDD with sortByKey() or save it
to a file.

Related

SCALA: How to use collect function to get the latest modified entry from a dataframe?

I have a scala dataframe with two columns:
id: String
updated: Timestamp
From this dataframe I just want to get out the latest date, for which I use the following code at the moment:
df.agg(max("updated")).head()
// returns a row
I've just read about the collect() function, which I'm told to be
safer to use for such a problem - when it runs as a job, it appears it is not aggregating the max on the whole dataset, it looks perfectly fine when it is running in a notebook -, but I don't understand how it should
be used.
I found an implementation like the following, but I could not figure how it should be used...
df1.agg({"x": "max"}).collect()[0]
I tried it like the following:
df.agg(max("updated")).collect()(0)
Without (0) it returns an Array, which actually looks good. So idea is, we should apply the aggregation on the whole dataset loaded in the drive, not just the partitioned version, otherwise it seems to not retrieve all the timestamps. My question now is, how is collect() actually supposed to work in such a situation?
Thanks a lot in advance!
I'm assuming that you are talking about a spark dataframe (not scala).
If you just want the latest date (only that column) you can do:
df.select(max("updated"))
You can see what's inside the dataframe with df.show(). Since df are immutable you need to assign the result of the select to another variable or add the show after the select().
This will return a dataframe with just one row with the max value in "updated" column.
To answer to your question:
So idea is, we should apply the aggregation on the whole dataset loaded in the drive, not just the partitioned version, otherwise it seems to not retrieve all the timestamp
When you select on a dataframe, spark will select data from the whole dataset, there is not a partitioned version and a driver version. Spark will shard your data across your cluster and all the operations that you define will be done on the entire dataset.
My question now is, how is collect() actually supposed to work in such a situation?
The collect operation is converting from a spark dataframe into an array (which is not distributed) and the array will be in the driver node, bear in mind that if your dataframe size exceed the memory available in the driver you will have an outOfMemoryError.
In this case if you do:
df.select(max("Timestamp")).collect().head
You DF (that contains only one row with one column which is your date), will be converted to a scala array. In this case is safe because the select(max()) will return just one row.
Take some time to read more about spark dataframe/rdd and the difference between transformation and action.
It sounds weird. First of all you don´t need to collect the dataframe to get the last element of a sorted dataframe. There are many answers to this topics:
How to get the last row from DataFrame?

Select N elements from each partition in spark

Assuming that I am having a RDD. I set number of partitions of RDD to 5. I want to select 10 elements from each partition and want to store them in a variable called var1 and later I want to broadcast var1. How can I achieve this?
If I use that will lead to huge data shuffling so I could use collect. I have to store selected elements from each partition in a variable. Also consider that this is an iterative problem and I have to broadcast after X specified iterations.
you can try getting partition number using .mapPartitionsWithIndex, grouping by partition using .groupBy, adding id with .zipWithIndex, then filtering up to 10 records for each group with .filter and finally .collect.
Apply take(n) function to each partition of the RDD, which will
produce another RDD with n*noOfPartitions items.
val var1 = rdd.mapPartitions(rows => rows.take(10)).collect()
Note: Here collect is happening in resultant RDD which should be much
smaller than original RDD(provided n is small enough).

What is the best option to generate sequence numbers in Spark code (Scala)?

What is the best way to implement ROW_NUMBER (sequence generator) in Spark program for Billions of records (>25 Billion) ?
Sample code:
select patient_id,
department_id,
row_number() over (partition by department_id order by dept_id asc) as Pat_serial_Nbr
from T_patient;
Row_number() in Spark program is running for more than 4 hours and failing for 15 Billion records.
I have tried RDD method zipWithIndex() for 15 Billion of records (execution took 40 mins), it is returning expected results.
public RDD<scala.Tuple2<T,Object>> zipWithIndex()
Zips this RDD with its element indices. The ordering is first based on the partition index and then the ordering of items within each partition. So the first item in the first partition gets index 0, and the last item in the last partition receives the largest index. This is similar to Scala's zipWithIndex but it uses Long instead of Int as the index type. This method needs to trigger a spark job when this RDD contains more than one partitions.
Note that some RDDs, such as those returned by groupBy(), do not guarantee order of elements in a partition. The index assigned to each element is therefore not guaranteed, and may even change if the RDD is reevaluated. If a fixed ordering is required to guarantee the same index assignments, you should sort the RDD with sortByKey() or save it to a file.
scala> List("X", "Y", "Z").zipWithIndex
res0: List[(String, Int)] = List((X,0), (Y,1), (Z,2))
The same works for dataframe - val rdd = df.rdd.zipWithIndex
Second option:
val df_id = df.withColumn("id",monotonicallyIncreasingId)
Reference: Spark-Monotonically increasing id not working as expected in dataframe?
Could you please suggest the optimal way to generate sequence numbers in Scala Spark.

Why is dataset.count causing a shuffle! (spark 2.2)

Here is my dataframe:
The underlying RDD has 2 partitions
When I do a df.count, the DAG produced is
When I do a df.rdd.count, the DAG produced is:
Ques: Count is an action in spark, the official definition is ‘Returns the number of rows in the DataFrame.’. Now, when I perform the count on the dataframe why is a shuffle occurring? Besides, when I do the same on the underlying RDD no shuffle occurs.
It makes no sense to me why a shuffle would occur anyway. I tried to go through the source code of count here spark github But it doesn’t make sense to me fully. Is the “groupby” being supplied to the action the culprit?
PS. df.coalesce(1).count does not cause any shuffle
It seems that DataFrame's count operation uses groupBy resulting in shuffle. Below is the code from https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
* Returns the number of rows in the Dataset.
* #group action
* #since 1.6.0
*/
def count(): Long = withAction("count", groupBy().count().queryExecution) {
plan =>
plan.executeCollect().head.getLong(0)
}
While if you look at RDD's count function, it passes on the aggregate function to each of the partitions, which returns the sum of each partition as Array and then use .sum to sum elements of array.
Code snippet from this link:
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala
/**
* Return the number of elements in the RDD.
*/
def count(): Long = sc.runJob(this, Utils.getIteratorSize _).sum
When spark is doing dataframe operation, it does first compute partial counts for every partition and then having another stage to sum those up together. This is particularly good for large dataframes, where distributing counts to multiple executors actually adds to performance.
The place to verify this is SQL tab of Spark UI, which would have some sort of the following physical plan description:
*HashAggregate(keys=[], functions=[count(1)], output=[count#202L])
+- Exchange SinglePartition
+- *HashAggregate(keys=[], functions=[partial_count(1)], output=[count#206L])
In the shuffle stage, the key is empty, and the value is count of the partition, and all these (key,value) pairs are shuffled to one single partition.
That is, the data moved in the shuffle stage is very little.

Efficiently take one value for each key out of a RDD[(key,value)]

My starting point is an RDD[(key,value)] in Scala using Apache Spark. The RDD contains roughly 15 million tuples. Each key has roughly 50+-20 values.
Now I'd like to take one value (doesn't matter which one) for each key. My current approach is the following:
HashPartition the RDD by the key. (There is no significant skew)
Group the tuples by key resulting in RDD[(key, array of values)]]
Take the first of each value array
Basically looks like this:
...
candidates
.groupByKey()
.map(c => (c._1, c._2.head)
...
The grouping is the expensive part. It is still fast because there is no network shuffle and candidates is in memory but can I do it faster?
My idea was to work on the partitions directly, but I'm not sure what I get out of the HashPartition. If I take the first tuple of each partition, I will get every key but maybe multiple tuples for a single key depending on the number of partitions? Or will I miss keys?
Thank you!
How about reduceByKey with a function that returns the first argument? Like this:
candidates.reduceByKey((x, _) => x)