print a specific partition of RDD / Dataframe - pyspark

I have been experimenting with partitions and repartitioning of PySpark RDDs.
I noticed, when repartitioning a small sample RDD from 2 to 6 partitions, that simply a few empty parts are added.
rdd = sc.parallelize([1,2,3,43,54,678], 2)
rdd.glom().collect()
>>> [[1, 2, 3], [43, 54, 678]]
rdd6 = rdd.repartition(6)
rdd6.glom().collect()
>>> [[], [1, 2, 3], [], [], [], [43, 54, 678]]
Now, I wonder if that also happens in my real data.
It seems I can't use glom() on larger data (df with 192497 rows).
df.rdd.glom().collect()
Because when I try, nothing happens. It makes sense though, the resulting print would be enormous...
SO
I'd like to print each partition, to check if they are empty. or at least the top 20 elements of each partition.
any ideas?
PS: I found solutions for Spark, but I couldn't get them to work in PySpark...
How to print elements of particular RDD partition in Spark?
btw: if someone can explain to me why I get those empty partitions in the first place, I'd be all ears...
Or how I know when to expect this to happen and how to avoid this.
Or does it simply not influence performance, if there are empty partitions in a dataset?

Apparently (and surprisingly), rdd.repartition only doing coalesce, so, no shuffling, no wonder why the distribution is unequal. One way to go is using dataframe.repartition
rdd = sc.parallelize([1,2,3,43,54,678], 2)
rdd.glom().collect()
>>> [[1, 2, 3], [43, 54, 678]]
rdd6 = rdd.repartition(6)
rdd6.glom().collect()
>>> [[], [1, 2, 3], [], [], [], [43, 54, 678]]
rdd6_df = spark.createDataFrame(rdd, T.IntegerType()).repartition(6).rdd
rdd6_df.glom().collect()
[[Row(value=678)],
[Row(value=3)],
[Row(value=2)],
[Row(value=1)],
[Row(value=43)],
[Row(value=54)]]

concerning the possibility to check if partitions are empty, I came across a few solutions myself:
(if there aren't that many partitions)
rdd.glom().collect()
>>>nothing happens
rdd.glom().collect()[1]
>>>[1, 2, 3]
Careful though, it will truly print the whole partition. For my data it resulted in a few thousand lines of print. but it worked!
source: How to print elements of particular RDD partition in Spark?
count lines in each partition and show smallest/largest number.
l = df.rdd.mapPartitionsWithIndex(lambda x,it: [(x,sum(1 for _ in it))]).collect()
min(l,key=lambda item:item[1])
>>>(2, 61705)
max(l,key=lambda item:item[1])
>>>(0, 65875)
source: Spark Dataframes: Skewed Partition after Join

Related

How to read csv in pyspark as different types, or map dataset into two different types

Is there a way to map RDD as
covidRDD = sc.textFile("us-states.csv") \
.map(lambda x: x.split(","))
#reducing states and cases by key
reducedCOVID = covidRDD.reduceByKey(lambda accum, n:accum+n)
print(reducedCOVID.take(1))
The dataset consists of 1 column of states and 1 column of cases. When it's created, it is read as
[[u'Washington', u'1'],...]
Thus, I want to have a column of string and a column of int. I am doing a project on RDD, so I want to avoid using dataframe.. any thoughts?
Thanks!
As the dataset contains key value pair, use groupBykey and aggregate the count.
If you have a dataset like [['WH', 10], ['TX', 5], ['WH', 2], ['IL', 5], ['TX', 6]]
The code below gives this output - [('IL', 5), ('TX', 11), ('WH', 12)]
data.groupByKey().map(lambda row: (row[0], sum(row[1]))).collect()
can use aggregateByKey with UDF. This method requires 3 parameters start location, aggregation function within partition and aggregation function across the partitions
This code also produces the same result as above
def addValues(a,b):
return a+b
data.aggregateByKey(0, addValues, addValues).collect()

How to find MIN and MAX of each Column for a rdd using Map/Reduce or by any other method

I have read nearly 100 CSV files into one RDD
rdd=sc.textFile("file:///C:/Users\pinjala/Documents/Python Scripts/Files_1/*.csv")
I want to find Min and Max for each column in the RDD.Nearly 100 columns.
Can some one suggest how i can find Min and max for a RDD for different columns.
When I used
rdd.collect(), I am able to see rdd as list containing column names in first element and values of each columns in rest of elements in a list.
rdd=sc.textFile("file:///C:/Users\pinjala/Documents/Python Scripts/Files_1/*.csv")
It will be better if you had given some sample data.
Anyway, i just simulated and here is code-
new_list = []
list_p = [['John',19,1,9,20,68],['Jack',3,2,5,12,99]] #list of tuple
rdd = sc.parallelize(list_p) #Build a RDD
print(rdd.collect()) # [['John', 19, 1, 9, 20, 68], ['Jack', 3, 2, 5, 12, 99]]
for p in list_p:
header = p[0]
p.remove(p[0])
min_p = sc.parallelize(p).min()
max_p = sc.parallelize(p).max()
new_list.append("["+header+","+str(min_p)+","+str(max_p)+"]")
print(new_list) # ['[John,1,68]', '[Jack,2,99]']

Obtaining inconsistent results in Spark

Have any Spark experts had strange experience: obtaining inconsistent map-reduce results using pypark?
Suppose in the midway, I have a RDD
....
add = sc.parallelize([(('Alex', item1), 3), (('Joe', item2), 1),...])
My goal is to aggregate how many different users, so I do
print (set(rdd.map(lambda x: (x[0][0],1)).reduceByKey(add).collect()))
print (rdd.map(lambda x: (x[0][0],1)).reduceByKey(add).collect())
print (set(rdd.map(lambda x: (x[0][0],1)).reduceByKey(add).map(lambda x: x[0]).collect()))
These three prints should have the same content (though in different formats). For example, the first one is a set of set({('Alex', 1), ('John', 10), ('Joe', 2)...}); second a list of [('Alex', 1), ('John', 10), ('Joe', 2)...]. The number of the items should be equal to the number of different users. Third is a set({'Alex', 'John', 'Joe'...})
But instead I got set({('Alex', 1), ('John', 2), ('Joe', 3)...}); second a list of [('John', 5), ('Joe', 2)...]('Alex' is even missing here). The lengths of the set and list are different.
Unfortunately, I even cannot reproduce the error if I only write a short test code; still get right results. Did any meet this problem before?
I think I figured out.
The reason is that if I used the same RDD frequently, I need to .cache().
If the RDD becomes
add = sc.parallelize([(('Alex', item1), 3), (('Joe', item2), 1),...]).cache()
then the inconsistent problem solved.
Or, if I further prepare the aggregated rdd as
aggregated_rdd = rdd.map(lambda x: (x[0][0],1)).reduceByKey(add)
print (set(aggregated_rdd.collect()))
print (aggregated_rdd.collect())
print (set(aggregated_rdd.map(lambda x: x[0]).collect()))
then there are no inconsistent problems neither.

How to make to make each partition executed sequentially in DataFrame with scala/spark?

I hava a DataFrame,I want to make the first partition is first executed ,the second partition is second executed,this is my code,but it does not work ,what should I do to make each partition executed sequentially?
val arr = Array(1, 7, 3, 3, 5,21, 7, 3, 9, 10)
var df=sc.parallelize(arr,4).toDF("aa")
var arrbrocast=new HashMap[Int,Double]()
val bro=m_sparkCtx.broadcast(arrbrocast)
val rdd=df.rdd.mapPartitionsWithIndex((partIdx,iter)=>{
var flag=true
println("----"+bro.value.size)
while (flag){
if(bro.value.contains(partIdx-1)) {
flag = false
}
}
bro.value+=(partIdx->1.0)
println(bro.value.get(partIdx-1).get)
iter
})
rdd.count()
If you want data to be processed sequentially don't use Spark. Open the file and read input stream line by line. Theoretically you can use SparkContext.runJob to process specific partitions but it useless when processing full dataset.
Also this is not how broadcast variables work. You should never attempt to modify them when executing tasks.

Iterate through mixed-Type scala Lists

Using Spark 2.1.1., I have an N-row csv as 'fileInput'
colname datatype elems start end
colA float 10 0 1
colB int 10 0 9
I have successfully made an array of sql.rows ...
val df = spark.read.format("com.databricks.spark.csv").option("header", "true").load(fileInput)
val rowCnt:Int = df.count.toInt
val aryToUse = df.take(rowCnt)
Array[org.apache.spark.sql.Row] = Array([colA,float,10,0,1], [colB,int,10,0,9])
Against those Rows and using my random-value-generator scripts, I have successfully populated an empty ListBuffer[Any] ...
res170: scala.collection.mutable.ListBuffer[Any] = ListBuffer(List(0.24455154, 0.108798146, 0.111522496, 0.44311434, 0.13506883, 0.0655781, 0.8273762, 0.49718297, 0.5322746, 0.8416396), List(1, 9, 3, 4, 2, 3, 8, 7, 4, 6))
Now, I have a mixed-type ListBuffer[Any] with different typed lists.
.
How do iterate through and zip these? [Any] seems to defy mapping/zipping. I need to take N lists generated by the inputFile's definitions, then save them to a csv file. Final output should be:
ColA, ColB
0.24455154, 1
0.108798146, 9
0.111522496, 3
... etc
The inputFile can then be used to create any number of 'colnames's, of any 'datatype' (I have scripts for that), of each type appearing 1::n times, of any number of rows (defined as 'elems'). My random-generating scripts customize the values per 'start' & 'end', but these columns are not relevant for this question).
Given a List[List[Any]], you can "zip" all these lists together using transpose, if you don't mind the result being a list-of-lists instead of a list of Tuples:
val result: Seq[List[Any]] = list.transpose
If you then want to write this into a CSV, you can start by mapping each "row" into a comma-separated String:
val rows: Seq[String] = result.map(_.mkString(","))
(note: I'm ignoring the Apache Spark part, which seems completely irrelevant to this question... the "metadata" is loaded via Spark, but then it's collected into an Array so it becomes irrelevant)
I think the RDD.zipWithUniqueId() or RDD.zipWithIndex() methods can perform what you wanna do.
Please refer to official documentation for more information. hope this help you