How I can transpose a data frame in pyspark? - pyspark

I couldn't find any function in pyspark for transposing a dataframe.
Cal Cal2 Cal3
'A' 12 11
'U' 10 9
'O' 5 5
'ER' 6 5
Cal 'A' 'U' 'O' 'ER'
Cal2 12 10 5 6
Cal3 11 9 5 5
in pandas is very easy: df.T but I am not sure how it is in pyspark!

Generation of the sample dataframe
df = spark.createDataFrame([('A' ,12 ,11),('U' ,10 ,9 ),('O' , 5 ,5 ),('ER', 6 ,5 )], ['Cal','Cal2','Cal3'])
Option 1: pyspark.pandas.DataFrame.T
For large dataframes compute.max_rows might be required
import pyspark.pandas as ps
ps.get_option("compute.max_rows") # 1000
ps.set_option("compute.max_rows", 2000)
(df
.to_pandas_on_spark()
.set_index('Cal')
.T
.reset_index()
.rename(columns={"index":"Cal"})
.to_spark()
.show())
+----+---+---+---+---+
| Cal| A| U| O| ER|
+----+---+---+---+---+
|Cal2| 12| 10| 5| 6|
|Cal3| 11| 9| 5| 5|
+----+---+---+---+---+
Option 2: pyspark, the hard way
import pyspark.sql.functions as F
header_col = 'Cal'
cols_minus_header = df.columns
cols_minus_header.remove(header_col)
df1 = (df
.groupBy()
.pivot('Cal')
.agg(F.first(F.array(cols_minus_header)))
.withColumn(header_col, F.array(*map(F.lit, cols_minus_header)))
)
df1.show(truncate = False)
+--------+------+------+-------+------------+
| A| ER| O| U| Cal|
+--------+------+------+-------+------------+
|[12, 11]|[6, 5]|[5, 5]|[10, 9]|[Cal2, Cal3]|
+--------+------+------+-------+------------+
df2 = df1.select(F.arrays_zip(*df1.columns).alias('az')).selectExpr('inline(az)')
df2.show(truncate = False)
+---+---+---+---+----+
|A |ER |O |U |Cal |
+---+---+---+---+----+
|12 |6 |5 |10 |Cal2|
|11 |5 |5 |9 |Cal3|
+---+---+---+---+----+

You can unpivot the dataframe and then pivot it based on a different column.
from pyspark.sql import functions as F
data = [('A', 12, 11,),
('U', 10, 9,),
('O', 5, 5,),
('ER', 6, 5,), ]
df = spark.createDataFrame(data, ("Cal", "Cal2", "Cal3",))
key_column = "Cal"
unpivot_cols = [x for x in df.columns if x != key_column]
unpivot_col_expr = " ,".join([f"'{c}', {c}" for c in unpivot_cols])
unpivot_expr = f"stack({len(unpivot_cols)}, {unpivot_col_expr}) as (key,value)"
unpivoted_df = df.select("Cal", F.expr(unpivotExpr))
unpivoted_df.groupBy("key").pivot(key_column).agg(F.first("value")).withColumnRenamed("key", key_column).show()
"""
+----+---+---+---+---+
| Cal| A| ER| O| U|
+----+---+---+---+---+
|Cal3| 11| 5| 5| 9|
|Cal2| 12| 6| 5| 10|
+----+---+---+---+---+
"""

With Spark 3.2.1 supports pyspark supports pandas API as well.
If your dataframe is small you can make use of the same.
This method is based on an expensive operation due to the nature of big data. Internally it needs to generate each row for each value, and then group twice - it is a huge operation. To prevent misusage, this method has the ‘compute.max_rows’ default limit of input length, and raises a ValueError.
See the below link for more details -
pyspark.pandas.DataFrame.transpose
In addition to the above, you can also use Koalas (available in databricks) and is similar to Pandas except makes more sense for distributed processing and available in Pyspark (from 3.0.0 onwards). Something as below -
kdf = df.to_koalas()
Transpose_kdf = kdf.transpose()
TransposeDF = Transpose_kdf.to_spark()
Koalas documentation - Databricks Koalas
One thing to note is, you need to define partitions so as to use Koalas efficiently, else there could be serious performance issues.
There is another option that pyspark natively provides is pivot and stack option. See the below documentation for details -
Stack
Pivot
I'll leave these two for you to explore in detail.

Related

Reducing Multiple Actions/Filter optimization with Transformations in Pyspark

I have a master dataset (df) out of which I am trying to create averages based on certain filters.
F1 = df.filter((df.Identifier1>0)).groupBy().avg('Amount')
F2 = df.filter((df.Identifier1>2)).groupBy().avg('Amount')
F3 = df.filter((df.Identifier2<2)).groupBy().avg('Amount')
F4 = df.filter((df.Identifier2<4)).groupBy().avg('Amount')
#Alternatively also tried another way for avg calculation,
F1 = df.filter((df.Identifier1>0)).agg(avg(col('Amount')))
..
Post Calculating these averages, I am trying to assign them to the records in the master df into two columns A1 and A2 using the same filters used in average calculation.
df = df.withColumn("A1", when((col("Identifier1") > 0)), (F1.collect()[0][0]))
….
….
.otherwise(avg(col('Amount')))
df = df.withColumn("A2", when((col("Identifier2") <2 )), (F3.collect()[0][0]))
….
….
.otherwise(avg(col('Amount')))
I am facing two issues:
When one of the averages is Null then I get an error while calling collect() or first()
Error:
Unsupported literal type class java.util.ArrayList [null]
As there are multiple actions involved the process takes over 2hrs.
Any help on the above is welcome.
Make a column for your filter conditions, e.g.
+---+--------+-----------+-----------+------+
| ID|Category|Identifier1|Identifier2|Amount|
+---+--------+-----------+-----------+------+
| 12| A| 2| 1| 100|
| 23| B| 7| 8| 500|
| 34| C| 1| 4| 300|
+---+--------+-----------+-----------+------+
df.withColumn('group', when(df.Identifier1 > 0, array(lit(1))).otherwise(array(lit(None)))) \
.withColumn('group', when(df.Identifier1 > 2, array_union(col('group'), array(lit(2)))).otherwise(col('group'))) \
.withColumn('group', when(df.Identifier2 < 2, array_union(col('group'), array(lit(3)))).otherwise(col('group'))) \
.withColumn('group', when(df.Identifier2 < 4, array_union(col('group'), array(lit(4)))).otherwise(col('group'))) \
.withColumn('group', explode('group')) \
.groupBy('group').agg(sum('Amount').alias('sum'), avg('Amount').alias('avg')).show()
+-----+---+-----+
|group|sum| avg|
+-----+---+-----+
| 1|900|300.0|
| 3|100|100.0|
| 4|100|100.0|
| 2|500|500.0|
+-----+---+-----+
and then group by to the group.

pyspark equivalent of pandas groupby('col1').col2.head()

I have a Spark Dataframe where for each set of rows with a given column value (col1), I want to grab a sample of the values in (col2). The number of rows for each possible value of col1 may vary widely, so i'm just looking for a set number, say 10, of each type.
There may be a better way to do this, but the natural approach seemed to be a df.groupby('col1')
in pandas, I could do df.groupby('col1').col2.head()
i understand that spark dataframes are not pandas dataframes, but this is a good analogy.
i suppose i could loop over all of col1 types as a filter, but that seems terribly icky.
any thoughts on how to do this? thanks.
Let me create a sample Spark dataframe with two columns.
df = SparkSQLContext.createDataFrame([[1, 'r1'],
[1, 'r2'],
[1, 'r2'],
[2, 'r1'],
[3, 'r1'],
[3, 'r2'],
[4, 'r1'],
[5, 'r1'],
[5, 'r2'],
[5, 'r1']], schema=['col1', 'col2'])
df.show()
+----+----+
|col1|col2|
+----+----+
| 1| r1|
| 1| r2|
| 1| r2|
| 2| r1|
| 3| r1|
| 3| r2|
| 4| r1|
| 5| r1|
| 5| r2|
| 5| r1|
+----+----+
After grouping by col1, we get GroupedData object (instead of Spark Dataframe). You can use aggregate functions like min, max, average. But getting a head() is little bit tricky. We need to convert GroupedData object back to Spark Dataframe. This can be done Using pyspark collect_list() aggregation function.
from pyspark.sql import functions
df1 = df.groupBy(['col1']).agg(functions.collect_list("col2")).show(n=3)
Output is:
+----+------------------+
|col1|collect_list(col2)|
+----+------------------+
| 5| [r1, r2, r1]|
| 1| [r1, r2, r2]|
| 3| [r1, r2]|
+----+------------------+
only showing top 3 rows

Spark2 Dataframe/RDD process in groups

I have the following table stored in Hive called ExampleData:
+--------+-----+---|
|Site_ID |Time |Age|
+--------+-----+---|
|1 |10:00| 20|
|1 |11:00| 21|
|2 |10:00| 24|
|2 |11:00| 24|
|2 |12:00| 20|
|3 |11:00| 24|
+--------+-----+---+
I need to be able to process the data by Site. Unfortunately partitioning it by Site doesn't work (there are over 100k sites, all with fairly small amounts of data).
For each site, I need to select the Time column and Age column separately, and use this to feed into a function (which ideally I want to run on the executors, not the driver)
I've got a stub of how I think I want it to work, but this solution would only run on the driver, so it's very slow. I need to find a way of writing it so it will run an executor level:
// fetch a list of distinct sites and return them to the driver
//(if you don't, you won't be able to loop around them as they're not on the executors)
val distinctSites = spark.sql("SELECT site_id FROM ExampleData GROUP BY site_id LIMIT 10")
.collect
val allSiteData = spark.sql("SELECT site_id, time, age FROM ExampleData")
distinctSites.foreach(row => {
allSiteData.filter("site_id = " + row.get(0))
val times = allSiteData.select("time").collect()
val ages = allSiteData.select("ages").collect()
processTimesAndAges(times, ages)
})
def processTimesAndAges(times: Array[Row], ages: Array[Row]) {
// do some processing
}
I've tried broadcasting the distinctSites across all nodes, but this did not prove fruitful.
This seems such a simple concept and yet I have spent a couple of days looking into this. I'm very new to Scala/Spark, so apologies if this is a ridiculous question!
Any suggestions or tips are greatly appreciated.
RDD API provides a number of functions which can be used to perform operations in groups starting with low level repartition / repartitionAndSortWithinPartitions and ending with a number of *byKey methods (combineByKey, groupByKey, reduceByKey, etc.).
Example :
rdd.map( tup => ((tup._1, tup._2, tup._3), tup) ).
groupByKey().
forEachPartition( iter => doSomeJob(iter) )
In DataFrame you can use aggregate functions,GroupedData class provides a number of methods for the most common functions, including count, max, min, mean and sum
Example :
val df = sc.parallelize(Seq(
(1, 10.3, 10), (1, 11.5, 10),
(2, 12.6, 20), (3, 2.6, 30))
).toDF("Site_ID ", "Time ", "Age")
df.show()
+--------+-----+---+
|Site_ID |Time |Age|
+--------+-----+---+
| 1| 10.3| 10|
| 1| 11.5| 10|
| 2| 12.6| 20|
| 3| 2.6| 30|
+--------+-----+---+
df.groupBy($"Site_ID ").count.show
+--------+-----+
|Site_ID |count|
+--------+-----+
| 1| 2|
| 3| 1|
| 2| 1|
+--------+-----+
Note : As you have mentioned that solution is very slow ,You need to use partition ,in your case range partition is good option.
http://dev.sortable.com/spark-repartition/
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-rdd-partitions.html
http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-1/

In Spark windowing, how do you fill null for when the number of rows selected are less than window size?

assume there is a dataframe as follows:
machine_id | value
1| 5
1| 3
1| 4
I want to produce a final dataframe like this
machine_id | value | sum
1| 5|null
1| 3| 8
1| 4| 7
basically I have to do a window of size two, but for the first row, we don't want to sum it up with zero. It will just be filled with null.
var winSpec = Window.orderBy("machine_id ").partitionBy("machine_id ").rangeBetween(-1, 0)
df.withColumn("sum", sum("value").over(winSpec))
You can use lag function, add value column with lag(value, 1):
val df = Seq((1,5),(1,3),(1,4)).toDF("machine_id", "value")
import org.apache.spark.sql.expressions.Window
val window = Window.partitionBy("machine_id").orderBy("id")
(df.withColumn("id", monotonically_increasing_id)
.withColumn("sum", $"value" + lag($"value",1).over(window))
.drop("id").show())
+----------+-----+----+
|machine_id|value| sum|
+----------+-----+----+
| 1| 5|null|
| 1| 3| 8|
| 1| 4| 7|
+----------+-----+----+
You should be using rowsBetween rather than rangeBetween api as below
import org.apache.spark.sql.functions._
var winSpec = Window.orderBy("machine_id").partitionBy("machine_id").rowsBetween(-1, 0)
df.withColumn("sum", sum("value").over(winSpec))
.withColumn("sum", when($"sum" === $"value", null).otherwise($"sum"))
.show(false)
which should give you your expected result
+----------+-----+----+
|machine_id|value|sum |
+----------+-----+----+
|1 |5 |null|
|1 |3 |8 |
|1 |4 |7 |
+----------+-----+----+
I hope the answer is helpful
If you want a general solution where n is the size of windows
Spark- Scala
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
val winSpec = Window.partitionBy("machine_id").orderBy("machine_id").rowsBetween(-n, 0)
val winSpec2 = Window.partitionBy("machine_id").orderBy("machine_id")
df.withColumn("sum", when(row_number().over(winSpec2) < n, "null").otherwise(sum("value").over(winSpec))
.show(false)
PySpark
from pyspark.sql.window import Window
from pyspark.sql.functions import *
winSpec = Window.partitionBy("machine_id").orderBy("machine_id").rowsBetween(-n, 0)
winSpec2 = Window.partitionBy("machine_id").orderBy("machine_id")
df.withColumn("sum", when(row_number().over(winSpec2) < n, "null").otherwise(sum("value").over(winSpec))
.show(false)

Pyspark Dataframe Apply function to two columns

Say I have two PySpark DataFrames df1 and df2.
df1= 'a'
1
2
5
df2= 'b'
3
6
And I want to find the closest df2['b'] value for each df1['a'], and add the closest values as a new column in df1.
In other words, for each value x in df1['a'], I want to find a y that achieves min(abx(x-y)) for all y in df2['b'](note: can assume that there is only one y that can achieve the minimum distance), and the result would be
'a' 'b'
1 3
2 3
5 6
I tried the following code to create a distance matrix first (before finding the values achieving the minimum distance):
from pyspark.sql.types import IntegerType
from pyspark.sql.functions import udf
def dict(x,y):
return abs(x-y)
udf_dict = udf(dict, IntegerType())
sql_sc = SQLContext(sc)
udf_dict(df1.a, df2.b)
which gives
Column<PythonUDF#dist(a,b)>
Then I tried
sql_sc.CreateDataFrame(udf_dict(df1.a, df2.b))
which runs forever without giving error/output.
My questions are:
As I'm new to Spark, is my way to construct the output DataFrame efficient? (My way would be creating a distance matrix for all the a and b values first, and then find the min one)
What's wrong with the last line of my code and how to fix it?
Starting with your second question - you can apply udf only to existing dataframe, I think you were thinking for something like this:
>>> df1.join(df2).withColumn('distance', udf_dict(df1.a, df2.b)).show()
+---+---+--------+
| a| b|distance|
+---+---+--------+
| 1| 3| 2|
| 1| 6| 5|
| 2| 3| 1|
| 2| 6| 4|
| 5| 3| 2|
| 5| 6| 1|
+---+---+--------+
But there is a more efficient way to apply this distance, by using internal abs:
>>> from pyspark.sql.functions import abs
>>> df1.join(df2).withColumn('distance', abs(df1.a -df2.b))
Then you can find matching numbers by calculating:
>>> distances = df1.join(df2).withColumn('distance', abs(df1.a -df2.b))
>>> min_distances = distances.groupBy('a').agg(min('distance').alias('distance'))
>>> distances.join(min_distances, ['a', 'distance']).select('a', 'b').show()
+---+---+
| a| b|
+---+---+
| 5| 6|
| 1| 3|
| 2| 3|
+---+---+