pyspark Transpose dataframe - pyspark

I have a dataframe as given below
ID, Code_Num, Code, Code1, Code2, Code3
10, 1, A1005*B1003, A1005, B1003, null
12, 2, A1007*D1008*C1004, A1007, D1008, C1004
I need help on transposing the above dataset, and output should be displayed as below.
ID, Code_Num, Code, Code_T
10, 1, A1005*B1003, A1005
10, 1, A1005*B1003, B1003
12, 2, A1007*D1008*C1004, A1007
12, 2, A1007*D1008*C1004, D1008
12, 2, A1007*D1008*C1004, C1004

Step 1: Creating the DataFrame.
values = [(10, 'A1005*B1003', 'A1005', 'B1003', None),(12, 'A1007*D1008*C1004', 'A1007', 'D1008', 'C1004')]
df = sqlContext.createDataFrame(values,['ID','Code','Code1','Code2','Code3'])
df.show()
+---+-----------------+-----+-----+-----+
| ID| Code|Code1|Code2|Code3|
+---+-----------------+-----+-----+-----+
| 10| A1005*B1003|A1005|B1003| null|
| 12|A1007*D1008*C1004|A1007|D1008|C1004|
+---+-----------------+-----+-----+-----+
Step 2: Explode the DataFrame -
def to_transpose(df, by):
# Filter dtypes and split into column names and type description
cols, dtypes = zip(*((c, t) for (c, t) in df.dtypes if c not in by))
# Spark SQL supports only homogeneous columns
assert len(set(dtypes)) == 1, "All columns have to be of the same type"
# Create and explode an array of (column_name, column_value) structs
kvs = explode(array([
struct(lit(c).alias("key"), col(c).alias("val")) for c in cols
])).alias("kvs")
return df.select(by + [kvs]).select(by + ["kvs.key", "kvs.val"])
df = to_transpose(df, ["ID","Code"]).drop('key').withColumnRenamed("val","Code_T")
df.show()
+---+-----------------+------+
| ID| Code|Code_T|
+---+-----------------+------+
| 10| A1005*B1003| A1005|
| 10| A1005*B1003| B1003|
| 10| A1005*B1003| null|
| 12|A1007*D1008*C1004| A1007|
| 12|A1007*D1008*C1004| D1008|
| 12|A1007*D1008*C1004| C1004|
+---+-----------------+------+
In case you only want non-Null values in column Code_T, just run the statement below -
df = df.where(col('Code_T').isNotNull())

Related

How to write a function that takes a list of column names of a DataFrame, reorders selected columns the left and preserves unselected columns

I'd like to build a function
def reorderColumns(columnNames: List[String]) = ...
that can be applied to a Spark DataFrame such that the columns specified in columnNames gets reordered to the left, and remaining columns (in any order) remain to the right.
Example:
Given a df with the following 5 columns
| A | B | C | D | E
df.reorderColumns(["D","B","A"]) returns a df with columns ordered like so:
| D | B | A | C | E
Try this one:
def reorderColumns(df: DataFrame, columns: Array[String]): DataFrame = {
val restColumns: Array[String] = df.columns.filterNot(c => columns.contains(c))
df.select((columns ++ restColumns).map(col): _*)
}
Usage example:
val spark: SparkSession = SparkSession.builder().appName("test").master("local[*]").getOrCreate()
import spark.implicits._
val df = List((1, 3, 1, 6), (2, 4, 2, 5), (3, 6, 3, 4)).toDF("colA", "colB", "colC", "colD")
reorderColumns(df, Array("colC", "colB")).show
// output:
//+----+----+----+----+
//|colC|colB|colA|colD|
//+----+----+----+----+
//| 1| 3| 1| 6|
//| 2| 4| 2| 5|
//| 3| 6| 3| 4|
//+----+----+----+----+

Looking to substract every value in a row based on the value of a separate DF

As the title states, I would like to subtract each value of a specific column by the mean of that column.
Here is my code attempt:
val test = moviePairs.agg(avg(col("rating1")).alias("avgX"), avg(col("rating2")).alias("avgY"))
val subMean = moviePairs.withColumn("meanDeltaX", col("rating1") - test.select("avgX").collect())
.withColumn("meanDeltaY", col("rating2") - test.select("avgY").collect())
subMean.show()
You can either use Spark's DataFrame functions or a mere SQL query to a DataFrame to aggregate the values of the means for the columns you are focusing on (rating1, rating2).
val moviePairs = spark.createDataFrame(
Seq(
("Moonlight", 7, 8),
("Lord Of The Drinks", 10, 1),
("The Disaster Artist", 3, 5),
("Airplane!", 7, 9),
("2001", 5, 1),
)
).toDF("movie", "rating1", "rating2")
// find the means for each column and isolate the first (and only) row to get their values
val means = moviePairs.agg(avg("rating1"), avg("rating2")).head()
// alternatively, by using a simple SQL query:
// moviePairs.createOrReplaceTempView("movies")
// val means = spark.sql("select AVG(rating1), AVG(rating2) from movies").head()
val subMean = moviePairs.withColumn("meanDeltaX", col("rating1") - means.getDouble(0))
.withColumn("meanDeltaY", col("rating2") - means.getDouble(1))
subMean.show()
Output for the test input DataFrame moviePairs (with the good ol' double precision loss which you can manage as seen here):
+-------------------+-------+-------+-------------------+-------------------+
| movie|rating1|rating2| meanDeltaX| meanDeltaY|
+-------------------+-------+-------+-------------------+-------------------+
| Moonlight| 7| 8| 0.5999999999999996| 3.2|
| Lord Of The Drinks| 10| 1| 3.5999999999999996| -3.8|
|The Disaster Artist| 3| 5|-3.4000000000000004|0.20000000000000018|
| Airplane!| 7| 9| 0.5999999999999996| 4.2|
| 2001| 5| 1|-1.4000000000000004| -3.8|
+-------------------+-------+-------+-------------------+-------------------+

spark aggregation with sorted rows that returns a row's value before a condition is met

I have some data (invoice data). Assuming id ~ date and id is what I'm sorting by:
fid, id, due, overdue
0, 1, 5, 0
0, 3, 5, 5
0, 13, 5, 10
0, 14, 5, 0
1, 5, 5, 0
1, 26, 5, 5
1, 27, 5, 10
1, 38, 5, 0
remove all rows under some arbitrary date-id id = 20
group_by fid and sort by id within the group
(major) aggregate a new column overdue_id that is the id of the row before the first row in the group that has a nonzero value for overdue
(minor) fill a row for every fid even if all rows are filtered out by #0
so the output would be (given default value null)
fid, overdue_id
0, 1
1, null
because for fid = 0, the first id with nonzero overdue is id = 3, and I'd like to output the id for the row that before that in id-date time which is id = 1.
I have group_by('fid').withColumn('overdue_id', ...), and want to use functions like agg, min, when, but am not sure after that as I am very new to the docs.
You can use the following steps to solve :
import pyspark.sql.functions as F
from pyspark.sql import *
#added fid=2 for overdue = 0 condition
fid = [0,1,2]*4
fid.sort()
dateId = [1,3,13,14,5,26,27,28]
dateId.extend(range(90,95))
due = [5]*12
overdue = [0,5,10,0]*2
overdue.extend([0,0,0,0])
data = zip(fid, dateId, due, overdue)
df = spark.createDataFrame(data, schema =["fid", "dateId", "due", "overdue"])
win = Window.partitionBy(df['fid']).orderBy(df['dateId'])
res = df\
.filter(F.col("dateId")!= 20)\
.withColumn("lag_id", F.lag(F.col("dateId"), 1).over(win))\
.withColumn("overdue_id", F.when(F.col("overdue")!=0, F.col("lag_id")).otherwise(None))\
.groupBy("fid")\
.agg(F.min("overdue_id").alias("min_overdue_id"))
>>> res.show()
+---+--------------+
|fid|min_overdue_id|
+---+--------------+
| 0| 1|
| 1| 5|
| 2| null|
+---+--------------+
You need to use the lag and window function. Before we begin, why is your example output showing null for fid 1. The first non zero value is for id 26, so the id before that is 5. so shouldn't be 5? Unless you need something else, you can try this.
tst=sqlContext.createDataFrame([(0, 1,5,0),(0,20,5,0),(0,30,5,5),(0,13,5,10),(0,14,5,0),(1,5,5,0),(1,26,5,5),(1,27,5,10),(1,38,5,0)],schema=["fid","id","due","overdue"])
# To filter data
tst_f = tst.where('id!=20')
# Define window function
w=Window.partitionBy('fid').orderBy('id')
tst_lag = tst_f.withColumn('overdue_id',F.lag('id').over(w))
# Remove rows with 0 overdue
tst_od = tst_lag.where('overdue!=0')
# Find the row before first non zero overdue
tst_res = tst_od.groupby('fid').agg(F.first('overdue_id').alias('overdue_id'))
tst_res.show()
+---+----------+
|fid|overdue_id|
+---+----------+
| 0| 1|
| 1| 5|
+---+----------+
If you are weary about using the first function , or just to be confident about avoiding ghost issues, you can try the below performance expensive option
# Create a copy to avoid ambiguous join and select the minimum from non zero overdue rows
tst_min= tst_od.withColumn("dummy",F.lit('dummy')).groupby('fid').agg(F.min('id').alias('id_min'))
# Join this with the dataframe to get results
tst_join = tst_od.join(tst_min,on=tst_od.id==tst_min.id_min,how='right')
tst_join.show()
+---+---+---+-------+----------+---+------+
|fid| id|due|overdue|overdue_id|fid|id_min|
+---+---+---+-------+----------+---+------+
| 1| 26| 5| 5| 5| 1| 26|
| 0| 13| 5| 10| 1| 0| 13|
+---+---+---+-------+----------+---+------+
# This way you can see all the information
You can filter the relevant information from this dataframe using filter() or where() method

substring function return column type instead of a value. Is there a way to fetch a value out of column type in pyspark

I am comparing a condition with pyspark join in my application by using substring function. This function is returning a column type instead of a value.
substring(trim(coalesce(df.col1)), 13, 3) returns
Column<b'substring(trim(coalesce(col1), 13, 3)'>
Tried with expr but still getting the same column type result
expr("substring(trim(coalesce(df.col1)),length(trim(coalesce(df.col1))) - 2, 3)")
I want to compare the values coming from substring to the value of another dataframe column. Both are of string type
pyspark:
substring(trim(coalesce(df.col1)), length(trim(coalesce(df.col1))) -2, 3) == df2["col2"]
lets say col1 = 'abcdefghijklmno'
The expected output of substring function should mno based on the above definition.
creating a sample dataframes to join
list1 = [('ABC','abcdefghijklmno'),('XYZ','abcdefghijklmno'),('DEF','abcdefghijklabc')]
df1=spark.createDataFrame(list1, ['col1', 'col2'])
list2 = [(1,'mno'),(2,'mno'),(3,'abc')]
df2=spark.createDataFrame(list2, ['col1', 'col2'])
import pyspark.sql.functions as f
creating a substring to read last n characters for three positions.
cond=f.substring(df1['col2'], -3, 3)==df2['col2']
newdf=df1.join(df2,cond)
>>> newdf.show()
+----+---------------+----+----+
|col1| col2|col1|col2|
+----+---------------+----+----+
| ABC|abcdefghijklmno| 1| mno|
| ABC|abcdefghijklmno| 2| mno|
| XYZ|abcdefghijklmno| 1| mno|
| XYZ|abcdefghijklmno| 2| mno|
| DEF|abcdefghijklabc| 3| abc|
+----+---------------+----+----+

How to work around the immutable data frames in Spark/Scala?

I am trying to convert following pyspark code into scala. As you know, the dataframes in scala are immutable, which are constraining me to convert the following code:
pyspark code:
time_frame = ["3m","6m","9m","12m","18m","27m","60m","60m_ab"]
variable_name = ["var1", "var2", "var3"....., "var30"]
train_df = sqlContext.sql("select * from someTable")
for var in variable_name:
for tf in range(1,len(time_frame)):
train_df=train_df.withColumn(str(time_frame[tf]+'_'+var), fn.col(str(time_frame[tf]+'_'+var))+fn.col(str(time_frame[tf-1]+'_'+var)))
So, as you see above the table has different columns which are used to recreate more columns. However the immutable nature of the dataframe in Spark/Scala is objecting, can you help me with some work around?
Here's one approach that first uses a for-comprehension to generate a list of tuples consisting of column name pairs, and then traverses the list using foldLeft to iteratively transform trainDF via withColumn:
import org.apache.spark.sql.functions._
val timeframes: Seq[String] = ???
val variableNames: Seq[String] = ???
val newCols = for {
vn <- variableNames
tf <- 1 until timeframes.size
} yield (timeframes(tf) + "_" + vn, timeframes(tf - 1) + "_" + vn)
val trainDF = spark.sql("""select * from some_table""")
val resultDF = newCols.foldLeft(trainDF)( (accDF, cs) =>
accDF.withColumn(cs._1, col(cs._1) + col(cs._2))
)
To test the above code, simply provide sample input and create table some_table:
val timeframes = Seq("3m", "6m", "9m")
val variableNames = Seq("var1", "var2")
val df = Seq(
(1, 10, 11, 12, 13, 14, 15),
(2, 20, 21, 22, 23, 24, 25),
(3, 30, 31, 32, 33, 34, 35)
).toDF("id", "3m_var1", "6m_var1", "9m_var1", "3m_var2", "6m_var2", "9m_var2")
df.createOrReplaceTempView("some_table")
ResultDF should look like the following:
resultDF.show
// +---+-------+-------+-------+-------+-------+-------+
// | id|3m_var1|6m_var1|9m_var1|3m_var2|6m_var2|9m_var2|
// +---+-------+-------+-------+-------+-------+-------+
// | 1| 10| 21| 33| 13| 27| 42|
// | 2| 20| 41| 63| 23| 47| 72|
// | 3| 30| 61| 93| 33| 67| 102|
// +---+-------+-------+-------+-------+-------+-------+