I have a CSV file like below.
PK,key,Value
100,col1,val11
100,col2,val12
100,idx,1
100,icol1,ival11
100,icol3,ival13
100,idx,2
100,icol1,ival21
100,icol2,ival22
101,col1,val21
101,col2,val22
101,idx,1
101,icol1,ival11
101,icol3,ival13
101,idx,3
101,icol1,ival31
101,icol2,ival32
I want to transform this into the following.
PK,idx,key,Value
100,,col1,val11
100,,col2,val12
100,1,idx,1
100,1,icol1,ival11
100,1,icol3,ival13
100,2,idx,2
100,2,icol1,ival21
100,2,icol2,ival22
101,,col1,val21
101,,col2,val22
101,1,idx,1
101,1,icol1,ival11
101,1,icol3,ival13
101,3,idx,3
101,3,icol1,ival31
101,3,icol2,ival32
Basically I want to create the an new column called idx in the output dataframe which will be populated with the same value "n" as that of the row following the key=idx, value="n".
Here is one way using last window function with Spark >= 2.0.0:
import org.apache.spark.sql.functions.{last, when, lit}
import org.apache.spark.sql.expressions.Window
val w = Window.partitionBy("PK").rowsBetween(Window.unboundedPreceding, 0)
df.withColumn("idx", when($"key" === lit("idx"), $"Value"))
.withColumn("idx", last($"idx", true).over(w))
.orderBy($"PK")
.show
Output:
+---+-----+------+----+
| PK| key| Value| idx|
+---+-----+------+----+
|100| col1| val11|null|
|100| col2| val12|null|
|100| idx| 1| 1|
|100|icol1|ival11| 1|
|100|icol3|ival13| 1|
|100| idx| 2| 2|
|100|icol1|ival21| 2|
|100|icol2|ival22| 2|
|101| col1| val21|null|
|101| col2| val22|null|
|101| idx| 1| 1|
|101|icol1|ival11| 1|
|101|icol3|ival13| 1|
|101| idx| 3| 3|
|101|icol1|ival31| 3|
|101|icol2|ival32| 3|
+---+-----+------+----+
The code first creates a new column called idx which contains the value of Value when key == idx, or null otherwise. Then it retrieves the last observed idx over the defined window.
Scala 2.12 and Spark 2.2.1 here. I have the following code:
myDf.show(5)
myDf.withColumn("rank", myDf("rank") * 10)
myDf.withColumn("lastRanOn", current_date())
println("And now:")
myDf.show(5)
When I run this, in the logs I see:
+---------+-----------+----+
|fizz|buzz|rizzrankrid|rank|
+---------+-----------+----+
| 2| 5| 1440370637| 128|
| 2| 5| 2114144780|1352|
| 2| 8| 199559784|3233|
| 2| 5| 1522258372| 895|
| 2| 9| 918480276| 882|
+---------+-----------+----+
And now:
+---------+-----------+-----+
|fizz|buzz|rizzrankrid| rank|
+---------+-----------+-----+
| 2| 5| 1440370637| 1280|
| 2| 5| 2114144780|13520|
| 2| 8| 199559784|32330|
| 2| 5| 1522258372| 8950|
| 2| 9| 918480276| 8820|
+---------+-----------+-----+
So, interesting:
The first withColumn works, transforming each row's rank value by multiplying itself by 10
However the second withColumn fails, which is just adding the current date/time to all rows as a new lastRanOn column
What do I need to do to get the lastRanOn column addition working?
Your example is probably too simple, because modifying rank should also not work.
withColumn does not update DataFrame, it's create a new DataFrame.
So you must do:
// if myDf is a var
myDf.show(5)
myDf = myDf.withColumn("rank", myDf("rank") * 10)
myDf = myDf.withColumn("lastRanOn", current_date())
println("And now:")
myDf.show(5)
or for example:
myDf.withColumn("rank", myDf("rank") * 10).withColumn("lastRanOn", current_date()).show(5)
Only then you will have new column added - after reassigning new DataFrame reference
I'm having a hard time framing the following Pyspark dataframe manipulation.
Essentially I am trying to group by category and then pivot/unmelt the subcategories and add new columns.
I've tried a number of ways, but they are very slow and and are not leveraging Spark's parallelism.
Here is my existing (slow, verbose) code:
from pyspark.sql.functions import lit
df = sqlContext.table('Table')
#loop over category
listids = [x.asDict().values()[0] for x in df.select("category").distinct().collect()]
dfArray = [df.where(df.category == x) for x in listids]
for d in dfArray:
#loop over subcategory
listids_sub = [x.asDict().values()[0] for x in d.select("sub_category").distinct().collect()]
dfArraySub = [d.where(d.sub_category == x) for x in listids_sub]
num = 1
for b in dfArraySub:
#renames all columns to append a number
for c in b.columns:
if c not in ['category','sub_category','date']:
column_name = str(c)+'_'+str(num)
b = b.withColumnRenamed(str(c), str(c)+'_'+str(num))
b = b.drop('sub_category')
num += 1
#if no df exists, create one and continually join new columns
try:
all_subs = all_subs.drop('sub_category').join(b.drop('sub_category'), on=['cateogry','date'], how='left')
except:
all_subs = b
#Fixes missing columns on union
try:
try:
diff_columns = list(set(all_cats.columns) - set(all_subs.columns))
for d in diff_columns:
all_subs = all_subs.withColumn(d, lit(None))
all_cats = all_cats.union(all_subs)
except:
diff_columns = list(set(all_subs.columns) - set(all_cats.columns))
for d in diff_columns:
all_cats = all_cats.withColumn(d, lit(None))
all_cats = all_cats.union(all_subs)
except Exception as e:
print e
all_cats = all_subs
But this is very slow. Any guidance would be greatly appreciated!
Your output is not really logical, but we can achieve this result using the pivot function. You need to precise your rules otherwise I can see a lot of cases it may fails.
from pyspark.sql import functions as F
from pyspark.sql.window import Window
df.show()
+----------+---------+------------+------------+------------+
| date| category|sub_category|metric_sales|metric_trans|
+----------+---------+------------+------------+------------+
|2018-01-01|furniture| bed| 100| 75|
|2018-01-01|furniture| chair| 110| 85|
|2018-01-01|furniture| shelf| 35| 30|
|2018-02-01|furniture| bed| 55| 50|
|2018-02-01|furniture| chair| 45| 40|
|2018-02-01|furniture| shelf| 10| 15|
|2018-01-01| rug| circle| 2| 5|
|2018-01-01| rug| square| 3| 6|
|2018-02-01| rug| circle| 3| 3|
|2018-02-01| rug| square| 4| 5|
+----------+---------+------------+------------+------------+
df.withColumn("fg", F.row_number().over(Window().partitionBy('date', 'category').orderBy("sub_category"))).groupBy('date', 'category', ).pivot('fg').sum('metric_sales', 'metric_trans').show()
+----------+---------+-------------------------------------+-------------------------------------+-------------------------------------+-------------------------------------+-------------------------------------+-------------------------------------+
| date| category|1_sum(CAST(`metric_sales` AS BIGINT))|1_sum(CAST(`metric_trans` AS BIGINT))|2_sum(CAST(`metric_sales` AS BIGINT))|2_sum(CAST(`metric_trans` AS BIGINT))|3_sum(CAST(`metric_sales` AS BIGINT))|3_sum(CAST(`metric_trans` AS BIGINT))|
+----------+---------+-------------------------------------+-------------------------------------+-------------------------------------+-------------------------------------+-------------------------------------+-------------------------------------+
|2018-02-01| rug| 3| 3| 4| 5| null| null|
|2018-02-01|furniture| 55| 50| 45| 40| 10| 15|
|2018-01-01|furniture| 100| 75| 110| 85| 35| 30|
|2018-01-01| rug| 2| 5| 3| 6| null| null|
+----------+---------+-------------------------------------+-------------------------------------+-------------------------------------+-------------------------------------+-------------------------------------+-------------------------------------+
This question already has answers here:
How do I add a new column to a Spark DataFrame (using PySpark)?
(10 answers)
Closed 5 years ago.
I am bit new on pyspark. I have a spark dataframe with about 5 columns and 5 records. I have list of 5 records.
Now I want to add these 5 static records from the list to the existing dataframe using withColumn. I did that, but its not working.
Any suggestions are greatly appreciated.
Below is my sample:
dq_results=[]
for a in range(0,len(dq_results)):
dataFile_df=dataFile_df.withColumn("dq_results",lit(dq_results[a]))
print lit(dq_results[a])
thanks,
Sreeram
dq_results=[]
Create one data frame from list dq_results:
df_list=spark.createDataFrame(dq_results_list,schema=dq_results_col)
Add one column for df_list id (it will be row id)
df_list_id = df_list.withColumn("id", monotonically_increasing_id())
Add one column for dataFile_df id (it will be row id)
dataFile_df= df_list.withColumn("id", monotonically_increasing_id())
Now we can join the both dataframe df_list and dataFile_df.
dataFile_df.join(df_list,"id").show()
So dataFile_df is final data frame
withColumn will add a new Column, but I guess you might want to append Rows instead. Try this:
df1 = spark.createDataFrame([(a, a*2, a+3, a+4, a+5) for a in range(5)], "A B C D E".split(' '))
new_data = [[100 + i*j for i in range(5)] for j in range(5)]
df1.unionAll(spark.createDataFrame(new_data)).show()
+---+---+---+---+---+
| A| B| C| D| E|
+---+---+---+---+---+
| 0| 0| 3| 4| 5|
| 1| 2| 4| 5| 6|
| 2| 4| 5| 6| 7|
| 3| 6| 6| 7| 8|
| 4| 8| 7| 8| 9|
|100|100|100|100|100|
|100|101|102|103|104|
|100|102|104|106|108|
|100|103|106|109|112|
|100|104|108|112|116|
+---+---+---+---+---+
I have the following dataframe:
df.show
+----------+-----+
| createdon|count|
+----------+-----+
|2017-06-28| 1|
|2017-06-17| 2|
|2017-05-20| 1|
|2017-06-23| 2|
|2017-06-16| 3|
|2017-06-30| 1|
I want to replace the count values by 0, where it is greater than 1, i.e., the resultant dataframe should be:
+----------+-----+
| createdon|count|
+----------+-----+
|2017-06-28| 1|
|2017-06-17| 0|
|2017-05-20| 1|
|2017-06-23| 0|
|2017-06-16| 0|
|2017-06-30| 1|
I tried the following expression:
df.withColumn("count", when(($"count" > 1), 0)).show
but the output was
+----------+--------+
| createdon| count|
+----------+--------+
|2017-06-28| null|
|2017-06-17| 0|
|2017-05-20| null|
|2017-06-23| 0|
|2017-06-16| 0|
|2017-06-30| null|
I am not able to understand, why for the value 1, null is getting displayed and how to overcome that. Can anyone help me?
You need to chain otherwise after when to specify the values where the conditions don't hold; In your case, it would be count column itself:
df.withColumn("count", when(($"count" > 1), 0).otherwise($"count"))
This can be done using udf function too
def replaceWithZero = udf((col: Int) => if(col > 1) 0 else col) //udf function
df.withColumn("count", replaceWithZero($"count")).show(false) //calling udf function
Note : udf functions should always be the choice only when there is no inbuilt functions as it requires serialization and deserialization of column data.