My data frame looks like -
id date
1 2018-08-23 11:48:22
2 2019-05-03 06:22:01
3 2019-05-13 10:12:15
4 2019-01-22 16:13:29
5 2018-11-27 11:17:19
My expected output is -
id date date1
1 2018-08-23 11:48:22 2018-08
2 2019-05-03 06:22:01 2019-05
3 2019-05-13 10:12:15 2019-05
4 2019-01-22 16:13:29 2019-01
5 2018-11-27 11:17:19 2018-11
How to do it in pyspark?
I think you are trying to drop day and time details, you can use date_format function for it
>>> df.show()
+---+-------------------+
| id| date|
+---+-------------------+
| 1|2018-08-23 11:48:22|
| 2|2019-05-03 06:22:01|
| 3|2019-05-13 10:12:15|
| 4|2019-01-22 16:13:29|
| 5|2018-11-27 11:17:19|
+---+-------------------+
>>> import pyspark.sql.functions as F
>>>
>>> df.withColumn('date1',F.date_format(F.to_date('date','yyyy-MM-dd HH:mm:ss'),'yyyy-MM')).show()
+---+-------------------+-------+
| id| date| date1|
+---+-------------------+-------+
| 1|2018-08-23 11:48:22|2018-08|
| 2|2019-05-03 06:22:01|2019-05|
| 3|2019-05-13 10:12:15|2019-05|
| 4|2019-01-22 16:13:29|2019-01|
| 5|2018-11-27 11:17:19|2018-11|
+---+-------------------+-------+
via to_date and then substr functions ... example:
import pyspark.sql.functions as F
import pyspark.sql.types as T
rawData = [(1, "2018-08-23 11:48:22"),
(2, "2019-05-03 06:22:01"),
(3, "2019-05-13 10:12:15")]
df = spark.createDataFrame(rawData).toDF("id","my_date")
df.withColumn("new_my_date",\
F.substring(F.to_date(F.col("my_date")), 1,7))\
.show()
+---+-------------------+-----------+
| id| my_date|new_my_date|
+---+-------------------+-----------+
| 1|2018-08-23 11:48:22| 2018-08|
| 2|2019-05-03 06:22:01| 2019-05|
| 3|2019-05-13 10:12:15| 2019-05|
+---+-------------------+-----------+
import pyspark.sql.functions as F
split_col = F.split(df['date'], '-')
df = df.withColumn('year', split_col.getItem(0)).withColumn('month', split_col.getItem(1))
df = df.select(F.concat(df['year'], F.lit('-'),df['month']).alias('year_month'))
df.show()
+----------+
|year_month|
+----------+
| 2018-08|
| 2019-05|
| 2019-05|
| 2019-01|
| 2018-11|
+----------+
Related
we need to merge multiple rows based on ID into a single record using Pyspark. If there are multiple updates to the column, then we have to select the one with the last update made to it.
Please note, NULL would mean there was no update made to the column in that instance.
So, basically we have to create a single row with the consolidated updates made to the records.
So,for example, if this is the dataframe ...
Looking for similar answer, but in Pyspark .. Merge rows in a spark scala Dataframe
------------------------------------------------------------
| id | column1 | column2 | updated_at |
------------------------------------------------------------
| 123 | update1 | <*no-update*> | 1634228709 |
| 123 | <*no-update*> | 80 | 1634228724 |
| 123 | update2 | <*no-update*> | 1634229000 |
expected output is -
------------------------------------------------------------
| id | column1 | column2 | updated_at |
------------------------------------------------------------
| 123 | update2 | 80 | 1634229000 |
Let's say that our input dataframe is:
+---+-------+----+----------+
|id |col1 |col2|updated_at|
+---+-------+----+----------+
|123|null |null|1634228709|
|123|null |80 |1634228724|
|123|update2|90 |1634229000|
|12 |update1|null|1634221233|
|12 |null |80 |1634228333|
|12 |update2|null|1634221220|
+---+-------+----+----------+
What we want is to covert updated_at to TimestampType then order by id and updated_at in desc order:
df = df.withColumn("updated_at", F.col("updated_at").cast(TimestampType())).orderBy(
F.col("id"), F.col("updated_at").desc()
)
that gives us:
+---+-------+----+-------------------+
|id |col1 |col2|updated_at |
+---+-------+----+-------------------+
|12 |null |80 |2021-10-14 18:18:53|
|12 |update1|null|2021-10-14 16:20:33|
|12 |update2|null|2021-10-14 16:20:20|
|123|update2|90 |2021-10-14 18:30:00|
|123|null |80 |2021-10-14 18:25:24|
|123|null |null|2021-10-14 18:25:09|
+---+-------+----+-------------------+
Now get first non None value in each column or return None and group by id:
exp = [F.first(x, ignorenulls=True).alias(x) for x in df.columns[1:]]
df = df.groupBy(F.col("id")).agg(*exp)
And the result is:
+---+-------+----+-------------------+
|id |col1 |col2|updated_at |
+---+-------+----+-------------------+
|123|update2|90 |2021-10-14 18:30:00|
|12 |update1|80 |2021-10-14 18:18:53|
+---+-------+----+-------------------+
Here's the full example code:
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from pyspark.sql.types import TimestampType
if __name__ == "__main__":
spark = SparkSession.builder.master("local").appName("Test").getOrCreate()
data = [
(123, None, None, 1634228709),
(123, None, 80, 1634228724),
(123, "update2", 90, 1634229000),
(12, "update1", None, 1634221233),
(12, None, 80, 1634228333),
(12, "update2", None, 1634221220),
]
columns = ["id", "col1", "col2", "updated_at"]
df = spark.createDataFrame(data, columns)
df = df.withColumn("updated_at", F.col("updated_at").cast(TimestampType())).orderBy(
F.col("id"), F.col("updated_at").desc()
)
exp = [F.first(x, ignorenulls=True).alias(x) for x in df.columns[1:]]
df = df.groupBy(F.col("id")).agg(*exp)
I want to explode the column in spark scala
reference_month M M+1 M+2
2020-01-01 10 12 10
2020-02-01 10 12 10
The output should be like
reference_month Month reference_date_id
2020-01-01 10 2020-01
2020-01-01 12 2020-02
2020-01-01 10 2020-03
2020-02-01 10 2020-02
2020-02-01 12 2020-03
2020-02-01 10 2020-04
Where reference_date_id = reference_month + x ( where x is derived from m, m+1,m+2).
Is there any way by which we can get the output in this format in spark scala?
You can you unpivot technique of Apache Spark
import org.apache.spark.sql.functions.expr
data.select($"reference_month",expr("stack(3,`M`,`M+1`,`M+2`) as (Month )")).show()
You can use **stack** function
import sys
from pyspark.sql.types import StructType,StructField,IntegerType,StringType
from pyspark.sql.functions import when,concat_ws,lpad,row_number,sum,col,expr,substring,length
from pyspark.sql.window import Window
schema = StructType([
StructField("reference_month", StringType(), True),\
StructField("M", IntegerType(), True),\
StructField("M+1", IntegerType(), True),\
StructField("M+2", IntegerType(), True)
])
mnt = [("2020-01-01",10,12,10),("2020-02-01",10,12,10)]
df=spark.createDataFrame(mnt,schema)
newdf = df.withColumn("t",col("reference_month").cast("date")).drop("reference_month").withColumnRenamed("t","reference_month")
exp = expr("""stack(3,`M`,`M+1`,`M+2`) as (Values)""")
t = newdf.select("reference_month",exp).withColumn('mnth',substring("reference_month",6,2)).withColumn("newmnth",col("mnth").cast("Integer")).drop('mnth')
windowval = (Window.partitionBy('reference_month').orderBy('reference_month').rowsBetween(-sys.maxsize, 0))
ref_cal=t.withColumn("reference_date_id",row_number().over(windowval)-1)
ref_cal.withColumn('new_dt',concat_ws('-',substring("reference_month",1,4),when(length(col("reference_date_id")+col("newmnth"))<2,lpad(col("reference_date_id")+col("newmnth"),2,'0')).otherwise(col("reference_date_id")+col("newmnth")))).drop("newmnth","reference_date_id").withColumnRenamed("new_dt","reference_date_id").orderBy("reference_month").show()
+---------------+------+-----------------+
|reference_month|Values|reference_date_id|
+---------------+------+-----------------+
| 2020-01-01| 10| 2020-01|
| 2020-01-01| 12| 2020-02|
| 2020-01-01| 10| 2020-03|
| 2020-02-01| 10| 2020-02|
| 2020-02-01| 12| 2020-03|
| 2020-02-01| 10| 2020-04|
+---------------+------+-----------------+
We can create an array with M,M+1,M+2 and then explode the array to get required dataframe.
Example:
df.selectExpr("reference_month","array(M,`M+1`,`M+2`)as arr").
selectExpr("reference_month","explode(arr) as Month").show()
+---------------+-----+
|reference_month|Month|
+---------------+-----+
| 202001| 10|
| 202001| 12|
| 202001| 10|
| 202002| 10|
| 202002| 12|
| 202002| 10|
+---------------+-----+
//or
val cols= Seq("M","M+1","M+2")
df.withColumn("arr",array(cols.head,cols.tail:_*)).drop(cols:_*).
selectExpr("reference_month","explode(arr) as Month").show()
I have data in table/Dataframe.
table/dataframe: temptable/temp_df
StoreId,Total_Sales,Date
S1,10000,01-Jan-18
S1,20000,02-Jan-18
S1,25000,03-Jan-18
S1,30000,04-Jan-18
S1,29000,05-Jan-18--> total sales value is decline from previous value(04-jan-18)
S1,28500,06-Jan-18--> total sales value is decline from previous value(05-jan-18)
S1,25500,07-Jan-18--> total sales value is decline from previous value(06-jan-18)(output row)
S1,25500,08-Jan-18--> total sales value is constant from previous value(07-jan-18)
S1,30000,09-Jan-18
S1,29000,10-Jan-18-->same
S1,28000,11-Jan-18-->same
S1,25000,12-Jan-18-->same (output row)
S1,25000,13-Jan-18
S1,30000,14-Jan-18
S1,29000,15-Jan-18
S1,28000,16-Jan-18
so I want those record from dataframe/table which are decline consecutive 3 times. if total value has a same total_sale then it will consider as neither decline nor increase.
The expected output is:
StoreId,Total_Sales,Date
S1,25500,07-Jan-18
S1,25000,12-Jan-18
from pyspark.sql import SparkSession
import pyspark.sql.functions as f
from pyspark.sql import Window
sc = SparkSession.builder.appName("example").\
config("spark.driver.memory","1g").\
config("spark.executor.cores",2).\
config("spark.max.cores",4).getOrCreate()
df = sc.read.format("csv").option("header","true").option("delimiter",",").load("storesales.csv")
w = Window.partitionBy("StoreID").orderBy("Date")
df = df.withColumn("oneprev",f.lag("Total_Sales",1).over(w)).withColumn("twoprev",f.lag("Total_Sales",2).over(w))
df = df.withColumn("isdeclining",f.when((df["Total_Sales"].cast("double") < df["oneprev"].cast("double")) & (df["oneprev"].cast("double") < df["twoprev"].cast("double")) ,"declining").otherwise("notdeclining"))
df = df.withColumn("oneprev_isdeclining",f.lag("isdeclining",1).over(w)).withColumn("twoprev_isdeclining",f.lag("isdeclining",2).over(w))
df = df.filter((df["isdeclining"] == "declining") & (df["oneprev_isdeclining"] != "declining") & (df["twoprev_isdeclining"] != "declining")).select(["StoreID","Date","Total_Sales"])
df.show()
You can combine some of the lines into one line but ideally spark sql optimizer should take care of it
Sample input +-------+-----------+---------+
|StoreId|Total_Sales| Date|
+-------+-----------+---------+
| S1| 10000|01-Jan-18|
| S1| 20000|02-Jan-18|
| S1| 25000|03-Jan-18|
| S1| 30000|04-Jan-18|
| S1| 29000|05-Jan-18|
| S1| 28500|06-Jan-18|
| S1| 25500|07-Jan-18|
| S1| 25500|08-Jan-18|
| S1| 30000|09-Jan-18|
| S1| 29000|10-Jan-18|
| S1| 28000|11-Jan-18|
| S1| 25000|12-Jan-18|
| S1| 25000|13-Jan-18|
| S1| 30000|14-Jan-18|
| S1| 29000|15-Jan-18|
+-------+-----------+---------+
Desired Output :
+-------+---------+-----------+
|StoreID| Date|Total_Sales|
+-------+---------+-----------+
| S1|06-Jan-18| 28500|
| S1|11-Jan-18| 28000|
+-------+---------+-----------+
This question already has answers here:
Padding in a Pyspark Dataframe
(2 answers)
Closed 4 years ago.
I have a DataFrame where it looks like below
|string_code|prefix_string_code|
|1234 |001234 |
|123 |000123 |
|56789 |056789 |
Basically what I want is to add '0' as many as necessary so that the length of column prefix_string_code will be 6.
What I have tried:
df.withColumn('prefix_string_code', when(length(col('string_code')) < 6, concat(lit('0' * (6 - length(col('string_code')))), col('string_code'))).otherwise(col('string_code')))
It did not work and instead produced the following:
|string_code|prefix_string_code|
|1234 |0.001234 |
|123 |0.000123 |
|56789 |0.056789 |
As you can see, if it's not in a decimal form, the code actually works. How do I do this properly?
Thanks!
you can use lpad function for this case
>>> import pyspark.sql.functions as F
>>> rdd = sc.parallelize([1234,123,56789,1234567])
>>> data = rdd.map(lambda x: Row(x))
>>> df=spark.createDataFrame(data,['string_code'])
>>> df.show()
+-----------+
|string_code|
+-----------+
| 1234|
| 123|
| 56789|
| 1234567|
+-----------+
>>> df.withColumn('prefix_string_code', F.when(F.length(df['string_code']) < 6 ,F.lpad(df['string_code'],6,'0')).otherwise(df['string_code'])).show()
+-----------+------------------+
|string_code|prefix_string_code|
+-----------+------------------+
| 1234| 001234|
| 123| 000123|
| 56789| 056789|
| 1234567| 1234567|
+-----------+------------------+
I read data from a csv file ,but don't have index.
I want to add a column from 1 to row's number.
What should I do,Thanks (scala)
With Scala you can use:
import org.apache.spark.sql.functions._
df.withColumn("id",monotonicallyIncreasingId)
You can refer to this exemple and scala docs.
With Pyspark you can use:
from pyspark.sql.functions import monotonically_increasing_id
df_index = df.select("*").withColumn("id", monotonically_increasing_id())
monotonically_increasing_id - The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive.
"I want to add a column from 1 to row's number."
Let say we have the following DF
+--------+-------------+-------+
| userId | productCode | count |
+--------+-------------+-------+
| 25 | 6001 | 2 |
| 11 | 5001 | 8 |
| 23 | 123 | 5 |
+--------+-------------+-------+
To generate the IDs starting from 1
val w = Window.orderBy("count")
val result = df.withColumn("index", row_number().over(w))
This would add an index column ordered by increasing value of count.
+--------+-------------+-------+-------+
| userId | productCode | count | index |
+--------+-------------+-------+-------+
| 25 | 6001 | 2 | 1 |
| 23 | 123 | 5 | 2 |
| 11 | 5001 | 8 | 3 |
+--------+-------------+-------+-------+
How to get a sequential id column id[1, 2, 3, 4...n]:
from pyspark.sql.functions import desc, row_number, monotonically_increasing_id
from pyspark.sql.window import Window
df_with_seq_id = df.withColumn('index_column_name', row_number().over(Window.orderBy(monotonically_increasing_id())) - 1)
Note that row_number() starts at 1, therefore subtract by 1 if you want 0-indexed column
NOTE : Above approaches doesn't give a sequence number, but it does give increasing id.
Simple way to do that and ensure the order of indexes is like below.. zipWithIndex.
Sample data.
+-------------------+
| Name|
+-------------------+
| Ram Ghadiyaram|
| Ravichandra|
| ilker|
| nick|
| Naveed|
| Gobinathan SP|
|Sreenivas Venigalla|
| Jackela Kowski|
| Arindam Sengupta|
| Liangpi|
| Omar14|
| anshu kumar|
+-------------------+
package com.example
import org.apache.spark.internal.Logging
import org.apache.spark.sql.SparkSession._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.{LongType, StructField, StructType}
import org.apache.spark.sql.{DataFrame, Row}
/**
* DistributedDataIndex : Program to index an RDD with
*/
object DistributedDataIndex extends App with Logging {
val spark = builder
.master("local[*]")
.appName(this.getClass.getName)
.getOrCreate()
import spark.implicits._
val df = spark.sparkContext.parallelize(
Seq("Ram Ghadiyaram", "Ravichandra", "ilker", "nick"
, "Naveed", "Gobinathan SP", "Sreenivas Venigalla", "Jackela Kowski", "Arindam Sengupta", "Liangpi", "Omar14", "anshu kumar"
)).toDF("Name")
df.show
logInfo("addColumnIndex here")
// Add index now...
val df1WithIndex = addColumnIndex(df)
.withColumn("monotonically_increasing_id", monotonically_increasing_id)
df1WithIndex.show(false)
/**
* Add Column Index to dataframe to each row
*/
def addColumnIndex(df: DataFrame) = {
spark.sqlContext.createDataFrame(
df.rdd.zipWithIndex.map {
case (row, index) => Row.fromSeq(row.toSeq :+ index)
},
// Create schema for index column
StructType(df.schema.fields :+ StructField("index", LongType, false)))
}
}
Result :
+-------------------+-----+---------------------------+
|Name |index|monotonically_increasing_id|
+-------------------+-----+---------------------------+
|Ram Ghadiyaram |0 |0 |
|Ravichandra |1 |8589934592 |
|ilker |2 |8589934593 |
|nick |3 |17179869184 |
|Naveed |4 |25769803776 |
|Gobinathan SP |5 |25769803777 |
|Sreenivas Venigalla|6 |34359738368 |
|Jackela Kowski |7 |42949672960 |
|Arindam Sengupta |8 |42949672961 |
|Liangpi |9 |51539607552 |
|Omar14 |10 |60129542144 |
|anshu kumar |11 |60129542145 |
+-------------------+-----+---------------------------+
As Ram said, zippedwithindex is better than monotonically increasing id, id you need consecutive row numbers. Try this (PySpark environment):
from pyspark.sql import Row
from pyspark.sql.types import StructType, StructField, LongType
new_schema = StructType(**original_dataframe**.schema.fields[:] + [StructField("index", LongType(), False)])
zipped_rdd = **original_dataframe**.rdd.zipWithIndex()
indexed = (zipped_rdd.map(lambda ri: row_with_index(*list(ri[0]) + [ri[1]])).toDF(new_schema))
where original_dataframe is the dataframe you have to add index on and row_with_index is the new schema with the column index which you can write as
row_with_index = Row(
"calendar_date"
,"year_week_number"
,"year_period_number"
,"realization"
,"index"
)
Here, calendar_date, year_week_number, year_period_number and realization were the columns of my original dataframe. You can replace the names with the names of your columns. index is the new column name you had to add for the row numbers.
If you require a unique sequence number for each row, I have a slightly different approach, where a static column is added and is used to compute the row number using that column.
val srcData = spark.read.option("header","true").csv("/FileStore/sample.csv")
srcData.show(5)
+--------+--------------------+
| Job| Name|
+--------+--------------------+
|Morpheus| HR Specialist|
| Kayla| Lawyer|
| Trisha| Bus Driver|
| Robert|Elementary School...|
| Ober| Judge|
+--------+--------------------+
val srcDataModf = srcData.withColumn("sl_no",lit("1"))
val windowSpecRowNum = Window.partitionBy("sl_no").orderBy("sl_no")
srcDataModf.withColumn("row_num",row_number.over(windowSpecRowNum)).drop("sl_no").select("row_num","Name","Job")show(5)
+-------+--------------------+--------+
|row_num| Name| Job|
+-------+--------------------+--------+
| 1| HR Specialist|Morpheus|
| 2| Lawyer| Kayla|
| 3| Bus Driver| Trisha|
| 4|Elementary School...| Robert|
| 5| Judge| Ober|
+-------+--------------------+--------+
For SparkR:
(Assuming sdf is some sort of spark data frame)
sdf<- withColumn(sdf, "row_id", SparkR:::monotonically_increasing_id())