Increase the date based on the length of Name column in PySpark dataframe

Increase the date based on the length of Name column in PySpark dataframe - date

I'm trying to add new columns based on the input Name and Date columns as below:
Input:
+------+-----------+
|Name |Date |
+------+-----------+
|PETER |1986-May-29|
+------+-----------+
Expected Output:
+---------+-----------+
|Character| New_Date|
+---------+-----------+
| P|1986-May-29|
| E|1986-May-30|
| T|1986-May-31|
| E|1986-Jun-01|
| R|1986-Jun-02|
+---------+-----------+
df_withchars = df.withColumn("Character", F.explode(F.split('Name','')))\
.filter(F.col('Character') != '')
df_withchars.withColumn('New_Date', (lambda x: F.date_add(x['Date'], 1) for i in range(len(x[0])))).show()
Tried the above code and throwing NameError: name 'x' is not defined

You can use split to create an array from the column and then posexplode to explode this array.
posexplode is similar to explode function but it returns one additional column - the position/index of an item. That means it will give you the number from 0 to 4 in your particular example. You can add this number to the date using date_add function.
Before we start lets import relevant functions:
from pyspark.sql.types import StructField, StructType, DateType, StringType
from pyspark.sql.functions import split, col, date_add, posexplode
from datetime import date
Then create a data frame:
sample_data = [('PETER', date.today())]
sample_schema = StructType([
StructField('Name', StringType(), True),
StructField('Date', DateType(), True),
])
df = spark.createDataFrame(data=sample_data, schema=sample_schema)
Final step is to use split and posexplode functions to get the index of each character, and then add the index to date.
df \
.withColumn('SplittedName', split('Name', "(?!$)")) \
.select('Name', 'Date', posexplode('SplittedName')) \
.withColumn('NewDate', date_add('Date', col('pos'))).show()
Note that in the above code, I've used regex pattern (?!$) (reference). You can use empty string "" as well. However, it will return an additional empty item in the resulting array in SplittedName column, which will requires adding another step just to remove this empty, not relevant item.
The result:
EDIT:
To get the resulting date in the desired format, you just need to use date_format (don't forget to import the function first) as follows:
df \
.withColumn('SplittedName', split('Name', "(?!$)")) \
.select('Name', 'Date', posexplode('SplittedName')) \
.withColumn('NewDate', date_format(date_add('Date', col('pos')), "yyyy-MMM-dd")).show()
New result:

Related

Pyspark: Identify the arrayType column from the the Struct and call udf to convert array to string

I am creating an accelerator where it migrates the data from source to destination. For Example, I will pick the data from an API and will migrate the data to csv. I have faced issues with handling arraytype while data is converted to csv. I have used withColumn and concat_ws method(i.e., df1=df.withColumn('films',F.concat_ws(':',F.col("films"))) films is the arraytype column ) for this conversion and it worked. Now I wanted this to happen dynamically. I mean, without specifying the column name, is there a way that I can pick the column name from struct which have the arraytype and then call the udf?
Thank you for your time!

You can get the type of the columns using df.schema. Depending on the type of the column you can apply concat_ws or not:
data = [["test1", "test2", [1,2,3], ["a","b","c"]]]
schema= ["col1", "col2", "arr1", "arr2"]
df = spark.createDataFrame(data, schema)
array_cols = [F.concat_ws(":", c.name).alias(c.name) \
for c in df.schema if isinstance(c.dataType, T.ArrayType) ]
other_cols = [F.col(c.name) \
for c in df.schema if not isinstance(c.dataType, T.ArrayType) ]
df = df.select(other_cols + array_cols)
Result:
+-----+-----+-----+-----+
| col1| col2| arr1| arr2|
+-----+-----+-----+-----+
|test1|test2|1:2:3|a:b:c|
+-----+-----+-----+-----+

how to pivot /transpose rows of a column in to individual columns in spark-scala without using the pivot method

Please check below image for the reference to my use case

You can get the same result without using pivot by adding the columns manually, if you know all the names of the new columns:
import org.apache.spark.sql.functions.{col, when}
dataframe
.withColumn("cheque", when(col("ttype") === "cheque", col("tamt")))
.withColumn("draft", when(col("ttype") === "draft", col("tamt")))
.drop("tamt", "ttype")
As this solution does not trigger shuffle, your processing will be faster than using pivot.
It can be generalized if you don't know the name of the columns. However, in this case you should benchmark to check whether pivot is more performant:
import org.apache.spark.sql.functions.{col, when}
val newColumnNames = dataframe.select("ttype").distinct.collect().map(_.getString(0))
newColumnNames
.foldLeft(dataframe)((df, columnName) => {
df.withColumn(columnName, when(col("ttype") === columnName, col("tamt")))
})
.drop("tamt", "ttype")

Use groupBy,pivot & agg functions. Check below code.
Added inline comments.
scala> df.show(false)
+----------+------+----+
|tdate |ttype |tamt|
+----------+------+----+
|2020-10-15|draft |5000|
|2020-10-18|cheque|7000|
+----------+------+----+
scala> df
.groupBy($"tdate") // Grouping data based on tdate column.
.pivot("ttype",Seq("cheque","draft")) // pivot based on ttype and "draft","cheque" are new column name
.agg(first("tamt")) // aggregation by "tamt" column.
.show(false)
+----------+------+-----+
|tdate |cheque|draft|
+----------+------+-----+
|2020-10-18|7000 |null |
|2020-10-15|null |5000 |
+----------+------+-----+

casting to string of column for pyspark dataframe throws error

I have pyspark dataframe with two columns with datatypes as
[('area', 'int'), ('customer_play_id', 'int')]
+----+----------------+
|area|customer_play_id|
+----+----------------+
| 100| 8606738 |
| 110| 8601843 |
| 130| 8602984 |
+----+----------------+
I want to cast column area to str using pyspark commands but I am getting error as below
I tried below
str(df['area']) : but it didnt change datatype to str
df.area.astype(str) : gave "TypeError: unexpected type: "
df['area'].cast(str) same as error above
Any help will be appreciated
I want datatype of area as string using pyspark dataframe operation

Simply you can do any of these -
Option1:
df1 = df.select('*',df.area.cast("string"))
select - All the columns you want in df1 should be mentioned in select
Option2:
df1 = df.selectExpr("*","cast(area as string) AS new_area")
selectExpr - All the columns you want in df1 should be mentioned in selectExpr
Option3:
df1 = df.withColumn("new_area", df.area.cast("string"))
withColumn will add new column (additional to existing columns of df)
"*" in select and selectExpr represent all the columns.

use withColumn function to change the data type or values in the field in spark e.g. is show below:
import pyspark.sql.functions as F
df = df.withColumn("area",F.col("area").cast("string"))

You Can use this UDF Function
from pyspark.sql.types import FloatType
tofloatfunc = udf(lambda x: x,FloatType())
changedTypedf = df.withColumn("Column_name", df["Column_name"].cast(FloatType()))

get the distinct elements of an ArrayType column in a spark dataframe

I have a dataframe with 3 columns named id, feat1 and feat2. feat1 and feat2 are in the form of Array of String:
Id, feat1,feat2
------------------
1, ["feat1_1","feat1_2","feat1_3"],[]
2, ["feat1_2"],["feat2_1","feat2_2"]
3,["feat1_4"],["feat2_3"]
I want to get the list of distinct elements inside each feature column, so the output will be:
distinct_feat1,distinct_feat2
-----------------------------
["feat1_1","feat1_2","feat1_3","feat1_4"],["feat2_1","feat2_2","feat2_3]
what is the best way to do this in Scala?

You can use the collect_set to find the distinct values of the corresponding column after applying the explode function on each column to unnest the array element in each cell. Suppose your data frame is called df:
import org.apache.spark.sql.functions._
val distinct_df = df.withColumn("feat1", explode(col("feat1"))).
withColumn("feat2", explode(col("feat2"))).
agg(collect_set("feat1").alias("distinct_feat1"),
collect_set("feat2").alias("distinct_feat2"))
distinct_df.show
+--------------------+--------------------+
| distinct_feat1| distinct_feat2|
+--------------------+--------------------+
|[feat1_1, feat1_2...|[, feat2_1, feat2...|
+--------------------+--------------------+
distinct_df.take(1)
res23: Array[org.apache.spark.sql.Row] = Array([WrappedArray(feat1_1, feat1_2, feat1_3, feat1_4),
WrappedArray(, feat2_1, feat2_2, feat2_3)])

one more solution for spark 2.4+
.withColumn("distinct", array_distinct(concat($"array_col1", $"array_col2")))
beware, if one of columns is null, result will be null

The method provided by Psidom works great, here is a function that does the same given a Dataframe and a list of fields:
def array_unique_values(df, fields):
from pyspark.sql.functions import col, collect_set, explode
from functools import reduce
data = reduce(lambda d, f: d.withColumn(f, explode(col(f))), fields, df)
return data.agg(*[collect_set(f).alias(f + '_distinct') for f in fields])
And then:
data = array_unique_values(df, my_fields)
data.take(1)

sqlContext.createDataframe from Row with Schema. pyspark: TypeError: IntegerType can not accept object in type <type 'unicode'>

After spending way to much time figuring out why I get the following error
pyspark: TypeError: IntegerType can not accept object in type <type 'unicode'>
while trying to create a dataframe based on Rows and a Schema, I noticed the following:
With a Row inside my rdd called rrdRows looking as follows:
Row(a="1", b="2", c=3)
and my dfSchema defined as:
dfSchema = StructType([
StructField("c", IntegerType(), True),
StructField("a", StringType(), True),
StructField("b", StringType(), True)
])
creating a dataframe as follows:
df = sqlContext.createDataFrame(rddRows, dfSchema)
brings the above mentioned Error, because Spark only considers the order of StructFields in the schema and does not match the name of the StructFields with the name of the Row fields.
In other words, in the above example, I noticed that spark tries to create a dataframe that would look as follow (if there would not be a typeError. e.x if everything would be of type String)
+---+---+---+
| c | b | a |
+---+---+---+
| 1 | 2 | 3 |
+---+---+---+
is this really expected, or some sort of bug?
EDIT: the rddRows are create along those lines:
def createRows(dic):
res = Row(a=dic["a"],b=dic["b"],c=int(dic["c"])
return res
rddRows = rddDict.map(createRows)
where rddDict is a parsed JSON file.

The constructor of the Row sorts the keys if you provide keyword arguments. Take a look at the source code here. When I found out about that, I ended up sorting my schema accordingly before applying it to the dataframe:
sorted_fields = sorted(dfSchema.fields, key=lambda x: x.name)
sorted_schema = StructType(fields=sorted_fields)
df = sqlContext.createDataFrame(rddRows, sorted_schema)