pyspark dataframe array of struct to columns - pyspark

I have a dataframe with :
"abc": array [
"def": struct {
"id": string,
"value": string
}
]
id can be "PHONE", "FAX" and "MAIL"
So, this is a sample :
technical_id
column_to_explode
1
[["PHONE", "083665xxxx"], ["FAX", "0325xxxxxx"]]
2
[["MAIL", "abc#xxx.com"]]
3
null
Is it possible to transform to :
technical_id
column_to_explode
PHONE
FAX
MAIL
1
[["PHONE", "083665xxxx"], ["FAX", "0325xxxxxx"]]
083665xxxx
0325xxxxxx
null
2
[["MAIL", "abc#xxx.com"]]
null
null
abc#xxx.com
3
null
null
null
null
I'm trying with explode but it duplicate rows and I would rather to avoid this.
Thanks.

You can do a pivot after the explode to ensure unique ID columns. Here's an example.
spark.sparkContext.parallelize([([('phone', 'abc'), ('email', 'xyz')], 1), ([('fax', 'klm')], 2)]). \
toDF(['arr_of_structs', 'id']). \
selectExpr('*', 'inline(arr_of_structs)'). \
groupBy('id'). \
pivot('_1'). \
agg(func.first('_2')). \
show(truncate=False)
# +---+-----+----+-----+
# |id |email|fax |phone|
# +---+-----+----+-----+
# |1 |xyz |null|abc |
# |2 |null |klm |null |
# +---+-----+----+-----+
The input dataframe looks like the following
spark.sparkContext.parallelize([([('phone', 'abc'), ('email', 'xyz')], 1), ([('fax', 'klm')], 2)]). \
toDF(['arr_of_structs', 'id']). \
show(truncate=False)
# +----------------------------+---+
# |arr_of_structs |id |
# +----------------------------+---+
# |[{phone, abc}, {email, xyz}]|1 |
# |[{fax, klm}] |2 |
# +----------------------------+---+
# root
# |-- arr_of_structs: array (nullable = true)
# | |-- element: struct (containsNull = true)
# | | |-- _1: string (nullable = true)
# | | |-- _2: string (nullable = true)
# |-- id: long (nullable = true)

Related

Pyspark - Find new, left and existing sales

I have a dataframe like this
and I want output like this
I need to aggregate the sales for each year band like this as below. For example for 2018-2019,
New_sales = sum of all sales of 2019 (which is the later year in 2018-2019) where the id didn't exists in 2018 but exists in 2019
Existing_sales = sum of sales of 2018 where the id is there in 2018 and 2019 subtract sum of sales of 2019
Existing_sales = 50+75 (sales of 2018) - (20+50) (sales of 2019) = 125-70 = 55
Left_sales = sum of all sales of 2018 (the earlier year in 2018-2019) where the id exists in 2018 but not in 2019
How do I achieve that?
I've added some more dummy data to include 2020 year for this example.
# +---+---+----------+
# | id|amt| dt|
# +---+---+----------+
# | 1| 50|2018-12-31|
# | 2|100|2018-12-31|
# | 3| 75|2018-12-31|
# | 1| 20|2019-12-31|
# | 3| 50|2019-12-31|
# | 5| 25|2019-12-31|
# | 1| 70|2020-12-31|
# | 2|150|2020-12-31|
# | 5|125|2020-12-31|
# +---+---+----------+
You can extract the year from the dates and pivot them with the sales values under it. This would make the calculations manageable by reducing the number of rows - Number of years would be less, so calculations would be optimized.
yr_sales_sdf = data_sdf. \
fillna(0, subset=['amt']). \
withColumn('yr', func.year('dt')). \
groupBy('id'). \
pivot('yr'). \
agg(func.first('amt'))
# +---+----+----+----+
# | id|2018|2019|2020|
# +---+----+----+----+
# | 5|null| 25| 125|
# | 1| 50| 20| 70|
# | 3| 75| 50|null|
# | 2| 100|null| 150|
# +---+----+----+----+
We'll need the years as a list - This can be extracted from the pivoted dataframe's columns as done in yrs list - and a case when function that will calculate if the ID is existing, left or new - I've created sale_cat_cond function (and it is purely pyspark so automatically optimized) that takes in two consecutive years as columns and checks the condition to generate the category and its sales as a struct. Structs are very helpful to track in these cases where the same condition is checked for more than one required values.
yrs = [k for k in yr_sales_sdf.columns if k[0:2] == '20']
sale_cat_cond = lambda frstCol, scndCol: (
func.when(func.col(frstCol).isNull() & func.col(scndCol).isNotNull(),
func.struct(func.lit(frstCol+'-'+scndCol).alias('year_band'),
func.lit('new_sales').alias('salecat'),
func.col(scndCol).alias('sale')
)
).
when(func.col(frstCol).isNotNull() & func.col(scndCol).isNull(),
func.struct(func.lit(frstCol+'-'+scndCol).alias('year_band'),
func.lit('left_sales').alias('salecat'),
func.col(frstCol).alias('sale')
)
).
otherwise(func.struct(func.lit(frstCol+'-'+scndCol).alias('year_band'),
func.lit('existing_sales').alias('salecat'),
(func.col(frstCol)-func.col(scndCol)).alias('sale')
)
)
)
Run the sale_cat_cond on all year columns, using a list comprehension, to calculate sales categories and their sales. This creates additional columns for all year bands.
yr_salecat_sdf = yr_sales_sdf. \
select('*',
*[sale_cat_cond(yrs[i], yrs[i+1]).alias(yrs[i]+'_'+yrs[i+1]+'_salecat') for i in range(len(yrs) - 1)]
)
# +---+----+----+----+-------------------------------+---------------------------------+
# |id |2018|2019|2020|2018_2019_salecat |2019_2020_salecat |
# +---+----+----+----+-------------------------------+---------------------------------+
# |5 |null|25 |125 |{2018-2019, new_sales, 25} |{2019-2020, existing_sales, -100}|
# |1 |50 |20 |70 |{2018-2019, existing_sales, 30}|{2019-2020, existing_sales, -50} |
# |3 |75 |50 |null|{2018-2019, existing_sales, 25}|{2019-2020, left_sales, 50} |
# |2 |100 |null|150 |{2018-2019, left_sales, 100} |{2019-2020, new_sales, 150} |
# +---+----+----+----+-------------------------------+---------------------------------+
The only thing left is to pivot-sum the sales categories and their sales. To do this, first collate all year band category structs into an array - This will make it easy to explode and pivot per ID (using SQL's inline function).
yr_salecat_sdf. \
withColumn('salecat_struct_arr',
func.array(*[k for k in yr_salecat_sdf.columns if '_salecat' in k])
). \
selectExpr('id', 'inline(salecat_struct_arr)'). \
groupBy('year_band'). \
pivot('salecat'). \
agg(func.sum('sale')). \
show()
# +---------+--------------+----------+---------+
# |year_band|existing_sales|left_sales|new_sales|
# +---------+--------------+----------+---------+
# |2019-2020| -150| 50| 150|
# |2018-2019| 55| 100| 25|
# +---------+--------------+----------+---------+
Additional details - schemas for all dataframes
data_sdf
# root
# |-- id: long (nullable = true)
# |-- amt: long (nullable = true)
# |-- dt: date (nullable = true)
yr_sales_sdf
# root
# |-- id: long (nullable = true)
# |-- 2018: long (nullable = true)
# |-- 2019: long (nullable = true)
# |-- 2020: long (nullable = true)
yr_salecat_sdf
# root
# |-- id: long (nullable = true)
# |-- 2018: long (nullable = true)
# |-- 2019: long (nullable = true)
# |-- 2020: long (nullable = true)
# |-- 2018-2019_salecat: struct (nullable = false)
# | |-- year_band: string (nullable = false)
# | |-- salecat: string (nullable = false)
# | |-- sale: long (nullable = true)
# |-- 2019-2020_salecat: struct (nullable = false)
# | |-- year_band: string (nullable = false)
# | |-- salecat: string (nullable = false)
# | |-- sale: long (nullable = true)
final result
# root
# |-- year_band: string (nullable = false)
# |-- existing_sales: long (nullable = true)
# |-- left_sales: long (nullable = true)
# |-- new_sales: long (nullable = true)
#Sparc
here is the solution . do let me know if you have questions around this.
--Approach--
You can use create two df based on year(date) and then
do inner join ---> to find the existing sales
df_2018 left_anti with df_2019--> give left_sales
df_2019 left_anti with df_2018 ---> give new sales.
combines these three by union , boom you get the result.
kindly upvote if you like my approach.
Solution:-
from pyspark.sql import Window
import pyspark.sql.functions as F
schema=["id","date_val","sales"]
data =[("1","2018-12-31","50"),
("2","2018-12-31","100"),
("3","2018-12-31","75"),
("1","2019-12-31","20"),
("3","2019-12-31","50"),
("5","2019-12-31","25")]
date_range=["2018","2019"]
df=spark.createDataFrame(data,schema)
df= df1.withColumn("date_val",F.col("date_val").cast("date"))\
.withColumn("year",F.year(F.col("date_val")).cast("string"))\
.withColumn("year_bands", F.lit(date_range[0]+"-"+date_range[1]))
filter_cond_2018 = (F.col("year") == "2018")
df_2018=df.filter(filter_cond_2018)
df_2019 = df.filter(~filter_cond_2018)
df_left_sales = df_2018.join(df_2019,["id"],"left_anti")\
.groupBy(["year","year_bands"]).agg(F.sum(F.col("sales")).alias("Left_Sales"))
df_new_sales=df_2019.join(df_2018,["id"],"left_anti")\
.groupBy(["year","year_bands"]).agg(F.sum(F.col("sales")).alias("New_Sales"))
df_ext_sales_2018 = df_2018.join(df_2019,["id"],"inner").select(df_2018["*"])\
.groupBy(["year","year_bands"]).agg(F.sum(F.col("sales")).alias("Existing_Sale_{}".format(date_range[0])))
df_ext_sales_2019 = df_2019.join(df_2018,["id"],"inner").select(df_2019["*"])\
.groupBy(["year","year_bands"]).agg(F.sum(F.col("sales")).alias("Existing_Sale_{}".format(date_range[1])))
df_agg = df_left_sales.join(df_new_sales,["year_bands"])\
.join(df_ext_sales_2018,["year_bands"])\
.join(df_ext_sales_2019,["year_bands"])
df_agg_fnl =df_agg\
.withColumn("Existing_Sales", F.col("Existing_Sale_{}".format(date_range[0]))-F.col("Existing_Sale_{}".format(date_range[1])))\
.select(["year_bands","Left_Sales","New_Sales","Existing_Sales"])
df_agg_fnl.show(10,0)
Generic Solution :
from pyspark.sql import functions as F ,DataFrame
schema=["id","date_val","sales"]
data =[("1","2018-12-31","50"),
("2","2018-12-31","100"),
("3","2018-12-31","75"),
("1","2019-12-31","20"),
("3","2019-12-31","50"),
("5","2019-12-31","25"),
("6","2020-12-31","25"),
("5","2020-12-31","10"),
("7","2020-12-31","25")]
date_range=["2018","2019"]
df=spark.createDataFrame(data,schema)
df = df.withColumn("year",F.split(F.col('date_val'), '-').getItem(0))
year_bands=df.select("year").distinct().toPandas()["year"].tolist()
def calculate_agg_data(df,start_year,end_year):
df_start_year=df.filter(F.col("year").isin([start_year]))
df_end_year=df.filter(F.col("year").isin([end_year]))
df_left_sales = df_start_year.join(df_end_year,["id"],"left_anti")\ .groupBy(["year","year_bands"]).agg(F.sum(F.col("sales")).alias("Left_Sales"))
df_new_sales=df_end_year.join(df_start_year,["id"],"left_anti")\
.groupBy(["year","year_bands"]).agg(F.sum(F.col("sales")).alias("New_Sales"))
df_start_year_ext_sales = df_start_year.join(df_end_year,["id"],"inner").select(df_start_year["*"])\
.groupBy(["year","year_bands"]).agg(F.sum(F.col("sales")).alias("Existing_Sale_{}".format(start_year)))
df_end_year_ext_sales = df_end_year.join(df_start_year,["id"],"inner").select(df_end_year["*"])\
.groupBy(["year","year_bands"]).agg(F.sum(F.col("sales")).alias("Existing_Sale_{}".format(end_year)))
# final agg
df_agg = df_left_sales.join(df_new_sales,["year_bands"])\
.join(df_start_year_ext_sales,["year_bands"])\
.join(df_end_year_ext_sales,["year_bands"])
df_agg_fnl =df_agg\
.withColumn("Existing_Sales", F.col("Existing_Sale_{}".format(start_year))-F.col("Existing_Sale_{}".format(end_year)))\
.select(["year_bands","Left_Sales","New_Sales","Existing_Sales"])
return df_agg_fnl
df_lst=[]
for index in range(len(year_bands)-1):
start_year=year_bands[index]
end_year=year_bands[index+1]
df=df.withColumn("year_bands", F.lit(start_year+"-"+end_year))
df_flt =df.filter(F.col("year").isin([start_year,end_year]))
df_agg =calculate_agg_data(df_flt,start_year,end_year)
df_lst.append(df_agg)
df_fnl = reduce(DataFrame.unionByName,df_lst)
df_fnl.show(10,0)
Alternative for reduce
df_fnl=df_lst[0]
for index in range(1,len(df_lst)):
if len(df_lst)>=1:
df_fnl=df_fnl.unionByName(df_lst[index])
df_fnl.show(40,0)
Kindly upvote of you like my solution and Approach .
()
So based on your previous post we have a dataframe df which is
+---+----------+----+------------+---------+---------+---------+---------+
|id |date |year|year_lst |2017-2018|2018-2019|2019-2020|2020-2021|
+---+----------+----+------------+---------+---------+---------+---------+
|1 |31/12/2017|2017|[2017, 2018]|Existing |Left |null |null |
|2 |31/12/2017|2017|[2017] |Left |null |null |null |
|3 |31/12/2017|2017|[2017, 2018]|Existing |Left |null |null |
|1 |31/12/2018|2018|[2017, 2018]|Existing |Left |null |null |
|3 |31/12/2018|2018|[2017, 2018]|Existing |Left |null |null |
|5 |31/12/2018|2018|[2018] |New |Left |null |null |
+---+----------+----+------------+---------+---------+---------+---------+
As you mentioned that you have a sales column now which you need to do the aggregation, assume when you use df.columns, your output should look like this:
['id', 'date', 'year', 'sales', 'year_lst', '2017-2018', '2018-2019', '2019-2020', '2020-2021']
To achieve your goal, what you can do is:
from pyspark.sql.functions import func
# loop from 2017-2018 column
for idx, year_year in enumerate(df.columns[4:]):
new_df = df.filter(func.col('year')==year_year.split('-')[0])\
.select('id',
'sales',
func.lit(year_year).alias('year_year'),
func.col(year_year).alias('status'))
if idx == 0:
output = new_df
else:
output = output.unionAll(new_df)
The remain is just the aggregation:
output.groupby('year_year').pivot('status').agg(func.sum('sales')).fillna(0).orderBy(func.col('year_year')).show()

Creating hierarchical JSON in Spark

I have a spark dataframe which I need to write to MongoDB. I wanted to know how can I write some of the columns of the dataframe as nested/hierarchical JSON in mongoDB.
Lets say the dataframe has 6 columns, col1, col2,.....col5, col6
I would want col1, col2, col3 as 1st hierarchy and the rest columns col4 to col6 as the 2nd hierarchy. something like this,
{
"col1": 123,
"col2": "abc",
"col3": 45,
"fields": {
"col4": "ert",
"col5": 45,
"col6": 56
}
}
how do I achieve this in pyspark?
Use to_json + struct in built function for this case.
Example:
df.show()
#+----+----+----+----+----+----+
#|col1|col2|col3|col4|col5|col6|
#+----+----+----+----+----+----+
#| 123| abc| 45| ert| 45| 56|
#+----+----+----+----+----+----+
from pyspark.sql.functions import *
df.withColumn("jsn",to_json(struct("col1","col2","col3",struct("col4","col5","col6").alias("fields")))).show(10,False)
#+----+----+----+----+----+----+---------------------------------------------------------------------------------------+
#|col1|col2|col3|col4|col5|col6|jsn |
#+----+----+----+----+----+----+---------------------------------------------------------------------------------------+
#|123 |abc |45 |ert |45 |56 |{"col1":"123","col2":"abc","col3":"45","fields":{"col4":"ert","col5":"45","col6":"56"}}|
#+----+----+----+----+----+----+---------------------------------------------------------------------------------------+
cols=df.columns
df.withColumn("jsn",to_json(struct("col1","col2","col3",struct("col4","col5","col6").alias("fields")))).drop(*cols).show(10,False)
#+---------------------------------------------------------------------------------------+
#|jsn |
#+---------------------------------------------------------------------------------------+
#|{"col1":"123","col2":"abc","col3":"45","fields":{"col4":"ert","col5":"45","col6":"56"}}|
#+---------------------------------------------------------------------------------------+
#using toJSON
df.withColumn("jsn",struct("col1","col2","col3",struct("col4","col5","col6").alias("fields"))).drop(*cols).toJSON().collect()
#[u'{"jsn":{"col1":"123","col2":"abc","col3":"45","fields":{"col4":"ert","col5":"45","col6":"56"}}}']
#to write as json file
df.withColumn("jsn",struct("col1","col2","col3",struct("col4","col5","col6").alias("fields"))).\
drop(*cols).\
write.\
format("json").\
save("<path>")
Update:
jsn column Represented as json struct
df.withColumn("jsn",struct("col1","col2","col3",struct("col4","col5","col6").alias("fields"))).drop(*cols).printSchema()
#root
# |-- jsn: struct (nullable = false)
# | |-- col1: string (nullable = true)
# | |-- col2: string (nullable = true)
# | |-- col3: string (nullable = true)
# | |-- fields: struct (nullable = false)
# | | |-- col4: string (nullable = true)
# | | |-- col5: string (nullable = true)
# | | |-- col6: string (nullable = true)

How to return ListBuffer as a column from UDF using Spark Scala?

I am trying to use UDF's and return ListBuffer as a column from UDF, i am getting error.
I have created Df by executing below code:
val df = Seq((1,"dept3##rama##kumar","dept3##rama##kumar"), (2,"dept31##rama1##kumar1","dept33##rama3##kumar3")).toDF("id","str1","str2")
df.show()
it show like below:
+---+--------------------+--------------------+
| id| str1| str2|
+---+--------------------+--------------------+
| 1| dept3##rama##kumar| dept3##rama##kumar|
| 2|dept31##rama1##ku...|dept33##rama3##ku...|
+---+--------------------+--------------------+
as per my requirement i have to use i have to split the above columns based some inputs so i have tried UDF like below :
def appendDelimiterError=udf((id: Int, str1: String, str2: String)=> {
var lit = new ListBuffer[Any]()
if(str1.contains("##"){val a=str1.split("##")}
else if(str1.contains("##"){val a=str1.split("##")}
else if(str1.contains("#&"){val a=str1.split("#&")}
if(str2.contains("##"){ val b=str2.split("##")}
else if(str2.contains("##"){ val b=str2.split("##") }
else if(str1.contains("##"){val b=str2.split("##")}
var tmp_row = List(a,"test1",b)
lit +=tmp_row
return lit
})
val
i try to cal by executing below code:
val df1=df.appendDelimiterError("newcol",appendDelimiterError(df("id"),df("str1"),df("str2"))
i getting error "this was a bad call" .i want use ListBuffer/list to store and return to calling place.
my expected output will be:
+---+--------------------+------------------------+----------------------------------------------------------------------+
| id| str1| str2 | newcol |
+---+--------------------+------------------------+----------------------------------------------------------------------+
| 1| dept3##rama##kumar| dept3##rama##kumar |ListBuffer(List("dept","rama","kumar"),List("dept3","rama","kumar")) |
| 2|dept31##rama1##kumar1|dept33##rama3##kumar3 | ListBuffer(List("dept31","rama1","kumar1"),List("dept33","rama3","kumar3")) |
+---+--------------------+------------------------+----------------------------------------------------------------------+
How to achieve this?
An alternative with my own fictional data to which you can tailor and no UDF:
import org.apache.spark.sql.functions.{col, udf}
import org.apache.spark.sql.expressions._
import org.apache.spark.sql.functions._
val df = Seq(
(1, "111##cat##666", "222##fritz##777"),
(2, "AAA##cat##555", "BBB##felix##888"),
(3, "HHH##mouse##yyy", "123##mickey##ZZZ")
).toDF("c0", "c1", "c2")
val df2 = df.withColumn( "c_split", split(col("c1"), ("(##)|(##)|(##)|(##)") ))
.union(df.withColumn("c_split", split(col("c2"), ("(##)|(##)|(##)|(##)") )) )
df2.show(false)
df2.printSchema()
val df3 = df2.groupBy(col("c0")).agg(collect_list(col("c_split")).as("List_of_Data") )
df3.show(false)
df3.printSchema()
Gives answer but no ListBuffer - really necessary?, as follows:
+---+---------------+----------------+------------------+
|c0 |c1 |c2 |c_split |
+---+---------------+----------------+------------------+
|1 |111##cat##666 |222##fritz##777 |[111, cat, 666] |
|2 |AAA##cat##555 |BBB##felix##888 |[AAA, cat, 555] |
|3 |HHH##mouse##yyy|123##mickey##ZZZ|[HHH, mouse, yyy] |
|1 |111##cat##666 |222##fritz##777 |[222, fritz, 777] |
|2 |AAA##cat##555 |BBB##felix##888 |[BBB, felix, 888] |
|3 |HHH##mouse##yyy|123##mickey##ZZZ|[123, mickey, ZZZ]|
+---+---------------+----------------+------------------+
root
|-- c0: integer (nullable = false)
|-- c1: string (nullable = true)
|-- c2: string (nullable = true)
|-- c_split: array (nullable = true)
| |-- element: string (containsNull = true)
+---+---------------------------------------+
|c0 |List_of_Data |
+---+---------------------------------------+
|1 |[[111, cat, 666], [222, fritz, 777]] |
|3 |[[HHH, mouse, yyy], [123, mickey, ZZZ]]|
|2 |[[AAA, cat, 555], [BBB, felix, 888]] |
+---+---------------------------------------+
root
|-- c0: integer (nullable = false)
|-- List_of_Data: array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: string (containsNull = true)

Removing duplicate array structs by last item in array struct in Spark Dataframe

So my table looks something like this:
customer_1|place|customer_2|item |count
-------------------------------------------------
a | NY | b |(2010,304,310)| 34
a | NY | b |(2024,201,310)| 21
a | NY | b |(2010,304,312)| 76
c | NY | x |(2010,304,310)| 11
a | NY | b |(453,131,235) | 10
I've tried doing, but this does not eliminate the duplicates as the former array is still there (as it should be, I need it for end results).
val df= df_one.withColumn("vs", struct(col("item").getItem(size(col("item"))-1), col("item"), col("count")))
.groupBy(col("customer_1"), col("place"), col("customer_2"))
.agg(max("vs").alias("vs"))
.select(col("customer_1"), col("place"), col("customer_2"), col("vs.item"), col("vs.count"))
I would like to group by customer_1, place and customer_2 columns and return only array structs whose last item (-1) is unique with the highest count, any ideas?
Expected output:
customer_1|place|customer_2|item |count
-------------------------------------------------
a | NY | b |(2010,304,312)| 76
a | NY | b |(2010,304,310)| 34
a | NY | b |(453,131,235) | 10
c | NY | x |(2010,304,310)| 11
Given that the schema of the dataframe is as
root
|-- customer_1: string (nullable = true)
|-- place: string (nullable = true)
|-- customer_2: string (nullable = true)
|-- item: array (nullable = true)
| |-- element: integer (containsNull = false)
|-- count: string (nullable = true)
You can apply concat funcations to create temp column for checking duplicate rows as done below
import org.apache.spark.sql.functions._
df.withColumn("temp", concat($"customer_1",$"place",$"customer_2", $"item"(size($"item")-1)))
.dropDuplicates("temp")
.drop("temp")
You should get following output
+----------+-----+----------+----------------+-----+
|customer_1|place|customer_2|item |count|
+----------+-----+----------+----------------+-----+
|a |NY |b |[2010, 304, 312]|76 |
|c |NY |x |[2010, 304, 310]|11 |
|a |NY |b |[453, 131, 235] |10 |
|a |NY |b |[2010, 304, 310]|34 |
+----------+-----+----------+----------------+-----+
Struct
Given the schema of dataframe is as
root
|-- customer_1: string (nullable = true)
|-- place: string (nullable = true)
|-- customer_2: string (nullable = true)
|-- item: struct (nullable = true)
| |-- _1: integer (nullable = false)
| |-- _2: integer (nullable = false)
| |-- _3: integer (nullable = false)
|-- count: string (nullable = true)
We can still do same as above with slight change in getting the third item from the struct as
import org.apache.spark.sql.functions._
df.withColumn("temp", concat($"customer_1",$"place",$"customer_2", $"item._3"))
.dropDuplicates("temp")
.drop("temp")
Hope the answer is helpful

Adding attribute of type Array[long] from existing attribute value in DF

I am using spark 2.0 and have a use case where I need to convert the attribute type of a column from string to Array[long].
Suppose I have a dataframe with schema :
root
|-- unique_id: string (nullable = true)
|-- column2 : string (nullable = true)
DF :
+----------+---------+
|unique_id | column2 |
+----------+---------+
| 1 | 123 |
| 2 | 125 |
+----------+---------+
now i want to add a new column with name "column3" of type Array[long]having the values from "column2"
like :
root
|-- unique_id: string (nullable = true)
|-- column2: long (nullable = true)
|-- column3: array (nullable = true)
| |-- element: long (containsNull = true)
new DF :
+----------+---------+---------+
|unique_id | column2 | column3 |
+----------+---------+---------+
| 1 | 123 | [123] |
| 2 | 125 | [125] |
+----------+---------+---------+
I there a way to achieve this ?
You can simply use withColumn and array function as
df.withColumn("column3", array(df("columnd")))
And I also see that you are trying to change the column2 from string to Long. A simple udf function should do the trick. So final solution would be
def changeToLong = udf((str: String) => str.toLong)
val finalDF = df
.withColumn("column2", changeToLong(col("column2")))
.withColumn("column3", array(col("column2")))
You need to import functions library too as
import org.apache.spark.sql.functions._