GroupBy and concat array columns pyspark - pyspark

I have this data frame
df = sc.parallelize([(1, [1, 2, 3]), (1, [4, 5, 6]) , (2,[2]),(2,[3])]).toDF(["store", "values"])
+-----+---------+
|store| values|
+-----+---------+
| 1|[1, 2, 3]|
| 1|[4, 5, 6]|
| 2| [2]|
| 2| [3]|
+-----+---------+
and I would like to convert into the follwing df:
+-----+------------------+
|store| values |
+-----+------------------+
| 1|[1, 2, 3, 4, 5, 6]|
| 2| [2, 3]|
+-----+------------------+
I did this:
from pyspark.sql import functions as F
df.groupBy("store").agg(F.collect_list("values"))
but the solution has this WrappedArrays
+-----+----------------------------------------------+
|store|collect_list(values) |
+-----+----------------------------------------------+
|1 |[WrappedArray(1, 2, 3), WrappedArray(4, 5, 6)]|
|2 |[WrappedArray(2), WrappedArray(3)] |
+-----+----------------------------------------------+
Is there any way to transform the WrappedArrays into concatenated arrays? Or can I do it differently?

You need a flattening UDF; starting from your own df:
spark.version
# u'2.2.0'
from pyspark.sql import functions as F
import pyspark.sql.types as T
def fudf(val):
return reduce (lambda x, y:x+y, val)
flattenUdf = F.udf(fudf, T.ArrayType(T.IntegerType()))
df2 = df.groupBy("store").agg(F.collect_list("values"))
df2.show(truncate=False)
# +-----+----------------------------------------------+
# |store| collect_list(values) |
# +-----+----------------------------------------------+
# |1 |[WrappedArray(1, 2, 3), WrappedArray(4, 5, 6)]|
# |2 |[WrappedArray(2), WrappedArray(3)] |
# +-----+----------------------------------------------+
df3 = df2.select("store", flattenUdf("collect_list(values)").alias("values"))
df3.show(truncate=False)
# +-----+------------------+
# |store| values |
# +-----+------------------+
# |1 |[1, 2, 3, 4, 5, 6]|
# |2 |[2, 3] |
# +-----+------------------+
UPDATE (after comment):
The above snippet will work only with Python 2. With Python 3, you should modify the UDF as follows:
import functools
def fudf(val):
return functools.reduce(lambda x, y:x+y, val)
Tested with Spark 2.4.4.

For a simple problem like this, you could also use the explode function. I don't know the performance characteristics versus the selected udf answer though.
from pyspark.sql import functions as F
df = sc.parallelize([(1, [1, 2, 3]), (1, [4, 5, 6]) , (2,[2]),(2,[3])]).toDF(['store', 'values'])
df2 = df.withColumn('values', F.explode('values'))
# +-----+------+
# |store|values|
# +-----+------+
# | 1| 1|
# | 1| 2|
# | 1| 3|
# | 1| 4|
# | 1| 5|
# | 1| 6|
# | 2| 2|
# | 2| 3|
# +-----+------+
df3 = df2.groupBy('store').agg(F.collect_list('values').alias('values'))
# +-----+------------------+
# |store| values |
# +-----+------------------+
# |1 |[4, 5, 6, 1, 2, 3]|
# |2 |[2, 3] |
# +-----+------------------+
Note: you could use F.collect_set() in the aggregation or .drop_duplicates() on df2 to remove duplicate values.
If you want to maintain ordered values in the collected list, I found the following method in another SO answer:
from pyspark.sql.window import Window
w = Window.partitionBy('store').orderBy('values')
df3 = df2.withColumn('ordered_value_lists', F.collect_list('values').over(w))
# +-----+------+-------------------+
# |store|values|ordered_value_lists|
# +-----+------+-------------------+
# |1 |1 |[1] |
# |1 |2 |[1, 2] |
# |1 |3 |[1, 2, 3] |
# |1 |4 |[1, 2, 3, 4] |
# |1 |5 |[1, 2, 3, 4, 5] |
# |1 |6 |[1, 2, 3, 4, 5, 6] |
# |2 |2 |[2] |
# |2 |3 |[2, 3] |
# +-----+------+-------------------+
df4 = df3.groupBy('store').agg(F.max('ordered_value_lists').alias('values'))
df4.show(truncate=False)
# +-----+------------------+
# |store|values |
# +-----+------------------+
# |1 |[1, 2, 3, 4, 5, 6]|
# |2 |[2, 3] |
# +-----+------------------+
If the values themselves don't determine the order, you can use F.posexplode() and use the 'pos' column in your window functions instead of 'values' to determine order. Note: you will also need a higher level order column to order the original arrays, then use the position in the array to order the elements of the array.
df = sc.parallelize([(1, [1, 2, 3], 1), (1, [4, 5, 6], 2) , (2, [2], 1),(2, [3], 2)]).toDF(['store', 'values', 'array_order'])
# +-----+---------+-----------+
# |store|values |array_order|
# +-----+---------+-----------+
# |1 |[1, 2, 3]|1 |
# |1 |[4, 5, 6]|2 |
# |2 |[2] |1 |
# |2 |[3] |2 |
# +-----+---------+-----------+
df2 = df.select('*', F.posexplode('values'))
# +-----+---------+-----------+---+---+
# |store|values |array_order|pos|col|
# +-----+---------+-----------+---+---+
# |1 |[1, 2, 3]|1 |0 |1 |
# |1 |[1, 2, 3]|1 |1 |2 |
# |1 |[1, 2, 3]|1 |2 |3 |
# |1 |[4, 5, 6]|2 |0 |4 |
# |1 |[4, 5, 6]|2 |1 |5 |
# |1 |[4, 5, 6]|2 |2 |6 |
# |2 |[2] |1 |0 |2 |
# |2 |[3] |2 |0 |3 |
# +-----+---------+-----------+---+---+
w = Window.partitionBy('store').orderBy('array_order', 'pos')
df3 = df2.withColumn('ordered_value_lists', F.collect_list('col').over(w))
# +-----+---------+-----------+---+---+-------------------+
# |store|values |array_order|pos|col|ordered_value_lists|
# +-----+---------+-----------+---+---+-------------------+
# |1 |[1, 2, 3]|1 |0 |1 |[1] |
# |1 |[1, 2, 3]|1 |1 |2 |[1, 2] |
# |1 |[1, 2, 3]|1 |2 |3 |[1, 2, 3] |
# |1 |[4, 5, 6]|2 |0 |4 |[1, 2, 3, 4] |
# |1 |[4, 5, 6]|2 |1 |5 |[1, 2, 3, 4, 5] |
# |1 |[4, 5, 6]|2 |2 |6 |[1, 2, 3, 4, 5, 6] |
# |2 |[2] |1 |0 |2 |[2] |
# |2 |[3] |2 |0 |3 |[2, 3] |
# +-----+---------+-----------+---+---+-------------------+
df4 = df3.groupBy('store').agg(F.max('ordered_value_lists').alias('values'))
# +-----+------------------+
# |store|values |
# +-----+------------------+
# |1 |[1, 2, 3, 4, 5, 6]|
# |2 |[2, 3] |
# +-----+------------------+
Edit: If you'd like to keep some columns along for the ride and they don't need to be aggregated, you can include them in the groupBy or rejoin them after aggregation (examples below). If they do require aggregation, only group by 'store' and just add whatever aggregation function you need on the 'other' column/s to the .agg() call.
from pyspark.sql import functions as F
df = sc.parallelize([(1, [1, 2, 3], 'a'), (1, [4, 5, 6], 'a') , (2, [2], 'b'), (2, [3], 'b')]).toDF(['store', 'values', 'other'])
# +-----+---------+-----+
# |store| values|other|
# +-----+---------+-----+
# | 1|[1, 2, 3]| a|
# | 1|[4, 5, 6]| a|
# | 2| [2]| b|
# | 2| [3]| b|
# +-----+---------+-----+
df2 = df.withColumn('values', F.explode('values'))
# +-----+------+-----+
# |store|values|other|
# +-----+------+-----+
# | 1| 1| a|
# | 1| 2| a|
# | 1| 3| a|
# | 1| 4| a|
# | 1| 5| a|
# | 1| 6| a|
# | 2| 2| b|
# | 2| 3| b|
# +-----+------+-----+
df3 = df2.groupBy('store', 'other').agg(F.collect_list('values').alias('values'))
# +-----+-----+------------------+
# |store|other| values|
# +-----+-----+------------------+
# | 1| a|[1, 2, 3, 4, 5, 6]|
# | 2| b| [2, 3]|
# +-----+-----+------------------+
df4 = (
df.drop('values')
.join(
df2.groupBy('store')
.agg(F.collect_list('values').alias('values')),
on=['store'], how='inner'
)
.drop_duplicates()
)
# +-----+-----+------------------+
# |store|other| values|
# +-----+-----+------------------+
# | 1| a|[1, 2, 3, 4, 5, 6]|
# | 2| b| [2, 3]|
# +-----+-----+------------------+

Now, it is possible to use the flatten function and things become a lot easier.
You just have to flatten the collected array after the groupby.
# 1. Create the DF
df = sc.parallelize([(1, [1, 2, 3]), (1, [4, 5, 6]) , (2,[2]),(2,[3])]).toDF(["store","values"])
+-----+---------+
|store| values|
+-----+---------+
| 1|[1, 2, 3]|
| 1|[4, 5, 6]|
| 2| [2]|
| 2| [3]|
+-----+---------+
# 2. Group by store
df = df.groupBy("store").agg(F.collect_list("values"))
+-----+--------------------+
|store|collect_list(values)|
+-----+--------------------+
| 1|[[1, 2, 3], [4, 5...|
| 2| [[2], [3]]|
+-----+--------------------+
# 3. finally.... flat the array
df = df.withColumn("flatten_array", F.flatten("collect_list(values)"))
+-----+--------------------+------------------+
|store|collect_list(values)| flatten_array|
+-----+--------------------+------------------+
| 1|[[1, 2, 3], [4, 5...|[1, 2, 3, 4, 5, 6]|
| 2| [[2], [3]]| [2, 3]|
+-----+--------------------+------------------+

I would probably do it this way.
>>> df = sc.parallelize([(1, [1, 2, 3]), (1, [4, 5, 6]) , (2,[2]),(2,[3])]).toDF(["store", "values"])
>>> df.show()
+-----+---------+
|store| values|
+-----+---------+
| 1|[1, 2, 3]|
| 1|[4, 5, 6]|
| 2| [2]|
| 2| [3]|
+-----+---------+
>>> df.rdd.map(lambda r: (r.store, r.values)).reduceByKey(lambda x,y: x + y).toDF(['store','values']).show()
+-----+------------------+
|store| values|
+-----+------------------+
| 1|[1, 2, 3, 4, 5, 6]|
| 2| [2, 3]|
+-----+------------------+

Since PySpark 2.4, you can use the following code:
df = df.groupBy("store").agg(collect_list("values").alias("values"))
df = df.select("store", array_sort(array_distinct(expr("reduce(values, array(), (x,y) -> concat(x, y))"))).alias("values"))

There is a predefined pyspark function to flatten
df = df.groupBy("store").agg(f.flatten(f.collect_list("values")).alias("values"))
its documentation is here.

Related

pyspark: Merge multiple columns into one column but save the original column name

I have a dataframe having many columns such as v1, v2, v3, and many more. Here only showing v1 and v2. Only one of the columns with the prefix v will have a real number value, and all other columns have a null value.
I want to merge the columns starting with v into one column and create a corresponding column cols to show that value is from which original column. An example of the original table and resultant table are shown below.
Note: Original table has about 200 columns from v1 to v200 and over million rows.
original table
+---------+-----+----+-----+
| org| v1| v2|count|
+---------+-----+----+-----+
| Sh| 46|null| 2|
| Sh| 41|null| 1|
| Sh| null| 4| 3|
| Fi| 30|null| 6|
| Fi| null| 4| 2|
| Xf| null| 2| 1|
| Ai| 27|null| 1|
+---------+-----+----+-----+
result table
+---------+-----+-----+-----+
| org| val| cols|count|
+---------+-----+-----+-----+
| Sh| 46| v1| 2|
| Sh| 41| v1| 1|
| Sh| 4| v2| 3|
| Fi| 30| v1| 6|
| Fi| 4| v2| 2|
| Xf| 2| v2| 1|
| Ai| 27| v1| 1|
+---------+-----+-----+-----+
Sample dataframe:
sample_data = (\
("Sh", 46, None, 2), \
("Sh", 46, None, 1), \
("Sh", None, 4, 3), \
("Fi", 30, None, 6), \
("Fi", None, 4, 2), \
("Xf", None, 2, 1), \
("Ai", 27, None, 1), \
)
columns= [ "org", "v1", "v2", "count"]
df = spark.createDataFrame(data = sample_data, schema = columns)
try this:
import pyspark.sql.functions as f
sample_data = (\
("Sh", 46, None, 2), \
("Sh", 46, None, 1), \
("Sh", None, 4, 3), \
("Fi", 30, None, 6), \
("Fi", None, 4, 2), \
("Xf", None, 2, 1), \
("Ai", 27, None, 1), \
)
columns= [ "org", "v1", "v2", "count"]
columns_of_interest_count = 2
df = (
spark.createDataFrame(data = sample_data, schema = columns)
.withColumn('mapped_columns', f.map_filter(f.map_from_arrays(f.array([f.lit(f'v{str(i)}') for i in range(1, columns_of_interest_count + 1)]), f.array([f.col(f'v{str(i)}') for i in range(1, columns_of_interest_count + 1)])), lambda _, value: ~f.isnull(value)))
.select('org', 'count', f.explode(f.col('mapped_columns')).alias('col', 'val'))
)
df.show(truncate= False)
output:
+---+-----+---+---+
|org|count|col|val|
+---+-----+---+---+
|Sh |2 |v1 |46 |
|Sh |1 |v1 |46 |
|Sh |3 |v2 |4 |
|Fi |6 |v1 |30 |
|Fi |2 |v2 |4 |
|Xf |1 |v2 |2 |
|Ai |1 |v1 |27 |
+---+-----+---+---+
You can create an array of structs and filter that array to keep the structs that are non-null.
data_sdf. \
withColumn('v_structs',
func.array(*[func.struct(func.lit(k).alias('cols'), func.col(k).alias('vals'))
for k in data_sdf.columns if k[0].lower() == 'v']
)
). \
withColumn('v_having_value_structs',
func.expr('filter(v_structs, x -> x.vals is not null)')
). \
select(*data_sdf.columns, func.expr('inline(v_having_value_structs)')). \
show(truncate=False)
# +---+----+----+-----+----+----+
# |org|v1 |v2 |count|cols|vals|
# +---+----+----+-----+----+----+
# |Sh |46 |null|2 |v1 |46 |
# |Sh |46 |null|1 |v1 |46 |
# |Sh |null|4 |3 |v2 |4 |
# |Fi |30 |null|6 |v1 |30 |
# |Fi |null|4 |2 |v2 |4 |
# |Xf |null|2 |1 |v2 |2 |
# |Ai |27 |null|1 |v1 |27 |
# +---+----+----+-----+----+----+
The array of structs would look like this
data_sdf. \
withColumn('v_structs',
func.array(*[func.struct(func.lit(k).alias('cols'), func.col(k).alias('vals'))
for k in data_sdf.columns if k[0].lower() == 'v']
)
). \
withColumn('v_having_value_structs', func.expr('filter(v_structs, x -> x.vals is not null)')). \
show(truncate=False)
# +---+----+----+-----+----------------------+----------------------+
# |org|v1 |v2 |count|v_structs |v_having_value_structs|
# +---+----+----+-----+----------------------+----------------------+
# |Sh |46 |null|2 |[{v1, 46}, {v2, null}]|[{v1, 46}] |
# |Sh |46 |null|1 |[{v1, 46}, {v2, null}]|[{v1, 46}] |
# |Sh |null|4 |3 |[{v1, null}, {v2, 4}] |[{v2, 4}] |
# |Fi |30 |null|6 |[{v1, 30}, {v2, null}]|[{v1, 30}] |
# |Fi |null|4 |2 |[{v1, null}, {v2, 4}] |[{v2, 4}] |
# |Xf |null|2 |1 |[{v1, null}, {v2, 2}] |[{v2, 2}] |
# |Ai |27 |null|1 |[{v1, 27}, {v2, null}]|[{v1, 27}] |
# +---+----+----+-----+----------------------+----------------------+
Yet another approach.
from functools import reduce
from pyspark.sql import functions as F
vcols = [x for x in df.columns if x.startswith('v')]
df = (df.select('*', F.coalesce(*vcols).alias('val'))
.select('org', 'val', 'count', reduce(lambda p, c: p.when(F.col(c) == F.col('val'), F.lit(c)), vcols, F).alias('cols'))
)
First use coalesce to take the first non-null value amongst the columns of interest.
Then use chain of when to match find which columns have the same value as one that I got from coalesce.
This part
reduce(lambda p, c: p.when(F.col(c) == F.col('val'), F.lit(c)), vcols, F)
will generate
(F.when(F.col('v1') == F.col('val'), F.lit('v1'))
.when(F.col('v2') == F.col('val'), F.lit('v2'))
...
)
So, this will give me the column name where the valid value come from.

How to find out duplicate values in a row in Pyspark Data frame

My data frame looks like -
val1 val2 val3
1 0 1
4 0 4
3 3 4
4 2 2
My final data frame should be -
val1 val2 val3 dup
1 0 1 1
4 0 4 4
3 3 4 3
4 2 2 2
creating dataframe
a = spark.createDataFrame([
("1", "0", "0","A"),
("1", "0", "2","B"),
("1", "1", "2","C"),
("1", "1", "3","H"),
("1", "2", "2","D"),
("1", "2", "2","E")
], ["val1", "val2", "val3","val4"])
create a list and explode and get counts.
df_a= a.withColumn('arr_val', array(col('val1'),col('val2'),col('val3')) )
df_b = df_a.withColumn('repeats', explode(col('arr_val')) ).\
groupby(['val1','val2','val3','repeats']).count().\
filter(col('count')>1)
df_a
+----+----+----+----+---------+
|val1|val2|val3|val4|arr_val |
+----+----+----+----+---------+
|1 |0 |0 |A |[1, 0, 0]|
|1 |0 |2 |B |[1, 0, 2]|
|1 |1 |2 |C |[1, 1, 2]|
|1 |1 |3 |H |[1, 1, 3]|
|1 |2 |2 |D |[1, 2, 2]|
|1 |2 |2 |E |[1, 2, 2]|
+----+----+----+----+---------+
df_b
+----+----+----+-------+-----+
|val1|val2|val3|repeats|count|
+----+----+----+-------+-----+
| 1| 0| 0| 0| 2|
| 1| 2| 2| 2| 2|
| 1| 1| 3| 1| 2|
| 1| 1| 2| 1| 2|
+----+----+----+-------+-----+
I do feel this unoptimized.
if some can optimize using
expr('filter(arr_val, x-> Count(x)>1)')

groupBy and get count of records for multiple columns in scala

As a part of big task I am facing some issues when I reach to find the count of records in each column grouping by another column. I am not much experienced in playing around with dataframe columns.
I am having a spark dataframe as below.
+---+------------+--------+--------+--------+
|id | date|signal01|signal02|signal03|
+---+------------+--------+--------+--------+
|050|2021-01-14 |1 |3 |0 |
|050|2021-01-15 |1 |3 |0 |
|050|2021-02-02 |1 |3 |0 |
|051|2021-01-14 |1 |3 |0 |
|051|2021-01-15 |1 |3 |0 |
|051|2021-02-02 |1 |3 |0 |
|051|2021-02-03 |1 |3 |0 |
|052|2021-03-03 |1 |3 |0 |
|052|2021-03-05 |1 |3 |0 |
|052|2021-03-06 |1 |3 |0 |
|052|2021-03-16 |1 |3 |0 |
I am working in scala language to make use of this data frame and trying to get result as shown below.
+---+--------+--------+--------+
|id |signal01|signal02|signal03|
+---+--------+--------+--------+
|050|3 |3 |3 |
|051|4 |4 |4 |
|052|4 |4 |4 |
for each Id, the count for each signal should be the output.
And also is there any way we could pass condition to get the count, such as count of signals with value > 0?
I have tried something below, getting total count ,but not grouped with Id which was not expected.
val signalColumns = ((Temp01DF.columns.toBuffer) -= ("id","date"))
val Temp02DF = Temp01DF.select(signalColumns.map(c => count(col(c)).alias(c)): _*).show()
+--------+--------+--------+
|signal01|signal02|signal03|
+--------+--------+--------+
|51 |51 |51 |
Is there any ways to achieve this in scala lang?
You are probably looking for groupBy, agg and count.
You can do something like this:
// define some data
val df = Seq(
("050", 1, 3, 0),
("050", 1, 3, 0),
("050", 1, 3, 0),
("051", 1, 3, 0),
("051", 1, 3, 0),
("051", 1, 3, 0),
("051", 1, 3, 0),
("052", 1, 3, 0),
("052", 1, 3, 0),
("052", 1, 3, 0),
("052", 1, 3, 0)
).toDF("id", "signal01", "signal02", "signal03")
val countColumns = Seq("signal01", "signal02", "signal03").map(c => count("*").as(c))
df.groupBy("id").agg(countColumns.head, countColumns.tail: _*).show
/*
+---+--------+--------+--------+
| id|signal01|signal02|signal03|
+---+--------+--------+--------+
|052| 4| 4| 4|
|051| 4| 4| 4|
|050| 3| 3| 3|
+---+--------+--------+--------+
*/
Instead of counting "*", you can have a predicate:
val countColumns = Seq("signal01", "signal02", "signal03").map(c => count(when(col(c) > 0, 1)).as(c))
df.groupBy("id").agg(countColumns.head, countColumns.tail: _*).show
/*
+---+--------+--------+--------+
| id|signal01|signal02|signal03|
+---+--------+--------+--------+
|052| 4| 4| 0|
|051| 4| 4| 0|
|050| 3| 3| 0|
+---+--------+--------+--------+
*/
A PySpark Solution
df = spark.createDataFrame([(50, 1, 3, 0),(50, 1, 3, 0), (50, 1, 3, 0), (51, 1, 3, 0), (51, 1, 3, 0), (51, 1, 3, 0), (51, 1, 3, 0), (52, 1, 3, 0),(52, 1, 3, 0), (52, 1, 3, 0), (52, 1, 3, 0)],[ "col1","col2", "col3", "col4"])
df.show()
df_grp = df.groupBy("col1").agg(F.count("col2").alias("col2"), F.count("col3").alias("col3"), F.count("col4").alias("col4"))
df_grp.show()
Output
+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
| 50| 3| 3| 3|
| 51| 4| 4| 4|
| 52| 4| 4| 4|
+----+----+----+----+
For the first part, I found that the required result can be achieved this way:
val signalCount = df.groupBy("id")
.agg(count("signal01"), count("signal02"), count("signal03"))
Make sure you have the spark functions imported:
import org.apache.spark.sql.functions._

Extracting vectors from features - pyspark

Let's say I have the dataframe bellow:
+---+------+-------+
|id |string|string2|
+---+------+-------+
|1 |foo |hello |
|2 |bar |hellow |
|3 |bar |hellow |
|4 |baz |hello |
+---+------+-------+
Column string contains 3 values [foo,bar,baz] and string2 contains 2 [hello,hellow].
How can I extract vectors for each column in the following way:
If column string contains foo I want to map it to vector [1,0,0] , for bar to [0,1,0] and so on. Same for string2 column (hello->[1,0],hellow->[0,1]).
Final dataframe should look something like this:
+---+----------+-----------+
|id |string_vec|string2_vec|
+---+----------+-----------+
|1 |[1,0,0] |[1,0] |
|2 |[0,1,0] |[0,1] |
|3 |[0,1,0] |[0,1] |
|4 |[0,0,1] |[0,1] |
+---+----------+-----------+
Finally I want to combine the _vec columns to:
+---+-----------+
|id |features |
+---+-----------+
|1 |[1,0,0,1,0]|
|2 |[0,1,0,0,1]|
|3 |[0,1,0,0,1]|
|4 |[0,0,1,0,1]|
+---+-----------+
I can do this with a for loop, but it is not efficient. My main problem is the mapping process. I guess for the rest I can use the VectorAssembler
You can create simple udf
Your dataframe:
values = [("foo", "hello"), ("bar", "hellow"),("bar","hellow"), ("baz","hello")]
from pyspark.sql.functions import udf
from pyspark.sql.types import *
df = spark.createDataFrame(values, ["string", "string2"])
df.show()
+------+-------+
|string|string2|
+------+-------+
| foo| hello|
| bar| hellow|
| bar| hellow|
| baz| hello|
+------+-------+
udf:
def encode(string1,string2):
values = ["foo","bar","baz","hello","hellow"]
string_values = [string1,string2]
return [1 if x in string_values else 0 for x in values]
encode_udf = udf(encode, ArrayType(IntegerType()))
result:
df.withColumn("features", encode_udf("string","string2")).show()
+------+-------+---------------+
|string|string2| features|
+------+-------+---------------+
| foo| hello|[1, 0, 0, 1, 0]|
| bar| hellow|[0, 1, 0, 0, 1]|
| bar| hellow|[0, 1, 0, 0, 1]|
| baz| hello|[0, 0, 1, 1, 0]|
+------+-------+---------------+

Spark: explode multiple columns into one

Is it possible to explode multiple columns into one new column in spark? I have a dataframe which looks like this:
userId varA varB
1 [0,2,5] [1,2,9]
desired output:
userId bothVars
1 0
1 2
1 5
1 1
1 2
1 9
What I have tried so far:
val explodedDf = df.withColumn("bothVars", explode($"varA")).drop("varA")
.withColumn("bothVars", explode($"varB")).drop("varB")
which doesn't work. Any suggestions is much appreciated.
You could wrap the two arrays into one and flatten the nested array before exploding it, as shown below:
val df = Seq(
(1, Seq(0, 2, 5), Seq(1, 2, 9)),
(2, Seq(1, 3, 4), Seq(2, 3, 8))
).toDF("userId", "varA", "varB")
df.
select($"userId", explode(flatten(array($"varA", $"varB"))).as("bothVars")).
show
// +------+--------+
// |userId|bothVars|
// +------+--------+
// | 1| 0|
// | 1| 2|
// | 1| 5|
// | 1| 1|
// | 1| 2|
// | 1| 9|
// | 2| 1|
// | 2| 3|
// | 2| 4|
// | 2| 2|
// | 2| 3|
// | 2| 8|
// +------+--------+
Note that flatten is available on Spark 2.4+.
Use array_union and then use explode function.
scala> df.show(false)
+------+---------+---------+
|userId|varA |varB |
+------+---------+---------+
|1 |[0, 2, 5]|[1, 2, 9]|
|2 |[1, 3, 4]|[2, 3, 8]|
+------+---------+---------+
scala> df
.select($"userId",explode(array_union($"varA",$"varB")).as("bothVars"))
.show(false)
+------+--------+
|userId|bothVars|
+------+--------+
|1 |0 |
|1 |2 |
|1 |5 |
|1 |1 |
|1 |9 |
|2 |1 |
|2 |3 |
|2 |4 |
|2 |2 |
|2 |8 |
+------+--------+
array_union is available in Spark 2.4+