scala dataframe join columns and split arrays explode spark - scala

I have some co-ordinates in multiple array columns in a dataframe and want to split them to have the x,y,z in separate columns in order, column1 data first, then column 2
for example...
COL 1 | COL2
[[x,y,z],[x,y,z],[x,y,z]...] | [[x,y,z],[x,y,z],[x,y,z]...]
e.g
[[1,1,1],[2,2,2],[3,3,3]...] | [[8,8,8],[9,9,9],[10,10,10]...]
required OUTPUT
COL X | COL Y | COL Z
x,x,x,x,x.... | y,y,y,y,y.... | z,z,z,z,z....
e.g.
1,2,3,..,8,9,10.. | 1,2,3,..,8,9,10.. | 1,2,3,..,8,9,10..
any help appreciated

You can use array_union function as follows
df.select(
array_union($"col1._1", $"col2._1").as("x"),
array_union($"col1._2", $"col2._2").as("y"),
array_union($"col1._3", $"col2._3").as("z"))
INPUT
+--------------------------------------------+--------------------------------------------------+
|col1 |col2 |
+--------------------------------------------+--------------------------------------------------+
|[[1, 1, 1], [2, 2, 2], [3, 3, 3], [4, 4, 4]]|[[8, 8, 8], [9, 9, 9], [10, 10, 10], [11, 11, 11]]|
+--------------------------------------------+--------------------------------------------------+
OTUPUT
+--------------------------+--------------------------+--------------------------+
|x |y |z |
+--------------------------+--------------------------+--------------------------+
|[1, 2, 3, 4, 8, 9, 10, 11]|[1, 2, 3, 4, 8, 9, 10, 11]|[1, 2, 3, 4, 8, 9, 10, 11]|
+--------------------------+--------------------------+--------------------------+

Related

Looking to get counts of items within ArrayType column without using Explode

NOTE: I'm working with Spark 2.4
Here is my dataset:
df
col
[1,3,1,4]
[1,1,1,2]
I'd like to essentially get a value_counts of the values in the array. The results df wou
df_upd
col
[{1:2},{3:1},{4:1}]
[{1:3},{2:1}]
I know I can do this by exploding df and then taking a group by but I'm wondering if I can do this without exploding.
Here's a solution using a udf that outputs the result as a MapType. It expects integer values in your arrays (easily changed) and to return integer counts.
from pyspark.sql import functions as F
from pyspark.sql import types as T
df = sc.parallelize([([1, 2, 3, 3, 1],),([4, 5, 6, 4, 5],),([2, 2, 2],),([3, 3],)]).toDF(['arrays'])
df.show()
+---------------+
| arrays|
+---------------+
|[1, 2, 3, 3, 1]|
|[4, 5, 6, 4, 5]|
| [2, 2, 2]|
| [3, 3]|
+---------------+
from collections import Counter
#F.udf(returnType=T.MapType(T.IntegerType(), T.IntegerType(), valueContainsNull=False))
def count_elements(array):
return dict(Counter(array))
df.withColumn('counts', count_elements(F.col('arrays'))).show(truncate=False)
+---------------+------------------------+
|arrays |counts |
+---------------+------------------------+
|[1, 2, 3, 3, 1]|[1 -> 2, 2 -> 1, 3 -> 2]|
|[4, 5, 6, 4, 5]|[4 -> 2, 5 -> 2, 6 -> 1]|
|[2, 2, 2] |[2 -> 3] |
|[3, 3] |[3 -> 2] |
+---------------+------------------------+

How to iterate over a column with list of lists as values and create a new column

I have a dataframe with values like
+---------+---------+---------+---------+----------+----------+----------+
| column1 | column2 | column3 | column4 | column 6 | column 7 | column 8 |
+---------+---------+---------+---------+----------+----------+----------+
| a | b | 1 | c | 2 | 3 | 4 |
| a | b | 4 | z | 5 | 6 | 7 |
| x | y | 1 | c | 2 | 3 | 4 |
| x | y | 4 | z | 5 | 6 | 7 |
+---------+---------+---------+---------+----------+----------+----------+
The rows are then grouped on the basis of column1 and column2 and then aggregated in a new column agg_data
+---------+---------+-----------------------+---------+
| column1 | column2 | agg_data | column4 |
+---------+---------+-----------------------+---------+
| a | b | [[1,2,3,4],[4,5,6,7]] | c |
| x | y | [[1,2,3,4],[4,5,6,7]] | z |
+---------+---------+-----------------------+---------+
The data inside the agg_data are actually "Row" objects that were grouped on the basis of column1 and column2 and then were aggregated into a single column.
I need to iterate over the values in agg_data, get the "column 7" data from the list of all the row stored, concatenate it and add a new column to the data frame.
Something like this.
+---------+---------+-----------------------+---------+------------+
| column1 | column2 | agg_data | column4 | agg_values |
+---------+---------+-----------------------+---------+------------+
| a | b | [[1,2,3,4],[4,5,6,7]] | c | 3,6 |
| x | y | [[1,2,3,4],[4,5,6,7]] | z | 3,6 |
+---------+---------+-----------------------+---------+------------+
I am new to scala, hence I dont have much idea how to approach this.
Though, I have tried few suggestions from Stack Overflow
like this one answer here. But that didnt work as expected.
It included all the values in the table in a single row.
If you are using Spark 3.0 you can use transform function as below
df.withColumn("agg_values", transform($"column3", arr => element_at(arr, -2)))
FOR Spark2.4+
df.withColumn("agg_values", expr("transform(column3, x -> element_at(x, -2) )"))
If you want to convert the added new array as string you can use concat_ws
Output:
+-------+-------+------------------------------------------+-------+---------+
|column1|column2|column3 |column4|agg_values|
+-------+-------+------------------------------------------+-------+---------+
|a |b |[[1, 2, 3, 4], [4, 5, 6, 7], [7, 8, 9, 0]]|c |[3, 6, 9]|
|x |y |[[1, 2, 3, 4], [4, 5, 6, 7], [7, 8, 9, 0]]|z |[3, 6, 9]|
+-------+-------+------------------------------------------+-------+---------+
pyspark solution, but can be adapted to scala also:
import pyspark.sql.functions as F
from pyspark.sql.types import *
tst = sqlContext.createDataFrame([('a','b',[[1,2,3,4],[4,5,6,7],[7,8,9,0]]),('x','y',[[1,2,3,4],[4,5,6,7],[7,8,9,0]])],schema=['column1','column2','column3'])
tst1 = tst.withColumn("flattened",F.flatten('column3'))
#%% generate position there may be better python ways to do this
pos=[]
ini=2
for x in range(3):# 3 is for this example
pos+=[ini]
ini=ini+4
#%%
expr=[F.col('flattened')[x] for x in pos]
tst2 = tst1.withColumn("result",F.array(*expr))
Results:
+-------+-------+--------------------+--------------------+---------+
|column1|column2| column3| flattened| result|
+-------+-------+--------------------+--------------------+---------+
| a| b|[[1, 2, 3, 4], [4...|[1, 2, 3, 4, 4, 5...|[3, 6, 9]|
| x| y|[[1, 2, 3, 4], [4...|[1, 2, 3, 4, 4, 5...|[3, 6, 9]|
+-------+-------+--------------------+--------------------+---------+
spark>=2.4
val df = spark.sql("select array(array(1,2,3,4),array(4,5,6,7),array(7,8,9,0)) as column3")
df.show(false)
df.printSchema()
/**
* +------------------------------------------+
* |column3 |
* +------------------------------------------+
* |[[1, 2, 3, 4], [4, 5, 6, 7], [7, 8, 9, 0]]|
* +------------------------------------------+
*
* root
* |-- column3: array (nullable = false)
* | |-- element: array (containsNull = false)
* | | |-- element: integer (containsNull = false)
*/
df.withColumn("agg_values", expr("TRANSFORM(column3, x -> element_at(x, -2) )"))
.show(false)
/**
* +------------------------------------------+----------+
* |column3 |agg_values|
* +------------------------------------------+----------+
* |[[1, 2, 3, 4], [4, 5, 6, 7], [7, 8, 9, 0]]|[3, 6, 9] |
* +------------------------------------------+----------+
*/
// use array_join to get string
df.withColumn("agg_values", expr("TRANSFORM(column3, x -> element_at(x, -2) )"))
.withColumn("agg_values", array_join(col("agg_values"), ", "))
.show(false)
/**
* +------------------------------------------+----------+
* |column3 |agg_values|
* +------------------------------------------+----------+
* |[[1, 2, 3, 4], [4, 5, 6, 7], [7, 8, 9, 0]]|3, 6, 9 |
* +------------------------------------------+----------+
*/
Check below code.
scala> df.show(false)
+-------+-------+------------------------------------------+-------+
|column1|column2|column3 |column4|
+-------+-------+------------------------------------------+-------+
|a |b |[[1, 2, 3, 4], [4, 5, 6, 7], [7, 8, 9, 0]]|c |
|x |y |[[1, 2, 3, 4], [4, 5, 6, 7], [7, 8, 9, 0]]|z |
+-------+-------+------------------------------------------+-------+
UDF
scala> val mkString = udf((data:Seq[Seq[Int]]) => data.map(_.init.last).mkString(","))
Result
scala> df.withColumn("agg_values",mkString($"column3")).show(false)
+-------+-------+------------------------------------------+-------+----------+
|column1|column2|column3 |column4|agg_values|
+-------+-------+------------------------------------------+-------+----------+
|a |b |[[1, 2, 3, 4], [4, 5, 6, 7], [7, 8, 9, 0]]|c |3,6,9 |
|x |y |[[1, 2, 3, 4], [4, 5, 6, 7], [7, 8, 9, 0]]|z |3,6,9 |
+-------+-------+------------------------------------------+-------+----------+
Without UDF - Spark 2.4+
scala>
df
.withColumn("agg_values",expr("concat_ws(',',flatten(transform(column3, x -> slice(x,-2,1))))"))
.show(false)
+-------+-------+------------------------------------------+-------+----------+
|column1|column2|column3 |column4|agg_values|
+-------+-------+------------------------------------------+-------+----------+
|a |b |[[1, 2, 3, 4], [4, 5, 6, 7], [7, 8, 9, 0]]|c |3,6,9 |
|x |y |[[1, 2, 3, 4], [4, 5, 6, 7], [7, 8, 9, 0]]|z |3,6,9 |
+-------+-------+------------------------------------------+-------+----------+
scala>
df
.withColumn("agg_values",expr("concat_ws(',',transform(column3, x -> element_at(x,-2)))"))
.show(false)
+-------+-------+------------------------------------------+-------+----------+
|column1|column2|column3 |column4|agg_values|
+-------+-------+------------------------------------------+-------+----------+
|a |b |[[1, 2, 3, 4], [4, 5, 6, 7], [7, 8, 9, 0]]|c |3,6,9 |
|x |y |[[1, 2, 3, 4], [4, 5, 6, 7], [7, 8, 9, 0]]|z |3,6,9 |
+-------+-------+------------------------------------------+-------+----------+

How to find sum of arrays in a column which is grouped by another column values in a spark dataframe using scala

I have a dataframe like below
c1 Value
A Array[47,97,33,94,6]
A Array[59,98,24,83,3]
A Array[77,63,93,86,62]
B Array[86,71,72,23,27]
B Array[74,69,72,93,7]
B Array[58,99,90,93,41]
C Array[40,13,85,75,90]
C Array[39,13,33,29,14]
C Array[99,88,57,69,49]
I need an output as below.
c1 Value
A Array[183,258,150,263,71]
B Array[218,239,234,209,75]
C Array[178,114,175,173,153]
Which is nothing but grouping column c1 and find the sum of values in column value in a sequential manner .
Please help, I couldn't find any way of doing this in google .
It is not very complicated. As you mention it, you can simply group by "c1" and aggregate the values of the array index by index.
Let's first generate some data:
val df = spark.range(6)
.select('id % 3 as "c1",
array((1 to 5).map(_ => floor(rand * 10)) : _*) as "Value")
df.show()
+---+---------------+
| c1| Value|
+---+---------------+
| 0|[7, 4, 7, 4, 0]|
| 1|[3, 3, 2, 8, 5]|
| 2|[2, 1, 0, 4, 4]|
| 0|[0, 4, 2, 1, 8]|
| 1|[1, 5, 7, 4, 3]|
| 2|[2, 5, 0, 2, 2]|
+---+---------------+
Then we need to iterate over the values of the array so as to aggregate them. It is very similar to the way we created them:
val n = 5 // if you know the size of the array
val n = df.select(size('Value)).first.getAs[Int](0) // If you do not
df
.groupBy("c1")
.agg(array((0 until n).map(i => sum(col("Value").getItem(i))) :_* ) as "Value")
.show()
+---+------------------+
| c1| Value|
+---+------------------+
| 0|[11, 18, 15, 8, 9]|
| 1| [2, 10, 5, 7, 4]|
| 2|[7, 14, 15, 10, 4]|
+---+------------------+

Dropping rows from a spark dataframe based on a condition

I want to drop rows from a spark dataframe of lists based on a condition. The condition is the length of the list being a certain length.
I have tried converting it into a list of lists and then using a for loop (demonstrated below) but I'm hoping to do it in one statement within spark and just creating a new immutable df from the original df based on this condition.
newList = df2.values.tolist()
finalList = []
for subList in newList:
if len(subList) < 4:
finalList.append(subList)
So for instance, if the dataframe is a one column dataframe and the column is named sequences, it looks like:
sequences
____________
[1, 2, 4]
[1, 6, 3]
[9, 1, 4, 6]
I want to drop all rows where the length of the list is more than 3, resulting in:
sequences
____________
[1, 2, 4]
[1, 6, 3]
Here it is one approach in Spark >= 1.5 using the build-in size function:
from pyspark.sql import Row
from pyspark.sql.functions import size
df = spark.createDataFrame([Row(a=[9, 3, 4], b=[8,9,10]),Row(a=[7, 2, 6, 4], b=[2,1,5]), Row(a=[7, 2, 4], b=[8,2,1,5]), Row(a=[2, 4], b=[8,2,10,12,20])])
df.where(size(df['a']) <= 3).show()
Output:
+---------+------------------+
| a| b|
+---------+------------------+
|[9, 3, 4]| [8, 9, 10]|
|[7, 2, 4]| [8, 2, 1, 5]|
| [2, 4]|[8, 2, 10, 12, 20]|
+---------+------------------+

How to extract subArray from Array[Array[Int]] column DataFrame

I have a Dataframe like this:
+---------------------------------------------------------------------+
|ARRAY |
+---------------------------------------------------------------------+
|[WrappedArray(1, 2, 3), WrappedArray(4, 5, 6), WrappedArray(7, 8, 9)]|
+---------------------------------------------------------------------+
I use this code to create it:
case class MySchema(arr: Array[Array[Int]])
val df = sc.parallelize(Seq(
Array(Array(1,2,3),
Array(4,5,6),
Array(7,8,9))))
.map(x => MySchema(x))
.toDF("ARRAY")
I would like to get a result like this:
+-----------+
|ARRAY | |
+-----------+
|[1, 2, 3] |
|[4, 5, 6] |
|[7, 8, 9] |
+-----------+
Do you have any idea?
I already try to call an udf to do a flatmap(x => x) on my Array line but I get an incorrect result :
+---------------------------+
|ARRAY |
+---------------------------+
|[1, 2, 3, 4, 5, 6, 7, 8, 9]|
+---------------------------+
Thank you for your help
You can explode:
import org.apache.spark.sql.functions.{col, explode}
df.select(explode(col("array")))