Spark Bitwise XOR between two DataFrame Columns - scala

I have a Spark DataFrame testDF with the following structure
scala> testDF.printSchema()
-------------------------------------------------
root
|-- id: long (nullable = true)
|-- array1: array (nullable = true)
| |-- element: integer (containsNull = true)
|-- array2: array (nullable = true)
| |-- element: integer (containsNull = true)
Both array1 and array2 are guaranteed to be of same length.
I want to perform bitwise xor between each element from array1 and array2.
Something like:
res = array1[0] ^ array2[0] + array1[1] ^ array2[1] + ...
I know this can be done using udf but wondering if there is a native spark way to do it.
Desired DataFrame Structure would look something like
root
|-- id: long (nullable = true)
|-- array1: array (nullable = true)
| |-- element: integer (containsNull = true)
|-- array2: array (nullable = true)
| |-- element: integer (containsNull = true)
|-- result: long (nullable = true)

Related

How to cast column with nested array in pyspark

I have this schema in pyspark:
root
|-- SortedLenders: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- LenderID: string (nullable = true)
| | |-- MaxProfit: string (nullable = true)
|-- FilteredOutDecisions: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- ApprovedAmount: integer (nullable = true)
| | |-- Reasons: array (nullable = true)
| | | |-- element: integer (containsNull = true)
How do I cast FilteredOutDecisions.Reasons column to double? Thank you, in advance!
Try this:
df = (
df
.withColumn('newFilteredOutDecisions', f.expr('transform(FilteredOutDecisions, element -> struct(element.ApprovedAmount as ApprovedAmount, transform(element.Reasons, value -> cast (value as double)) as Reasons))'))
)

Add a field already exists in df pyspark in struct field

I have the follow df:
sku category price state infos_gerais
33344 mmmma 3.00 SP [{5, 5656655, 5845454}]
33344 mmmma 3.00 MG [{5, 6565767, 5854545}]
33344 mmmma 3.00 RS [{5, 8788787, 4564646}]
The schema of df follow:
|-- sku: string (nullable = true)
|-- category: string (nullable = true)
|-- price: double (nullable = true)
|-- state: string (nullable = true)
|-- infos_gerais: array (nullable = true)
| |-- element: struct (containsNull = false)
| | |-- service_type_id: integer (nullable = true)
| | |-- cep_ini: integer (nullable = true)
| | |-- cep_fim: integer (nullable = true)
See that in df the field that don't repeat is 'state', so I need insert this field in struct 'infos_gerais' and apply a groupBy, so I try this below code, but return a error. Anyone can help me?
df_end = df_end.withColumn(
"infos_gerais",
sf.collect_list(
sf.struct(
sf.col("infos_gerais.*"),
sf.col('infos_gerais.state').alias('state'))
)
)
I need the follow df output:
sku category price infos_gerais
33344 mmmma 3.00 [{5, 5656655, 5845454, SP}, {5, 6565767, 5854545, MG},{5, 8788787, 4564646, RS}]
given you have an array of structs, you can use transform to process the elements of the array and withField on the structs to add/replace a struct field.
here's a simple example
data_sdf. \
withColumn('infos_gerais',
func.transform('infos_gerais', lambda x: x.withField('state', func.col('state')))
). \
groupBy('sku', 'category', 'price'). \
agg(func.flatten(func.collect_list('infos_gerais')).alias('infos_gerais')). \
show(truncate=False)
# +-----+--------+-----+---------------------------------------------------------------------------------+
# |sku |category|price|infos_gerais |
# +-----+--------+-----+---------------------------------------------------------------------------------+
# |33344|mmmma |3.0 |[{5, 5656655, 5845454, SP}, {5, 6565767, 5854545, MG}, {5, 8788787, 4564646, RS}]|
# +-----+--------+-----+---------------------------------------------------------------------------------+
# root
# |-- sku: string (nullable = true)
# |-- category: string (nullable = true)
# |-- price: double (nullable = true)
# |-- infos_gerais: array (nullable = true)
# | |-- element: struct (containsNull = true)
# | | |-- service_type_id: integer (nullable = true)
# | | |-- cep_ini: integer (nullable = true)
# | | |-- cep_fim: integer (nullable = true)
# | | |-- state: string (nullable = true)

Spark 2 converting scala array to WrappedArray

Spark 2 is converting scala array to WrappedArray automatically When i am passing array to function. However, In Spark 1.6 array is converted to string like '[a,b,c]' . Here is my code
val df_date_agg = df
.groupBy($"a",$"b",$"c")
.agg(sum($"d").alias("data1"),sum($"e").alias("data2"))
.groupBy($"a")
.agg(collect_list(array($"b",$"c",$"data1")).alias("final_data1"),
collect_list(array($"b",$"c",$"data2")).alias("final_data2"))
When I am running above code to spark 1.6. I am getting below schema
|-- final_data1: array (nullable = true)
| |-- element: string (containsNull = true)
|-- final_data2: array (nullable = true)
| |-- element: string (containsNull = true)
but in spark 2
|-- final_data1: array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: string (containsNull = true)
|-- final_data1: array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: string (containsNull = true)
How can I change datatype of spark 2 as per spark 1?
Since you want a string representation of an array, how about casting the array into a string like this?
val df_date_agg = df
.groupBy($"a",$"b",$"c")
.agg(sum($"d").alias("data1"),sum($"e").alias("data2"))
.groupBy($"a")
.agg(collect_list(array($"b",$"c",$"data1") cast "string").alias("final_data1"),
collect_list(array($"b",$"c",$"data2") cast "string").alias("final_data2"))
It might simply be what your old version of spark was doing. I was not able to verify.

Convert array of vectors to DenseVector

I am running Spark 2.1 with Scala. I am trying to convert and array of vectors into a DenseVector.
Here is my dataframe:
scala> df_transformed.printSchema()
root
|-- id: long (nullable = true)
|-- vals: vector (nullable = true)
|-- hashValues: array (nullable = true)
| |-- element: vector (containsNull = true)
scala> df_transformed.show()
+------------+--------------------+--------------------+
| id| vals| hashValues|
+------------+--------------------+--------------------+
|401310732094|[-0.37154,-0.1159...|[[-949518.0], [47...|
|292125586474|[-0.30407,0.35437...|[[-764013.0], [31...|
|362051108485|[-0.36748,0.05738...|[[-688834.0], [18...|
|222480119030|[-0.2509,0.55574,...|[[-1167047.0], [2...|
|182270925238|[0.32288,-0.60789...|[[-836660.0], [97...|
+------------+--------------------+--------------------+
For example, I need to extract the value of the hashValues column into a DenseVectorfor id 401310732094.
This can be done with an UDF:
import spark.implicits._
val convertToVec = udf((array: Seq[Vector]) =>
Vectors.dense(array.flatMap(_.toArray).toArray)
)
val df = df_transformed.withColumn("hashValues", convertToVec($"hashValues"))
This will overwrite the hashValues column with a new one containing a DenseVector.
Tested with a dataframe with following schema:
root
|-- id: integer (nullable = false)
|-- hashValues: array (nullable = true)
| |-- element: vector (containsNull = true)
The result is:
root
|-- id: integer (nullable = false)
|-- hashValues: vector (nullable = true)

How to join two dataframes where key to be used for joining has different datatype in both dataframes

I have two dataframes df1, df2 whose schema is as follows:
DF1 is of the form:
slotSize2: struct (nullable = true)
| |-- 120x600: struct (nullable = true)
| | |-- pView: string (nullable = true)
| |-- 160x600: struct (nullable = true)
| | |-- level: string (nullable = true)
| | |-- pATF: string (nullable = true)
| | |-- pView: string (nullable = true)
| | |-- pViewV1: string (nullable = true)
| | |-- sPos: string (nullable = true)
| |-- 250x250: struct (nullable = true)
| | |-- pView: string (nullable = true)
| |-- 300x250: struct (nullable = true)
| | |-- level: string (nullable = true)
| | |-- pATF: string (nullable = true)
| | |-- pView: string (nullable = true)
| | |-- pViewV1: string (nullable = true)
| | |-- sPos: string (nullable = true)
Dataframe df2 has schema :
root
|-- bidId: array (nullable = true)
| |-- element: string (containsNull = true)
|-- slotSize1: array (nullable = true)
| |-- element: string (containsNull = true)
In dataframe df2 we have slotSize (named as slotSize1) in form of string and in dataframe df1 we have nested form of slotsize i.e. for every slotsize we have corresponding map.
I want to join two dataframes df1, df2 to form a new dataframe df3 which has schema (bidId, slotSize , viewMap) where bidId is present in df1 , slotSIze is of form 120x600 and is present in both schemas, and viewMap corresponds to the nested map corresponding to every slotSize in df1.