I have dataframe:
data = [{"category": 'A', "bigram": 'delicious spaghetti', "vector": [0.01, -0.02, 0.03], 'all_vector' : 2},
{"category": 'A', "bigram": 'delicious dinner', "vector": [0.04, 0.05, 0.06], 'all_vector' : 2},
{"category": 'B', "bigram": 'new blog', "vector": [-0.14, -0.15, -0.16], 'all_vector' : 2},
{"category": 'B', "bigram": 'bright sun', "vector": [0.071, -0.09, 0.063], 'all_vector' : 2}
]
sdf = spark.createDataFrame(data)
+----------+-------------------+--------+---------------------+
|all_vector|bigram |category|vector |
+----------+-------------------+--------+---------------------+
|2 |delicious spaghetti|A |[0.01, -0.02, 0.03] |
|2 |delicious dinner |A |[0.04, 0.05, 0.06] |
|2 |new blog |B |[-0.14, -0.15, -0.16]|
|2 |bright sun |B |[0.071, -0.09, 0.063]|
+----------+-------------------+--------+---------------------+
I need to element-wise add lists in a vector column and divide by all_vector column ( i need normalize vector). Then group by category column. I wrote an example code but unfortunately it doesn't work:
#udf_annotator(returnType=ArrayType(FloatType()))
def result_vector(vector, all_vector):
lst = [sum(x) for x in zip(*vector)] / all_vector
return lst
sdf_new = sdf\
.withColumn('norm_vector', result_vector(F.col('vector'), F.col('all_vector')))\
.withColumn('rank', F.row_number().over(Window.partitionBy('category')))\
.where(F.col('rank') == 1)
I want it this way:
+----------+-------------------+--------+-----------------------+---------------------+
|all_vector|bigram |category|norm_vector |vector |
+----------+-------------------+--------+-----------------------+---------------------+
|2 |delicious spaghetti|A |[0.05, 0.03, 0.09] |[0.01, -0.02, 0.03] |
|2 |delicious dinner |A |[0.05, 0.03, 0.09] |[0.04, 0.05, 0.06] |
|2 |new blog |B |[-0.069, -0.24, -0.097]|[-0.14, -0.15, -0.16]|
|2 |bright sun |B |[-0.069, -0.24, -0.097]|[0.071, -0.09, 0.063]|
+----------+-------------------+--------+-----------------------+---------------------+
The zip_with function will help you zip two arrays and apply a function element wise. To use the function, we can create an array collection of the arrays in the vector column, and use the aggregate function. There might also be other simpler ways to do this though.
data_sdf. \
withColumn('vector_collection', func.collect_list('vector').over(wd.partitionBy('cat'))). \
withColumn('ele_wise_sum',
func.expr('''
aggregate(vector_collection,
cast(array() as array<double>),
(x, y) -> zip_with(x, y, (a, b) -> coalesce(a, 0) + coalesce(b, 0))
)
''')
). \
show(truncate=False)
# +---+---------------------+----------------------------------------------+-------------------------------------+
# |cat|vector |vector_collection |ele_wise_sum |
# +---+---------------------+----------------------------------------------+-------------------------------------+
# |B |[-0.14, -0.15, -0.16]|[[-0.14, -0.15, -0.16], [0.071, -0.09, 0.063]]|[-0.06900000000000002, -0.24, -0.097]|
# |B |[0.071, -0.09, 0.063]|[[-0.14, -0.15, -0.16], [0.071, -0.09, 0.063]]|[-0.06900000000000002, -0.24, -0.097]|
# |A |[0.01, -0.02, 0.03] |[[0.01, -0.02, 0.03], [0.04, 0.05, 0.06]] |[0.05, 0.030000000000000002, 0.09] |
# |A |[0.04, 0.05, 0.06] |[[0.01, -0.02, 0.03], [0.04, 0.05, 0.06]] |[0.05, 0.030000000000000002, 0.09] |
# +---+---------------------+----------------------------------------------+-------------------------------------+
Related
I have a Dataframe which has the following structure and data
Source:
Column1(String), Column2(String), Date
-----------------------
1, 2, 01/01/2021
A, B, 02/01/2021
M, N, 05/01/2021
I want to transform it to the following (First 2 columns are replicated in values and date is incremented until a fixed date (until 07/01/2021 in this example) for each of the source row)
To Result:
1, 2, 01/01/2021
1, 2, 02/01/2021
1, 2, 03/01/2021
1, 2, 04/01/2021
1, 2, 05/01/2021
1, 2, 06/01/2021
1, 2, 07/01/2021
A, B, 02/01/2021
A, B, 03/01/2021
A, B, 04/01/2021
A, B, 05/01/2021
A, B, 06/01/2021
A, B, 07/01/2021
M, N, 05/01/2021
M, N, 06/01/2021
M, N, 07/01/2021
Any idea on how this can be achieved in scala spark?
I got this link Replicate Spark Row N-times, but there is no hint on how a particular column can be incremented during replication.
We can use sequence function to generate list of dates in required range, then explode the output array of sequence function to get dataframe in required format.
val spark = SparkSession.builder().master("local[*]").getOrCreate()
import spark.implicits._
spark.sparkContext.setLogLevel("ERROR")
// Sample dataframe
val df = List(("1", "2", "01/01/2021"),
("A", "B", "02/01/2021"),
("M", "N", "05/01/2021"))
.toDF("Column1(String)", "Column2(String)", "Date")
df
.withColumn("Date",explode_outer(sequence(to_date('Date,"dd/MM/yyyy"),
to_date(lit("07/01/2021"),"dd/MM/yyyy"))))
.withColumn("Date",date_format('Date,"dd/MM/yyyy"))
.show(false)
/*
+---------------+---------------+----------+
|Column1(String)|Column2(String)|Date |
+---------------+---------------+----------+
|1 |2 |01/01/2021|
|1 |2 |02/01/2021|
|1 |2 |03/01/2021|
|1 |2 |04/01/2021|
|1 |2 |05/01/2021|
|1 |2 |06/01/2021|
|1 |2 |07/01/2021|
|A |B |02/01/2021|
|A |B |03/01/2021|
|A |B |04/01/2021|
|A |B |05/01/2021|
|A |B |06/01/2021|
|A |B |07/01/2021|
|M |N |05/01/2021|
|M |N |06/01/2021|
|M |N |07/01/2021|
+---------------+---------------+----------+ */
I have a dataframe with below structure
ID:string
Amt:long
Col:array
element:struct
Seq:int
Pct:double
Sh:double
Dataframe output
+----+-------+------------------------------------------+
|ID |Amt |col |
+----+-------+------------------------------------------+
|ABC |23077 |[[1, 1.5, 1, 10000], [2, 1.2, 2.5,40000]] |
+------------+------------------------------------------+
I need to to the following calculation
Last element of the first arrary will be same 10000.
For the next array I need to minus it with the value from first (40000-10000) and get output as 30000
Expected output
+----+-------+-------------------------------------------+
|ID |Amt |col1 |
+----+---------------------------------------------------+
|ABC |23077 |[[1, 1.5, 1, 10000], [2, 1.2, 2.5, 30000]] |
+----+-------+-------------------------------------------+
How would I achieve this?
You can use transform and compare the Amt with the previous entry:
val df2 = df.withColumn(
"col",
expr("""
transform(
col,
(x, i) -> struct(
x.Seq as Seq, x.Pct as Pct, x.Sh as Sh,
case when i=0 then x.Amt else x.Amt - col[i-1].Amt end as Amt
)
)
""")
)
df2.show(false)
+-----+---+--------------------------------------------+
|Amt |ID |col |
+-----+---+--------------------------------------------+
|23077|ABC|[[1, 1.5, 1.0, 10000], [2, 1.2, 2.5, 30000]]|
+-----+---+--------------------------------------------+
Source json data
{"ID": "ABC", "Amt": 23077, "col": [{"Seq": 1, "Pct": 1.5, "Sh": 1},{"Seq": 2, "Pct": 1.2, "Sh": 2.5}]}
With below structure
ID:string
Amt:long
Col:array
element:struct
Seq:int
Pct:double
Sh:double
I have a dataframe with below output
+----+-------+-----------------------------+
|ID |Amt |col |
+----+-------+-----------------------------+
|ABC |23077 |[[1, 1.5, 1], [2, 1.2, 2.5]] |
+------------+-----------------------------+
I need to add Amt column to the col towards the end of the each element in the array.
+----+-------+-------------------------------------------+
|ID |Amt |col1 |
+----+---------------------------------------------------+
|ABC |23077 |[[1, 1.5, 1, 23077], [2, 1.2, 2.5, 23077]] |
+----+-------+-------------------------------------------+
If your Spark version >= 2.4, you can use transform to add elements to the struct:
val df2 = df.selectExpr(
"Amt",
"ID",
"transform(col, x -> struct(x.Seq as Seq, x.Pct as Pct, x.Sh as Sh, Amt)) as col1"
)
df2.show(false)
+-----+---+--------------------------------------------+
|Amt |ID |col1 |
+-----+---+--------------------------------------------+
|23077|ABC|[[1, 1.5, 1.0, 23077], [2, 1.2, 2.5, 23077]]|
+-----+---+--------------------------------------------+
For older Spark versions, you can explode the array of structs and reconstruct them:
val df2 = df.selectExpr("Amt","ID","inline(col)")
.groupBy("ID","Amt")
.agg(collect_list(struct(col("Seq"),col("Pct"),col("Sh"),col("Amt"))).as("col1"))
df2.show(false)
+---+-----+--------------------------------------------+
|ID |Amt |col1 |
+---+-----+--------------------------------------------+
|ABC|23077|[[1, 1.5, 1.0, 23077], [2, 1.2, 2.5, 23077]]|
+---+-----+--------------------------------------------+
Here is a sample data
val df4 = sc.parallelize(List(
("A1",45, "5", 1, 90),
("A2",60, "1", 1, 120),
("A6", 30, "9", 1, 450),
("A7", 89, "7", 1, 333),
("A7", 89, "4", 1, 320),
("A2",60, "5", 1, 22),
("A1",45, "22", 1, 1)
)).toDF("CID","age", "children", "marketplace_id","value")
thanks to #Shu for this piece of code
val df5 = df4.selectExpr("CID","""to_json(named_struct("id", children)) as item""", "value", "marketplace_id")
+---+-----------+-----+--------------+
|CID|item |value|marketplace_id|
+---+-----------+-----+--------------+
|A1 |{"id":"5"} |90 |1 |
|A2 |{"id":"1"} |120 |1 |
|A6 |{"id":"9"} |450 |1 |
|A7 |{"id":"7"} |333 |1 |
|A7 |{"id":"4"} |320 |1 |
|A2 |{"id":"5"} |22 |1 |
|A1 |{"id":"22"}|1 |1 |
+---+-----------+-----+--------------+
when you do df5.dtypes
(CID,StringType), (item,StringType), (value,IntegerType), (marketplace_id,IntegerType)
the column item is of string type, is there a way this can be of json/object type(if that is a thing)?
EDIT 1:
I will describe what I am trying to achieve here, the above two steps remains same.
val w = Window.partitionBy("CID").orderBy(desc("value"))
val sorted_list = df5.withColumn("item", collect_list("item").over(w)).groupBy("CID").agg(max("item") as "item")
Output:
+---+-------------------------+
|CID|item |
+---+-------------------------+
|A6 |[{"id":"9"}] |
|A2 |[{"id":"1"}, {"id":"5"}] |
|A7 |[{"id":"7"}, {"id":"4"}] |
|A1 |[{"id":"5"}, {"id":"22"}]|
+---+-------------------------+
now whatever is inside [ ] is a string. which is causing a problem for one of the tools we are using.
Sorry, pardon me I am new to scala, spark if this is a basic question.
Store json data using struct type, check below code.
scala> dfa
.withColumn("item_without_json",struct($"cid".as("id")))
.withColumn("item_as_json",to_json($"item_without_json"))
.show(false)
+---+-----------+-----+--------------+-----------------+------------+
|CID|item |value|marketplace_id|item_without_json|item_as_json|
+---+-----------+-----+--------------+-----------------+------------+
|A1 |{"id":"A1"}|90 |1 |[A1] |{"id":"A1"} |
|A2 |{"id":"A2"}|120 |1 |[A2] |{"id":"A2"} |
|A6 |{"id":"A6"}|450 |1 |[A6] |{"id":"A6"} |
|A7 |{"id":"A7"}|333 |1 |[A7] |{"id":"A7"} |
|A7 |{"id":"A7"}|320 |1 |[A7] |{"id":"A7"} |
|A2 |{"id":"A2"}|22 |1 |[A2] |{"id":"A2"} |
|A1 |{"id":"A1"}|1 |1 |[A1] |{"id":"A1"} |
+---+-----------+-----+--------------+-----------------+------------+
Based on the comment you made to have the dataset converted to json you would use:
df4
.select(collect_list(struct($"CID".as("id"))).as("items"))
.write()
.json(path)
The output will look like:
{"items":[{"id":"A1"},{"id":"A2"},{"id":"A6"},{"id":"A7"}, ...]}
If you need the thing in memory to pass down to a function, instead of write().json(...) use toJSON
Suppose I've a dataframe as below
+----+----------+----+----------------+
|colA| colB|colC| colD|
+----+----------+----+----------------+
| 1|2020-03-24| 21|[0.0, 2.49, 3.1]|
| 1|2020-03-17| 20|[1.0, 2.49, 3.1]|
| 1|2020-03-10| 19|[2.0, 2.49, 3.1]|
| 2|2020-03-24| 21|[0.0, 2.49, 3.1]|
| 2|2020-03-17| 20|[1.0, 2.49, 3.1]|
+----+----------+----+----------------+
I want to collect colD into a singe row, that too only the list in collect within a range.
Output
+----+----------+----+----------------+------------------------------------------------------+
|colA|colB |colC|colD |colE |
+----+----------+----+----------------+------------------------------------------------------+
|1 |2020-03-24|21 |[0.0, 2.49, 3.1]|[[0.0, 2.49, 3.1], [1.0, 2.49, 3.1]] |
|1 |2020-03-17|20 |[1.0, 2.49, 3.1]|[[1.0, 2.49, 3.1], [2.0, 2.49, 3.1]]
|1 |2020-03-10|19 |[2.0, 2.49, 3.1]|[[2.0, 2.49, 3.1]] |
|2 |2020-03-24|21 |[0.0, 2.49, 3.1]|[[0.0, 2.49, 3.1], [1.0, 2.49, 3.1]] |
|2 |2020-03-17|20 |[1.0, 2.49, 3.1]|[[1.0, 2.49, 3.1]] |
+----+----------+----+----------------+------------------------------------------------------+
I tried the following , but it gives me the error:
cannot resolve 'RANGE BETWEEN CAST((`colC` - 2) AS STRING) FOLLOWING AND CAST(`colC` AS STRING) FOLLOWING' due to data type mismatch: Window frame lower bound 'cast((colC#575 - 2) as string)' is not a literal.;;
val data = Seq(("1", "2020-03-24", 21, List(0.0, 2.49,3.1)), ("1", "2020-03-17", 20, List(1.0, 2.49,3.1)), ("1", "2020-03-10", 19, List(2.0, 2.49,3.1)), ("2", "2020-03-24", 21, List(0.0, 2.49,3.1)),
("2", "2020-03-17", 20, List(1.0, 2.49,3.1))
)
val rdd = spark.sparkContext.parallelize(data)
val df = rdd.toDF("colA","colB", "colC", "colD")
df.show()
val df1 = df
.withColumn("colE", collect_list("colD").over(Window.partitionBy("colA")
.orderBy("colB").rangeBetween($"colC" - lit(2), $"colC")))
.show(false)