This question already has answers here:
How to query JSON data column using Spark DataFrames?
(5 answers)
Closed 2 years ago.
I have a dataFrame in the following format:
id types
--- -------
1 {"BMW":"10000","Skoda":"12345"}
2 {"Honda":"90000","BMW":"11000","Benz":"56000"}
I need to create a new dataFrame like this:
id types value
--- ------ -------
1 BMW 10000
1 Skoda 12345
2 Honda 90000
2 BMW 11000
2 Benz 56000
Use from_json with MapType and explode the array.
Example:
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
df.withColumn("jsn", from_json(col("types"),MapType(StringType(),StringType()))).
select("id",explode("jsn")).
show()
//+---+-----+-----+
//| id| key|value|
//+---+-----+-----+
//| 1| BMW|10000|
//| 1|Skoda|12345|
//| 2|Honda|90000|
//| 2| BMW|11000|
//| 2| Benz|56000|
//+---+-----+-----+
Related
Here I have some sample dataset and How to delete the ID's dynamically (no hardcoded values) based on (Column) Status = "Removed".
Sample Dataset:
ID Status Date Amount
1 New 01/05/20 20
1 Assigned 02/05/20 30
1 In-Progress 02/05/20 50
2 New 02/05/20 30
2 Removed 03/05/20 20
3 In-Progress 09/05/20 50
3 Removed 09/05/20 20
4 New 10/05/20 20
4 Assigned 10/05/20 30
Expecting Result:-
ID Status Date Amount
1 New 01/05/20 20
1 Assigned 02/05/20 30
1 In-Progress 02/05/20 50
4 New 10/05/20 20
4 Assigned 10/05/20 30
Thanks in Advance.
You can use either filter, not like/rlike to filter out records from the dataframe that have status = removed.
import org.apche.spark.sql.functions._
//assuming df is the dataframe
//using filter or where clause, trim to remove white spaces lower to convert to lower
val df1=df.filter(lower(trim(col("status"))) !== "removed")
//or by filtering status Removed filter won't match if you have mixed case
val df1=df.filter(col("status") !== "Removed")
//using not like
val df1=df.filter(!lower(col("status")).like("removed"))
//using not rlike
val df1=df.filter(!col("status").rlike(".*(?i)removed.*"))
Now df1 dataframe will have the required records in it.
UPDATE:
From Spark2.4:
We can use join or window clause for this case.
val df=Seq((1,"New","01/05/20","20"),(1,"Assigned","02/05/20","30"),(1,"In-Progress","02/05/20","50"),(2,"New","02/05/20","30"),(2,"Removed","03/05/20","20"),(3,"In-Progress","09/05/20","50"),(3,"Removed","09/05/20","20"),(4,"New","10/05/20","20"),(4,"Assigned","10/05/20","30")).toDF("ID","Status","Date","Amount")
import org.apache.spark.sql.expressions._
val df1=df.
groupBy("id").
agg(collect_list(lower(col("Status"))).alias("status_arr"))
//using array_contains function
df.alias("t1").join(df1.alias("t2"),Seq("id"),"inner").
filter(!array_contains(col("status_arr"),"removed")).
drop("status_arr").show()
//without join using window clause
val w=Window.partitionBy("id").orderBy("Status").rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
df.withColumn("status_arr",collect_list(lower(col("status"))).over(w)).
filter(!array_contains(col("status_arr"),"removed")).
drop("status_arr").
show()
//+---+-----------+--------+------+
//| ID| Status| Date|Amount|
//+---+-----------+--------+------+
//| 1| New|01/05/20| 20|
//| 1| Assigned|02/05/20| 30|
//| 1|In-Progress|02/05/20| 50|
//| 4| New|10/05/20| 20|
//| 4| Assigned|10/05/20| 30|
//+---+-----------+--------+------+
For Spark < 2.4:
val df1=df.groupBy("id").agg(concat_ws("",collect_list(lower(col("Status")))).alias("status_arr"))
df.alias("t1").join(df1.alias("t2"),Seq("id"),"inner").
filter(!col("status_arr").contains("removed")).
drop(col("status_arr")).
show()
//Using window functions
df.withColumn("status_arr",concat_ws("",collect_list(lower(col("status"))).over(w))).
filter(!col("status_arr").contains("removed")).
drop(col("status_arr")).
show(false)
//+---+-----------+--------+------+
//| ID| Status| Date|Amount|
//+---+-----------+--------+------+
//| 1| New|01/05/20| 20|
//| 1| Assigned|02/05/20| 30|
//| 1|In-Progress|02/05/20| 50|
//| 4| New|10/05/20| 20|
//| 4| Assigned|10/05/20| 30|
//+---+-----------+--------+------+
Assuming res0 is your dataset, you could do:
import spark.implicits._
val x = res0.where($"Status" !== "Removed")
x.show()
This will remove the rows with Status as removed, but wont give what you want to achieve based on what you have posted above.
See my code:
val spark: SparkSession = SparkSession.builder().appName("ReadFiles").master("local[*]").getOrCreate()
import spark.implicits._
val data: DataFrame = spark.read.option("header", "true")
.option("inferschema", "true")
.csv("Resources/atom.csv")
data.show()
Data looks like:
ID Name City newCol1 newCol2
1 Ali lhr null null
2 Ahad swl 1 10
3 Sana khi null null
4 ABC xyz null null
New list of values:
val nums: List[Int] = List(10,20)
I want to add these values where ID=4. So that DataFrame may look like:
ID Name City newCol1 newCol2
1 Ali lhr null null
2 Ahad swl 1 10
3 Sana khi null null
4 ABC xyz 10 20
I wonder if it is possible. Any help will be appreciated. Thanks
It's possible, Use when otherwise statements for this case.
import org.apache.spark.sql.functions._
df.withColumn("newCol1", when(col("id") === 4, lit(nums(0))).otherwise(col("newCol1")))
.withColumn("newCol2", when(col("id") === 4, lit(nums(1))).otherwise(col("newCol2")))
.show()
//+---+----+----+-------+-------+
//| ID|Name|City|newCol1|newCol2|
//+---+----+----+-------+-------+
//| 1| Ali| lhr| null| null|
//| 2|Ahad| swl| 1| 10|
//| 3|Sana| khi| null| null|
//| 4| ABC| xyz| 10| 20|
//+---+----+----+-------+-------+
This question already has answers here:
How to join two DataFrames and change column for missing values?
(3 answers)
How to do left outer join in spark sql?
(3 answers)
Closed 5 years ago.
I have 2 dataframes.
df1 =
dep-code rank
abc 1
bcd 2
df2 =
some cols... dep-code
abc
bcd
abc
I want to add new column to df2 as rank with df1.dep-code = df2.dep-code
result -
some cols... dep-code rank
abc 1
bcd 2
abc 1
That's a simple join:
df2.join(df1, "dep-code")
With the following inpouts:
df1 with the join and the desired column:
+--------+----+
|dep-code|rank|
+--------+----+
| abc| 1|
| bcd| 2|
+--------+----+
df2 with the join column plus an extra one (aColumn):
+----------+--------+
| aColumn|dep-code|
+----------+--------+
| some| abc|
| someother| bcd|
|yetAnother| abc|
+----------+--------+
You'll retrieve the output below:
+--------+----------+----+
|dep-code| aColumn|rank|
+--------+----------+----+
| abc| some| 1|
| abc|yetAnother| 1|
| bcd| someother| 2|
+--------+----------+----+
This question already has answers here:
How to select the first row of each group?
(9 answers)
Closed 5 years ago.
I have a dataframe df as mentioned below:
**customers** **product** **val_id** **rule_name** **rule_id** **priority**
1 A 1 ABC 123 1
3 Z r ERF 789 2
2 B X ABC 123 2
2 B X DEF 456 3
1 A 1 DEF 456 2
I want to create a new dataframe df2, which will have only unique customer ids, but as rule_name and rule_id columns are different for same customer in data, so I want to pick those records which has highest priority for the same customer, so my final outcome should be:
**customers** **product** **val_id** **rule_name** **rule_id** **priority**
1 A 1 ABC 123 1
3 Z r ERF 789 2
2 B X ABC 123 2
Can anyone please help me to achieve it using Spark scala. Any help will be appericiated.
You basically want to select rows with extreme values in a column. This is a really common issue, so there's even a whole tag greatest-n-per-group. Also see this question SQL Select only rows with Max Value on a Column which has a nice answer.
Here's an example for your specific case.
Note that this could select multiple rows for a customer, if there are multiple rows for that customer with the same (minimum) priority value.
This example is in pyspark, but it should be straightforward to translate to Scala
# find best priority for each customer. this DF has only two columns.
cusPriDF = df.groupBy("customers").agg( F.min(df["priority"]).alias("priority") )
# now join back to choose only those rows and get all columns back
bestRowsDF = df.join(cusPriDF, on=["customers","priority"], how="inner")
To create df2 you have to first order df by priority and then find unique customers by id. Like this:
val columns = df.schema.map(_.name).filterNot(_ == "customers").map(col => first(col).as(col))
val df2 = df.orderBy("priority").groupBy("customers").agg(columns.head, columns.tail:_*).show
It would give you expected output:
+----------+--------+-------+----------+--------+---------+
| customers| product| val_id| rule_name| rule_id| priority|
+----------+--------+-------+----------+--------+---------+
| 1| A| 1| ABC| 123| 1|
| 3| Z| r| ERF| 789| 2|
| 2| B| X| ABC| 123| 2|
+----------+--------+-------+----------+--------+---------+
Corey beat me to it, but here's the Scala version:
val df = Seq(
(1,"A","1","ABC",123,1),
(3,"Z","r","ERF",789,2),
(2,"B","X","ABC",123,2),
(2,"B","X","DEF",456,3),
(1,"A","1","DEF",456,2)).toDF("customers","product","val_id","rule_name","rule_id","priority")
val priorities = df.groupBy("customers").agg( min(df.col("priority")).alias("priority"))
val top_rows = df.join(priorities, Seq("customers","priority"), "inner")
+---------+--------+-------+------+---------+-------+
|customers|priority|product|val_id|rule_name|rule_id|
+---------+--------+-------+------+---------+-------+
| 1| 1| A| 1| ABC| 123|
| 3| 2| Z| r| ERF| 789|
| 2| 2| B| X| ABC| 123|
+---------+--------+-------+------+---------+-------+
You will have to use min aggregation on priority column grouping the dataframe by customers and then inner join the original dataframe with the aggregated dataframe and select the required columns.
val aggregatedDF = dataframe.groupBy("customers").agg(max("priority").as("priority_1"))
.withColumnRenamed("customers", "customers_1")
val finalDF = dataframe.join(aggregatedDF, dataframe("customers") === aggregatedDF("customers_1") && dataframe("priority") === aggregatedDF("priority_1"))
finalDF.select("customers", "product", "val_id", "rule_name", "rule_id", "priority").show
you should have the desired result
My dataframe has 120 columns.Suppose my dataframe has the below structure
Id value1 value2 value3
a 10 1983 19
a 20 1983 20
a 10 1983 21
b 10 1984 1
b 10 1984 2
we can see here the id a, value1 have different values(10,20). I have to find columns having the different values for a particular id. Is there any statistical or any other approach in spark to solve this problem?
Expected output
id new_column
a value1,value3
b value3
The following code might be a start of an answer:
val result = log.select("Id","value1","value2","value3").groupBy('Id).agg('Id, countDistinct('value1),countDistinct('value2),countDistinct('value3))
Should do the following:
1)
log.select("Id","value1","value2","value3")
select relevant columns (if you want to take all columns it might be redundant)
2)
groupBy('Id)
group rows with the same ID
3)
agg('Id, countDistinct('value1),countDistinct('value2),countDistinct('value3))
output : ID, and number(count) of unique(distinct) values per ID/specific column
You can do it in several ways, one of them being the distinct method, that is similar to the SQL behaviour. Another one would be the groupBy method, where you have to pass in parameters the name of the columns you want to group (e.g. df.groupBy("Id", "value1")).
Below is an example using the distinct method.
scala> case class Person(name : String, age: Int)
defined class Person
scala> val persons = Seq(Person("test", 10), Person("test", 20), Person("test", 10)).toDF
persons: org.apache.spark.sql.DataFrame = [name: string, age: int]
scala> persons.show
+----+---+
|name|age|
+----+---+
|test| 10|
|test| 20|
|test| 10|
+----+---+
scala> persons.select("name", "age").distinct().show
+-----+---+
| name|age|
+-----+---+
| test| 10|
| test| 20|
+-----+---+