Find columns with different values - scala

My dataframe has 120 columns.Suppose my dataframe has the below structure
Id value1 value2 value3
a 10 1983 19
a 20 1983 20
a 10 1983 21
b 10 1984 1
b 10 1984 2
we can see here the id a, value1 have different values(10,20). I have to find columns having the different values for a particular id. Is there any statistical or any other approach in spark to solve this problem?
Expected output
id new_column
a value1,value3
b value3

The following code might be a start of an answer:
val result = log.select("Id","value1","value2","value3").groupBy('Id).agg('Id, countDistinct('value1),countDistinct('value2),countDistinct('value3))
Should do the following:
1)
log.select("Id","value1","value2","value3")
select relevant columns (if you want to take all columns it might be redundant)
2)
groupBy('Id)
group rows with the same ID
3)
agg('Id, countDistinct('value1),countDistinct('value2),countDistinct('value3))
output : ID, and number(count) of unique(distinct) values per ID/specific column

You can do it in several ways, one of them being the distinct method, that is similar to the SQL behaviour. Another one would be the groupBy method, where you have to pass in parameters the name of the columns you want to group (e.g. df.groupBy("Id", "value1")).
Below is an example using the distinct method.
scala> case class Person(name : String, age: Int)
defined class Person
scala> val persons = Seq(Person("test", 10), Person("test", 20), Person("test", 10)).toDF
persons: org.apache.spark.sql.DataFrame = [name: string, age: int]
scala> persons.show
+----+---+
|name|age|
+----+---+
|test| 10|
|test| 20|
|test| 10|
+----+---+
scala> persons.select("name", "age").distinct().show
+-----+---+
| name|age|
+-----+---+
| test| 10|
| test| 20|
+-----+---+

Related

Create separate columns from array column in Spark Dataframe in Scala when array is large [duplicate]

This question already has answers here:
Spark Scala: How to convert Dataframe[vector] to DataFrame[f1:Double, ..., fn: Double)]
(5 answers)
Closed 4 years ago.
I have two columns: one of type Integer and one of type linalg.Vector. I can convert linalg.Vector to array. Each array has 32 elements. I want to convert each element in the array to a column. So the input is like :
column1 column2
(3, 5, 25, ...., 12) 3
(2, 7, 15, ...., 10) 4
(1, 10, 12, ..., 35) 2
Output should be:
column1_1 column1_2 column1_3 ......... column1_32 column 2
3 5 25 ......... 12 3
2 7 15 ......... 10 4
1 1 0 12 ......... 12 2
Except, in my case there are 32 elements in the array. It is too many to use the method in question Convert Array of String column to multiple columns in spark scala
I tried a few ways and none of it worked. What is the right way to do this?
Thanks a lot.
scala> import org.apache.spark.sql.Column
scala> val df = Seq((Array(3,5,25), 3),(Array(2,7,15),4),(Array(1,10,12),2)).toDF("column1", "column2")
df: org.apache.spark.sql.DataFrame = [column1: array<int>, column2: int]
scala> def getColAtIndex(id:Int): Column = col(s"column1")(id).as(s"column1_${id+1}")
getColAtIndex: (id: Int)org.apache.spark.sql.Column
scala> val columns: IndexedSeq[Column] = (0 to 2).map(getColAtIndex) :+ col("column2") //Here, instead of 2, you can give the value of n
columns: IndexedSeq[org.apache.spark.sql.Column] = Vector(column1[0] AS `column1_1`, column1[1] AS `column1_2`, column1[2] AS `column1_3`, column2)
scala> df.select(columns: _*).show
+---------+---------+---------+-------+
|column1_1|column1_2|column1_3|column2|
+---------+---------+---------+-------+
| 3| 5| 25| 3|
| 2| 7| 15| 4|
| 1| 10| 12| 2|
+---------+---------+---------+-------+
This can be done best by writing a UserDefinedFunction like:
val getElementFromVectorUDF = udf(getElementFromVector(_: Vector, _: Int))
def getElementFromVector(vec: Vector, idx: Int) = {
vec(idx)
}
You can use it like this then:
df.select(
getElementFromVectorUDF($"column1", 0) as "column1_0",
...
getElementFromVectorUDF($"column1", n) as "column1_n",
)
I hope this helps.

Scala: How to combine two data frames?

First Df is:
ID Name ID2 Marks
1 12 1 333
Second Df2 is:
ID Name ID2 Marks
1 3 989
7 98 8 878
I need output is:
ID Name ID2 Marks
1 12 1 333
1 3 989
7 98 8 878
Kindly help!
Use union or unionAll function:
df1.unionAll(df2)
df1.union(df2)
for example:
scala> val a = (1,"12",1,333)
a: (Int, String, Int, Int) = (1,12,1,333)
scala> val b = (1,"",3,989)
b: (Int, String, Int, Int) = (1,"",3,989)
scala> val c = (7,"98",8,878)
c: (Int, String, Int, Int) = (7,98,8,878)
scala> import spark.implicits._
import spark.implicits._
scala> val df1 = List(a).toDF("ID","Name","ID2","Marks")
df1: org.apache.spark.sql.DataFrame = [ID: int, Name: string ... 2 more fields]
scala> val df2 = List(b, c).toDF("ID","Name","ID2","Marks")
df2: org.apache.spark.sql.DataFrame = [ID: int, Name: string ... 2 more fields]
scala> df1.show
+---+----+---+-----+
| ID|Name|ID2|Marks|
+---+----+---+-----+
| 1| 12| 1| 333|
+---+----+---+-----+
scala> df2.show
+---+----+---+-----+
| ID|Name|ID2|Marks|
+---+----+---+-----+
| 1| | 3| 989|
| 7| 98| 8| 878|
+---+----+---+-----+
scala> df1.union(df2).show
+---+----+---+-----+
| ID|Name|ID2|Marks|
+---+----+---+-----+
| 1| 12| 1| 333|
| 1| | 3| 989|
| 7| 98| 8| 878|
+---+----+---+-----+
A simple union or unionAll should do the trick for you
Df.union(Df2)
or
Df.unionAll(Df2)
As given in the api document
Returns a new Dataset containing union of rows in this Dataset and another Dataset.
This is equivalent to UNION ALL in SQL. To do a SQL-style set union (that does
deduplication of elements), use this function followed by a [[distinct]].
Also as standard in SQL, this function resolves columns by position (not by name).

Get Unique records in Spark [duplicate]

This question already has answers here:
How to select the first row of each group?
(9 answers)
Closed 5 years ago.
I have a dataframe df as mentioned below:
**customers** **product** **val_id** **rule_name** **rule_id** **priority**
1 A 1 ABC 123 1
3 Z r ERF 789 2
2 B X ABC 123 2
2 B X DEF 456 3
1 A 1 DEF 456 2
I want to create a new dataframe df2, which will have only unique customer ids, but as rule_name and rule_id columns are different for same customer in data, so I want to pick those records which has highest priority for the same customer, so my final outcome should be:
**customers** **product** **val_id** **rule_name** **rule_id** **priority**
1 A 1 ABC 123 1
3 Z r ERF 789 2
2 B X ABC 123 2
Can anyone please help me to achieve it using Spark scala. Any help will be appericiated.
You basically want to select rows with extreme values in a column. This is a really common issue, so there's even a whole tag greatest-n-per-group. Also see this question SQL Select only rows with Max Value on a Column which has a nice answer.
Here's an example for your specific case.
Note that this could select multiple rows for a customer, if there are multiple rows for that customer with the same (minimum) priority value.
This example is in pyspark, but it should be straightforward to translate to Scala
# find best priority for each customer. this DF has only two columns.
cusPriDF = df.groupBy("customers").agg( F.min(df["priority"]).alias("priority") )
# now join back to choose only those rows and get all columns back
bestRowsDF = df.join(cusPriDF, on=["customers","priority"], how="inner")
To create df2 you have to first order df by priority and then find unique customers by id. Like this:
val columns = df.schema.map(_.name).filterNot(_ == "customers").map(col => first(col).as(col))
val df2 = df.orderBy("priority").groupBy("customers").agg(columns.head, columns.tail:_*).show
It would give you expected output:
+----------+--------+-------+----------+--------+---------+
| customers| product| val_id| rule_name| rule_id| priority|
+----------+--------+-------+----------+--------+---------+
| 1| A| 1| ABC| 123| 1|
| 3| Z| r| ERF| 789| 2|
| 2| B| X| ABC| 123| 2|
+----------+--------+-------+----------+--------+---------+
Corey beat me to it, but here's the Scala version:
val df = Seq(
(1,"A","1","ABC",123,1),
(3,"Z","r","ERF",789,2),
(2,"B","X","ABC",123,2),
(2,"B","X","DEF",456,3),
(1,"A","1","DEF",456,2)).toDF("customers","product","val_id","rule_name","rule_id","priority")
val priorities = df.groupBy("customers").agg( min(df.col("priority")).alias("priority"))
val top_rows = df.join(priorities, Seq("customers","priority"), "inner")
+---------+--------+-------+------+---------+-------+
|customers|priority|product|val_id|rule_name|rule_id|
+---------+--------+-------+------+---------+-------+
| 1| 1| A| 1| ABC| 123|
| 3| 2| Z| r| ERF| 789|
| 2| 2| B| X| ABC| 123|
+---------+--------+-------+------+---------+-------+
You will have to use min aggregation on priority column grouping the dataframe by customers and then inner join the original dataframe with the aggregated dataframe and select the required columns.
val aggregatedDF = dataframe.groupBy("customers").agg(max("priority").as("priority_1"))
.withColumnRenamed("customers", "customers_1")
val finalDF = dataframe.join(aggregatedDF, dataframe("customers") === aggregatedDF("customers_1") && dataframe("priority") === aggregatedDF("priority_1"))
finalDF.select("customers", "product", "val_id", "rule_name", "rule_id", "priority").show
you should have the desired result

How to group by gender and join by positions per group?

I have tried numerous approaches to turn the following:
Gender, Age, Value
1, 20, 21
2, 23 22
1, 26, 23
2, 29, 24
into
Male_Age, Male_Value, Female_Age, Female_Value
20 21 23 22
26 23 29 24
What i need to do is group by gender and instead of using an aggregate like (sum, count, avg) I need to create List[age] and List[value]. This should be possible because i am using a Dataset which allows functional operations.
If the number of rows for males and females are not the same, the columns should be filled with nulls.
One approach I tried was to make a new a new dataframe using the columns of other dataframes like so:
df
.select(male.select("sex").where('sex === 1).col("sex"),
female.select("sex").where('sex === 2).col("sex"))
However, this bizarrely produces output like so:
sex, sex,
1, 1
2, 2
1, 1
2, 2
I can't see how that is possible.
I also tried using pivot, but it forces me to aggregate after the group by:
df.withColumn("sex2", df.col("sex"))
.groupBy("sex")
.pivot("sex2")
.agg(
sum('value').as("mean"),
stddev('value).as("std. dev") )
.show()
|sex| 1.0_mean| 1.0_std. dev| 2.0_mean| 2.0_std. dev|
|1.0|0.4926065526| 1.8110632697| | |
|2.0| | |0.951250372|1.75060275400785|
The following code does what I need in Oracle SQL, so it should possible in Spark SQL too I reckon...
drop table mytable
CREATE TABLE mytable
( gender number(10) NOT NULL,
age number(10) NOT NULL,
value number(10) );
insert into mytable values (1,20,21);
insert into mytable values(2,23,22);
insert into mytable values (1,26,23);
insert into mytable values (2,29,24);
insert into mytable values (1,30,25);
select * from mytable;
SELECT A.VALUE AS MALE,
B.VALUE AS FEMALE
FROM
(select value, rownum RN from mytable where gender = 1) A
FULL OUTER JOIN
(select value, rownum RN from mytable where gender = 2) B
ON A.RN = B.RN
The following should give you the result.
val df = Seq(
(1, 20, 21),
(2, 23, 22),
(1, 26, 23),
(2, 29, 24)
).toDF("Gender", "Age", "Value")
scala> df.show
+------+---+-----+
|Gender|Age|Value|
+------+---+-----+
| 1| 20| 21|
| 2| 23| 22|
| 1| 26| 23|
| 2| 29| 24|
+------+---+-----+
// Gender 1 = Male
// Gender 2 = Female
import org.apache.spark.sql.expressions.Window
val byGender = Window.partitionBy("gender").orderBy("gender")
val males = df
.filter("gender = 1")
.select($"age" as "male_age",
$"value" as "male_value",
row_number() over byGender as "RN")
scala> males.show
+--------+----------+---+
|male_age|male_value| RN|
+--------+----------+---+
| 20| 21| 1|
| 26| 23| 2|
+--------+----------+---+
val females = df
.filter("gender = 2")
.select($"age" as "female_age",
$"value" as "female_value",
row_number() over byGender as "RN")
scala> females.show
+----------+------------+---+
|female_age|female_value| RN|
+----------+------------+---+
| 23| 22| 1|
| 29| 24| 2|
+----------+------------+---+
scala> males.join(females, Seq("RN"), "outer").show
+---+--------+----------+----------+------------+
| RN|male_age|male_value|female_age|female_value|
+---+--------+----------+----------+------------+
| 1| 20| 21| 23| 22|
| 2| 26| 23| 29| 24|
+---+--------+----------+----------+------------+
Given a DataFrame called df with columns gender, age, and value, you can do this:
df.groupBy($"gender")
.agg(collect_list($"age"), collect_list($"value")).rdd.map { row =>
val ages: Seq[Int] = row.getSeq(1)
val values: Seq[Int] = row.getSeq(2)
(row.getInt(0), ages.head, ages.last, values.head, values.last)
}.toDF("gender", "male_age", "female_age", "male_value", "female_value")
This uses the collect_list aggregating function in the very helpful Spark functions library to aggregate the values you want. (As you can see, there is also a collect_set as well.)
After that, I don't know of any higher-level DataFrame functions to expand those columnar arrays into individual columns of their own, so I fall back to the lower-level RDD API our ancestors used. I simply expand everything into a Tuple and then turn it back into a DataFrame. The commenters above mention corner cases I have not addressed; using functions like headOption and tailOption might be useful there. But this should be enough to get you moving.

Create a new column based on date checking

I have two dataframes in Scala:
df1 =
ID Field1
1 AAA
2 BBB
4 CCC
and
df2 =
PK start_date_time
1 2016-10-11 11:55:23
2 2016-10-12 12:25:00
3 2016-10-12 16:20:00
I also have a variable start_date with the format yyyy-MM-dd equal to 2016-10-11.
I need to create a new column check in df1 based on the following condition: If PK is equal to ID AND the year, month and day of start_date_time are equal to start_date, then check is equal to 1, otherwise 0.
The result should be this one:
df1 =
ID Field1 check
1 AAA 1
2 BBB 0
4 CCC 0
In my previous question I had two dataframes and it was suggested to use joining and filtering. However, in this case it won't work. My initial idea was to use udf, but not sure how to make it working for this case.
You can combine join and withColumn for this case. i.e. firstly join with df2 on ID column and then use when.otherwise syntax to modify the check column:
import org.apache.spark.sql.functions.lit
val df2_date = df2.withColumn("date", to_date(df2("start_date_time"))).withColumn("check", lit(1)).select($"PK".as("ID"), $"date", $"check")
df1.join(df2_date, Seq("ID"), "left").withColumn("check", when($"date" === "2016-10-11", $"check").otherwise(0)).drop("date").show
+---+------+-----+
| ID|Field1|check|
+---+------+-----+
| 1| AAA| 1|
| 2| BBB| 0|
| 4| CCC| 0|
+---+------+-----+
Or another option, firstly filter on df2, and then join it back with df1 on ID column:
val df2_date = (df2.withColumn("date", to_date(df2("start_date_time"))).
filter($"date" === "2016-10-11").
withColumn("check", lit(1)).
select($"PK".as("ID"), $"date", $"check"))
df1.join(df2_date, Seq("ID"), "left").drop("date").na.fill(0).show
+---+------+-----+
| ID|Field1|check|
+---+------+-----+
| 1| AAA| 1|
| 2| BBB| 0|
| 4| CCC| 0|
+---+------+-----+
In case you have a date like 2016-OCT-11, you can convert it sql Date for comparison as follows:
val format = new java.text.SimpleDateFormat("yyyy-MMM-dd")
val parsed = format.parse("2016-OCT-11")
val date = new java.sql.Date(parsed.getTime())
// date: java.sql.Date = 2016-10-11