Multiple agg functions using Over Window (Concat & max ) - scala

I'm beginner in Spark, Is there any way to apply multiple agg functions for two differents columns using the same Over Window ? In my case i want to apply concat and max
I have a Dataset (DS1) like this.
+-----+--------------+---------+--------------+
|Col_1|Col_2 | Col_3 + Col_4 |
+-----+--------------+---------+--------------+
| 1 | aa |10 + test_1_1 +
| 1 | bb |20 + test_1_2 +
| 2 | cc |30 + test_2_1 +
| 2 | dd |40 + test_2_2 +
I wnat to get something like this (DS2)
+-----+--------------+---------+--------------+--------
|Col_1|Col_2 | Col_3 + Col_5 |
+-----+--------------+---------+----------------------+
| 1 | bb |20 + test_1_2;test_1_1 +
| 2 | dd |40 + test_2_2;test_2_1 +
------|--------------|---------+----------------------+
I know how to apply max function Over window, but how can I add concatenation to get the dataset DS2
val partitionColumns = Seq(
"Col_1"
)
df.withColumn(
"max_Col_3",
max(col("Col_3")) over Window
.partitionBy(
partitionColumns .map(col): _*
)
)
.filter(col("max_Col_3").equalTo(col("Col_3")))
.drop("max_Col_3")

You can use sql expr to get col2, col3, col5 respectively.
val DS2 = DS1.groupBy("col_1").agg(expr("max(col_2) as col_2"), expr("max(col_3) as col_3"), expr("array_join(collect_list(col_4), ';') as col_5"))
DS2.show()

Related

Split values in dataframe into separate dataframe column -spark scala

My requirement is to convert the below dataframe
df.show()
Id | vals
1 | name=John || age=25 || col1 =val1 || col2= val2
2 | name=Joe || age=23 || col1 =val11 || col2= val22
Into
Id | name | age | col1 | col2
1 | John | 25 | val1 | val2
2 | Joe | 23 | val11 |val22
Please assist me with this.
To generate the wanted result in a dynamic fashion, here's one approach that uses a mix of split and explode to transform column vals into an ArrayType column of [key, value] (e.g. ["name", "john"]), followed by a grouping by id and pivot on the key to aggregate value:
import org.apache.spark.sql.functions._
import spark.implicits._
val df = Seq(
(1, "name=John || age=25 || col1 =val1 || col2= val2"),
(2, "name=Joe || age=23 || col1 =val11 || col2= val22")
).toDF("id", "vals")
df.
withColumn("flattened", explode(split($"vals", "\\s*\\|\\|\\s*"))).
withColumn("kv_array", split($"flattened", "\\s*=\\s*")).
groupBy($"id").pivot($"kv_array"(0)).agg(first($"kv_array"(1))).
show
// +---+---+-----+-----+----+
// |id |age|col1 |col2 |name|
// +---+---+-----+-----+----+
// |1 |25 |val1 |val2 |John|
// |2 |23 |val11|val22|Joe |
// +---+---+-----+-----+----+
You could use spark sql split function to split your string and convert to array[string] and then select the columns accordingly. Something like below:
val df1 = df.withColumn("vals",split($"vals","\\|\\|"))
.select($"id",split($"vals"(0),"=")(1).alias("name"),
split($"vals"(1),"=")(1).alias("age"),
split($"vals"(2),"=")(1).alias("col1"),
split($"vals"(3),"=")(1).alias("col2"))

Spark dataframe Column content modification

I have a dataframe as shown below df.show():
+--------+---------+---------+---------+---------+
| Col11 | Col22 | Expend1 | Expend2 | Expend3 |
+--------+---------+---------+---------+---------+
| Value1 | value1 | 123 | 2264 | 56 |
| Value1 | value2 | 124 | 2255 | 23 |
+--------+---------+---------+---------+---------+
Can I transform the above data frame to the below using some SQL?
+--------+---------+-------------+---------------+------------+
| Col11 | Col22 | Expend1 | Expend2 | Expend3 |
+--------+---------+-------------+---------------+------------+
| Value1 | value1 | Expend1:123 | Expend2: 2264 | Expend3:56 |
| Value1 | value2 | Expend1:124 | Expend2: 2255 | Expend3:23 |
+--------+---------+-------------+---------------+------------+
You can use the idea of foldLeft here
import spark.implicits._
import org.apache.spark.sql.functions._
val df = spark.sparkContext.parallelize(Seq(
("Value1", "value1", "123", "2264", "56"),
("Value1", "value2", "124", "2255", "23")
)).toDF("Col11", "Col22", "Expend1", "Expend2", "Expend3")
//Lists your columns for operation
val cols = List("Expend1", "Expend2", "Expend3")
val newDF = cols.foldLeft(df){(acc, name) =>
acc.withColumn(name, concat(lit(name + ":"), col(name)))
}
newDF.show()
Output:
+------+------+-----------+------------+----------+
| Col11| Col22| Expend1| Expend2| Expend3|
+------+------+-----------+------------+----------+
|Value1|value1|Expend1:123|Expend2:2264|Expend3:56|
|Value1|value2|Expend1:124|Expend2:2255|Expend3:23|
+------+------+-----------+------------+----------+
you can do that using simple sql select statement if you want can use udf as well
Ex -> select Col11 , Col22 , 'Expend1:' + cast(Expend1 as varchar(10)) as Expend1, .... from table
val df = Seq(("Value1", "value1", "123", "2264", "56"), ("Value1", "value2", "124", "2255", "23") ).toDF("Col11", "Col22", "Expend1", "Expend2", "Expend3")
val cols = df.columns.filter(!_.startsWith("Col")) // It will only fetch other than col% prefix columns
val getCombineData = udf { (colName:String, colvalue:String) => colName + ":"+ colvalue}
var in = df
for (e <- cols) {
in = in.withColumn(e, getCombineData(lit(e), col(e)) )
}
in.show
// results
+------+------+-----------+------------+----------+
| Col11| Col22| Expend1| Expend2| Expend3|
+------+------+-----------+------------+----------+
|Value1|value1|Expend1:123|Expend2:2264|Expend3:56|
|Value1|value2|Expend1:124|Expend2:2255|Expend3:23|
+------+------+-----------+------------+----------+

Update data from two Data Frames Scala-Spark

I have two Data Frames:
DF1:
ID | Col1 | Col2
1 a aa
2 b bb
3 c cc
DF2:
ID | Col1 | Col2
1 ab aa
2 b bba
4 d dd
How can I join these two DFs and the result should be:
Result:
1 ab aa
2 b bba
3 c cc
4 d dd
My code is:
val df = DF1.join(DF2, Seq("ID"), "outer")
.select($"ID",
when(DF1("Col1").isNull, lit(0)).otherwise(DF1("Col1")).as("Col1"),
when(DF1("Col2").isNull, lit(0)).otherwise(DF2("Col2")).as("Col2"))
.orderBy("ID")
And it works, but I don't want to specify each column, because I have large files.
So, is there any way to update the dataframe (and to add some recors if in the second DF are new one) without specifying each column?
A simple leftanti join of df1 with df2 and merging of the result into df2 should get your desired output as
df2.union(df1.join(df2, Seq("ID"), "leftanti")).orderBy("ID").show(false)
which should give you
+---+----+----+
|ID |Col1|Col2|
+---+----+----+
|1 |ab |aa |
|2 |b |bba |
|3 |c |cc |
|4 |d |dd |
+---+----+----+
The solution doesn't match the logic you have in your code but generates the expected result

Scala/ Spark- Multiply an Integer with each value in a Dataframe Column

I have a sample dataframe
df_that_I_have
+---------+---------+-------+
| country | members | some |
+---------+---------+-------+
| India | 50 | 1 |
+---------+---------+-------+
| Japan | 20 | 3 |
+---------+---------+-------+
| India | 20 | 1 |
+---------+---------+-------+
| Japan | 10 | 3 |
+---------+---------+-------+
and I want a dataframe that looks like this
df_that_I_want
+---------+---------+-------+
| country | members | some |
+---------+---------+-------+
| India | 70 | 10 | // 5 * Sum of "some" for India, i.e. (1 + 1)
+---------+---------+-------+
| Japan | 30 | 30 | // 5 * Sum of "some" for Japan, i.e. (3 + 3)
+---------+---------+-------+
The second dataframe has the sum of members and the sum of some multiplied 5.
This is what I'm doing to achieve this
val df_that_I_want = df_that_I_have
.select(df_that_I_have("country"),
df_that_I_have.groupBy("country").sum("members"),
5 * df_that_I_have.groupBy("country").sum("some")) //Problem here
But the compiler does not allow me to do this because apparently I can't multiply 5 with a column.
How can I multiply an Integer value with the sum of some for each country?
You can try lit function.
scala> val df_that_I_have = Seq(("India",50,1),("India",20,1),("Japan",20,3),("Japan",10,3)).toDF("Country","Members","Some")
df_that_I_have: org.apache.spark.sql.DataFrame = [Country: string, Members: int, Some: int]
scala> val df1 = df_that_I_have.groupBy("country").agg(sum("members"), sum("some") * lit(5))
df1: org.apache.spark.sql.DataFrame = [country: string, sum(members): bigint, ((sum(some),mode=Complete,isDistinct=false) * 5): bigint]
scala> val df_that_I_want= df1.select($"Country",$"sum(Members)".alias("Members"), $"((sum(Some),mode=Complete,isDistinct=false) * 5)".alias("Some"))
df_that_I_want: org.apache.spark.sql.DataFrame = [Country: string, Members: bigint, Some: bigint]
scala> df_that_I_want.show
+-------+-------+----+
|Country|Members|Some|
+-------+-------+----+
| India| 70| 10|
| Japan| 30| 30|
+-------+-------+----+
Please try this
df_that_I_have.select("country").groupBy("country").agg(sum("members"), sum("some") * lit(5))
df_that_I_have.select("country").groupBy("country").agg(sum("members"), sum("some") * lit(5))
lit function is used for creating the column of literal value which is 5 here.
when you are not able to multiply 5 directly, it is creating a column containing 5 and multiplying with it.

How to filter a dataframe based on column values(multiple values through a arraybuffer) in scala

In scala/spark code I have 1 Dataframe which contains some rows:
col1 col2
Abc someValue1
xyz someValue2
lmn someValue3
zmn someValue4
pqr someValue5
cda someValue6
And i have a variable of ArrayBuffer[String] which contains [xyz,pqr,abc];
I want to filter given dataframe based on given values in arraybuffer at col1.
In SQL it would be like:
select * from tableXyz where col1 in("xyz","pqr","abc");
Assuming you have your dataframe:
val df = sc.parallelize(Seq(("abc","someValue1"),
("xyz","someValue2"),
("lmn","someValue3"),
("zmn","someValue4"),
("pqr","someValue5"),
("cda","someValue6")))
.toDF("col1","col2")
+----+----------+
|col1| col2|
+----+----------+
| abc|someValue1|
| xyz|someValue2|
| lmn|someValue3|
| zmn|someValue4|
| pqr|someValue5|
| cda|someValue6|
+----+----------+
Then you can define an UDF to filter the dataframe based on array's values:
val array = ArrayBuffer[String]("xyz","pqr","abc")
val function: (String => Boolean) = (arg: String) => array.contains(arg)
val udfFiltering = udf(function)
val filtered = df.filter(udfFiltering(col("col1")))
filtered.show()
+----+----------+
|col1| col2|
+----+----------+
| abc|someValue1|
| xyz|someValue2|
| pqr|someValue5|
+----+----------+
Alternately you can register your dataframe and sql-query it by SQLContext:
var elements = ""
array.foreach { el => elements += "\"" + el + "\"" + "," }
elements = elements.dropRight(1)
val query = "select * from tableXyz where col1 in(" + elements + ")"
df.registerTempTable("tableXyz")
val filtered = sqlContext.sql(query)
filtered.show()
+----+----------+
|col1| col2|
+----+----------+
| abc|someValue1|
| xyz|someValue2|
| pqr|someValue5|
+----+----------+