Spark SQL Map only one column of DataFrame - scala

Sorry for the noob question, I have a dataframe in SparkSQL like this:
id | name | data
----------------
1 | Mary | ABCD
2 | Joey | DOGE
3 | Lane | POOP
4 | Jack | MEGA
5 | Lynn | ARGH
I want to know how to do two things:
1) use a scala function on one or more columns to produce another column
2) use a scala function on one or more columns to replace a column
Examples:
1) Create a new boolean column that tells whether the data starts with A:
id | name | data | startsWithA
------------------------------
1 | Mary | ABCD | true
2 | Joey | DOGE | false
3 | Lane | POOP | false
4 | Jack | MEGA | false
5 | Lynn | ARGH | true
2) Replace the data column with its lowercase counterpart:
id | name | data
----------------
1 | Mary | abcd
2 | Joey | doge
3 | Lane | poop
4 | Jack | mega
5 | Lynn | argh
What is the best way to do this in SparkSQL? I've seen many examples of how to return a single transformed column, but I don't know how to get back a new DataFrame with all the original columns as well.

You can use withColumn to add new column or to replace the existing column
as
val df = Seq(
(1, "Mary", "ABCD"),
(2, "Joey", "DOGE"),
(3, "Lane", "POOP"),
(4, "Jack", "MEGA"),
(5, "Lynn", "ARGH")
).toDF("id", "name", "data")
val resultDF = df.withColumn("startsWithA", $"data".startsWith("A"))
.withColumn("data", lower($"data"))
If you want separate dataframe then
val resultDF1 = df.withColumn("startsWithA", $"data".startsWith("A"))
val resultDF2 = df.withColumn("data", lower($"data"))
withColumn replaces the old column if the same column name is provided and creates a new column if new column name is provided.
Output:
+---+----+----+-----------+
|id |name|data|startsWithA|
+---+----+----+-----------+
|1 |Mary|abcd|true |
|2 |Joey|doge|false |
|3 |Lane|poop|false |
|4 |Jack|mega|false |
|5 |Lynn|argh|true |
+---+----+----+-----------+

Related

Convert result of aggregation into 3 separate fields of col name, aggregate function and value

I have a dataframe in the form
+-----+--------+-------+
| id | label | count |
+-----+--------+-------+
| id1 | label1 | 5 |
| id1 | label1 | 2 |
| id2 | label2 | 3 |
+-----+--------+-------+
and I would like the resulting output to look like
+-----+--------+----------+----------+-------+
| id | label | col_name | agg_func | value |
+-----+--------+----------+----------+-------+
| id1 | label1 | count | avg | 3.5 |
| id1 | label1 | count | sum | 7 |
| id2 | label2 | count | avg | 3 |
| id2 | label2 | count | sum | 3 |
+-----+--------+----------+----------+-------+
First, I created a list of aggregate functions using the code below. I then apply these functions into the original dataframe to get the aggregation results in separate columns.
val f = org.apache.spark.sql.functions
val aggCols = Seq("col_name")
val aggFuncs = Seq("avg", "sum")
val aggOp = for (func <- aggFuncs) yield {
aggCols.map(x => f.getClass.getMethod(func, x.getClass).invoke(f, x).asInstanceOf[Column])
}
val aggOpFlat = aggOp.flatten
df.groupBy("id", "label").agg(aggOpFlat.head, aggOpFlat.tail: _*).na.fill(0)
I get to the format
+-----+--------+---------------+----------------+
| id | label | avg(col_name) | sum(col_name) |
+-----+--------+---------------+----------------+
| id1 | label1 | 3.5 | 7 |
| id2 | label2 | 3 | 3 |
+-----+--------+---------------+----------------+
but cannot think of the logic to get to what I want.
A possible solution might be to wrap all the aggregate values inside a map and then to use explode function.
Something like that (shouldn't be an issue to make it dynamic).
val df = List ( ("id1", "label1", 5), ("id1", "label1", 2), ("id2", "label2", 3)).toDF("id", "label", "count")
df
.groupBy("id", "label")
.agg(avg("count").as("avg"), sum("count").as("sum"))
.withColumn("map", map( lit("avg"), col("avg"), lit("sum"), col("sum")))
.select(col("id"), col("label"), explode(col("map")))
.show

Fast split Spark dataframe by keys in some column and save as different dataframes

I have Spark 2.3 very big dataframe like this:
-------------------------
| col_key | col1 | col2 |
-------------------------
| AA | 1 | 2 |
| AB | 2 | 1 |
| AA | 2 | 3 |
| AC | 1 | 2 |
| AA | 3 | 2 |
| AC | 5 | 3 |
-------------------------
I need to "split" this dataframe by values in col_key column and save each splitted part in separate csv file, so I have to get smaller dataframes like
-------------------------
| col_key | col1 | col2 |
-------------------------
| AA | 1 | 2 |
| AA | 2 | 3 |
| AA | 3 | 2 |
-------------------------
and
-------------------------
| col_key | col1 | col2 |
-------------------------
| AC | 1 | 2 |
| AC | 5 | 3 |
-------------------------
and so far.
Every result dataframe I need to save as different csv file.
Count of keys is not big (20-30) but total count of data is (~200 millions records).
I have the solution where in the loop is selected every part of data and then saved to file:
val keysList = df.select("col_key").distinct().map(r => r.getString(0)).collect.toList
keysList.foreach(k => {
val dfi = df.where($"col_key" === lit(k))
SaveDataByKey(dfi, path_to_save)
})
It works correct, but bad issue of this solution is that every selection of data by every key couse full passing through whole dataframe, and it get too many time.
I think must be faster solution, where we pass through dataframe only once and during this put every record to "rigth" result dataframe (or directly to separate file). But I don't know how can to do it :)
May be, someone have ideas about it?
Also I prefer to use Spark's DataFrame API because it provides fastest way of data processing (so using RDD's is not desirable, if possible).
You need to partition by column and save as csv file. Each partition save as one file.
yourDF
.write
.partitionBy("col_key")
.csv("/path/to/save")
Why don't you try this ?

Identifying recurring values a column over a Window (Scala)

I have a data frame with two columns: "ID" and "Amount", each row representing a transaction of a particular ID and the transacted amount. My example uses the following DF:
val df = sc.parallelize(Seq((1, 120),(1, 120),(2, 40),
(2, 50),(1, 30),(2, 120))).toDF("ID","Amount")
I want to create a new column identifying whether said amount is a recurring value, i.e. occurs in any other transaction for the same ID, or not.
I have found a way to do this more generally, i.e. across the entire column "Amount", not taking into account the ID, using the following function:
def recurring_amounts(df: DataFrame, col: String) : DataFrame = {
var df_to_arr = df.select(col).rdd.map(r => r(0).asInstanceOf[Double]).collect()
var arr_to_map = df_to_arr.groupBy(identity).mapValues(_.size)
var map_to_df = arr_to_map.toSeq.toDF(col, "Count")
var df_reformat = map_to_df.withColumn("Amount", $"Amount".cast(DoubleType))
var df_out = df.join(df_reformat, Seq("Amount"))
return df_new
}
val df_output = recurring_amounts(df, "Amount")
This returns:
+---+------+-----+
|ID |Amount|Count|
+---+------+-----+
| 1 | 120 | 3 |
| 1 | 120 | 3 |
| 2 | 40 | 1 |
| 2 | 50 | 1 |
| 1 | 30 | 1 |
| 2 | 120 | 3 |
+---+------+-----+
which I can then use to create my desired binary variable to indicate whether the amount is recurring or not (yes if > 1, no otherwise).
However, my problem is illustrated in this example by the value 120, which is recurring for ID 1 but not for ID 2. My desired output therefore is:
+---+------+-----+
|ID |Amount|Count|
+---+------+-----+
| 1 | 120 | 2 |
| 1 | 120 | 2 |
| 2 | 40 | 1 |
| 2 | 50 | 1 |
| 1 | 30 | 1 |
| 2 | 120 | 1 |
+---+------+-----+
I've been trying to think of a way to apply a function using
.over(Window.partitionBy("ID") but not sure how to go about it. Any hints would be much appreciated.
If you are good in sql, you can write sql query for your Dataframe. The first thing that you need to do is to register your Dataframeas a table in the spark's memory. After that you can write the sql on top of the table. Note that spark is the spark session variable.
val df = sc.parallelize(Seq((1, 120),(1, 120),(2, 40),(2, 50),(1, 30),(2, 120))).toDF("ID","Amount")
df.registerTempTable("transactions")
spark.sql("select *,count(*) over(partition by ID,Amount) as Count from transactions").show()
Please let me know if you have any questions.

Select specific rows from Spark dataframe per grouping

I have a dataframe like that:
+-----------------+------------------+-----------+--------+---+
| conversation_id | message_body | timestamp | sender | |
+-----------------+------------------+-----------+--------+---+
| A | hi | 9:00 | John | |
| A | how are you? | 10:00 | John | |
| A | can we meet? | 10:05 | John | * |
| A | not bad | 10:30 | Steven | * |
| A | great | 10:40 | John | |
| A | yeah, let's meet | 10:35 | Steven | |
| B | Hi | 12:00 | Anna | * |
| B | Hello | 12:05 | Ken | * |
+-----------------+------------------+-----------+--------+---+
For each conversation I would like to have the last message in the first block of the 1st sender and the first message of the 2nd sender. I marked them with an asterisk.
One idea that I had is to assign 0s for the first user and 1s for the second user.
Ideally I would like to have something like that:
+-----------------+---------+------------+--------------+---------+------------+----------+
| conversation_id | sender1 | timestamp1 | message1 | sender2 | timestamp2 | message2 |
+-----------------+---------+------------+--------------+---------+------------+----------+
| A | John | 10:05 | can we meet? | Steven | 10:30 | not bad |
| B | Anna | 12:00 | Hi | Ken | 12:05 | Hello |
+-----------------+---------+------------+--------------+---------+------------+----------+
How could I do that in Spark?
Interesting issues arose.
Amended 10:35 to 10:45
Used leading 0 format e.g. 09:00 instead of 9:00
You will need to use your own data types accordingly, this simply demonstrates the approach needed
Done in DataBricks Notebook
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
import spark.implicits._
val df =
Seq(("A", "hi", "09:00", "John"), ("A", "how are you?", "10:00", "John"),
("A", "can we meet?", "10:05", "John"), ("A", "not bad", "10:30", "Steven"),
("A", "great", "10:40", "John"), ("A", "yeah, let's meet", "10:45", "Steven"),
("B", "Hi", "12:00", "Anna"), ("B", "Hello", "12:05", "Ken")
).toDF("conversation_id", "message_body", "timestampX", "sender")
// Get rank, 1 is who were initiates conversation, the other values can be used to infer relationships
// Note no #Transient required here with Window
val df2 = df.withColumn("rankC", row_number().over(Window.partitionBy($"conversation_id").orderBy($"timestampX".asc)))
// A value <> 1 is the first message of second Sender.
// The 1 value of this is the last message of first "block"
val df3 = df2.select('conversation_id as "df3_conversation_id", 'sender as "df3_sender", 'rankC as "df3_rank")
// To avoid pipelined renaming issues that occur
val df3a = df3.groupBy("df3_conversation_id", "df3_sender").agg(min("df3_rank") as "rankC2").filter("rankC2 != 1")
//JOIN the values with some smarts. Some odd errors in Spark thru pipe-lining gotten. Need to drop pipelined row(), ranking etc.
val df4 = df3a.join(df2, (df3a("df3_conversation_id") === df2("conversation_id")) && (df3a("rankC2") === df2("rankC") + 1)).drop("rankC").drop("rankC2")
val df4a = df3a.join(df2, (df3a("df3_conversation_id") === df2("conversation_id")) && (df3a("rankC2") === df2("rankC"))).drop("rankC").drop("rankC2")
// The get other missing data, could have all been combined but done in steps for simplicity. Just a simple final JOIN and you ahve the answer.
val df5 = df4.join(df4a, (df4("df3_conversation_id") === df4a("df3_conversation_id")))
df5.show(false)
returns:
Output will not completely format here, run it in REPL to see titles.
|B |Ken |B |Hi |12:00 |Anna |B |Ken |B |Hello |12:05 |Ken |
|A |Steven |A |can we meet?|10:05 |John |A |Steven |A |not bad |10:30 |Steven|
You can further manipulate the data, the heavy lifting is done now. The Catalyst Optimizer has some issues compiling etc. so this is why I worked around in this fashion.

How filter one big dataframe many times(equal to small df‘s row count) by another small dataframe(row by row) ?

I have two spark dataframe,dfA and dfB.
I want to filter dfA by dfB's each row, which means if dfB have 10000 rows, i need to filter dfA 10000 times with 10000 different filter conditions generated by dfB. Then, after each filter i need to collect the filter result as a column in dfB.
dfA dfB
+------+---------+---------+ +-----+-------------+--------------+
| id | value1 | value2 | | id | min_value1 | max_value1 |
+------+---------+---------+ +-----+-------------+--------------+
| 1 | 0 | 4345 | | 1 | 0 | 3 |
| 1 | 1 | 3434 | | 1 | 5 | 9 |
| 1 | 2 | 4676 | | 2 | 1 | 4 |
| 1 | 3 | 3454 | | 2 | 6 | 8 |
| 1 | 4 | 9765 | +-----+-------------+--------------+
| 1 | 5 | 5778 | ....more rows, nearly 10000 rows.
| 1 | 6 | 5674 |
| 1 | 7 | 3456 |
| 1 | 8 | 6590 |
| 1 | 9 | 5461 |
| 1 | 10 | 4656 |
| 2 | 0 | 2324 |
| 2 | 1 | 2343 |
| 2 | 2 | 4946 |
| 2 | 3 | 4353 |
| 2 | 4 | 4354 |
| 2 | 5 | 3234 |
| 2 | 6 | 8695 |
| 2 | 7 | 6587 |
| 2 | 8 | 5688 |
+------+---------+---------+
......more rows,nearly one billons rows
so my expected result is
resultDF
+-----+-------------+--------------+----------------------------+
| id | min_value1 | max_value1 | results |
+-----+-------------+--------------+----------------------------+
| 1 | 0 | 3 | [4345,3434,4676,3454] |
| 1 | 5 | 9 | [5778,5674,3456,6590,5461] |
| 2 | 1 | 4 | [2343,4946,4353,4354] |
| 2 | 6 | 8 | [8695,6587,5688] |
+-----+-------------+--------------+----------------------------+
My stupid solutions is
def tempFunction(id:Int,dfA:DataFrame,dfB:DataFrame): DataFrame ={
val dfa = dfA.filter("id ="+ id)
val dfb = dfB.filter("id ="+ id)
val arr = dfb.groupBy("id")
.agg(collect_list(struct("min_value1","max_value1"))
.collect()
val rangArray = arr(0)(1).asInstanceOf[Seq[Row]] // get range array of id
// initial a resultDF to store each query's results
val min_value1 = rangArray(0).get(0).asInstanceOf[Int]
val max_value1 = rangArray(0).get(1).asInstanceOf[Int]
val s = "value1 between "+min_value1+" and "+ max_value1
var resultDF = dfa.filter(s).groupBy("id")
.agg(collect_list("value1").as("results"),
min("value1").as("min_value1"),
max("value1").as("max_value1"))
for( i <-1 to timePairArr.length-1){
val temp_min_value1 = rangArray(0).get(0).asInstanceOf[Int]
val temp_max_value1 = rangArray(0).get(1).asInstanceOf[Int]
val query = "value1 between "+temp_min_value1+" and "+ temp_max_value1
val tempResultDF = dfa.filter(query).groupBy("id")
.agg(collect_list("value1").as("results"),
min("value1").as("min_value1"),
max("value1").as("max_value1"))
resultDF = resultDF.union(tempResultDF)
}
return resultDF
}
def myFunction():DataFrame = {
val dfA = spark.read.parquet(routeA)
val dfB = spark.read.parquet(routeB)
val idArrays = dfB.select("id").distinct().collect()
// initial result
var resultDF = tempFunction(idArrays(0).get(0).asInstanceOf[Int],dfA,dfB)
//tranverse all id
for(i<-1 to idArrays.length-1){
val tempDF = tempFunction(idArrays(i).get(0).asInstanceOf[Int],dfA,dfB)
resultDF = resultDF.union(tempDF)
}
return resultDF
}
Maybe you don't want to see my brute force code.it's idea is
finalResult = null;
for each id in dfB:
for query condition of this id:
tempResult = query dfA
union tempResult to finalResult
I've tried my algorithms, it cost almost 50 hours.
Does anybody has a more efficient way ? Very thanks.
Assuming that your DFB is small dataset, I am trying to give the below solution.
Try using a Broadcast Join like below
import org.apache.spark.sql.functions.broadcast
dfA.join(broadcast(dfB), col("dfA.id") === col("dfB.id") && col("dfA.value1") >= col("dfB.min_value1") && col("dfA.value1") <= col("dfB.max_value1")).groupBy(col("dfA.id")).agg(collect_list(struct("value2").as("results"));
BroadcastJoin is like a Map Side Join. This will materialize the smaller data to all the mappers. This will improve the performance by omitting the required sort-and-shuffle phase during a reduce step.
Some points i would like you to avoid:
Never use collect(). When a collect operation is issued on a RDD, the dataset is copied to the driver.
If your data is too big you might get memory out of bounds exception.
Try using take() or takeSample() instead.
It is obvious that when two dataframes/datasets are involved in calculation then join should be performed. So join is a must step for you. But when should you join is the important question.
I would suggest to aggregate and reduce rows in dataframes as much as possible before joining as it would reduce shuffling.
In your case you can reduce only dfA as you need exact dfB with a column added from dfA meeting the condition
So you can groupBy id and aggregate dfA so that you get one row of each id, then you can perform the join. And then you can use a udf function for your logic of calculation
comments are provided for clarity and explanation
import org.apache.spark.sql.functions._
//udf function to filter only the collected value2 which has value1 within range of min_value1 and max_value1
def selectRangedValue2Udf = udf((minValue: Int, maxValue: Int, list: Seq[Row])=> list.filter(row => row.getAs[Int]("value1") <= maxValue && row.getAs[Int]("value1") >= minValue).map(_.getAs[Int]("value2")))
dfA.groupBy("id") //grouping by id
.agg(collect_list(struct("value1", "value2")).as("collection")) //collecting all the value1 and value2 as structs
.join(dfB, Seq("id"), "right") //joining both dataframes with id
.select(col("id"), col("min_value1"), col("max_value1"), selectRangedValue2Udf(col("min_value1"), col("max_value1"), col("collection")).as("results")) //calling the udf function defined above
which should give you
+---+----------+----------+------------------------------+
|id |min_value1|max_value1|results |
+---+----------+----------+------------------------------+
|1 |0 |3 |[4345, 3434, 4676, 3454] |
|1 |5 |9 |[5778, 5674, 3456, 6590, 5461]|
|2 |1 |4 |[2343, 4946, 4353, 4354] |
|2 |6 |8 |[8695, 6587, 5688] |
+---+----------+----------+------------------------------+
I hope the answer is helpful