I have a dataframe df
+-----+--------+----------+-----+
|count|currency| date|value|
+-----+--------+----------+-----+
| 3| GBP|2021-01-14| 4|
| 102| USD|2021-01-14| 3|
| 234| EUR|2021-01-14| 5|
| 28| GBP|2021-01-16| 5|
| 48| USD|2021-01-16| 7|
| 68| EUR|2021-01-15| 6|
| 20| GBP|2021-01-15| 1|
| 33| EUR|2021-01-17| 2|
| 106| GBP|2021-01-17| 10|
+-----+--------+----------+-----+
I have a separate dataframe for USD exchange_rate
val exchange_rate = spark.read.format("csv").load("/Users/khan/data/exchange_rate.csv")
exchange_rate.show()
INSERTTIME EXCAHNGERATE CURRENCY
2021-01-14 0.731422 GBP
2021-01-14 0.784125 EUR
2021-01-15 0.701922 GBP
2021-01-15 0.731422 EUR
2021-01-16 0.851422 GBP
2021-01-16 0.721128 EUR
2021-01-17 0.771621 GBP
2021-01-17 0.751426 EUR
I want to convert the GBP and EUR currency to USD in df by looking in the exchange_rate dataframe corresponding to the date
My Idea
import com.currency_converter.CurrencyConverter from http://xavierguihot.com/currency_converter/#com.currency_converter.CurrencyConverter$
is there a simpler way to do it?
You can use a correlated subquery (a fancy way of doing joins):
df.createOrReplaceTempView("df1")
exchange_rate.createOrReplaceTempView("df2")
val result = spark.sql("""
select count, 'USD' as currency, date, value,
value * coalesce(
(select min(df2.EXCAHNGERATE)
from df2
where df1.date = df2.INSERTTIME and df1.currency = df2.CURRENCY),
1 -- use 1 as exchange rate if no exchange rate found
) as converted
from df1
""")
result.show
+-----+--------+----------+-----+---------+
|count|currency| date|value|converted|
+-----+--------+----------+-----+---------+
| 3| USD|2021-01-14| 4| 2.925688|
| 102| USD|2021-01-14| 3| 3.0|
| 234| USD|2021-01-14| 5| 3.920625|
| 28| USD|2021-01-16| 5| 4.25711|
| 48| USD|2021-01-16| 7| 7.0|
| 68| USD|2021-01-15| 6| 4.388532|
| 20| USD|2021-01-15| 1| 0.701922|
| 33| USD|2021-01-17| 2| 1.502852|
| 106| USD|2021-01-17| 10| 7.71621|
+-----+--------+----------+-----+---------+
You can join both DataFrames and operate by row
val dfJoin = df1.join(df2, df1.col("date") === df2.col("INSERTTIME") &&
df1.col("currency") === df2.col("CURRENCY"),"left")#currecy should be changed to currency
dfJoin.withColumn("USD", col("value") * col("EXCAHNGERATE")).show()
Related
I have a data set that looks like this:
+------------------------|-----+
| timestamp| zone|
+------------------------+-----+
| 2019-01-01 00:05:00 | A|
| 2019-01-01 00:05:00 | A|
| 2019-01-01 00:05:00 | B|
| 2019-01-01 01:05:00 | C|
| 2019-01-01 02:05:00 | B|
| 2019-01-01 02:05:00 | B|
+------------------------+-----+
For each hour I need to count which zone had the most rows and end up with a table that looks like this:
+-----|-----+-----+
| hour| zone| max |
+-----+-----+-----+
| 0| A| 2|
| 1| C| 1|
| 2| B| 2|
+-----+-----+-----+
My instructions say that I need to use the Window function along with "group by" to find my max count.
I've tried a few things but I'm not sure if I'm close. Any help would be appreciated.
You can use 2 subsequent window-functions to get your result:
df
.withColumn("hour",hour($"timestamp"))
.withColumn("cnt",count("*").over(Window.partitionBy($"hour",$"zone")))
.withColumn("rnb",row_number().over(Window.partitionBy($"hour").orderBy($"cnt".desc)))
.where($"rnb"===1)
.select($"hour",$"zone",$"cnt".as("max"))
You can use Windowing functions and group by with dataframes.
In your case you could use rank() over(partition by) window function.
import org.apache.spark.sql.function._
// first group by hour and zone
val df_group = data_tms.
select(hour(col("timestamp")).as("hour"), col("zone"))
.groupBy(col("hour"), col("zone"))
.agg(count("zone").as("max"))
// second rank by hour order by max in descending order
val df_rank = df_group.
select(col("hour"),
col("zone"),
col("max"),
rank().over(Window.partitionBy(col("hour")).orderBy(col("max").desc)).as("rank"))
// filter by col rank = 1
df_rank
.select(col("hour"),
col("zone"),
col("max"))
.where(col("rank") === 1)
.orderBy(col("hour"))
.show()
/*
+----+----+---+
|hour|zone|max|
+----+----+---+
| 0| A| 2|
| 1| C| 1|
| 2| B| 2|
+----+----+---+
*/
Is there a way to replace null values in spark data frame with next row not null value. There is additional row_count column added for windows partitioning and ordering. More specifically, I'd like to achieve the following result:
+---------+-----------+ +---------+--------+
| row_count | id| |row_count | id|
+---------+-----------+ +------+-----------+
| 1| null| | 1| 109|
| 2| 109| | 2| 109|
| 3| null| | 3| 108|
| 4| null| | 4| 108|
| 5| 108| => | 5| 108|
| 6| null| | 6| 110|
| 7| 110| | 7| 110|
| 8| null| | 8| null|
| 9| null| | 9| null|
| 10| null| | 10| null|
+---------+-----------+ +---------+--------+
I tried with below code, It is not giving proper result.
val ss = dataframe.select($"*", sum(when(dataframe("id").isNull||dataframe("id") === "", 1).otherwise(0)).over(Window.orderBy($"row_count")) as "value")
val window1=Window.partitionBy($"value").orderBy("id").rowsBetween(0, Long.MaxValue)
val selectList=ss.withColumn("id_fill_from_below",last("id").over(window1)).drop($"row_count").drop($"value")
Here is a approach
Filter the non nulls (dfNonNulls)
Filter the nulls (dfNulls)
Find the right value for null id, using join and Window function
Fill the null dataframe (dfNullFills)
union dfNonNulls and dfNullFills
data.csv
row_count,id
1,
2,109
3,
4,
5,108
6,
7,110
8,
9,
10,
var df = spark.read.format("csv")
.option("header", "true")
.option("inferSchema", "true")
.load("data.csv")
var dfNulls = df.filter(
$"id".isNull
).withColumnRenamed(
"row_count","row_count_nulls"
).withColumnRenamed(
"id","id_nulls"
)
val dfNonNulls = df.filter(
$"id".isNotNull
).withColumnRenamed(
"row_count","row_count_values"
).withColumnRenamed(
"id","id_values"
)
dfNulls = dfNulls.join(
dfNonNulls, $"row_count_nulls" lt $"row_count_values","left"
).select(
$"id_nulls",$"id_values",$"row_count_nulls",$"row_count_values"
)
val window = Window.partitionBy("row_count_nulls").orderBy("row_count_values")
val dfNullFills = dfNulls.withColumn(
"rn", row_number.over(window)
).where($"rn" === 1).drop("rn").select(
$"row_count_nulls".alias("row_count"),$"id_values".alias("id"))
dfNullFills .union(dfNonNulls).orderBy($"row_count").show()
which results in
+---------+----+
|row_count| id|
+---------+----+
| 1| 109|
| 2| 109|
| 3| 108|
| 4| 108|
| 5| 108|
| 6| 110|
| 7| 110|
| 8|null|
| 9|null|
| 10|null|
+---------+----+
I have a dataframe call productPrice have column ID and Price, I want to get the ID that had the highest price, if two ID have the same highest price, I only get the one the have the smaller ID number. I use
val highestprice = productPrice.orderBy(asc("ID")).orderBy(desc("price")).limit(1)
But the result I got is not the one that have the smaller ID, instead the one I got is the one the have a larger ID. I don't know what's wrong with my logic, any idea?
Try this.
scala> val df = Seq((4, 30),(2,50),(3,10),(5,30),(1,50),(6,25)).toDF("id","price")
df: org.apache.spark.sql.DataFrame = [id: int, price: int]
scala> df.show
+---+-----+
| id|price|
+---+-----+
| 4| 30|
| 2| 50|
| 3| 10|
| 5| 30|
| 1| 50|
| 6| 25|
+---+-----+
scala> df.sort(desc("price"), asc("id")).show
+---+-----+
| id|price|
+---+-----+
| 1| 50|
| 2| 50|
| 4| 30|
| 5| 30|
| 6| 25|
| 3| 10|
+---+-----+
Approaching the same problem using Spark SQL:
val df = Seq((4, 30),(2,50),(3,10),(5,30),(1,50),(6,25)).toDF("id","price")
df.createOrReplaceTempView("prices")
--
%sql
SELECT id, price
FROM prices
ORDER BY price DESC, id ASC
LIMIT(1)
How can I replace empty values in a column Field1 of DataFrame df?
Field1 Field2
AA
12 BB
This command does not provide an expected result:
df.na.fill("Field1",Seq("Anonymous"))
The expected result:
Field1 Field2
Anonymous AA
12 BB
You can also try this.
This might handle both blank/empty/null
df.show()
+------+------+
|Field1|Field2|
+------+------+
| | AA|
| 12| BB|
| 12| null|
+------+------+
df.na.replace(Seq("Field1","Field2"),Map(""-> null)).na.fill("Anonymous", Seq("Field2","Field1")).show(false)
+---------+---------+
|Field1 |Field2 |
+---------+---------+
|Anonymous|AA |
|12 |BB |
|12 |Anonymous|
+---------+---------+
Fill: Returns a new DataFrame that replaces null or NaN values in
numeric columns with value.
Two things:
An empty string is not null or NaN, so you'll have to use a case statement for that.
Fill seems to not work well when giving a text value into a numeric column.
Failing Null Replace with Fill / Text:
scala> a.show
+----+---+
| f1| f2|
+----+---+
|null| AA|
| 12| BB|
+----+---+
scala> a.na.fill("Anonymous", Seq("f1")).show
+----+---+
| f1| f2|
+----+---+
|null| AA|
| 12| BB|
+----+---+
Working Example - Using Null With All Numbers:
scala> a.show
+----+---+
| f1| f2|
+----+---+
|null| AA|
| 12| BB|
+----+---+
scala> a.na.fill(1, Seq("f1")).show
+---+---+
| f1| f2|
+---+---+
| 1| AA|
| 12| BB|
+---+---+
Failing Example (Empty String instead of Null):
scala> b.show
+---+---+
| f1| f2|
+---+---+
| | AA|
| 12| BB|
+---+---+
scala> b.na.fill(1, Seq("f1")).show
+---+---+
| f1| f2|
+---+---+
| | AA|
| 12| BB|
+---+---+
Case Statement Fix Example:
scala> b.show
+---+---+
| f1| f2|
+---+---+
| | AA|
| 12| BB|
+---+---+
scala> b.select(when(col("f1") === "", "Anonymous").otherwise(col("f1")).as("f1"), col("f2")).show
+---------+---+
| f1| f2|
+---------+---+
|Anonymous| AA|
| 12| BB|
+---------+---+
You can try using below code when you have n number of columns in dataframe.
Note: When you are trying to write data into formats like parquet, null data types are not supported. we have to type cast it.
val df = Seq(
(1, ""),
(2, "Ram"),
(3, "Sam"),
(4,"")
).toDF("ID", "Name")
// null type column
val inputDf = df.withColumn("NulType", lit(null).cast(StringType))
//Output
+---+----+-------+
| ID|Name|NulType|
+---+----+-------+
| 1| | null|
| 2| Ram| null|
| 3| Sam| null|
| 4| | null|
+---+----+-------+
//Replace all blank space in the dataframe with null
val colName = inputDf.columns //*This will give you array of string*
val data = inputDf.na.replace(colName,Map(""->"null"))
data.show()
+---+----+-------+
| ID|Name|NulType|
+---+----+-------+
| 1|null| null|
| 2| Ram| null|
| 3| Sam| null|
| 4|null| null|
+---+----+-------+
I have a DataFrame with below data
scala> nonFinalExpDF.show
+---+----------+
| ID| DATE|
+---+----------+
| 1| null|
| 2|2016-10-25|
| 2|2016-10-26|
| 2|2016-09-28|
| 3|2016-11-10|
| 3|2016-10-12|
+---+----------+
From this DataFrame I want to get below DataFrame
+---+----------+----------+
| ID| DATE| INDICATOR|
+---+----------+----------+
| 1| null| 1|
| 2|2016-10-25| 0|
| 2|2016-10-26| 1|
| 2|2016-09-28| 0|
| 3|2016-11-10| 1|
| 3|2016-10-12| 0|
+---+----------+----------+
Logic -
For latest DATE(MAX Date) of an ID, Indicator value would be 1 and others
are 0.
For null value of the account Indicator would be 1
Please suggest me a simple logic to do that.
Try
df.createOrReplaceTempView("df")
spark.sql("""
SELECT id, date,
CAST(LEAD(COALESCE(date, TO_DATE('1900-01-01')), 1)
OVER (PARTITION BY id ORDER BY date) IS NULL AS INT)
FROM df""")