how to explode a spark dataframe - scala

I exploded a nested schema but I am not getting what I want,
before exploded it looks like this:
df.show()
+----------+----------------------------------------------------------+
|CaseNumber| SourceId |
+----------+----------------------------------------------------------+
| 0 |[{"id":"1","type":"Sku"},{"id":"22","type":"ContractID"}] |
+----------|----------------------------------------------------------|
| 1 |[{"id":"3","type":"Sku"},{"id":"24","type":"ContractID"}] |
+---------------------------------------------------------------------+
I want it to be like this
+----------+-------------------+
| CaseNumber| Sku | ContractId |
+----------+-------------------+
| 0 | 1 | 22 |
+----------|------|------------|
| 1 | 3 | 24 |
+------------------------------|

Here is one way using the build-in get_json_object function:
import org.apache.spark.sql.functions.get_json_object
val df = Seq(
(0, """[{"id":"1","type":"Sku"},{"id":"22","type":"ContractID"}]"""),
(1, """[{"id":"3","type":"Sku"},{"id":"24","type":"ContractID"}]"""))
.toDF("CaseNumber", "SourceId")
df.withColumn("sku", get_json_object($"SourceId", "$[0].id").cast("int"))
.withColumn("ContractId", get_json_object($"SourceId", "$[1].id").cast("int"))
.drop("SourceId")
.show
// +----------+---+----------+
// |CaseNumber|sku|ContractId|
// +----------+---+----------+
// | 0| 1| 22|
// | 1| 3| 24|
// +----------+---+----------+
UPDATE
After our discussion we realised that the mentioned data is of array<struct<id:string,type:string>> type and not a simple string. Next is the solution for the new schema:
df.withColumn("sku", $"SourceIds".getItem(0).getField("id"))
.withColumn("ContractId", $"SourceIds".getItem(1).getField("id"))

Related

Removing alphabets from alphanumeric values present in column of dataframe of spark

The two column of dataframe looks like.
SKU | COMPSKU
PT25M | PT10M
PT3H | PT20M
TH | QR12
S18M | JH
spark with scala
How can i remove all alphabets and only numbers retain..
Expected output:
25|10
3|20
0|12
18|0
You could also do it this way.
df.withColumn(
"SKU",
when(regexp_replace(col("SKU"),"[a-zA-Z]","")==="",0
).otherwise(regexp_replace(col("SKU"),"[a-zA-Z]",""))
).withColumn(
"COMPSKU",
when(regexp_replace(col("COMPSKU"),"[a-zA-Z]","")==="", 0
).otherwise(regexp_replace(col("COMPSKU"),"[a-zA-Z]",""))
).show()
/*
+-----+-------+
| SKU|COMPSKU|
+-----+-------+
| 25 | 10 |
| 3 | 20 |
| 0 | 12 |
| 18 | 0 |
+-----+-------+
*/
Try with regexp_replace function then use case when otherwise statement to replace empty values with 0.
Example:
df.show()
/*
+-----+-------+
| SKU|COMPSKU|
+-----+-------+
|PT25M| PT10M|
| PT3H| PT20M|
| TH| QR12|
| S18M| JH|
+-----+-------+
*/
df.withColumn("SKU",regexp_replace(col("SKU"),"[a-zA-Z]","")).
withColumn("COMPSKU",regexp_replace(col("COMPSKU"),"[a-zA-Z]","")).
withColumn("SKU",when(length(trim(col("SKU")))===0,lit(0)).otherwise(col("SKU"))).
withColumn("COMPSKU",when(length(trim(col("COMPSKU")))===0,lit(0)).otherwise(col("COMPSKU"))).
show()
/*
+---+-------+
|SKU|COMPSKU|
+---+-------+
| 25| 10|
| 3| 20|
| 0| 12|
| 18| 0|
+---+-------+
*/

Apache Spark calculating column value on the basis of distinct value of columns

I am processing the following tables and I would like to compute a new column (outcome) based on the distinct value of 2 other columns.
| id1 | id2 | outcome
| 1 | 1 | 1
| 1 | 1 | 1
| 1 | 3 | 2
| 2 | 5 | 1
| 3 | 1 | 1
| 3 | 2 | 2
| 3 | 3 | 3
The outcome should begin in incremental order starting from 1 based on the combined value of id1 and id2. Any hints how this can be accomplished in Scala. row_number doesn't seem to be useful here in this case.
The logic here is that for each unique value of id1 we will start numbering the outcome with min(id2) for corresponding id1 being assigned a value of 1.
You could try dense_rank()
with your example
val df = sqlContext
.read
.option("sep","|")
.option("header", true)
.option("inferSchema",true)
.csv("/home/cloudera/files/tests/ids.csv") // Here we read the .csv files
.cache()
df.show()
df.printSchema()
df.createOrReplaceTempView("table")
sqlContext.sql(
"""
|SELECT id1, id2, DENSE_RANK() OVER(PARTITION BY id1 ORDER BY id2) AS outcome
|FROM table
|""".stripMargin).show()
output
+---+---+-------+
|id1|id2|outcome|
+---+---+-------+
| 2| 5| 1|
| 1| 1| 1|
| 1| 1| 1|
| 1| 3| 2|
| 3| 1| 1|
| 3| 2| 2|
| 3| 3| 3|
+---+---+-------+
Use Window function to club(partition) them by first id and then order each partition based on second id.
Now you just need to assign a rank (dense_rank) over each Window partition.
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
df
.withColumn("outcome", dense_rank().over(Window.partitionBy("id1").orderBy("id2")))

Spark (scala) dataframes - Check whether strings in column exist in a column of another dataframe

I have a spark dataframe, and I wish to check whether each string in a particular column exists in a pre-defined a column of another dataframe.
I have found a same problem in Spark (scala) dataframes - Check whether strings in column contain any items from a set
but I want to Check whether strings in column exists in a column of another dataframe not a List or a set follow that question. Who can help me! I don't know convert a column to a set or a list and i don't know "exists" method in dataframe.
My data is similar to this
df1:
+---+-----------------+
| id| url |
+---+-----------------+
| 1|google.com |
| 2|facebook.com |
| 3|github.com |
| 4|stackoverflow.com|
+---+-----------------+
df2:
+-----+------------+
| id | urldetail |
+-----+------------+
| 11 |google.com |
| 12 |yahoo.com |
| 13 |facebook.com|
| 14 |twitter.com |
| 15 |youtube.com |
+-----+------------+
Now, i am trying to create a third column with the results of a comparison to see if the strings in the $"urldetail" column if exists in $"url"
+---+------------+-------------+
| id| urldetail | check |
+---+------------+-------------+
| 11|google.com | 1 |
| 12|yahoo.com | 0 |
| 13|facebook.com| 1 |
| 14|twitter.com | 0 |
| 15|youtube.com | 0 |
+---+------------+-------------+
I want to use UDF but i don't know how to check whether string exists in a column of a dataframe! Please help me!
I have a spark dataframe, and I wish to check whether each string in a
particular column contains any number of words from a pre-defined a
column of another dataframe.
Here is the way. using = or like
package examples
import org.apache.log4j.Level
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.{col, _}
object CompareColumns extends App {
val logger = org.apache.log4j.Logger.getLogger("org")
logger.setLevel(Level.WARN)
val spark = SparkSession.builder()
.appName(this.getClass.getName)
.config("spark.master", "local").getOrCreate()
import spark.implicits._
val df1 = Seq(
(1, "google.com"),
(2, "facebook.com"),
(3, "github.com"),
(4, "stackoverflow.com")).toDF("id", "url").as("first")
df1.show
val df2 = Seq(
(11, "google.com"),
(12, "yahoo.com"),
(13, "facebook.com"),
(14, "twitter.com")).toDF("id", "url").as("second")
df2.show
val df3 = df2.join(df1, expr("first.url like second.url"), "full_outer").select(
col("first.url")
, col("first.url").contains(col("second.url")).as("check")).filter("url is not null")
df3.na.fill(Map("check" -> false))
.show
}
Result :
+---+-----------------+
| id| url|
+---+-----------------+
| 1| google.com|
| 2| facebook.com|
| 3| github.com|
| 4|stackoverflow.com|
+---+-----------------+
+---+------------+
| id| url|
+---+------------+
| 11| google.com|
| 12| yahoo.com|
| 13|facebook.com|
| 14| twitter.com|
+---+------------+
+-----------------+-----+
| url|check|
+-----------------+-----+
| google.com| true|
| facebook.com| true|
| github.com|false|
|stackoverflow.com|false|
+-----------------+-----+
with full outer join we can achive this...
For more details see my article with all joins here in my linked in post
Note : Instead of 0 for false 1 for true i have used boolean
conditions here.. you can translate them in to what ever you wanted...
UPDATE : If rows are increasing in second dataframe
you can use this, it wont miss any rows from second
val df3 = df2.join(df1, expr("first.url like second.url"), "full").select(
col("second.*")
, col("first.url").contains(col("second.url")).as("check"))
.filter("url is not null")
df3.na.fill(Map("check" -> false))
.show
Also, one more thing is you can try regexp_extract as shown in below post
https://stackoverflow.com/a/53880542/647053
read in your data and use the trim operation just to be conservative when joining on the strings to remove the whitesapace
val df= Seq((1,"google.com"), (2,"facebook.com"), ( 3,"github.com "), (4,"stackoverflow.com")).toDF("id", "url").select($"id", trim($"url").as("url"))
val df2 =Seq(( 11 ,"google.com"), (12 ,"yahoo.com"), (13 ,"facebook.com"),(14 ,"twitter.com"),(15,"youtube.com")).toDF( "id" ,"urldetail").select($"id", trim($"urldetail").as("urldetail"))
df.join(df2.withColumn("flag", lit(1)).drop("id"), (df("url")===df2("urldetail")), "left_outer").withColumn("contains_bool",
when($"flag"===1, true) otherwise(false)).drop("flag","urldetail").show
+---+-----------------+-------------+
| id| url|contains_bool|
+---+-----------------+-------------+
| 1| google.com| true|
| 2| facebook.com| true|
| 3| github.com| false|
| 4|stackoverflow.com| false|
+---+-----------------+-------------+

Group rows that match sub string in a column using scala

I have a fol df:
Zip | Name | id |
abc | xyz | 1 |
def | wxz | 2 |
abc | wex | 3 |
bcl | rea | 4 |
abc | txc | 5 |
def | rfx | 6 |
abc | abc | 7 |
I need to group all the names that contain 'x' based on same Zip using scala
Desired Output:
Zip | Count |
abc | 3 |
def | 2 |
Any help is highly appreciated
As #Shaido mentioned in the comment above, all you need is filter, groupBy and aggregation as
import org.apache.spark.sql.functions._
fol.filter(col("Name").contains("x")) //filtering the rows that has x in the Name column
.groupBy("Zip") //grouping by Zip column
.agg(count("Zip").as("Count")) //counting the rows in each groups
.show(false)
and you should have the desired output
+---+-----+
|Zip|Count|
+---+-----+
|abc|3 |
|def|2 |
+---+-----+
You want to groupBy bellow data frame.
+---+----+---+
|zip|name| id|
+---+----+---+
|abc| xyz| 1|
|def| wxz| 2|
|abc| wex| 3|
|bcl| rea| 4|
|abc| txc| 5|
|def| rfx| 6|
|abc| abc| 7|
+---+----+---+
then you can simply use groupBy function with passing column parameter and followed by count will give you the result.
val groupedDf: DataFrame = df.groupBy("zip").count()
groupedDf.show()
// +---+-----+
// |zip|count|
// +---+-----+
// |bcl| 1|
// |abc| 4|
// |def| 2|
// +---+-----+

How to iterate over pairs in a column in Scala

I have a data frame like this, imported from a parquet file:
| Store_id | Date_d_id |
| 0 | 23-07-2017 |
| 0 | 26-07-2017 |
| 0 | 01-08-2017 |
| 0 | 25-08-2017 |
| 1 | 01-01-2016 |
| 1 | 04-01-2016 |
| 1 | 10-01-2016 |
What I am trying to achieve next is to loop through each customer's date in pair and get the day difference. Here is what it should look like:
| Store_id | Date_d_id | Day_diff |
| 0 | 23-07-2017 | null |
| 0 | 26-07-2017 | 3 |
| 0 | 01-08-2017 | 6 |
| 0 | 25-08-2017 | 24 |
| 1 | 01-01-2016 | null |
| 1 | 04-01-2016 | 3 |
| 1 | 10-01-2016 | 6 |
And finally, I will like to reduce the data frame to the average day difference by customer:
| Store_id | avg_diff |
| 0 | 7.75 |
| 1 | 3 |
I am very new to Scala and I don't even know where to start. Any help is highly appreciated! Thanks in advance.
Also, I am using Zeppelin notebook
One approach would be to use lag(Date) over Window partition and a UDF to calculate the difference in days between consecutive rows, then follow by grouping the DataFrame for the average difference in days. Note that Date_d_id is converted to yyyy-mm-dd format for proper String ordering within the Window partitions:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
val df = Seq(
(0, "23-07-2017"),
(0, "26-07-2017"),
(0, "01-08-2017"),
(0, "25-08-2017"),
(1, "01-01-2016"),
(1, "04-01-2016"),
(1, "10-01-2016")
).toDF("Store_id", "Date_d_id")
def daysDiff = udf(
(d1: String, d2: String) => {
import java.time.LocalDate
import java.time.temporal.ChronoUnit.DAYS
DAYS.between(LocalDate.parse(d1), LocalDate.parse(d2))
}
)
val df2 = df.
withColumn( "Date_ymd",
regexp_replace($"Date_d_id", """(\d+)-(\d+)-(\d+)""", "$3-$2-$1")).
withColumn( "Prior_date_ymd",
lag("Date_ymd", 1).over(Window.partitionBy("Store_id").orderBy("Date_ymd"))).
withColumn( "Days_diff",
when($"Prior_date_ymd".isNotNull, daysDiff($"Prior_date_ymd", $"Date_ymd")).
otherwise(0L))
df2.show
// +--------+----------+----------+--------------+---------+
// |Store_id| Date_d_id| Date_ymd|Prior_date_ymd|Days_diff|
// +--------+----------+----------+--------------+---------+
// | 1|01-01-2016|2016-01-01| null| 0|
// | 1|04-01-2016|2016-01-04| 2016-01-01| 3|
// | 1|10-01-2016|2016-01-10| 2016-01-04| 6|
// | 0|23-07-2017|2017-07-23| null| 0|
// | 0|26-07-2017|2017-07-26| 2017-07-23| 3|
// | 0|01-08-2017|2017-08-01| 2017-07-26| 6|
// | 0|25-08-2017|2017-08-25| 2017-08-01| 24|
// +--------+----------+----------+--------------+---------+
val resultDF = df2.groupBy("Store_id").agg(avg("Days_diff").as("Avg_diff"))
resultDF.show
// +--------+--------+
// |Store_id|Avg_diff|
// +--------+--------+
// | 1| 3.0|
// | 0| 8.25|
// +--------+--------+
You can use lag function to get the previous date over Window function, then do some manipulation to get the final dataframe that you require
first of all the Date_d_id column need to be converted to include timestamp for sorting to work correctly
import org.apache.spark.sql.functions._
val timestapeddf = df.withColumn("Date_d_id", from_unixtime(unix_timestamp($"Date_d_id", "dd-MM-yyyy")))
which should give your dataframe as
+--------+-------------------+
|Store_id| Date_d_id|
+--------+-------------------+
| 0|2017-07-23 00:00:00|
| 0|2017-07-26 00:00:00|
| 0|2017-08-01 00:00:00|
| 0|2017-08-25 00:00:00|
| 1|2016-01-01 00:00:00|
| 1|2016-01-04 00:00:00|
| 1|2016-01-10 00:00:00|
+--------+-------------------+
then you can apply the lag function over window function and finally get the date difference as
import org.apache.spark.sql.expressions._
val windowSpec = Window.partitionBy("Store_id").orderBy("Date_d_id")
val laggeddf = timestapeddf.withColumn("Day_diff", when(lag("Date_d_id", 1).over(windowSpec).isNull, null).otherwise(datediff($"Date_d_id", lag("Date_d_id", 1).over(windowSpec))))
laggeddf should be
+--------+-------------------+--------+
|Store_id|Date_d_id |Day_diff|
+--------+-------------------+--------+
|0 |2017-07-23 00:00:00|null |
|0 |2017-07-26 00:00:00|3 |
|0 |2017-08-01 00:00:00|6 |
|0 |2017-08-25 00:00:00|24 |
|1 |2016-01-01 00:00:00|null |
|1 |2016-01-04 00:00:00|3 |
|1 |2016-01-10 00:00:00|6 |
+--------+-------------------+--------+
now the final step is to use groupBy and aggregation to find the average
laggeddf.groupBy("Store_id")
.agg(avg("Day_diff").as("avg_diff"))
which should give you
+--------+--------+
|Store_id|avg_diff|
+--------+--------+
| 0| 11.0|
| 1| 4.5|
+--------+--------+
Now if you want to neglect the null Day_diff then you can do
laggeddf.groupBy("Store_id")
.agg((sum("Day_diff")/count($"Day_diff".isNotNull)).as("avg_diff"))
which should give you
+--------+--------+
|Store_id|avg_diff|
+--------+--------+
| 0| 8.25|
| 1| 3.0|
+--------+--------+
I hope the answer is helpful