DataFrame user-defined function not applied unless I change column name - scala

I want to convert my DataFrame column using implicits functions definition.
I have my DataFrame type defined, which contains additional functions:
class MyDF(df: DataFrame) {
def bytes2String(colName: String): DataFrame = df
.withColumn(colname + "_tmp", udf((x: Array[Byte]) => bytes2String(x)).apply(col(colname)))
.drop(colname)
.withColumnRenamed(colname + "_tmp", colname)
}
Then I define my implicit conversion class:
object NpDataFrameImplicits {
implicit def toNpDataFrame(df: DataFrame): NpDataFrame = new NpDataFrame(df)
}
So finally, here is what I do in a small FunSuite unit test:
test("example: call to bytes2String") {
val df: DataFrame = ...
df.select("header.ID").show() // (1)
df.bytes2String("header.ID").withColumnRenamed("header.ID", "id").select("id").show() // (2)
df.bytes2String("header.ID").select("header.ID").show() // (3)
}
Show #1
+-------------------------------------------------+
|ID |
+-------------------------------------------------+
|[62 BF 58 0C 6C 59 48 9C 91 13 7B 97 E7 29 C0 2F]|
|[5C 54 49 07 00 24 40 F4 B3 0E E7 2C 03 B8 06 3C]|
|[5C 3E A2 21 01 D9 4C 1B 80 4E F9 92 1D 4A FE 26]|
|[08 C1 55 89 CE 0D 45 8C 87 0A 4A 04 90 2D 51 56]|
+-------------------------------------------------+
Show #2
+------------------------------------+
|id |
+------------------------------------+
|62bf580c-6c59-489c-9113-7b97e729c02f|
|5c544907-0024-40f4-b30e-e72c03b8063c|
|5c3ea221-01d9-4c1b-804e-f9921d4afe26|
|08c15589-ce0d-458c-870a-4a04902d5156|
+------------------------------------+
Show #3
+-------------------------------------------------+
|ID |
+-------------------------------------------------+
|[62 BF 58 0C 6C 59 48 9C 91 13 7B 97 E7 29 C0 2F]|
|[5C 54 49 07 00 24 40 F4 B3 0E E7 2C 03 B8 06 3C]|
|[5C 3E A2 21 01 D9 4C 1B 80 4E F9 92 1D 4A FE 26]|
|[08 C1 55 89 CE 0D 45 8C 87 0A 4A 04 90 2D 51 56]|
+-------------------------------------------------+
As you can witness here, the third show (aka without the column renaming) does not work as expected and shows us a non-converted ID column. Anyone knows why?
EDIT:
Output of df.select(col("header.ID") as "ID").bytes2String("ID").show():
+------------------------------------+
|ID |
+------------------------------------+
|62bf580c-6c59-489c-9113-7b97e729c02f|
|5c544907-0024-40f4-b30e-e72c03b8063c|
|5c3ea221-01d9-4c1b-804e-f9921d4afe26|
|08c15589-ce0d-458c-870a-4a04902d5156|
+------------------------------------+

Let me explain, what is happening on your conversion function with bellow example.
First Create data frame:
val jsonString: String =
"""{
| "employee": {
| "id": 12345,
| "name": "krishnan"
| },
| "_id": 1
|}""".stripMargin
val jsonRDD: RDD[String] = sc.parallelize(Seq(jsonString, jsonString))
val df: DataFrame = sparkSession.read.json(jsonRDD)
df.printSchema()
Output structure:
root
|-- _id: long (nullable = true)
|-- employee: struct (nullable = true)
| |-- id: long (nullable = true)
| |-- name: string (nullable = true)
Conversion function similar to your's:
def myConversion(myDf: DataFrame, colName: String): DataFrame = {
myDf.withColumn(colName + "_tmp", udf((x: Long) => (x+1).toString).apply(col(colName)))
.drop(colName)
.withColumnRenamed(colName + "_tmp", colName)
}
Scenario 1#
Do the conversion for root level field.
myConversion(df, "_id").show()
myConversion(df, "_id").select("_id").show()
Result:
+----------------+---+
| employee|_id|
+----------------+---+
|[12345,krishnan]| 2|
|[12345,krishnan]| 2|
+----------------+---+
+---+
|_id|
+---+
| 2|
| 2|
+---+
Scenario 2# do the conversion for employee.id. Here, when we use employee.id means, data frame got added with new field id at root level. This is the correct behavior.
myConversion(df, "employee.id").show()
myConversion(df, "employee.id").select("employee.id").show()
Result:
+---+----------------+-----------+
|_id| employee|employee.id|
+---+----------------+-----------+
| 1|[12345,krishnan]| 12346|
| 1|[12345,krishnan]| 12346|
+---+----------------+-----------+
+-----+
| id|
+-----+
|12345|
|12345|
+-----+
Scenario 3# Select the inner field to root level and then perform conversion.
myConversion(df.select("employee.id"), "id").show()
Result:
+-----+
| id|
+-----+
|12346|
|12346|
+-----+
My new conversion function, takes struct type field and perform conversion and store it into struct type field itself. Here, pass employee field and convert the id field alone, but changes are done field employee at root level.
case class Employee(id: String, name: String)
def myNewConversion(myDf: DataFrame, colName: String): DataFrame = {
myDf.withColumn(colName + "_tmp", udf((row: Row) => Employee((row.getLong(0)+1).toString, row.getString(1))).apply(col(colName)))
.drop(colName)
.withColumnRenamed(colName + "_tmp", colName)
}
Your scenario number 3# using my conversion function.
myNewConversion(df, "employee").show()
myNewConversion(df, "employee").select("employee.id").show()
Result#
+---+----------------+
|_id| employee|
+---+----------------+
| 1|[12346,krishnan]|
| 1|[12346,krishnan]|
+---+----------------+
+-----+
| id|
+-----+
|12346|
|12346|
+-----+

Related

how to identify digital chars as date from a string column in spark dataframe

I would like to extract digital chars from a string in a column of spark dataframe.
e.g.
id val (string)
58 [dttg] 201805_mogtca_onvt
91 20050221_frcas
17 201709 dcsevas
I need:
id a_date year month
58 201805 2018 05
91 20050221 2005 02
17 201709 2017 09
I am trying:
df.withColumn('date', DF.to_date(F.col('val').isdigit() # how to get digital chars ?
You should start by removing all non numeric characters through a regex_replace for instance:
df.withColumn("a_date", regexp_replace($"val", "[^0-9]", ""))
Then, since you seem to have different time format in each row, the easiest way is by using substrings
df.withColumn("a_date", regexp_replace($"val", "[^0-9]", ""))
.withColumn("year", substring($"a_date", 0, 4))
.withColumn("month", substring($"a_date", 5, 2))
.drop("val")
INPUT
+---+-------------------------+
|id |val |
+---+-------------------------+
|58 |[dttg] 201805_mogtca_onvt|
|91 |20050221_frcas |
|17 |201709 dcsevas |
+---+-------------------------+
OUTPUT
+---+--------+----+-----+
|id |a_date |year|month|
+---+--------+----+-----+
|58 |201805 |2018|05 |
|91 |20050221|2005|02 |
|17 |201709 |2017|09 |
+---+--------+----+-----+

Full outer join in RDD scala spark

I have two file below:
file1
0000003 杉山______ 26 F
0000005 崎村______ 50 F
0000007 梶川______ 42 F
file2
0000005 82 79 16 21 80
0000001 46 39 8 5 21
0000004 58 71 20 10 6
0000009 60 89 33 18 6
0000003 30 50 71 36 30
0000007 50 2 33 15 62
Now, I would like join columns have the same value in field 1.
I want something like this:
0000005 崎村______ 50 F 82 79 16 21 80
0000003 杉山______ 26 F 30 50 71 36 30
0000007 梶川______ 42 F 50 2 33 15 62
You can use Data Frame Join concept instead of RDD joining. That will be easy. You can refer my sample code below. Hope that will help you.
I am considering your data is in same format as you mentioned above. If it is in CSV or any other format then you can skip Step-2 and update Step-1 as per data format. If you required output in RDD format then you can use Step-5 otherwise you can ignore it as per comment mentioned in code snippet.
I have modified data (like A_____, B_____, C____) just for readability.
//Step1: Loading file1 and file2 to corresponding DataFrame in text format
val df1 = spark.read.format("text").load("<path of file1>")
val df2 = spark.read.format("text").load("<path of file2>")
//Step2: Spliting single column "value" into multiple column for join Key
val file1 = ((((df1.withColumn("col1", split($"value", " ")(0)))
.withColumn("col2", split($"value", " ")(1)))
.withColumn("col3", split($"value", " ")(2)))
.withColumn("col4", split($"value", " ")(3)))
.select("col1","col2", "col3", "col4")
/*
+-------+-------+----+----+
|col1 |col2 |col3|col4|
+-------+-------+----+----+
|0000003|A______|26 |F |
|0000005|B______|50 |F |
|0000007|C______|42 |F |
+-------+-------+----+----+
*/
val file2 = ((((((df2.withColumn("col1", split($"value", " ")(0)))
.withColumn("col2", split($"value", " ")(1)))
.withColumn("col3", split($"value", " ")(2)))
.withColumn("col4", split($"value", " ")(3)))
.withColumn("col5", split($"value", " ")(4)))
.withColumn("col6", split($"value", " ")(5)))
.select("col1","col2", "col3", "col4","col5","col6")
/*
+-------+----+----+----+----+----+
|col1 |col2|col3|col4|col5|col6|
+-------+----+----+----+----+----+
|0000005|82 |79 |16 |21 |80 |
|0000001|46 |39 |8 |5 |21 |
|0000004|58 |71 |20 |10 |6 |
|0000009|60 |89 |33 |18 |6 |
|0000003|30 |50 |71 |36 |30 |
|0000007|50 |2 |33 |15 |62 |
+-------+----+----+----+----+----+
*/
//Step3: you can do alias to refer column name with aliases to increase readablity
val file01 = file1.as("f1")
val file02 = file2.as("f2")
//Step4: Joining files on Key
file01.join(file02,col("f1.col1") === col("f2.col1"))
/*
+-------+-------+----+----+-------+----+----+----+----+----+
|col1 |col2 |col3|col4|col1 |col2|col3|col4|col5|col6|
+-------+-------+----+----+-------+----+----+----+----+----+
|0000005|B______|50 |F |0000005|82 |79 |16 |21 |80 |
|0000003|A______|26 |F |0000003|30 |50 |71 |36 |30 |
|0000007|C______|42 |F |0000007|50 |2 |33 |15 |62 |
+-------+-------+----+----+-------+----+----+----+----+----+
*/
// Step5: if you want file data in RDD format the you can use below command
file01.join(file02,col("f1.col1") === col("f2.col1")).rdd.collect
/*
Array[org.apache.spark.sql.Row] = Array([0000005,B______,50,F,0000005,82,79,16,21,80], [0000003,A______,26,F,0000003,30,50,71,36,30], [0000007,C______,42,F,0000007,50,2,33,15,62])
*/
i found the solution , here is my code :
val rddPair1 = logData1.map { x =>
var data = x.split(" ")
var index = 0
var value=""
var key = data(index)
for( i <- 0 to data.length-1){
if(i!=index){
value+= data(i)+" "
}
}
new Tuple2(key, value.trim)
}
val rddPair2 = logData2.map { x =>
var data = x.split(" ")
var index = 0
var value=""
var key = data(index)
for( i <- 0 to data.length-1){
if(i!=index){
value+= data(i)+" "
}
}
new Tuple2(key, value.trim)
}
rddPair1.join(rddPair2).collect().foreach(f =>{
println(f._1+" "+f._2._1+" "+f._2._2
)})
}
result:
0000003 杉山______ 26 F 30 50 71 36 30
0000005 崎村______ 50 F 82 79 16 21 80
0000007 梶川______ 42 F 50 2 33 15 62

How convert sequential numerical processing of Cassandra table data to parallel in Spark?

We are doing some mathematical modelling on data from Cassandra table using the spark cassandra connector and the execution is currently sequential to get the output. How do you parallelize this for faster execution?
I'm new to Spark and I tried a few things but I'm unable understand how to use tabular data in map , groupby, reduceby functions. If someone can help explain (with some code snippets) how to parrellize tabular data, it will be really helpful.
import org.apache.spark.sql.{Row, SparkSession}
import com.datastax.spark.connector._
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
class SparkExample(sparkSession: SparkSession, pathToCsv: String) {
private val sparkContext = sparkSession.sparkContext
sparkSession.stop()
val conf = new SparkConf(true)
.set("spark.cassandra.connection.host","127.0.0.1")
.setAppName("cassandra").setMaster("local[*]")
val sc = new SparkContext(conf)
def testExample(): Unit = {
val KNMI_rdd = sc.cassandraTable ("dbks1","knmi_w")
val Table_count = KNMI_rdd.count()
val KNMI_idx = KNMI_rdd.zipWithIndex
val idx_key = KNMI_idx.map{case (k,v) => (v,k)}
var i = 0
var n : Int = Table_count.toInt
println(Table_count)
for ( i <- 1 to n if i < n) {
println(i)
val Row = idx_key.lookup(i)
println(Row)
val firstRow = Row(0)
val yyyy_var = firstRow.get[Int]("yyyy")
val mm_var = firstRow.get[Double]("mm")
val dd_var = firstRow.get[Double]("dd")
val dr_var = firstRow.get[Double]("dr")
val tg_var = firstRow.get[Double]("tg")
val ug_var = firstRow.get[Double]("ug")
val loc_var = firstRow.get[String]("loc")
val pred_factor = (((0.15461 * tg_var) + (0.8954 * ug_var)) / ((0.0000451 * dr_var) + 0.0004487))
println(yyyy_var,mm_var,dd_var,loc_var)
println(pred_factor)
}
}
}
//test data
// loc | yyyy | mm | dd | dr | tg | ug
//-----+------+----+----+-----+-----+----
// AMS | 2019 | 1 | 1 | 35 | 5 | 84
// AMS | 2019 | 1 | 2 | 76 | 34 | 74
// AMS | 2019 | 1 | 3 | 46 | 33 | 85
// AMS | 2019 | 1 | 4 | 35 | 1 | 84
// AMS | 2019 | 1 | 5 | 29 | 0 | 93
// AMS | 2019 | 1 | 6 | 32 | 25 | 89
// AMS | 2019 | 1 | 7 | 42 | 23 | 89
// AMS | 2019 | 1 | 8 | 68 | 75 | 92
// AMS | 2019 | 1 | 9 | 98 | 42 | 86
// AMS | 2019 | 1 | 10 | 92 | 12 | 76
// AMS | 2019 | 1 | 11 | 66 | 0 | 71
// AMS | 2019 | 1 | 12 | 90 | 56 | 85
// AMS | 2019 | 1 | 13 | 83 | 139 | 90
Edit 1:
I tired using map function and I'm able to calculate the mathematical computations, how do I add keys in front of these values which is defined by WeatherId?
case class Weather( loc: String, yyyy: Int, mm: Int, dd: Int,dr: Double, tg: Double, ug: Double)
case class WeatherId(loc: String, yyyy: Int, mm: Int, dd: Int)
val rows = dataset1
.map(line => Weather(
line.getAs[String]("loc"),
line.getAs[Int]("yyyy"),
line.getAs[Int]("mm"),
line.getAs[Int]("dd"),
line.getAs[Double]("dr"),
line.getAs[Double]("tg"),
line.getAs[Double]("ug")
) )
val pred_factor = rows
.map(x => (( ((x.dr * betaz) + (x.tg * betay)) + (x.ug) * betaz)))
Thanks
TL;DR;
Use a Dataframe/Dataset instead of an RDD.
The argument for DFs over RDDs is long but the short of it is that DFs and their structured alternative DS' outperform the low level RDDs.
With the spark-cassandra connector you can configure input split size that dictate the size of partition size in spark, more partitions more parallelism.
val lastdf = spark
.read
.format("org.apache.spark.sql.cassandra")
.options(Map(
"table" -> "words",
"keyspace" -> "test" ,
"cluster" -> "ClusterOne",
"spark.cassandra.input.split.size_in_mb" -> 48 // smaller size = more partitions
)
).load()

How to get count of group by two columns

The below is myDf
fi_Sk sec_SK END_DATE
89 42 20160122
89 42 20150330
51 43 20140116
51 43 20130616
82 43 20100608
82 43 20160608
The below is my code:
val count = myDf.withColumn("END_DATE", unix_timestamp(col("END_DATE"), dateFormat))
.groupBy(col("sec_SK"),col("fi_Sk"))
.agg(count("sec_SK").as("Visits"), max("END_DATE").as("Recent_Visit"))
.withColumn("Recent_Visit", from_unixtime(col("Recent_Visit"), dateFormat))
I am getting visits incorrectly,i need to group by(fi_Sk and sec_SK) for counting visits
the result should be like below :
fi_Sk sec_SK Visits END_DATE
89 42 2 20160122
51 43 2 20140116
82 43 2 20160608
currently i am getting :
fi_Sk sec_SK Visits END_DATE
89 42 2 20160122
51 43 2 20140116
groupBy and aggregation would aggregate all the rows in group into one row but the expected output seems that you want to populate the count for each row in the group. Window function is the appropriate solution for you
import org.apache.spark.sql.expressions.Window
def windowSpec = Window.partitionBy("fi_Sk", "sec_SK")
import org.apache.spark.sql.functions._
df.withColumn("Visits", count("fi_Sk").over(windowSpec))
// .sort("fi_Sk", "END_DATE")
// .show(false)
//
// +-----+------+--------+------+
// |fi_Sk|sec_SK|END_DATE|Visits|
// +-----+------+--------+------+
// |51 |42 |20130616|2 |
// |51 |42 |20140116|2 |
// |89 |44 |20100608|1 |
// |89 |42 |20150330|2 |
// |89 |42 |20160122|2 |
// +-----+------+--------+------+

Spark Dataframe Group by having New Indicator Column

I need to group by "KEY" Column and need to check whether "TYPE_CODE" column has both "PL" and "JL" values , if so then i need to add a Indicator Column as "Y" else "N"
Example :
//Input Values
val values = List(List("66","PL") ,
List("67","JL") , List("67","PL"),List("67","PO"),
List("68","JL"),List("68","PO")).map(x =>(x(0), x(1)))
import spark.implicits._
//created a dataframe
val cmc = values.toDF("KEY","TYPE_CODE")
cmc.show(false)
------------------------
KEY |TYPE_CODE |
------------------------
66 |PL |
67 |JL |
67 |PL |
67 |PO |
68 |JL |
68 |PO |
-------------------------
Expected Output :
For each "KEY", If it has "TYPE_CODE" has both PL & JL then Y
else N
-----------------------------------------------------
KEY |TYPE_CODE | Indicator
-----------------------------------------------------
66 |PL | N
67 |JL | Y
67 |PL | Y
67 |PO | Y
68 |JL | N
68 |PO | N
---------------------------------------------------
For example,
67 has both PL & JL - So "Y"
66 has only PL - So "N"
68 has only JL - So "N"
One option:
1) collect TYPE_CODE as list;
2) check if it contains the specific strings;
3) then flatten the list with explode:
(cmc.groupBy("KEY")
.agg(collect_list("TYPE_CODE").as("TYPE_CODE"))
.withColumn("Indicator",
when(array_contains($"TYPE_CODE", "PL") && array_contains($"TYPE_CODE", "JL"), "Y").otherwise("N"))
.withColumn("TYPE_CODE", explode($"TYPE_CODE"))).show
+---+---------+---------+
|KEY|TYPE_CODE|Indicator|
+---+---------+---------+
| 68| JL| N|
| 68| PO| N|
| 67| JL| Y|
| 67| PL| Y|
| 67| PO| Y|
| 66| PL| N|
+---+---------+---------+
Another option:
Group by KEY and use agg to create two separate indicator columns (one for JL and on for PL), then calculate the combined indicator
join with the original DataFrame
Altogether:
val indicators = cmc.groupBy("KEY").agg(
sum(when($"TYPE_CODE" === "PL", 1).otherwise(0)) as "pls",
sum(when($"TYPE_CODE" === "JL", 1).otherwise(0)) as "jls"
).withColumn("Indicator", when($"pls" > 0 && $"jls" > 0, "Y").otherwise("N"))
val result = cmc.join(indicators, "KEY")
.select("KEY", "TYPE_CODE", "Indicator")
This might be slower than #Psidom's answer, but might be safer - collect_list might be problematic if you have a huge number of matches for a specific key (that list would have to be stored in a single worker's memory).
EDIT:
In case the input is known to be unique (i.e. JL / PL would only appear once per key, at most), indicators could be created using simple count aggregation, which is (arguably) easier to read:
val indicators = cmc
.where($"TYPE_CODE".isin("PL", "JL"))
.groupBy("KEY").count()
.withColumn("Indicator", when($"count" === 2, "Y").otherwise("N"))