Using custom classes in Spark on datasets and dataframes

Using custom classes in Spark on datasets and dataframes - scala

I put together some custom classes I would like to use as native datatypes is SparkSQL. I see UDTs just got opened up to the public, but they are quite hard to figure out. Is there any way I'd be able to do this?
Example
case class IPv4(ipAddress: String){
// IPv4 converted to a number
val addrL: Long = IPv4ToLong(ipAddress)
}
// Will read in a bunch of random IPs in the form {"ipAddress": "60.80.39.27"}
val IPv4DF: DataFrame = spark.read.json(path)
IPv4DF.createOrReplaceTempView("IPv4")
spark.sql(
"""SELECT *
FROM IPv4
WHERE ipAddress.addrL > 100000"""
)

You can construct a Dataset and filter using the case class addrL attribute:
case class IPv4(ipAddress: String){
// IPv4 converted to a number
val addrL: Long = IPv4ToLong(ipAddress)
}
val ds = Seq("60.80.39.27").toDF("ipAddress").as[IPv4]
ds.filter(_.addrL > 100000).show
+-----------+
| ipAddress|
+-----------+
|60.80.39.27|
+-----------+

Related

Write transformation in typesafe way

How do I write the below code in typesafe manner in spark scala with Dataset Api:
val schema: StructType = Encoders.product[CaseClass].schema
//read json from a file
val readAsDataSet :CaseClass=sparkSession.read.option("mode",mode).schema(schema).json(path)as[CaseClass]
//below code needs to be written in type safe way:
val someDF= readAsDataSet.withColumn("col1",explode(col("col_to_be_exploded")))
.select(from_unixtime(col("timestamp").divide(1000))
.as("date"), col("col1"))

As someone in the comments said, you can create a Dataset[CaseClass] and do your operations on there. Let's set it up:
import spark.implicits._
case class MyTest (timestamp: Long, col_explode: Seq[String])
val df = Seq(
MyTest(1673850366000L, Seq("some", "strings", "here")),
MyTest(1271850365998L, Seq("pasta", "with", "cream")),
MyTest(611850366000L, Seq("tasty", "food"))
).toDF("timestamp", "col_explode").as[MyTest]
df.show(false)
+-------------+---------------------+
|timestamp |col_explode |
+-------------+---------------------+
|1673850366000|[some, strings, here]|
|1271850365998|[pasta, with, cream] |
|611850366000 |[tasty, food] |
+-------------+---------------------+
Typically, you can do many operations with the map function and the Scala language.
A map function returns the same amount of elements as the input has. The explode function that you're using, however, does not return the same amount of elements. You can implement this behaviour using the flatMap function.
So, using the Scala language and the flatMap function together, you can do something like this:
import java.time.LocalDateTime
import java.time.ZoneOffset
case class Exploded (datetime: String, exploded: String)
val output = df.flatMap{ case MyTest(timestamp, col_explode) =>
col_explode.map( value => {
val date = LocalDateTime.ofEpochSecond(timestamp/1000, 0, ZoneOffset.UTC).toString
Exploded(date, value)
}
)
}
output.show(false)
+-------------------+--------+
|datetime |exploded|
+-------------------+--------+
|2023-01-16T06:26:06|some |
|2023-01-16T06:26:06|strings |
|2023-01-16T06:26:06|here |
|2010-04-21T11:46:05|pasta |
|2010-04-21T11:46:05|with |
|2010-04-21T11:46:05|cream |
|1989-05-22T14:26:06|tasty |
|1989-05-22T14:26:06|food |
+-------------------+--------+
As you see, we've created a second case class called Exploded which we use to type our output dataset. Our output dataset has the following type: org.apache.spark.sql.Dataset[Exploded] so everything is completely type safe.

Scala WithColumn only if both columns exists

I have seen some variations of this question asked but havent found exactly what Im looking for. Here is the question:
I have some report names that I have collected in a dataframe and pivoted. The trouble I am having is regarding the resilience of the report_name. I cant be assured that every 90 days data will be present and that Rpt1, Rpt2, and Rpt3 will be there. So how do I go about creating a calculation ONLY if the column is present. I have outlined how my code looks right now. It works if all columns are there, but Id like to future proof it to ensure that if the report is not present in the 90 day window the pipline will not error out, but instead just skip the .withColumn addition
df1=(reports.alias("r")
.groupBy(uniqueid)
.filter("current_date<=90")
.pivot(report_name)
**
Result would be the following columns uniqueid Rpt1, Rpt2, Rpt3
* +---+-----+------+----------+
* |id |Rpt1 |Rpt2 |Rpt3 |
* +---+-----+------+----------+
* |205|72 |36 | 12 |
**
df2=(df1.alias("d1")
.withColumn("new_calc",expr("Rpt2/Rpt3"))

You can catch the error with a Try monad and return the original dataframe if withColumn fails.
import scala.util.Try
val df2 = Try(df1.withColumn("new_calc", expr("Rpt2/Rpt3")))
.getOrElse(df1)
.alias("d1")
You can also define it as a method if you want to reuse:
import org.apache.spark.sql.Column
def withColumnIfExist(df: DataFrame, colName: String, col: Column) =
Try(df.withColumn("new_calc",expr("Rpt2/Rpt3"))).getOrElse(df)
val df3 = withColumnIfExist(df1, "new_calc", expr("Rpt2/Rpt3"))
.alias("d1")
And if you need to chain multiple transformation you can use it with transform:
val df4 = df1.alias("d1")
.transform(withColumnIfExist(_, "new_calc", expr("Rpt2/Rpt3")))
.transform(withColumnIfExist(_, "new_calc_2", expr("Rpt1/Rpt2")))
Or you can implement it as an extension method with implicit class:
implicit class RichDataFrame(df: DataFrame) {
def withColumnIfExist(colName: String, col: Column): DataFrame =
Try(df.withColumn("new_calc", expr("Rpt2/Rpt3"))).getOrElse(df)
}
val df5 = df1.alias("d1")
.withColumnIfExist("new_calc", expr("Rpt2/Rpt3"))
.withColumnIfExist("new_calc_2", expr("Rpt1/Rpt2"))
Since withColumn works with all datasets, and if you want withColumnIfExist to work generically for all datasets including dataframe:
implicit class RichDataset[A](ds: Dataset[A]) {
def withColumnIfExist(colName: String, col: Column): DataFrame =
Try(ds.withColumn("new_calc", expr("Rpt2/Rpt3"))).getOrElse(ds.toDF)
}

Spark: How to get String value while generating output file

I have two files
--------Student.csv---------
StudentId,City
101,NDLS
102,Mumbai
-------StudentDetails.csv---
StudentId,StudentName,Course
101,ABC,C001
102,XYZ,C002
Requirement
StudentId in first should be replaced with StudentName and Course in the second file.
Once replaced I need to generate a new CSV with complete details like
ABC,C001,NDLS
XYZ,C002,Mumbai
Code used
val studentRDD = sc.textFile(file path);
val studentdetailsRDD = sc.textFile(file path);
val studentB = sc.broadcast(studentdetailsRDD.collect)
//Generating CSV
studentRDD.map{student =>
val name = getName(student.StudentId)
val course = getCourse(student.StudentId)
Array(name, course, student.City)
}.mapPartitions{data =>
val stringWriter = new StringWriter();
val csvWriter =new CSVWriter(stringWriter);
csvWriter.writeAll(data.toList)
Iterator(stringWriter.toString())
}.saveAsTextFile(outputPath)
//Functions defined to get details
def getName(studentId : String) {
studentB.value.map{stud =>if(studentId == stud.StudentId) stud.StudentName}
}
def getCourse(studentId : String) {
studentB.value.map{stud =>if(studentId == stud.StudentId) stud.Course}
}
Problem
File gets generated but the values are object representations instead of String value.
How can I get the string values instead of objects ?

As suggested in another answer, Spark's DataFrame API is especially suitable for this, as it easily supports joining two DataFrames, and writing CSV files.
However, if you insist on staying with RDD API, looks like the main issue with your code is the lookup functions: getName and getCourse basically do nothing, because their return type is Unit; Using an if without an else means that for some inputs there's no return value, which makes the entire function return Unit.
To fix this, it's easier to get rid of them and simplify the lookup by broadcasting a Map:
// better to broadcast a Map instead of an Array, would make lookups more efficient
val studentB = sc.broadcast(studentdetailsRDD.keyBy(_.StudentId).collectAsMap())
// convert to RDD[String] with the wanted formatting
val resultStrings = studentRDD.map { student =>
val details = studentB.value(student.StudentId)
Array(details.StudentName, details.Course, student.City)
}
.map(_.mkString(",")) // naive CSV writing with no escaping etc., you can also use CSVWriter like you did
// save as text file
resultStrings.saveAsTextFile(outputPath)

Spark has great support for join and write to file. Join only takes 1 line of code and write also only takes 1.
Hand write those code can be error proven, hard to read and most likely super slow.
val df1 = Seq((101,"NDLS"),
(102,"Mumbai")
).toDF("id", "city")
val df2 = Seq((101,"ABC","C001"),
(102,"XYZ","C002")
).toDF("id", "name", "course")
val dfResult = df1.join(df2, "id").select("id", "city", "name")
dfResult.repartition(1).write.csv("hello.csv")
There will be a directory created. There is only 1 file in the directory which is the finally result.

How to generate multiple records based on column?

I have records like below. I would like to convert a single record into two records with values EXTERNAL and INTERNAL each if the 3rd attribute is All.
Input dataset:
Surender,cts,INTERNAL
Raja,cts,EXTERNAL
Ajay,tcs,All
Expected output:
Surender,cts,INTERNAL
Raja,cts,EXTERNAL
Ajay,tcs,INTERNAL
Ajay,tcs,EXTERNAL
My Spark Code :
case class Customer(name:String,organisation:String,campaign_type:String)
val custRDD = sc.textFile("/user/cloudera/input_files/customer.txt")
val mapRDD = custRDD.map(record => record.split(","))
.map(arr => (arr(0),arr(1),arr(2))
.map(tuple => {
val name = tuple._1.trim
val organisation = tuple._2.trim
val campaign_type = tuple._3.trim.toUpperCase
Customer(name, organisation, campaign_type)
})
mapRDD.toDF().registerTempTable("customer_processed")
sqlContext.sql("SELECT * FROM customer_processed").show
Could Someone help me to fix this issue?

Since it's Scala...
If you want to write a more idiomatic Scala code (and perhaps trading some performance due to lack of optimizations to have a more idiomatic code), you can use flatMap operator (removed the implicit parameter):
flatMap[U](func: (T) ⇒ TraversableOnce[U]): Dataset[U] Returns a new Dataset by first applying a function to all elements of this Dataset, and then flattening the results.
NOTE: flatMap is equivalent to explode function, but you don't have to register a UDF (as in the other answer).
A solution could be as follows:
// I don't care about the names of the columns since we use Scala
// as you did when you tried to write the code
scala> input.show
+--------+---+--------+
| _c0|_c1| _c2|
+--------+---+--------+
|Surender|cts|INTERNAL|
| Raja|cts|EXTERNAL|
| Ajay|tcs| All|
+--------+---+--------+
val result = input.
as[(String, String, String)].
flatMap { case r # (name, org, campaign) =>
if ("all".equalsIgnoreCase(campaign)) {
Seq("INTERNAL", "EXTERNAL").map { cname =>
(name, org, cname)
}
} else Seq(r)
}
scala> result.show
+--------+---+--------+
| _1| _2| _3|
+--------+---+--------+
|Surender|cts|INTERNAL|
| Raja|cts|EXTERNAL|
| Ajay|tcs|INTERNAL|
| Ajay|tcs|EXTERNAL|
+--------+---+--------+
Comparing performance of the two queries, i.e. flatMap-based vs explode-based queries, I think explode-based may be slightly faster and optimized better as some code is under Spark's control (using logical operators before they get mapped to physical couterparts). In flatMap the entire optimization is your responsibility as a Scala developer.
The below red-bounded area corresponds to flatMap-based code and the warning sign are very cost expensive DeserializeToObject and SerializeFromObject operators.
What's interesting is the number of Spark jobs per query and their durations. It appears that explode-based query takes 2 Spark jobs and 200 ms while flatMap-based take only 1 Spark job and 43 ms.
That surprises me a lot and suggests that flatMap-based query could be faster (!)

You can use and udf to transform the campaign_type column containing a Seq of strings to map it to the campaigns type and then explode :
val campaignType_ : (String => Seq[String]) = {
case s if s == "ALL" => Seq("EXTERNAL", "INTERNAL")
case s => Seq(s)
}
val campaignType = udf(campaignType_)
val df = Seq(("Surender", "cts", "INTERNAL"),
("Raja", "cts", "EXTERNAL"),
("Ajay", "tcs", "ALL"))
.toDF("name", "organisation", "campaign_type")
val step1 = df.withColumn("campaign_type", campaignType($"campaign_type"))
step1.show
// +--------+------------+--------------------+
// | name|organisation| campaign_type|
// +--------+------------+--------------------+
// |Surender| cts| [INTERNAL]|
// | Raja| cts| [EXTERNAL]|
// | Ajay| tcs|[EXTERNAL, INTERNAL]|
// +--------+------------+--------------------+
val step2 = step1.select($"name", $"organisation", explode($"campaign_type"))
step2.show
// +--------+------------+--------+
// | name|organisation| col|
// +--------+------------+--------+
// |Surender| cts|INTERNAL|
// | Raja| cts|EXTERNAL|
// | Ajay| tcs|EXTERNAL|
// | Ajay| tcs|INTERNAL|
// +--------+------------+--------+
EDIT:
You don't actually need a udf, you can use a when().otherwise predicate instead on step1 as followed :
val step1 = df.withColumn("campaign_type",
when(col("campaign_type") === "ALL", array("EXTERNAL", "INTERNAL")).otherwise(array(col("campaign_type")))

Sparksql cut String after special position

Hello what i want to do is to cut a URL so it is all in a specific Format.
At the Moment my URL Looks like this.
[https://url.com/xxxxxxx/xxxxx/xxxxxx]
I just want to cut everything after the third / and just Count my data so that i have an overview how much URLs i have in my data.
I hope someone can help me

User-defined functions (UDFs) is what you need. Assume you have following input:
case class Data(url: String)
val urls = sqlContext.createDataFrame(Seq(Data("http://google.com/q=dfsdf"), Data("https://fb.com/gsdgsd")))
urls.registerTempTable("urls")
Now you can define UDF that gets only hostname from URL:
def getHost(url: String) = url.split('/')(2) //naive implementation, for example only
sqlContext.udf.register("getHost", getHost _)
And get your data transformed using SQL:
val hosts = sqlContext.sql("select getHost(url) as host from urls")
hosts.show()
Result:
+----------+
| host|
+----------+
|google.com|
| fb.com|
+----------+
If you prefer Scala DSL, you can use your UDF too:
import org.apache.spark.sql.functions.udf
val getHostUdf = udf(getHost _)
val urls = urls.select(getHostUdf($"url") as "host")
Result will be exactly the same.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Using custom classes in Spark on datasets and dataframes - scala

Related

Write transformation in typesafe way

Scala WithColumn only if both columns exists

Spark: How to get String value while generating output file

How to generate multiple records based on column?

Sparksql cut String after special position

Categories

Resources