How to fill the null value in dataframe to uuid? - scala

There is a dataframe with null values in one column(not all being null), it need to fill the null value with uuid, is there a way?
cala> val df = Seq(("stuff2",null,null), ("stuff2",null,Array("value1","value2")),("stuff3","stuff3",null)).toDF("field","field2","values")
df: org.apache.spark.sql.DataFrame = [field: string, field2: string, values: array<string>]
scala> df.show
+------+------+----------------+
| field|field2| values|
+------+------+----------------+
|stuff2| null| null|
|stuff2| null|[value1, value2]|
|stuff3|stuff3| null|
+------+------+----------------+
I tried this way, but each row of the "field2" has the same uuid.
scala> val fillDF = df.na.fill(java.util.UUID.randomUUID().toString(), Seq("field2"))
fillDF: org.apache.spark.sql.DataFrame = [field: string, field2: string, values: array<string>]
scala> fillDF.show
+------+--------------------+----------------+
| field| field2| values|
+------+--------------------+----------------+
|stuff2|d007ffae-9134-4ac...| null|
|stuff2|d007ffae-9134-4ac...|[value1, value2]|
|stuff3| stuff3| null|
+------+--------------------+----------------+
How to make it? in case there is more than 1,000,000 rows

You can do it using UDF and coalesce like below.
import org.apache.spark.sql.functions.udf
val arr = udf(() => java.util.UUID.randomUUID().toString())
val df2 = df.withColumn("field2", coalesce(df("field2"), arr()))
df2.show()
You will get different UUID like below.
+------+--------------------+----------------+
| field| field2| values|
+------+--------------------+----------------+
|stuff2|fda6bc42-1265-407...| null|
|stuff2|3fa74767-abd7-405...|[value1, value2]|
|stuff3| stuff3| null|
+------+--------------------+----------------+

You can easily do this by using UDF , it can be something like this :
def generateUUID(value: String):String = {
import java.util.UUID
if (Option(value).isDefined) {
value
}
else {
UUID.randomUUID().toString
}
val funcUDF = generateUUID _
val generateUUID = udf(funcUDF)
Now pass the fillDF accrodingly:
fillDF.withColumns("field2",generateUUID(fillDF("field2"))).show
P.S: The code is not tested but it should work !

Related

unable to getoutput for spark case classes

i am trying to implement
using spark 2.4.8 and sbt version 1.4.3 using intellij
code:
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
case class Person(id:Int,Name:String,cityId:Long)
case class City(id:Long,Name:String)
val family=Seq(Person(1,"john",11),(2,"MAR",12),(3,"Iweta",10)).toDF
val cities=Seq(City(11,"boston"),(12,"dallas")).toDF
error:
Exception in thread "main" java.lang.NoClassDefFoundError: no Java class corresponding to Product with Serializable found
at scala.reflect.runtime.JavaMirrors$JavaMirror.typeToJavaClass(JavaMirrors.scala:1300)
at scala.reflect.runtime.JavaMirrors$JavaMirror.runtimeClass(JavaMirrors.scala:192)
at scala.reflect.runtime.JavaMirrors$JavaMirror.runtimeClass(JavaMirrors.scala:54)
at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:60)
at org.apache.spark.sql.Encoders$.product(Encoders.scala:275)
at org.apache.spark.sql.LowPrioritySQLImplicits$class.newProductEncoder(SQLImplicits.scala:248)
at org.apache.spark.sql.SQLImplicits.newProductEncoder(SQLImplicits.scala:34)
at usingcaseclass$.main(usingcaseclass.scala:26)
at usingcaseclass.main(usingcaseclass.scala)
case class Salary(depName: String, empNo: Long, salary: Long)
val empsalary = Seq(Salary("sales", 1, 5000), Salary("personnel", 2, 3900)).toDS
empsalary.show(false)
value toDS is not a member of Seq[Salary]
val empsalary = Seq(Salary("sales", 1, 5000), Salary("personnel", 2, 3900)).toDS
any idea how to prevent this error
You have defined Seq in a wrong way, which will result in Seq[Product with Serializable] not Seq[T] on which toDF works.
Below modified lines should work for you.
val family=Seq(Person(1,"john",11),Person(2,"MAR",12),Person(3,"Iweta",10))
family.toDF().show()
+---+-----+------+
| id| Name|cityId|
+---+-----+------+
| 1| john| 11|
| 2| MAR| 12|
| 3|Iweta| 10|
+---+-----+------+

Is it possible to apply when.otherwise functions within agg after a groupBy?

Been recently trying to apply a default function to aggregated values that were being calculated so that I didn't have to reprocess them afterwards. As far as I see I'm getting the following error.
Caused by: java.lang.UnsupportedOperationException: Schema for type org.apache.spark.sql.Column is not supported
From the following function.
val defaultUDF: UserDefinedFunction = udf[Column, Column, Any](defaultFunction)
def defaultFunction(col: Column, value: Any): Column = {
when(col.equalTo(value), null).otherwise(col)
}
And applying it the following way.
val initialDF = Seq(
("a", "b", 1),
("a", "b", null),
("a", null, 0)
).toDF("field1", "field2", "field3")
initialDF
.groupBy("field1", "field2")
.agg(
defaultUDF(functions.count("field3"), lit(0)).as("counter") // exception thrown here
)
Am I trying to do black magic in here or is it something that I may be missing?
The issue is in the implementation of your UserDefinedFunction:
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions._
def defaultFunction(col: Column, value: Any): Column = {
when(col.equalTo(value), null).otherwise(col)
}
val defaultUDF: UserDefinedFunction = udf[Column, Column, Any](defaultFunction)
// java.lang.UnsupportedOperationException: Schema for type org.apache.spark.sql.Column is not supported
// at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:789)
// at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:724)
// at scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:56)
// at org.apache.spark.sql.catalyst.ScalaReflection$class.cleanUpReflectionObjects(ScalaReflection.scala:906)
// at org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:46)
// at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:723)
// at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:720)
// at org.apache.spark.sql.functions$.udf(functions.scala:3914)
// ... 65 elided
The error you're getting is basically because Spark cannot map the return type (i.e. Column) of your UserDefinedFunction defaultFunction to a Spark DataType.
Your defaultFunction has to accept and return Scala types that correspond with a Spark DataType. You can find the list of supported Scala types here: https://spark.apache.org/docs/latest/sql-reference.html#data-types
In any case, you don't need a UserDefinedFunction if your function takes Columns and returns a Column. For your use-case, the following code will work:
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
case class Record(field1: String, field2: String, field3: java.lang.Integer)
val df = Seq(
Record("a", "b", 1),
Record("a", "b", null),
Record("a", null, 0)
).toDS
df.show
// +------+------+------+
// |field1|field2|field3|
// +------+------+------+
// | a| b| 1|
// | a| b| null|
// | a| null| 0|
// +------+------+------+
def defaultFunction(col: Column, value: Any): Column = {
when(col.equalTo(value), null).otherwise(col)
}
df
.groupBy("field1", "field2")
.agg(defaultFunction(count("field3"), lit(0)).as("counter"))
.show
// +------+------+-------+
// |field1|field2|counter|
// +------+------+-------+
// | a| b| 1|
// | a| null| 1|
// +------+------+-------+

I am facing error when I create dataset in Spark

Error:
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types._
case class Drug(S_No: int,Name: string,Drug_Name: string,Gender: string,Drug_Value: int)
scala> val ds=spark.read.csv("file:///home/xxx/drug_detail.csv").as[Drug]
org.apache.spark.sql.AnalysisException: cannot resolve '`S_No`' given input columns: [_c1, _c2, _c3, _c4, _c0];
at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$3.applyOrElse(CheckAnalysis.scala:110)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$3.applyOrElse(CheckAnalysis.scala:107)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:278)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:278)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:277)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:275)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:275)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:326)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:324)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:275)
Here is my test data:
1,Brandon Buckner,avil,female,525
2,Veda Hopkins,avil,male,633
3,Zia Underwood,paracetamol,male,980
4,Austin Mayer,paracetamol,female,338
5,Mara Higgins,avil,female,153
6,Sybill Crosby,avil,male,193
7,Tyler Rosales,paracetamol,male,778
8,Ivan Hale,avil,female,454
9,Alika Gilmore,paracetamol,female,833
10,Len Burgess,metacin,male,325
Generate structtype schema by using sql encoders then pass the schema while reading the csv file and define types in case class as Int,String instead of lower case int,string.
Example:
Sample data:
cat drug_detail.csv
1,foo,bar,M,2
2,foo1,bar1,F,3
Spark-shell:
case class Drug(S_No: Int,Name: String,Drug_Name: String,Gender: String,Drug_Value: Int)
import org.apache.spark.sql.Encoders
val schema = Encoders.product[Drug].schema
val ds=spark.read.schema(schema).csv("file:///home/xxx/drug_detail.csv").as[Drug]
ds.show()
//+----+----+---------+------+----------+
//|S_No|Name|Drug_Name|Gender|Drug_Value|
//+----+----+---------+------+----------+
//| 1| foo| bar| M| 2|
//| 2|foo1| bar1| F| 3|
//+----+----+---------+------+----------+
use as :
val ds=spark.read.option("header", "true").csv("file:///home/xxx/drug_detail.csv").as[Drug]
If your csv file contains headers, maybe include option("header","true").
e.g.: spark.read.option("header", "true").csv("...").as[Drug]

Need to flatten a dataframe on the basis of one column in Scala

I have a dataframe with a below schema
root
|-- name: string (nullable = true)
|-- roll: string (nullable = true)
|-- subjectID: string (nullable = true)
The values in the dataframe are as below
+-------------------+---------+--------------------+
| name| roll| SubjectID|
+-------------------+---------+--------------------+
| sam|ta1i3dfk4| xy|av|mm|
| royc|rfhqdbnb3| a|
| alcaly|ta1i3dfk4| xx|zz|
+-------------------+---------+--------------------+
I need to derive the dataframe by flattenig subject ID as below.
please note : SubjectID is string as well
+-------------------+---------+--------------------+
| name| roll| SubjectID|
+-------------------+---------+--------------------+
| sam|ta1i3dfk4| xy|
| sam|ta1i3dfk4| av|
| sam|ta1i3dfk4| mm|
| royc|rfhqdbnb3| a|
| alcaly|ta1i3dfk4| xx|
| alcaly|ta1i3dfk4| zz|
+-------------------+---------+--------------------+
Any suggestion
you can use explode functions to flatten.
example:
val inputDF = Seq(
("sam", "ta1i3dfk4", "xy|av|mm"),
("royc", "rfhqdbnb3", "a"),
("alcaly", "rfhqdbnb3", "xx|zz")
).toDF("name", "roll", "subjectIDs")
//split and explode `subjectIDs`
val result = input.withColumn("subjectIDs",
split(col("subjectIDs"), "\\|"))
.withColumn("subjectIDs", explode($"subjectIDs"))
resultDF.show()
+------+---------+----------+
| name| roll|subjectIDs|
+------+---------+----------+
| sam|ta1i3dfk4| xy|
| sam|ta1i3dfk4| av|
| sam|ta1i3dfk4| mm|
| royc|rfhqdbnb3| a|
|alcaly|rfhqdbnb3| xx|
|alcaly|rfhqdbnb3| zz|
+------+---------+----------+
You can use flatMap on dataset. Full executable code:
package main
import org.apache.spark.sql.{Dataset, SparkSession}
object Main extends App {
case class Roll(name: Option[String], roll: Option[String], subjectID: Option[String])
val mySpark = SparkSession
.builder()
.master("local[2]")
.appName("Spark SQL basic example")
.getOrCreate()
import mySpark.implicits._
val inputDF: Dataset[Roll] = Seq(
("sam", "ta1i3dfk4", "xy|av|mm"),
("royc", "rfhqdbnb3", "a"),
("alcaly", "rfhqdbnb3", "xx|zz")
).toDF("name", "roll", "subjectID").as[Roll]
val out: Dataset[Roll] = inputDF.flatMap {
case Roll(n, r, Some(ids)) if ids.nonEmpty =>
ids.split("\\|").map(id => Roll(n, r, Some(id)))
case x => Some(x)
}
out.show()
}
Note:
you can use split('|') instead of split("\\|")
you can change default handle if id must be non empty

Scala Spark - empty map on DataFrame column for map(String, Int)

I am joining two DataFrames, where there are columns of a type Map[String, Int]
I want the merged DF to have an empty map [] and not null on the Map type columns.
val df = dfmerged.
.select("id"),
coalesce(col("map_1"), lit(null).cast(MapType(StringType, IntType))).alias("map_1"),
coalesce(col("map_2"), lit(Map.empty[String, Int])).alias("map_2")
for a map_1 column, a null will be inserted, but I'd like to have an empty map
map_2 is giving me an error:
java.lang.RuntimeException: Unsupported literal type class
scala.collection.immutable.Map$EmptyMap$ Map()
I've also tried with an udf function like:
case class myStructMap(x:Map[String, Int])
val emptyMap = udf(() => myStructMap(Map.empty[String, Int]))
also did not work.
when I try something like:
.select( coalesce(col("myMapCol"), lit(map())).alias("brand_viewed_count")...
or
.select(coalesce(col("myMapCol"), lit(map().cast(MapType(LongType, LongType)))).alias("brand_viewed_count")...
I get the error:
cannot resolve 'map()' due to data type mismatch: cannot cast
MapType(NullType,NullType,false) to MapType(LongType,IntType,true);
In Spark 2.2
import org.apache.spark.sql.functions.typedLit
val df = Seq((1L, null), (2L, Map("foo" -> "bar"))).toDF("id", "map")
df.withColumn("map", coalesce($"map", typedLit(Map[String, Int]()))).show
// +---+-----------------+
// | id| map|
// +---+-----------------+
// | 1| Map()|
// | 2|Map(foobar -> 42)|
// +---+-----------------+
Before
df.withColumn("map", coalesce($"map", map().cast("map<string,int>"))).show
// +---+-----------------+
// | id| map|
// +---+-----------------+
// | 1| Map()|
// | 2|Map(foobar -> 42)|
// +---+-----------------+