unable to getoutput for spark case classes - scala

i am trying to implement
using spark 2.4.8 and sbt version 1.4.3 using intellij
code:
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
case class Person(id:Int,Name:String,cityId:Long)
case class City(id:Long,Name:String)
val family=Seq(Person(1,"john",11),(2,"MAR",12),(3,"Iweta",10)).toDF
val cities=Seq(City(11,"boston"),(12,"dallas")).toDF
error:
Exception in thread "main" java.lang.NoClassDefFoundError: no Java class corresponding to Product with Serializable found
at scala.reflect.runtime.JavaMirrors$JavaMirror.typeToJavaClass(JavaMirrors.scala:1300)
at scala.reflect.runtime.JavaMirrors$JavaMirror.runtimeClass(JavaMirrors.scala:192)
at scala.reflect.runtime.JavaMirrors$JavaMirror.runtimeClass(JavaMirrors.scala:54)
at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:60)
at org.apache.spark.sql.Encoders$.product(Encoders.scala:275)
at org.apache.spark.sql.LowPrioritySQLImplicits$class.newProductEncoder(SQLImplicits.scala:248)
at org.apache.spark.sql.SQLImplicits.newProductEncoder(SQLImplicits.scala:34)
at usingcaseclass$.main(usingcaseclass.scala:26)
at usingcaseclass.main(usingcaseclass.scala)
case class Salary(depName: String, empNo: Long, salary: Long)
val empsalary = Seq(Salary("sales", 1, 5000), Salary("personnel", 2, 3900)).toDS
empsalary.show(false)
value toDS is not a member of Seq[Salary]
val empsalary = Seq(Salary("sales", 1, 5000), Salary("personnel", 2, 3900)).toDS
any idea how to prevent this error

You have defined Seq in a wrong way, which will result in Seq[Product with Serializable] not Seq[T] on which toDF works.
Below modified lines should work for you.
val family=Seq(Person(1,"john",11),Person(2,"MAR",12),Person(3,"Iweta",10))
family.toDF().show()
+---+-----+------+
| id| Name|cityId|
+---+-----+------+
| 1| john| 11|
| 2| MAR| 12|
| 3|Iweta| 10|
+---+-----+------+

Related

Scala Transform DF into RDD - error java.lang.NumberFormatException: For input string: "age"

I'm doing baby steps in scala and spark and Im facing an error that I'm not able to solve.
I'm trying map a csv file into DF, but is returning an error.
// Adding schema to RDDs - Initialization
import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder
import org.apache.spark.sql.Encoder
import spark.implicits._
case class Employee(name: String, age: Long)
val employeeDF = spark.sparkContext.textFile.("./employee.txt").map(_.split(",")).map(attributes => Employee(attributes(0), attributes(1).trim.toInt)).toDF()
employeeDF.createOrReplaceTempView("employee")
var youngstersDF = spark.sql("SELECT name,age FROM employee WHERE age BETWEEN 18 AND 30")
youngstersDF.map(youngster => "Name: " + youngster(0)).show()
When I try map the name returns an error as described below:
The error returned is:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 19.0 failed 1 times, most recent failure: Lost task 0.0 in stage 19.0 (TID 21, 192.168.0.122, executor driver): java.lang.NumberFormatException: For input string: "age"
The file content is:
name,age
John,28
Andrew,36
Clarke,22
Kevin,42
I googled it but no lucky in terms of solutions/answers.
Can someone please help?
With Many Thanks
Xavy
You need to filter out the header from your data while converting to dataframe.
Example:
import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder
import org.apache.spark.sql.Encoder
import spark.implicits._
case class Employee(name: String, age: Long)
val employeeRDD = spark.sparkContext.textFile("./employee.txt")
//storing header string
val header=employeeRDD.first()
//filter out the header from the data
val employeeDF = employeeRDD.filter(r => r!=header).map(_.split(",")).map(attributes => Employee(attributes(0), attributes(1).trim.toInt)).toDF()
employeeDF.createOrReplaceTempView("employee")
sql("select * from employee").show()
//+------+---+
//| name|age|
//+------+---+
//| John| 28|
//|Andrew| 36|
//|Clarke| 22|
//| Kevin| 42|
//+------+---+
Just FYI you can use spark.read.csv with option header and pass the schema while reading.
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
val sch=new StructType().add("name",StringType).add("age",LongType)
val df=spark.read.option("header",true).option("delimiter",",").schema(sch).csv("./employee.txt")
df.show()
//+------+---+
//| name|age|
//+------+---+
//| John| 28|
//|Andrew| 36|
//|Clarke| 22|
//| Kevin| 42|
//+------+---+
I would try this-
def getType[T: scala.reflect.runtime.universe.TypeTag](obj: T) = scala.reflect.runtime.universe.typeOf[T]
val path = getClass.getResource("/csv/employee.txt").getPath
val ds = spark.read
.schema(ScalaReflection.schemaFor[Employee].dataType.asInstanceOf[StructType])
.option("header", true)
.option("sep", ",")
.csv(path)
.as[Employee]
println(getType(ds))
/**
* org.apache.spark.sql.Dataset[com.som.spark.learning.Employee]
*/
ds.show(false)
ds.printSchema()
/**
* +------+---+
* |name |age|
* +------+---+
* |John |28 |
* |Andrew|36 |
* |Clarke|22 |
* |Kevin |42 |
* +------+---+
*
* root
* |-- name: string (nullable = true)
* |-- age: long (nullable = true)
*/
case class Employee(name: String, age: Long)

Is it possible to apply when.otherwise functions within agg after a groupBy?

Been recently trying to apply a default function to aggregated values that were being calculated so that I didn't have to reprocess them afterwards. As far as I see I'm getting the following error.
Caused by: java.lang.UnsupportedOperationException: Schema for type org.apache.spark.sql.Column is not supported
From the following function.
val defaultUDF: UserDefinedFunction = udf[Column, Column, Any](defaultFunction)
def defaultFunction(col: Column, value: Any): Column = {
when(col.equalTo(value), null).otherwise(col)
}
And applying it the following way.
val initialDF = Seq(
("a", "b", 1),
("a", "b", null),
("a", null, 0)
).toDF("field1", "field2", "field3")
initialDF
.groupBy("field1", "field2")
.agg(
defaultUDF(functions.count("field3"), lit(0)).as("counter") // exception thrown here
)
Am I trying to do black magic in here or is it something that I may be missing?
The issue is in the implementation of your UserDefinedFunction:
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions._
def defaultFunction(col: Column, value: Any): Column = {
when(col.equalTo(value), null).otherwise(col)
}
val defaultUDF: UserDefinedFunction = udf[Column, Column, Any](defaultFunction)
// java.lang.UnsupportedOperationException: Schema for type org.apache.spark.sql.Column is not supported
// at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:789)
// at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:724)
// at scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:56)
// at org.apache.spark.sql.catalyst.ScalaReflection$class.cleanUpReflectionObjects(ScalaReflection.scala:906)
// at org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:46)
// at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:723)
// at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:720)
// at org.apache.spark.sql.functions$.udf(functions.scala:3914)
// ... 65 elided
The error you're getting is basically because Spark cannot map the return type (i.e. Column) of your UserDefinedFunction defaultFunction to a Spark DataType.
Your defaultFunction has to accept and return Scala types that correspond with a Spark DataType. You can find the list of supported Scala types here: https://spark.apache.org/docs/latest/sql-reference.html#data-types
In any case, you don't need a UserDefinedFunction if your function takes Columns and returns a Column. For your use-case, the following code will work:
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
case class Record(field1: String, field2: String, field3: java.lang.Integer)
val df = Seq(
Record("a", "b", 1),
Record("a", "b", null),
Record("a", null, 0)
).toDS
df.show
// +------+------+------+
// |field1|field2|field3|
// +------+------+------+
// | a| b| 1|
// | a| b| null|
// | a| null| 0|
// +------+------+------+
def defaultFunction(col: Column, value: Any): Column = {
when(col.equalTo(value), null).otherwise(col)
}
df
.groupBy("field1", "field2")
.agg(defaultFunction(count("field3"), lit(0)).as("counter"))
.show
// +------+------+-------+
// |field1|field2|counter|
// +------+------+-------+
// | a| b| 1|
// | a| null| 1|
// +------+------+-------+

I am facing error when I create dataset in Spark

Error:
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types._
case class Drug(S_No: int,Name: string,Drug_Name: string,Gender: string,Drug_Value: int)
scala> val ds=spark.read.csv("file:///home/xxx/drug_detail.csv").as[Drug]
org.apache.spark.sql.AnalysisException: cannot resolve '`S_No`' given input columns: [_c1, _c2, _c3, _c4, _c0];
at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$3.applyOrElse(CheckAnalysis.scala:110)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$3.applyOrElse(CheckAnalysis.scala:107)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:278)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:278)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:277)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:275)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:275)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:326)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:324)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:275)
Here is my test data:
1,Brandon Buckner,avil,female,525
2,Veda Hopkins,avil,male,633
3,Zia Underwood,paracetamol,male,980
4,Austin Mayer,paracetamol,female,338
5,Mara Higgins,avil,female,153
6,Sybill Crosby,avil,male,193
7,Tyler Rosales,paracetamol,male,778
8,Ivan Hale,avil,female,454
9,Alika Gilmore,paracetamol,female,833
10,Len Burgess,metacin,male,325
Generate structtype schema by using sql encoders then pass the schema while reading the csv file and define types in case class as Int,String instead of lower case int,string.
Example:
Sample data:
cat drug_detail.csv
1,foo,bar,M,2
2,foo1,bar1,F,3
Spark-shell:
case class Drug(S_No: Int,Name: String,Drug_Name: String,Gender: String,Drug_Value: Int)
import org.apache.spark.sql.Encoders
val schema = Encoders.product[Drug].schema
val ds=spark.read.schema(schema).csv("file:///home/xxx/drug_detail.csv").as[Drug]
ds.show()
//+----+----+---------+------+----------+
//|S_No|Name|Drug_Name|Gender|Drug_Value|
//+----+----+---------+------+----------+
//| 1| foo| bar| M| 2|
//| 2|foo1| bar1| F| 3|
//+----+----+---------+------+----------+
use as :
val ds=spark.read.option("header", "true").csv("file:///home/xxx/drug_detail.csv").as[Drug]
If your csv file contains headers, maybe include option("header","true").
e.g.: spark.read.option("header", "true").csv("...").as[Drug]

How to deal with missing value input in dataset-Spark/Scala

I have a data like this
8213034705_cst,95,2.927373,jake7870,0,95,117.5,xbox,3
,10,0.18669,parakeet2004,5,1,120,xbox,3
8213060420_gfd,26,0.249757,bluebubbles_1,25,1,120,xbox,3
8213060420_xcv,80,0.59059,sa4741,3,1,120,xbox,3
,75,0.657384,jhnsn2273,51,1,120,xbox,3
I am trying to put "missing value" to the first column where the records are missing ( or removing them altogether). I am trying to execute the following code but it is giving me error
import org.apache.spark._
import org.apache.spark.SparkContext._
import org.apache.spark.sql._
import org.apache.log4j._
import org.apache.spark.sql.functions
import java.lang.String
import org.apache.spark.sql.functions.udf
//import spark.implicits._
object DocParser2
{
case class Auction(auctionid:Option[String], bid:Double, bidtime:Double, bidder:String, bidderrate:Integer, openbid:Double, price:Double, item:String, daystolive:Integer)
def readint(ip:Option[String]):String = ip match
{
case Some(ip) => ip.split("_")(0)
case None => "missing value"
}
def main(args:Array[String]) =
{
val spark=SparkSession.builder.appName("DocParser").master("local[*]").getOrCreate()
import spark.implicits._
val intUDF = udf(readint _)
val lines=spark.read.format("csv").option("header","false").option("inferSchema", true).load("data/auction2.csv").toDF("auctionid","bid","bidtime","bidder","bidderrate","openbid","price","item","daystolive")
val recordsDS=lines.as[Auction]
recordsDS.printSchema()
println("splitting auction id into String and Int")
// recordsDS.withColumn("auctionid_int",java.lang.String.split('auctionid,"_")).show() some error with the split method
val auctionidcol=recordsDS.col("auctionid")
recordsDS.withColumn("auctionid_int",intUDF('auctionid)).show()
spark.stop()
}
}
but it is through the following runtime error
cannot cast java.lang.String to Scala.option in the line val intUDF
= udf(readint _)
could you help me figure out the error?
Thanks
a UDF has never an Option as an input, but rather needs the actual type to be passed. In case of a String you can do null-check inside your UDF, for primitive types (Int, Double etc) which can't be null, there are other solutions...
You can read csv file with spark.read.csv, and use na.drop() to remove records that contain missing values, tested on spark 2.0.2:
val df = spark.read.option("header", "false").option("inferSchema", "true").csv("Path to Csv file")
df.show
+--------------+---+--------+-------------+---+---+-----+----+---+
| _c0|_c1| _c2| _c3|_c4|_c5| _c6| _c7|_c8|
+--------------+---+--------+-------------+---+---+-----+----+---+
|8213034705_cst| 95|2.927373| jake7870| 0| 95|117.5|xbox| 3|
| null| 10| 0.18669| parakeet2004| 5| 1|120.0|xbox| 3|
|8213060420_gfd| 26|0.249757|bluebubbles_1| 25| 1|120.0|xbox| 3|
|8213060420_xcv| 80| 0.59059| sa4741| 3| 1|120.0|xbox| 3|
| null| 75|0.657384| jhnsn2273| 51| 1|120.0|xbox| 3|
+--------------+---+--------+-------------+---+---+-----+----+---+
df.na.drop().show
+--------------+---+--------+-------------+---+---+-----+----+---+
| _c0|_c1| _c2| _c3|_c4|_c5| _c6| _c7|_c8|
+--------------+---+--------+-------------+---+---+-----+----+---+
|8213034705_cst| 95|2.927373| jake7870| 0| 95|117.5|xbox| 3|
|8213060420_gfd| 26|0.249757|bluebubbles_1| 25| 1|120.0|xbox| 3|
|8213060420_xcv| 80| 0.59059| sa4741| 3| 1|120.0|xbox| 3|
+--------------+---+--------+-------------+---+---+-----+----+---+

How to fill the null value in dataframe to uuid?

There is a dataframe with null values in one column(not all being null), it need to fill the null value with uuid, is there a way?
cala> val df = Seq(("stuff2",null,null), ("stuff2",null,Array("value1","value2")),("stuff3","stuff3",null)).toDF("field","field2","values")
df: org.apache.spark.sql.DataFrame = [field: string, field2: string, values: array<string>]
scala> df.show
+------+------+----------------+
| field|field2| values|
+------+------+----------------+
|stuff2| null| null|
|stuff2| null|[value1, value2]|
|stuff3|stuff3| null|
+------+------+----------------+
I tried this way, but each row of the "field2" has the same uuid.
scala> val fillDF = df.na.fill(java.util.UUID.randomUUID().toString(), Seq("field2"))
fillDF: org.apache.spark.sql.DataFrame = [field: string, field2: string, values: array<string>]
scala> fillDF.show
+------+--------------------+----------------+
| field| field2| values|
+------+--------------------+----------------+
|stuff2|d007ffae-9134-4ac...| null|
|stuff2|d007ffae-9134-4ac...|[value1, value2]|
|stuff3| stuff3| null|
+------+--------------------+----------------+
How to make it? in case there is more than 1,000,000 rows
You can do it using UDF and coalesce like below.
import org.apache.spark.sql.functions.udf
val arr = udf(() => java.util.UUID.randomUUID().toString())
val df2 = df.withColumn("field2", coalesce(df("field2"), arr()))
df2.show()
You will get different UUID like below.
+------+--------------------+----------------+
| field| field2| values|
+------+--------------------+----------------+
|stuff2|fda6bc42-1265-407...| null|
|stuff2|3fa74767-abd7-405...|[value1, value2]|
|stuff3| stuff3| null|
+------+--------------------+----------------+
You can easily do this by using UDF , it can be something like this :
def generateUUID(value: String):String = {
import java.util.UUID
if (Option(value).isDefined) {
value
}
else {
UUID.randomUUID().toString
}
val funcUDF = generateUUID _
val generateUUID = udf(funcUDF)
Now pass the fillDF accrodingly:
fillDF.withColumns("field2",generateUUID(fillDF("field2"))).show
P.S: The code is not tested but it should work !