I have a data like this
8213034705_cst,95,2.927373,jake7870,0,95,117.5,xbox,3
,10,0.18669,parakeet2004,5,1,120,xbox,3
8213060420_gfd,26,0.249757,bluebubbles_1,25,1,120,xbox,3
8213060420_xcv,80,0.59059,sa4741,3,1,120,xbox,3
,75,0.657384,jhnsn2273,51,1,120,xbox,3
I am trying to put "missing value" to the first column where the records are missing ( or removing them altogether). I am trying to execute the following code but it is giving me error
import org.apache.spark._
import org.apache.spark.SparkContext._
import org.apache.spark.sql._
import org.apache.log4j._
import org.apache.spark.sql.functions
import java.lang.String
import org.apache.spark.sql.functions.udf
//import spark.implicits._
object DocParser2
{
case class Auction(auctionid:Option[String], bid:Double, bidtime:Double, bidder:String, bidderrate:Integer, openbid:Double, price:Double, item:String, daystolive:Integer)
def readint(ip:Option[String]):String = ip match
{
case Some(ip) => ip.split("_")(0)
case None => "missing value"
}
def main(args:Array[String]) =
{
val spark=SparkSession.builder.appName("DocParser").master("local[*]").getOrCreate()
import spark.implicits._
val intUDF = udf(readint _)
val lines=spark.read.format("csv").option("header","false").option("inferSchema", true).load("data/auction2.csv").toDF("auctionid","bid","bidtime","bidder","bidderrate","openbid","price","item","daystolive")
val recordsDS=lines.as[Auction]
recordsDS.printSchema()
println("splitting auction id into String and Int")
// recordsDS.withColumn("auctionid_int",java.lang.String.split('auctionid,"_")).show() some error with the split method
val auctionidcol=recordsDS.col("auctionid")
recordsDS.withColumn("auctionid_int",intUDF('auctionid)).show()
spark.stop()
}
}
but it is through the following runtime error
cannot cast java.lang.String to Scala.option in the line val intUDF
= udf(readint _)
could you help me figure out the error?
Thanks
a UDF has never an Option as an input, but rather needs the actual type to be passed. In case of a String you can do null-check inside your UDF, for primitive types (Int, Double etc) which can't be null, there are other solutions...
You can read csv file with spark.read.csv, and use na.drop() to remove records that contain missing values, tested on spark 2.0.2:
val df = spark.read.option("header", "false").option("inferSchema", "true").csv("Path to Csv file")
df.show
+--------------+---+--------+-------------+---+---+-----+----+---+
| _c0|_c1| _c2| _c3|_c4|_c5| _c6| _c7|_c8|
+--------------+---+--------+-------------+---+---+-----+----+---+
|8213034705_cst| 95|2.927373| jake7870| 0| 95|117.5|xbox| 3|
| null| 10| 0.18669| parakeet2004| 5| 1|120.0|xbox| 3|
|8213060420_gfd| 26|0.249757|bluebubbles_1| 25| 1|120.0|xbox| 3|
|8213060420_xcv| 80| 0.59059| sa4741| 3| 1|120.0|xbox| 3|
| null| 75|0.657384| jhnsn2273| 51| 1|120.0|xbox| 3|
+--------------+---+--------+-------------+---+---+-----+----+---+
df.na.drop().show
+--------------+---+--------+-------------+---+---+-----+----+---+
| _c0|_c1| _c2| _c3|_c4|_c5| _c6| _c7|_c8|
+--------------+---+--------+-------------+---+---+-----+----+---+
|8213034705_cst| 95|2.927373| jake7870| 0| 95|117.5|xbox| 3|
|8213060420_gfd| 26|0.249757|bluebubbles_1| 25| 1|120.0|xbox| 3|
|8213060420_xcv| 80| 0.59059| sa4741| 3| 1|120.0|xbox| 3|
+--------------+---+--------+-------------+---+---+-----+----+---+
Related
i am trying to implement
using spark 2.4.8 and sbt version 1.4.3 using intellij
code:
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
case class Person(id:Int,Name:String,cityId:Long)
case class City(id:Long,Name:String)
val family=Seq(Person(1,"john",11),(2,"MAR",12),(3,"Iweta",10)).toDF
val cities=Seq(City(11,"boston"),(12,"dallas")).toDF
error:
Exception in thread "main" java.lang.NoClassDefFoundError: no Java class corresponding to Product with Serializable found
at scala.reflect.runtime.JavaMirrors$JavaMirror.typeToJavaClass(JavaMirrors.scala:1300)
at scala.reflect.runtime.JavaMirrors$JavaMirror.runtimeClass(JavaMirrors.scala:192)
at scala.reflect.runtime.JavaMirrors$JavaMirror.runtimeClass(JavaMirrors.scala:54)
at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:60)
at org.apache.spark.sql.Encoders$.product(Encoders.scala:275)
at org.apache.spark.sql.LowPrioritySQLImplicits$class.newProductEncoder(SQLImplicits.scala:248)
at org.apache.spark.sql.SQLImplicits.newProductEncoder(SQLImplicits.scala:34)
at usingcaseclass$.main(usingcaseclass.scala:26)
at usingcaseclass.main(usingcaseclass.scala)
case class Salary(depName: String, empNo: Long, salary: Long)
val empsalary = Seq(Salary("sales", 1, 5000), Salary("personnel", 2, 3900)).toDS
empsalary.show(false)
value toDS is not a member of Seq[Salary]
val empsalary = Seq(Salary("sales", 1, 5000), Salary("personnel", 2, 3900)).toDS
any idea how to prevent this error
You have defined Seq in a wrong way, which will result in Seq[Product with Serializable] not Seq[T] on which toDF works.
Below modified lines should work for you.
val family=Seq(Person(1,"john",11),Person(2,"MAR",12),Person(3,"Iweta",10))
family.toDF().show()
+---+-----+------+
| id| Name|cityId|
+---+-----+------+
| 1| john| 11|
| 2| MAR| 12|
| 3|Iweta| 10|
+---+-----+------+
Been recently trying to apply a default function to aggregated values that were being calculated so that I didn't have to reprocess them afterwards. As far as I see I'm getting the following error.
Caused by: java.lang.UnsupportedOperationException: Schema for type org.apache.spark.sql.Column is not supported
From the following function.
val defaultUDF: UserDefinedFunction = udf[Column, Column, Any](defaultFunction)
def defaultFunction(col: Column, value: Any): Column = {
when(col.equalTo(value), null).otherwise(col)
}
And applying it the following way.
val initialDF = Seq(
("a", "b", 1),
("a", "b", null),
("a", null, 0)
).toDF("field1", "field2", "field3")
initialDF
.groupBy("field1", "field2")
.agg(
defaultUDF(functions.count("field3"), lit(0)).as("counter") // exception thrown here
)
Am I trying to do black magic in here or is it something that I may be missing?
The issue is in the implementation of your UserDefinedFunction:
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions._
def defaultFunction(col: Column, value: Any): Column = {
when(col.equalTo(value), null).otherwise(col)
}
val defaultUDF: UserDefinedFunction = udf[Column, Column, Any](defaultFunction)
// java.lang.UnsupportedOperationException: Schema for type org.apache.spark.sql.Column is not supported
// at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:789)
// at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:724)
// at scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:56)
// at org.apache.spark.sql.catalyst.ScalaReflection$class.cleanUpReflectionObjects(ScalaReflection.scala:906)
// at org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:46)
// at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:723)
// at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:720)
// at org.apache.spark.sql.functions$.udf(functions.scala:3914)
// ... 65 elided
The error you're getting is basically because Spark cannot map the return type (i.e. Column) of your UserDefinedFunction defaultFunction to a Spark DataType.
Your defaultFunction has to accept and return Scala types that correspond with a Spark DataType. You can find the list of supported Scala types here: https://spark.apache.org/docs/latest/sql-reference.html#data-types
In any case, you don't need a UserDefinedFunction if your function takes Columns and returns a Column. For your use-case, the following code will work:
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
case class Record(field1: String, field2: String, field3: java.lang.Integer)
val df = Seq(
Record("a", "b", 1),
Record("a", "b", null),
Record("a", null, 0)
).toDS
df.show
// +------+------+------+
// |field1|field2|field3|
// +------+------+------+
// | a| b| 1|
// | a| b| null|
// | a| null| 0|
// +------+------+------+
def defaultFunction(col: Column, value: Any): Column = {
when(col.equalTo(value), null).otherwise(col)
}
df
.groupBy("field1", "field2")
.agg(defaultFunction(count("field3"), lit(0)).as("counter"))
.show
// +------+------+-------+
// |field1|field2|counter|
// +------+------+-------+
// | a| b| 1|
// | a| null| 1|
// +------+------+-------+
Error:
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types._
case class Drug(S_No: int,Name: string,Drug_Name: string,Gender: string,Drug_Value: int)
scala> val ds=spark.read.csv("file:///home/xxx/drug_detail.csv").as[Drug]
org.apache.spark.sql.AnalysisException: cannot resolve '`S_No`' given input columns: [_c1, _c2, _c3, _c4, _c0];
at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$3.applyOrElse(CheckAnalysis.scala:110)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$3.applyOrElse(CheckAnalysis.scala:107)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:278)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:278)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:277)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:275)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:275)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:326)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:324)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:275)
Here is my test data:
1,Brandon Buckner,avil,female,525
2,Veda Hopkins,avil,male,633
3,Zia Underwood,paracetamol,male,980
4,Austin Mayer,paracetamol,female,338
5,Mara Higgins,avil,female,153
6,Sybill Crosby,avil,male,193
7,Tyler Rosales,paracetamol,male,778
8,Ivan Hale,avil,female,454
9,Alika Gilmore,paracetamol,female,833
10,Len Burgess,metacin,male,325
Generate structtype schema by using sql encoders then pass the schema while reading the csv file and define types in case class as Int,String instead of lower case int,string.
Example:
Sample data:
cat drug_detail.csv
1,foo,bar,M,2
2,foo1,bar1,F,3
Spark-shell:
case class Drug(S_No: Int,Name: String,Drug_Name: String,Gender: String,Drug_Value: Int)
import org.apache.spark.sql.Encoders
val schema = Encoders.product[Drug].schema
val ds=spark.read.schema(schema).csv("file:///home/xxx/drug_detail.csv").as[Drug]
ds.show()
//+----+----+---------+------+----------+
//|S_No|Name|Drug_Name|Gender|Drug_Value|
//+----+----+---------+------+----------+
//| 1| foo| bar| M| 2|
//| 2|foo1| bar1| F| 3|
//+----+----+---------+------+----------+
use as :
val ds=spark.read.option("header", "true").csv("file:///home/xxx/drug_detail.csv").as[Drug]
If your csv file contains headers, maybe include option("header","true").
e.g.: spark.read.option("header", "true").csv("...").as[Drug]
We can create a dataframe from a list of Java objects using:
DataFrame df = sqlContext.createDataFrame(list, Example.class);
In case of Java, Spark can infer the schema directly from the class, in this case Example.class.
Is there a way to do the same in case of Scala?
if you use case classes in scala, this works out of the box
// define this class outside main method
case class MyCustomObject(id:Long,name:String,age:Int)
import spark.implicits._
val df = Seq(
MyCustomObject(1L,"Peter",34),
MyCustomObject(2L,"John",52)
).toDF()
df.show()
+---+-----+---+
| id| name|age|
+---+-----+---+
| 1|Peter| 34|
| 2| John| 52|
+---+-----+---+
If you want to use a non-case class, you need to extend the trait Product and implement these methods yourself
There is a dataframe with null values in one column(not all being null), it need to fill the null value with uuid, is there a way?
cala> val df = Seq(("stuff2",null,null), ("stuff2",null,Array("value1","value2")),("stuff3","stuff3",null)).toDF("field","field2","values")
df: org.apache.spark.sql.DataFrame = [field: string, field2: string, values: array<string>]
scala> df.show
+------+------+----------------+
| field|field2| values|
+------+------+----------------+
|stuff2| null| null|
|stuff2| null|[value1, value2]|
|stuff3|stuff3| null|
+------+------+----------------+
I tried this way, but each row of the "field2" has the same uuid.
scala> val fillDF = df.na.fill(java.util.UUID.randomUUID().toString(), Seq("field2"))
fillDF: org.apache.spark.sql.DataFrame = [field: string, field2: string, values: array<string>]
scala> fillDF.show
+------+--------------------+----------------+
| field| field2| values|
+------+--------------------+----------------+
|stuff2|d007ffae-9134-4ac...| null|
|stuff2|d007ffae-9134-4ac...|[value1, value2]|
|stuff3| stuff3| null|
+------+--------------------+----------------+
How to make it? in case there is more than 1,000,000 rows
You can do it using UDF and coalesce like below.
import org.apache.spark.sql.functions.udf
val arr = udf(() => java.util.UUID.randomUUID().toString())
val df2 = df.withColumn("field2", coalesce(df("field2"), arr()))
df2.show()
You will get different UUID like below.
+------+--------------------+----------------+
| field| field2| values|
+------+--------------------+----------------+
|stuff2|fda6bc42-1265-407...| null|
|stuff2|3fa74767-abd7-405...|[value1, value2]|
|stuff3| stuff3| null|
+------+--------------------+----------------+
You can easily do this by using UDF , it can be something like this :
def generateUUID(value: String):String = {
import java.util.UUID
if (Option(value).isDefined) {
value
}
else {
UUID.randomUUID().toString
}
val funcUDF = generateUUID _
val generateUUID = udf(funcUDF)
Now pass the fillDF accrodingly:
fillDF.withColumns("field2",generateUUID(fillDF("field2"))).show
P.S: The code is not tested but it should work !