I am facing error when I create dataset in Spark - scala

Error:
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types._
case class Drug(S_No: int,Name: string,Drug_Name: string,Gender: string,Drug_Value: int)
scala> val ds=spark.read.csv("file:///home/xxx/drug_detail.csv").as[Drug]
org.apache.spark.sql.AnalysisException: cannot resolve '`S_No`' given input columns: [_c1, _c2, _c3, _c4, _c0];
at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$3.applyOrElse(CheckAnalysis.scala:110)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$3.applyOrElse(CheckAnalysis.scala:107)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:278)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:278)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:277)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:275)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:275)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:326)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:324)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:275)
Here is my test data:
1,Brandon Buckner,avil,female,525
2,Veda Hopkins,avil,male,633
3,Zia Underwood,paracetamol,male,980
4,Austin Mayer,paracetamol,female,338
5,Mara Higgins,avil,female,153
6,Sybill Crosby,avil,male,193
7,Tyler Rosales,paracetamol,male,778
8,Ivan Hale,avil,female,454
9,Alika Gilmore,paracetamol,female,833
10,Len Burgess,metacin,male,325

Generate structtype schema by using sql encoders then pass the schema while reading the csv file and define types in case class as Int,String instead of lower case int,string.
Example:
Sample data:
cat drug_detail.csv
1,foo,bar,M,2
2,foo1,bar1,F,3
Spark-shell:
case class Drug(S_No: Int,Name: String,Drug_Name: String,Gender: String,Drug_Value: Int)
import org.apache.spark.sql.Encoders
val schema = Encoders.product[Drug].schema
val ds=spark.read.schema(schema).csv("file:///home/xxx/drug_detail.csv").as[Drug]
ds.show()
//+----+----+---------+------+----------+
//|S_No|Name|Drug_Name|Gender|Drug_Value|
//+----+----+---------+------+----------+
//| 1| foo| bar| M| 2|
//| 2|foo1| bar1| F| 3|
//+----+----+---------+------+----------+

use as :
val ds=spark.read.option("header", "true").csv("file:///home/xxx/drug_detail.csv").as[Drug]

If your csv file contains headers, maybe include option("header","true").
e.g.: spark.read.option("header", "true").csv("...").as[Drug]

Related

unable to getoutput for spark case classes

i am trying to implement
using spark 2.4.8 and sbt version 1.4.3 using intellij
code:
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
case class Person(id:Int,Name:String,cityId:Long)
case class City(id:Long,Name:String)
val family=Seq(Person(1,"john",11),(2,"MAR",12),(3,"Iweta",10)).toDF
val cities=Seq(City(11,"boston"),(12,"dallas")).toDF
error:
Exception in thread "main" java.lang.NoClassDefFoundError: no Java class corresponding to Product with Serializable found
at scala.reflect.runtime.JavaMirrors$JavaMirror.typeToJavaClass(JavaMirrors.scala:1300)
at scala.reflect.runtime.JavaMirrors$JavaMirror.runtimeClass(JavaMirrors.scala:192)
at scala.reflect.runtime.JavaMirrors$JavaMirror.runtimeClass(JavaMirrors.scala:54)
at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:60)
at org.apache.spark.sql.Encoders$.product(Encoders.scala:275)
at org.apache.spark.sql.LowPrioritySQLImplicits$class.newProductEncoder(SQLImplicits.scala:248)
at org.apache.spark.sql.SQLImplicits.newProductEncoder(SQLImplicits.scala:34)
at usingcaseclass$.main(usingcaseclass.scala:26)
at usingcaseclass.main(usingcaseclass.scala)
case class Salary(depName: String, empNo: Long, salary: Long)
val empsalary = Seq(Salary("sales", 1, 5000), Salary("personnel", 2, 3900)).toDS
empsalary.show(false)
value toDS is not a member of Seq[Salary]
val empsalary = Seq(Salary("sales", 1, 5000), Salary("personnel", 2, 3900)).toDS
any idea how to prevent this error
You have defined Seq in a wrong way, which will result in Seq[Product with Serializable] not Seq[T] on which toDF works.
Below modified lines should work for you.
val family=Seq(Person(1,"john",11),Person(2,"MAR",12),Person(3,"Iweta",10))
family.toDF().show()
+---+-----+------+
| id| Name|cityId|
+---+-----+------+
| 1| john| 11|
| 2| MAR| 12|
| 3|Iweta| 10|
+---+-----+------+

Is it possible to apply when.otherwise functions within agg after a groupBy?

Been recently trying to apply a default function to aggregated values that were being calculated so that I didn't have to reprocess them afterwards. As far as I see I'm getting the following error.
Caused by: java.lang.UnsupportedOperationException: Schema for type org.apache.spark.sql.Column is not supported
From the following function.
val defaultUDF: UserDefinedFunction = udf[Column, Column, Any](defaultFunction)
def defaultFunction(col: Column, value: Any): Column = {
when(col.equalTo(value), null).otherwise(col)
}
And applying it the following way.
val initialDF = Seq(
("a", "b", 1),
("a", "b", null),
("a", null, 0)
).toDF("field1", "field2", "field3")
initialDF
.groupBy("field1", "field2")
.agg(
defaultUDF(functions.count("field3"), lit(0)).as("counter") // exception thrown here
)
Am I trying to do black magic in here or is it something that I may be missing?
The issue is in the implementation of your UserDefinedFunction:
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions._
def defaultFunction(col: Column, value: Any): Column = {
when(col.equalTo(value), null).otherwise(col)
}
val defaultUDF: UserDefinedFunction = udf[Column, Column, Any](defaultFunction)
// java.lang.UnsupportedOperationException: Schema for type org.apache.spark.sql.Column is not supported
// at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:789)
// at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:724)
// at scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:56)
// at org.apache.spark.sql.catalyst.ScalaReflection$class.cleanUpReflectionObjects(ScalaReflection.scala:906)
// at org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:46)
// at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:723)
// at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:720)
// at org.apache.spark.sql.functions$.udf(functions.scala:3914)
// ... 65 elided
The error you're getting is basically because Spark cannot map the return type (i.e. Column) of your UserDefinedFunction defaultFunction to a Spark DataType.
Your defaultFunction has to accept and return Scala types that correspond with a Spark DataType. You can find the list of supported Scala types here: https://spark.apache.org/docs/latest/sql-reference.html#data-types
In any case, you don't need a UserDefinedFunction if your function takes Columns and returns a Column. For your use-case, the following code will work:
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
case class Record(field1: String, field2: String, field3: java.lang.Integer)
val df = Seq(
Record("a", "b", 1),
Record("a", "b", null),
Record("a", null, 0)
).toDS
df.show
// +------+------+------+
// |field1|field2|field3|
// +------+------+------+
// | a| b| 1|
// | a| b| null|
// | a| null| 0|
// +------+------+------+
def defaultFunction(col: Column, value: Any): Column = {
when(col.equalTo(value), null).otherwise(col)
}
df
.groupBy("field1", "field2")
.agg(defaultFunction(count("field3"), lit(0)).as("counter"))
.show
// +------+------+-------+
// |field1|field2|counter|
// +------+------+-------+
// | a| b| 1|
// | a| null| 1|
// +------+------+-------+

How to create a dataframe from a list of Scala custom objects

We can create a dataframe from a list of Java objects using:
DataFrame df = sqlContext.createDataFrame(list, Example.class);
In case of Java, Spark can infer the schema directly from the class, in this case Example.class.
Is there a way to do the same in case of Scala?
if you use case classes in scala, this works out of the box
// define this class outside main method
case class MyCustomObject(id:Long,name:String,age:Int)
import spark.implicits._
val df = Seq(
MyCustomObject(1L,"Peter",34),
MyCustomObject(2L,"John",52)
).toDF()
df.show()
+---+-----+---+
| id| name|age|
+---+-----+---+
| 1|Peter| 34|
| 2| John| 52|
+---+-----+---+
If you want to use a non-case class, you need to extend the trait Product and implement these methods yourself

Spark Scala CSV Column names to Lower Case

Please find the code below and Let me know how I can change the Column Names to Lower case. I tried withColumnRename but I have to do it for each column and type all the column names. I just want to do it on columns so I don't want to mention all the column names as there are too many of them.
Scala Version: 2.11
Spark : 2.2
import org.apache.spark.sql.SparkSession
import org.apache.log4j.{Level, Logger}
import com.datastax
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import com.datastax.spark.connector._
import org.apache.spark.sql._
object dataframeset {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("Sample1").setMaster("local[*]")
val sc = new SparkContext(conf)
sc.setLogLevel("ERROR")
val rdd1 = sc.cassandraTable("tdata", "map3")
Logger.getLogger("org").setLevel(Level.ERROR)
Logger.getLogger("akka").setLevel(Level.ERROR)
val spark1 = org.apache.spark.sql.SparkSession.builder().master("local").config("spark.cassandra.connection.host","127.0.0.1")
.appName("Spark SQL basic example").getOrCreate()
val df = spark1.read.format("csv").option("header","true").option("inferschema", "true").load("/Users/Desktop/del2.csv")
import spark1.implicits._
println("\nTop Records are:")
df.show(1)
val dfprev1 = df.select(col = "sno", "year", "StateAbbr")
dfprev1.show(1)
}
}
Required output:
|sno|year|stateabbr| statedesc|cityname|geographiclevel
All the Columns names should be in lower case.
Actual output:
Top Records are:
+---+----+---------+-------------+--------+---------------+----------+----------+--------+--------------------+---------------+---------------+--------------------+----------+--------------------+---------------------+--------------------------+-------------------+---------------+-----------+----------+---------+--------+---------+-------------------+
|sno|year|StateAbbr| StateDesc|CityName|GeographicLevel|DataSource| category|UniqueID| Measure|Data_Value_Unit|DataValueTypeID| Data_Value_Type|Data_Value|Low_Confidence_Limit|High_Confidence_Limit|Data_Value_Footnote_Symbol|Data_Value_Footnote|PopulationCount|GeoLocation|categoryID|MeasureId|cityFIPS|TractFIPS|Short_Question_Text|
+---+----+---------+-------------+--------+---------------+----------+----------+--------+--------------------+---------------+---------------+--------------------+----------+--------------------+---------------------+--------------------------+-------------------+---------------+-----------+----------+---------+--------+---------+-------------------+
| 1|2014| US|United States| null| US| BRFSS|Prevention| 59|Current lack of h...| %| AgeAdjPrv|Age-adjusted prev...| 14.9| 14.6| 15.2| null| null| 308745538| null| PREVENT| ACCESS2| null| null| Health Insurance|
+---+----+---------+-------------+--------+---------------+----------+----------+--------+--------------------+---------------+---------------+--------------------+----------+--------------------+---------------------+--------------------------+-------------------+---------------+-----------+----------+---------+--------+---------+-------------------+
only showing top 1 row
+---+----+---------+
|sno|year|StateAbbr|
+---+----+---------+
| 1|2014| US|
+---+----+---------+
only showing top 1 row
Just use toDF:
df.toDF(df.columns map(_.toLowerCase): _*)
Other way to achieve it is using FoldLeft method.
val myDFcolNames = myDF.columns.toList
val rdoDenormDF = myDFcolNames.foldLeft(myDF)((myDF, c) =>
myDF.withColumnRenamed(c.toString.split(",")(0), c.toString.toLowerCase()))

How to deal with missing value input in dataset-Spark/Scala

I have a data like this
8213034705_cst,95,2.927373,jake7870,0,95,117.5,xbox,3
,10,0.18669,parakeet2004,5,1,120,xbox,3
8213060420_gfd,26,0.249757,bluebubbles_1,25,1,120,xbox,3
8213060420_xcv,80,0.59059,sa4741,3,1,120,xbox,3
,75,0.657384,jhnsn2273,51,1,120,xbox,3
I am trying to put "missing value" to the first column where the records are missing ( or removing them altogether). I am trying to execute the following code but it is giving me error
import org.apache.spark._
import org.apache.spark.SparkContext._
import org.apache.spark.sql._
import org.apache.log4j._
import org.apache.spark.sql.functions
import java.lang.String
import org.apache.spark.sql.functions.udf
//import spark.implicits._
object DocParser2
{
case class Auction(auctionid:Option[String], bid:Double, bidtime:Double, bidder:String, bidderrate:Integer, openbid:Double, price:Double, item:String, daystolive:Integer)
def readint(ip:Option[String]):String = ip match
{
case Some(ip) => ip.split("_")(0)
case None => "missing value"
}
def main(args:Array[String]) =
{
val spark=SparkSession.builder.appName("DocParser").master("local[*]").getOrCreate()
import spark.implicits._
val intUDF = udf(readint _)
val lines=spark.read.format("csv").option("header","false").option("inferSchema", true).load("data/auction2.csv").toDF("auctionid","bid","bidtime","bidder","bidderrate","openbid","price","item","daystolive")
val recordsDS=lines.as[Auction]
recordsDS.printSchema()
println("splitting auction id into String and Int")
// recordsDS.withColumn("auctionid_int",java.lang.String.split('auctionid,"_")).show() some error with the split method
val auctionidcol=recordsDS.col("auctionid")
recordsDS.withColumn("auctionid_int",intUDF('auctionid)).show()
spark.stop()
}
}
but it is through the following runtime error
cannot cast java.lang.String to Scala.option in the line val intUDF
= udf(readint _)
could you help me figure out the error?
Thanks
a UDF has never an Option as an input, but rather needs the actual type to be passed. In case of a String you can do null-check inside your UDF, for primitive types (Int, Double etc) which can't be null, there are other solutions...
You can read csv file with spark.read.csv, and use na.drop() to remove records that contain missing values, tested on spark 2.0.2:
val df = spark.read.option("header", "false").option("inferSchema", "true").csv("Path to Csv file")
df.show
+--------------+---+--------+-------------+---+---+-----+----+---+
| _c0|_c1| _c2| _c3|_c4|_c5| _c6| _c7|_c8|
+--------------+---+--------+-------------+---+---+-----+----+---+
|8213034705_cst| 95|2.927373| jake7870| 0| 95|117.5|xbox| 3|
| null| 10| 0.18669| parakeet2004| 5| 1|120.0|xbox| 3|
|8213060420_gfd| 26|0.249757|bluebubbles_1| 25| 1|120.0|xbox| 3|
|8213060420_xcv| 80| 0.59059| sa4741| 3| 1|120.0|xbox| 3|
| null| 75|0.657384| jhnsn2273| 51| 1|120.0|xbox| 3|
+--------------+---+--------+-------------+---+---+-----+----+---+
df.na.drop().show
+--------------+---+--------+-------------+---+---+-----+----+---+
| _c0|_c1| _c2| _c3|_c4|_c5| _c6| _c7|_c8|
+--------------+---+--------+-------------+---+---+-----+----+---+
|8213034705_cst| 95|2.927373| jake7870| 0| 95|117.5|xbox| 3|
|8213060420_gfd| 26|0.249757|bluebubbles_1| 25| 1|120.0|xbox| 3|
|8213060420_xcv| 80| 0.59059| sa4741| 3| 1|120.0|xbox| 3|
+--------------+---+--------+-------------+---+---+-----+----+---+