Converting from java.util.List to spark dataset - scala

I am still very new to spark and scala, but very familiar with Java. I have some java jar that has a function that returns an List (java.util.List) of Integers, but I want to convert these to a spark dataset so I can append it to another column and then perform a join. Is there any easy way to do this? I've tried things similar to this code:
val testDSArray : java.util.List[Integer] = new util.ArrayList[Integer]()
testDSArray.add(4)
testDSArray.add(7)
testDSArray.add(10)
val testDS : Dataset[Integer] = spark.createDataset(testDSArray, Encoders.INT())
but it gives me compiler errors (cannot resolve overloaded method)?

If you look at the type signature you will see that in Scala the encoder is passed in a second (and implicit) parameter list.
You may:
Pass it in another parameter list.
val testDS = spark.createDataset(testDSArray)(Encoders.INT)
Don't pass it, and leave the Scala's implicit mechanism resolves it.
import spark.implicits._
val testDS = spark.createDataset(testDSArray)
Convert the Java's List to a Scala's one first.
import collection.JavaConverters._
import spark.implicits._
val testDS = testDSArray.asScala.toDS()

Related

Best practice to define implicit/explicit encoding in dataframe column value extraction without RDD

I am trying to get column data in a collection without RDD map api (doing the pure dataframe way)
object CommonObject{
def doSomething(...){
.......
val releaseDate = tableDF.where(tableDF("item") <=> "releaseDate").select("value").map(r => r.getString(0)).collect.toList.head
}
}
this is all good except Spark 2.3 suggests
No implicits found for parameter evidence$6: Encoder[String]
between map and collect
map(r => r.getString(0))(...).collect
I understand to add
import spark.implicits._
before the process however it requires a spark session instance
it's pretty annoying especially when there is no spark session instance in a method. As a Spark newbie how to nicely resolve the implicit encoding parameter in the context?
You can always add a call to SparkSession.builder.getOrCreate() inside your method. Spark will find the already existing SparkSession and won't create a new one, so there is no performance impact. Then you can import explicits which will work for all case classes. This is easiest way to add encoding. Alternatively an explicit encoder can be added using Encoders class.
val spark = SparkSession.builder
.appName("name")
.master("local[2]")
.getOrCreate()
import spark.implicits._
The other way is to get SparkSession from the dataframe dataframe.sparkSession
def dummy (df : DataFrame) = {
val spark = df.sparkSession
import spark.implicits._
}

Fetching a DataFrame into a Case Class instead results instead in reading a Tuple1

Given a case class :
case class ScoringSummary(MatchMethod: String="",
TP: Double=0,
FP: Double=0,
Precision: Double=0,
Recall: Double=0,
F1: Double=0)
We are writing summary records out as:
summaryDf.write.parquet(path)
Later we (attempt to) read the parquet file into a new dataframe:
implicit val generalRowEncoder: Encoder[ScoringSummary] =
org.apache.spark.sql.Encoders.kryo[ScoringSummary]
val summaryDf = spark.read.parquet(path).as[ScoringSummary]
But this fails - for some reason spark believes the contents of the data were Tuple1 instead of ScoringSummary:
Try to map struct<MatchMethod:string,TP:double,FP:double,Precision:double,
Recall:double,F1:double> to Tuple1,
but failed as the number of fields does not line up.;
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveDeserializer$
.org$apache$spark$sql$catalyst$analysis$Analyzer$
ResolveDeserializer$$fail(Analyzer.scala:2168)
What step / setting is missing/incorrect for the correct translation?
Use import spark.implicits._ instead of registering an Encoder
I had forgotten that it is required to import spark.implicits. The incorrect approach was to add the Encoder: i.e. do not include the following line
implicit val generalRowEncoder: Encoder[ScoringSummary] =
org.apache.spark.sql.Encoders.kryo[ScoringSummary] // Do NOT add this Encoder
Here is the error when removing the Encoder line
Error:(59, 113) Unable to find encoder for type stored in a Dataset.
Primitive types (Int, String, etc) and Product types (case classes)
are supported by importing spark.implicits._ Support for serializing
other types will be added in future releases.
val summaryDf = ParquetLoader.loadParquet(sparkEnv,res.state.dfs(ScoringSummaryTag).copy(df=None)).df.get.as[ScoringSummary]
Instead the following code should be added
import spark.implicits._
And then the same code works:
val summaryDf = spark.read.parquet(path).as[ScoringSummary]
As an aside: encoders are not required for case class'es or primitive types: and the above is a case class. kryo becomes handy for complex object types.

Not able to write SequenceFile in Scala for Array[NullWritable, ByteWritable]

I have a Byte Array in Scala: val nums = Array[Byte](1,2,3,4,5,6,7,8,9) or you can take any other Byte array.
I want to save it as a sequence file in HDFS. Below is the code, I am writing in scala console.
import org.apache.hadoop.io.compress.GzipCodec
nums.map( x => (NullWritable.get(), new ByteWritable(x)))).saveAsSequenceFile("/yourPath", classOf[GzipCodec])
But, it's giving following error:
error: values saveAsSequenceFile is not a member of Array[ (org.apache.hadoop.io.NullWritable), (org.apache.hadoop.io.ByteWritable)]
You require to import these classes as well (in scala console).
import org.apache.hadoop.io.NullWritable
import org.apache.hadoop.io.ByteWritable
The method saveAsSequenceFile is available on an RDD not on an array. So first you need to lift your array into an RDD and then you will be able to call the method saveAsSequenceFile
val v = sc.parallelize(Array(("owl",3), ("gnu",4), ("dog",1), ("cat",2), ("ant",5)), 2)
v.saveAsSequenceFile("hd_seq_file")
http://homepage.cs.latrobe.edu.au/zhe/ZhenHeSparkRDDAPIExamples.html

Not able to apply function to Spark Dataframe Column

I am trying to apply a function to one of my dataframe columns to convert the values. The values in the column are like "20160907" I need value to be "2016-09-07".
I wrote a function like this:
def convertDate(inDate:String ): String = {
val year = inDate.substring(0,4)
val month = inDate.substring(4,6)
val day = inDate.substring(6,8)
return year+'-'+month+'-'+day
}
And in my spark scala code, I am using this:
def final_Val {
val oneDF = hiveContext.read.orc("/tmp/new_file.txt")
val convertToDate_udf = udf(convertToDate _)
val convertedDf = oneDF.withColumn("modifiedDate", convertToDate_udf(col("EXP_DATE")))
convertedDf.show()
}
Suprisingly, in spark shell I am able to run without any error. In scala IDE I am getting the below compilation error:
Multiple markers at this line:
not enough arguments for method udf: (implicit evidence$2:
reflect.runtime.universe.TypeTag[String], implicit evidence$3: reflect.runtime.universe.TypeTag[String])org.apache.spark.sql.UserDefinedFunction. Unspecified value parameters evidence$2, evidence$3.
I am using Spark 1.6.2, Scala 2.10.5
Can someone please tell me what I am doing wrong here?
Same code I tried with different functions like in this post: stackoverflow.com/questions/35227568/applying-function-to-spark-dataframe-column".
I am not getting any compilation issues with this code. I am unable to find out the issue with my code
From what I have learned in a spark-summit course, you have to use the sql.functions methods as much as possible. before implementing your own udf you have to check if there's no existing function in the sql.functions package that does the same work. using the existing functions spark can do a lot of optimizations for you and it will not be obliged to serialize and deserialize you data from and to JVM objects.
to achieve the result you want I'm gonna propose this solution :
val oneDF = spark.sparkContext.parallelize(Seq("19931001", "19931001")).toDF("EXP_DATE")
val convertedDF = oneDF.withColumn("modifiedDate", from_unixtime(unix_timestamp($"EXP_DATE", "yyyyMMdd"), "yyyy-MM-dd"))
convertedDF.show()
this gives the following results :
+--------+------------+
|EXP_DATE|modifiedDate|
+--------+------------+
|19931001| 1993-10-01|
|19931001| 1993-10-01|
+--------+------------+
Hope this help. Best Regards

How to load a csv directly into a Spark Dataset?

I have a csv file [1] which I want to load directly into a Dataset. The problem is that I always get errors like
org.apache.spark.sql.AnalysisException: Cannot up cast `probability` from string to float as it may truncate
The type path of the target object is:
- field (class: "scala.Float", name: "probability")
- root class: "TFPredictionFormat"
You can either add an explicit cast to the input data or choose a higher precision type of the field in the target object;
Moreover, and specifically for the phrases field (check case class [2]) it get
org.apache.spark.sql.AnalysisException: cannot resolve '`phrases`' due to data type mismatch: cannot cast StringType to ArrayType(StringType,true);
If I define all the fields in my case class [2] as type String then everything works fine but this is not what I want. Is there a simple way to do it [3]?
References
[1] An example row
B017NX63A2,Merrell,"['merrell_for_men', 'merrell_mens_shoes', 'merrel']",merrell_shoes,0.0806054356579781
[2] My code snippet is as follows
import spark.implicits._
val INPUT_TF = "<SOME_URI>/my_file.csv"
final case class TFFormat (
doc_id: String,
brand: String,
phrases: Seq[String],
prediction: String,
probability: Float
)
val ds = sqlContext.read
.option("header", "true")
.option("charset", "UTF8")
.csv(INPUT_TF)
.as[TFFormat]
ds.take(1).map(println)
[3] I have found ways to do it by first defining columns on a DataFrame level and the convert things to Dataset (like here or here or here) but I am almost sure this is not the way things supposed to be done. I am also pretty sure that Encoders are probably the answer but I don't have a clue how
TL;DR With csv input transforming with standard DataFrame operations is the way to go. If you want to avoid you should use input format which has is expressive (Parquet or even JSON).
In general data to be converted to statically typed dataset must be already of the correct type. The most efficient way to do it is to provide schema argument for csv reader:
val schema: StructType = ???
val ds = spark.read
.option("header", "true")
.schema(schema)
.csv(path)
.as[T]
where schema could be inferred by reflection:
import org.apache.spark.sql.catalyst.ScalaReflection
import org.apache.spark.sql.types.StructType
val schema = ScalaReflection.schemaFor[T].dataType.asInstanceOf[StructType]
Unfortunately it won't work with your data and class because csv reader doesn't support ArrayType (but it would work for atomic types like FloatType) so you have to use the hard way. A naive solution could be expressed as below:
import org.apache.spark.sql.functions._
val df: DataFrame = ??? // Raw data
df
.withColumn("probability", $"probability".cast("float"))
.withColumn("phrases",
split(regexp_replace($"phrases", "[\\['\\]]", ""), ","))
.as[TFFormat]
but you may need something more sophisticated depending on the content of phrases.