Spark DataFrame not supporting Char datatype - scala

I am creating a Spark DataFrame from a text file. Say Employee file which contains String, Int, Char.
created a class:
case class Emp (
Name: String,
eid: Int,
Age: Int,
Sex: Char,
Sal: Int,
City: String)
Created RDD1 using split, then created RDD2:
val textFileRDD2 = textFileRDD1.map(attributes => Emp(
attributes(0),
attributes(1).toInt,
attributes(2).toInt,
attributes(3).charAt(0),
attributes(4).toInt,
attributes(5)))
And Final RDDS as:
finalRDD = textFileRDD2.toDF
when I create final RDD it throws the error:
java.lang.UnsupportedOperationException: No Encoder found for scala.Char"
can anyone help me out why and how to resolve it?

Spark SQL doesn't provide Encoders for Char and generic Encoders are not very useful.
You can either use a StringType:
attributes(3).slice(0, 1)
or ShortType (or BooleanType, ByteType if you accept only binary response):
attributes(3)(0) match {
case 'F' => 1: Short
...
case _ => 0: Short
}

Related

Transforming RDD[String] to RDD[myclass]

I am trying to transform RDD[String] to RDD[Picture] but could not do it. If I could manage to convert RDD to RDD[Picture] I would use the def hasValidCountry to check if the values latitude and longitude of the picture meta valid. And after that I am trying to check if user Tags are valid with def hasTags in Picture class. The problem I encounter :
Implicit conversion found: row ⇒ augmentString(row): scala.collection.immutable.StringOps
type mismatch; found : String required: Array[String]
value InterestingPics is not a member of Array[Nothing] possible cause: maybe a semicolon is missing before `value InterestingPics'?
My intention is to choose line which has valid country and tags and transform all the line to new RDD[Picture] class.
ScalaFile1 (I have updated the ScalaFile):
object Part2 {
def main(args: Array[String]): Unit = {
var spark: SparkSession = null
try {
spark = SparkSession.builder().appName("Flickr using dataframes").config("spark.master", "local[*]").getOrCreate()
val originalFlickrMeta: RDD[String] = spark.sparkContext.textFile("flickrSample.txt")
val InterestingPics = originalFlickrMeta.map(row => row.split('\t')).map(field => Picture(field(0).toString())
InterestingPics.collect
InterestingPics.take(5).foreach(println)
This works, as an example:
case class case_for_rdd(c1: Int, c2: String, c3: String)
val rdd_data = spark.sparkContext.textFile("/FileStore/tables/csv01-4.txt")
val rdd = rdd_data.map(row => row.split(',')).map(field => case_for_rdd(field(0).toInt, field(1), field(2)))
rdd.collect
More complicated example with reading into RDD from file with array. Array needs a delimiter.
1,10,100,aa|bb|cc
2,20,200,xxxxxx|yyyyyyyy|z|aaa
Some sample code, but use List as otherwise you get to see array addresses, that's what those strange strings are, courtesy of smarter
people here:
case class case_for_rdd(c1: Int, c2: String, c3: String, a4: List[String])
val rdd_data = spark.sparkContext.textFile("/FileStore/tables/csv03.txt")
val myCaseRdd = rdd_data.map(row => row.split(',')).map(field => case_for_rdd(field(0).toInt, field(1), field(2), (field(3).split("\\|").toList)))
myCaseRdd.collect
My advice is to use a DF and the splitting stuff is then easier. Also, manipulation of the rdd via transformation, then the case class is lost. Array with DF api has no such issue.
I have an solution to my question in accordence with help of #thebluephantom. Thank you very much.
val InterestingPics = originalFlickrMeta.map(line => (new Picture(line.split("\t")))).filter(f => f.c != null && f.userTags.length > 0)
InterestingPics.collect().foreach(println)

Scala dataset map fails with exception No applicable constructor/method found for zero actual parameters

I have the following case classes
case class FeedbackData (prefix : String, position : Int, click : Boolean,
suggestion: Suggestion,
history : List[RequestHistory],
eventTimestamp: Long)
case class Suggestion (clicks : Long, sources : List[String], ctr : Float)
case class RequestHistory (timestamp: Long, url: String)
I use it to perform a map operation on my dataset
sqlContext = ss.sqlContext
import sqlContext.implicits._
val input: Dataset[FeedbackData] = ss.read.json("filename").as(Encoders.bean(classOf[FeedbackData]))
input.map(row => transformRow(row))
At runtime I see the exception
java.util.concurrent.ExecutionException: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 24, Column 81: failed to compile:
No applicable constructor/method found for zero actual parameters; candidates are: "package.FeedbackData(java.lang.String, int, boolean, package.Suggestion, scala.collection.immutable.List, long)"
What am I doing wrong ?
Context is fine here, issue with case class, Scala long (Long) have to used instead of Java long (long):
case class A(num1 : Long, num2 : Long, num3 : Long)
Inspired from #pasha701,use case could be
case class Student(id: Int, name: String)
import spark.implicits._
val df = Seq((1, "james"), (2, "tony")).toDF("id", "name")
df.printSchema()
df.as[Student].rdd.map{
stu=>
stu.id+"\t"+stu.name
}.collect().foreach(println)
output:
root
|-- id: integer (nullable = false)
|-- name: string (nullable = true)
1 james
2 tony
reference:https://spark.apache.org/docs/2.4.0/sql-getting-started.html

SPARK SQL : How to convert List[List[Any]) to Data Frame

val list = List(List(1,"Ankita"),List(2,"Kunal"))
and now I want to convert it into the data frame -
val list = List(List(1,"Ankita"),List(2,"Kunal")).toDF("id","name")
but it throws an error -
java.lang.ClassNotFoundException: Scala.any
AFAIK, List[List[Any]] cannot be converted to DataFrame directly, It need to convert to some type (here I took example to Person) List[Person]
case class Person(id: Int, name: String)
val list = List(List(1,"Ankita"),List(2,"Kunal"))
val listDf = list.map(x => Person(x(0).asInstanceOf[Int], x(1).toString)).toDF("id","name")
Another way is per the comment of user8371915, Create list of pairs and convert to DataFrame
val listDf = list.map {
case List(id: Int, name: String) => (id, name) } toDF("id", "name")
Because List (inside List) can be arbitrary size and can not using implicit type conversion.
It can be converted if you change to List of Tuple.
val list = List((1,"Ankita"),(2,"Kunal")).toDF("id","name")

How to handle dates in Spark using Scala?

I have a flat file that looks like as mentioned below.
id,name,desg,tdate
1,Alex,Business Manager,2016-01-01
I am using the Spark Context to read this file as follows.
val myFile = sc.textFile("file.txt")
I want to generate a Spark DataFrame from this file and I am using the following code to do so.
case class Record(id: Int, name: String,desg:String,tdate:String)
val myFile1 = myFile.map(x=>x.split(",")).map {
case Array(id, name,desg,tdate) => Record(id.toInt, name,desg,tdate)
}
myFile1.toDF()
This is giving me a DataFrame with id as int and rest of the columns as String.
I want the last column, tdate, to be casted to date type.
How can I do that?
You just need to convert the String to a java.sql.Date object. Then, your code can simply become:
import java.sql.Date
case class Record(id: Int, name: String,desg:String,tdate:Date)
val myFile1 = myFile.map(x=>x.split(",")).map {
case Array(id, name,desg,tdate) => Record(id.toInt, name,desg,Date.valueOf(tdate))
}
myFile1.toDF()

How to convert datatype in SPARK SQL to specific datatype but RDD result to a specifical class

I am reading a csv file and need to create a RDDSchema
I read the file by using the sqlContext.csvFile
val testfile = sqlContext.csvFile("file")
testfile.registerTempTable(testtable)
I wanted to change the pick some of the fields and return an RDD type of those fields
For example : class Test(ID: String, order_date: Date, Name: String, value: Double)
Using sqlContext.sql("Select col1, col2, col3, col4 FROM ...)
val testfile = sqlContext.sql("Select col1, col2, col3, col4 FROM testtable).collect
testfile.getClass
Class[_ <: Array[org.apache.spark.sql.Row]] = class [Lorg.apache.spark.sql.Row;
So I wanted to change col1 to double, col2 to a date , and column3 to string?
Is there a way to do this in the sqlContext.sql or I have to run a map function to the result and then turn it back to RDD..
I tried to do the do the item in one statement and I got this error:
val old_rdd : RDD[Test] = sqlContext.sql("SELECT col, col2, col3,col4 FROM testtable").collect.map(t => (t(0) : String ,dateFormat.parse(dateFormat.format(1)),t(2) : String, t(3) : Double))
The issue I am having is the assignment does not result on RDD[Test] where Test is a class defined
The error is saying that the map command is coming out as an Array Class and not an RDD Class
found : Array[edu.model.Test]
[error] required: org.apache.spark.rdd.RDD[edu.model.Test]
Lets say you have a case class like this:
case class Test(
ID: String, order_date: java.sql.Date, Name: String, value: Double)
Since you load your data with csvFile with default parameters it doesn't perform any schema inference and your data is stored as plain strings. Lets assume that there are no other fields:
val df = sc.parallelize(
("ORD1", "2016-01-02", "foo", "2.23") ::
("ORD2", "2016-07-03", "bar", "9.99") :: Nil
).toDF("col1", "col2", "col3", "col4")
Your attempt to use map is wrong for more than one reason:
function you use annotates individual values with incorrect types. Not only Row.apply is of type Int => Any but also your data table contains shouldn't contain any Double values
since you collect (which doesn't makes sense here) you fetch all data to the driver and result is local Array not RDD
finally, if all previous issues were resolved, (String, Date, String, Double) is clearly not a Test
One way to handle this:
import org.apache.spark.sql.Row
import org.apache.spark.rdd.RDD
val casted = df.select(
$"col1".alias("ID"),
$"col2".cast("date").alias("order_date"),
$"col3".alias("name"),
$"col4".cast("double").alias("value")
)
val tests: RDD[Test] = casted.map {
case Row(id: String, date: java.sql.Date, name: String, value: Double) =>
Test(id, date, name, value)
}
You can also try to use new Dataset API but it is far from stable:
casted.as[Test].rdd