Scala Spark Dataset change class type - scala

I have a dataframe which I created as a schema of MyData1 and then I created a column so that the new dataframe follows the schema of MyData2. And now I want to return the new dataframe as a Dataset but having the following error:
[info] org.apache.spark.sql.AnalysisException: cannot resolve '`hashed`' given input columns: [id, description];
[info] at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
[info] at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$3.applyOrElse(CheckAnalysis.scala:110)
[info] at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$3.applyOrElse(CheckAnalysis.scala:107)
[info] at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:278)
[info] at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:278)
here is my code:
import org.apache.spark.sql.{DataFrame, Dataset}
case class MyData1(id: String, description: String)
case class MyData2(id: String, description: String, hashed: String)
object MyObject {
def read(arg1: String, arg2: String): Dataset[MyData2] {
var df: DataFrame = null
val obj1 = new Matcher("cbutrer383", "e8f8chsdfd")
val obj2 = new Matcher("cbutrer383", "g567g4rwew")
val obj3 = new Matcher("cbutrer383", "567yr45e45")
df = Seq(obj1, obj2, obj3).toDF("id", "description")
df.withColumn("hashed", lit("hash"))
val ds: Dataset[MyData2] = df.as[MyData2]
ds
}
}
I know that there is something probably wrong in the following line but can't figure out
val ds: Dataset[MyData2] = df.as[MyData2]
I am a newbie so probably doing a basic mistake. Anyone can help? TIA

You forgot to assign the newly created Dataframe to df
df = df.withColumn("hashed", lit("hash"))
withcolumn Spark docs says
Returns a new Dataset by adding a column or replacing the existing
column that has the same name.

The better version of your read function is as below,
Just try to avoid null assignments, var, and a return statement is not really required
def read(arg1: String, arg2: String): Dataset[MyData2] = {
val obj1 = new Matcher("cbutrer383", "e8f8chsdfd")
val obj2 = new Matcher("cbutrer383", "g567g4rwew")
val obj3 = new Matcher("cbutrer383", "567yr45e45")
Seq(obj1, obj2, obj3).toDF("id", "description")
.withColumn("hashed", lit("hash"))
.as[MyData2]
}

Related

Spark - scala - How to convert DataFrame to custom object?

Here is block code. In the code snippet I am reading multi line json and converting into Emp object.
def main(args: Array[String]): Unit = {
val filePath = Configuration.folderPath + "emp_unformatted.json"
val sparkConfig = new SparkConf().setMaster("local[2]").setAppName("findEmp")
val sparkContext = new SparkContext(sparkConfig)
val sqlContext = new SQLContext(sparkContext)
val formattedJsonData = sqlContext.read.option("multiline", "true").json(filePath)
val res = formattedJsonData.rdd.map(empParser)
for (e <- res.take(2)) println(e.name + " " + e.company + " " + e.about)
}
case class Emp(name: String, company: String, email: String, address: String, about: String)
def empParser(row: Row): Emp =
{
new Emp(row.getAs("name"), row.getAs("company"), row.getAs("email"), row.getAs("address"), row.getAs("about"))
}
My question is the line "formattedJsonData.rdd.map(empParser)" approach is correct? I am converting to RDD of Emp Object.
1. is that right approach.
2. Suppose I have 1L, 1M records, in that case any performance isssue.
3. have any better option to convert collection of emp
If you are using spark 2, You can use dataset which is also type-safe plus it provides performance benefits of DataFrames.
val df = sqlSession.read.option("multiline", "true").json(filePath)
import sqlSession.implicits._
val ds: Dataset[Emp] = df.as[Emp]

How to create NULLable Flink table columns from Scala case classes that contain Option types

I would like to create a DataSet (or DataStream) from a collection of case classes that contain Option values.
In the created table columns resulting from Option values should either contain NULL or the actual primitive value.
This is what I tried:
import org.apache.flink.api.scala.ExecutionEnvironment
import org.apache.flink.streaming.api.scala._
import org.apache.flink.table.api.TableEnvironment
import org.apache.flink.table.api.scala._
import org.apache.flink.types.Row
object OptionExample {
case class Event(time: Timestamp, id: String, value: Option[Int])
def main(args: Array[String]): Unit = {
val env = ExecutionEnvironment.getExecutionEnvironment
val tEnv = TableEnvironment.getTableEnvironment(env)
val data = env.fromCollection(Seq(
Event(Timestamp.valueOf("2018-01-01 00:01:00"), "a", Some(3)),
Event(Timestamp.valueOf("2018-01-01 00:03:00"), "a", None),
Event(Timestamp.valueOf("2018-01-01 00:03:00"), "b", Some(7)),
Event(Timestamp.valueOf("2018-01-01 00:02:00"), "a", Some(5))
))
val table = tEnv.fromDataSet(data)
table.printSchema()
// root
// |-- time: Timestamp
// |-- id: String
// |-- value: Option[Integer]
val result = table
.groupBy('id)
.select('id, 'value.avg as 'averageValue)
// Print results
val ds: DataSet[Row] = result.toDataSet
ds.print()
}
}
But this causes an Exception in the aggregation part...
org.apache.flink.table.api.ValidationException: Expression avg('value) failed on input check: avg requires numeric types, get Option[Integer] here
...so with this approach Option does not get converted into a numeric type with NULLs as described above.
How can I achieve this with Flink?
(I'm coming from Apache Spark, there Datasets created from case classes with Options have this behaviour. I would like to achieve something similar with Flink)

Java.lang.IllegalArgumentException: requirement failed: Columns not found in Double

I am working in spark I have many csv files that contain lines, a line looks like that:
2017,16,16,51,1,1,4,-79.6,-101.90,-98.900
It can contain more or less fields, depends on the csv file
Each file corresponds to a cassandra table, where I need to insert all the lines the file contains so what I basically do is get the line, split its elements and put them in a List[Double]
sc.stop
import com.datastax.spark.connector._, org.apache.spark.SparkContext, org.apache.spark.SparkContext._, org.apache.spark.SparkConf
val conf = new SparkConf(true).set("spark.cassandra.connection.host", "localhost")
val sc = new SparkContext(conf)
val nameTable = "artport"
val ligne = "20171,16,165481,51,1,1,4,-79.6000,-101.7000,-98.9000"
val linetoinsert : List[String] = ligne.split(",").toList
var ainserer : Array[Double] = new Array[Double](linetoinsert.length)
for (l <- 0 to linetoinsert.length)yield {ainserer(l) = linetoinsert(l).toDouble}
val liste = ainserer.toList
val rdd = sc.parallelize(liste)
rdd.saveToCassandra("db", nameTable) //db is the name of my keyspace in cassandra
When I run my code I get this error
java.lang.IllegalArgumentException: requirement failed: Columns not found in Double: [collecttime, sbnid, enodebid, rackid, shelfid, slotid, channelid, c373910000, c373910001, c373910002]
at scala.Predef$.require(Predef.scala:224)
at com.datastax.spark.connector.mapper.DefaultColumnMapper.columnMapForWriting(DefaultColumnMapper.scala:108)
at com.datastax.spark.connector.writer.MappedToGettableDataConverter$$anon$1.<init>(MappedToGettableDataConverter.scala:37)
at com.datastax.spark.connector.writer.MappedToGettableDataConverter$.apply(MappedToGettableDataConverter.scala:28)
at com.datastax.spark.connector.writer.DefaultRowWriter.<init>(DefaultRowWriter.scala:17)
at com.datastax.spark.connector.writer.DefaultRowWriter$$anon$1.rowWriter(DefaultRowWriter.scala:31)
at com.datastax.spark.connector.writer.DefaultRowWriter$$anon$1.rowWriter(DefaultRowWriter.scala:29)
at com.datastax.spark.connector.writer.TableWriter$.apply(TableWriter.scala:382)
at com.datastax.spark.connector.RDDFunctions.saveToCassandra(RDDFunctions.scala:35)
... 60 elided
I figured out that the insertion works if my RDD was of type :
rdd: org.apache.spark.rdd.RDD[(Double, Double, Double, Double, Double, Double, Double, Double, Double, Double)]
But the one I get from what I am doing is RDD org.apache.spark.rdd.RDD[Double]
I can't use scala Tuple9 for example because I don't know the number of elements my list is going to contain before execution, this solution also doesn't fit my problem because sometimes I have more than 100 columns in my csv and tuple stops at Tuple22
Thanks for your help
As #SergGr mentioned Cassandra table has a schema with known columns. So you need to map your Array to Cassandra schema before saving to Cassandra database. You can use Case Class for this. Try the following code, I assume each column in Cassandra table is of type Double.
//create a case class equivalent to your Cassandra table
case class Schema(collecttime: Double,
sbnid: Double,
enodebid: Double,
rackid: Double,
shelfid: Double,
slotid: Double,
channelid: Double,
c373910000: Double,
c373910001: Double,
c373910002: Double)
object test {
import com.datastax.spark.connector._, org.apache.spark.SparkContext, org.apache.spark.SparkContext._, org.apache.spark.SparkConf
def main(args: Array[String]): Unit = {
val conf = new SparkConf(true).set("spark.cassandra.connection.host", "localhost")
val sc = new SparkContext(conf)
val nameTable = "artport"
val ligne = "20171,16,165481,51,1,1,4,-79.6000,-101.7000,-98.9000"
//parse ligne string Schema case class
val schema = parseString(ligne)
//get RDD[Schema]
val rdd = sc.parallelize(Seq(schema))
//now you can save this RDD to cassandra
rdd.saveToCassandra("db", nameTable)
}
//function to parse string to Schema case class
def parseString(s: String): Schema = {
//get each field from string array
val Array(collecttime, sbnid, enodebid, rackid, shelfid, slotid,
channelid, c373910000, c373910001, c373910002, _*) = s.split(",").map(_.toDouble)
//map those fields to Schema class
Schema(collecttime,
sbnid,
enodebid,
rackid,
shelfid,
slotid,
channelid,
c373910000,
c373910001,
c373910002)
}
}

How to deal with contexts in Spark/Scala when using map()

I'm not very familiar with Scala, neither with Spark, and I'm trying to develop a basic test to understand how DataFrames actually work. My objective is to update my myDF based on values of some registries of another table.
Well, on the one hand, I have my App:
object TestApp {
def main(args: Array[String]) {
val conf: SparkConf = new SparkConf().setAppName("test").setMaster("local[*]")
val sc = new SparkContext(conf)
implicit val hiveContext : SQLContext = new HiveContext(sc)
val test: Test = new Test()
test.test
}
}
On the other hand, I have my Test class :
class Test(implicit sqlContext: SQLContext) extends Serializable {
val hiveContext: SQLContext = sqlContext
import hiveContext.implicits._
def test(): Unit = {
val myDF = hiveContext.read.table("myDB.Customers").sort($"cod_a", $"start_date".desc)
myDF.map(myMap).take(1)
}
def myMap(row: Row): Row = {
def _myMap: (String, String) = {
val investmentDF: DataFrame = hiveContext.read.table("myDB.Investment")
var target: (String, String) = casoX(investmentDF, row.getAs[String]("cod_a"), row.getAs[String]("cod_p"))
target
}
def casoX(df: DataFrame, codA: String, codP: String)(implicit hiveContext: SQLContext): (String, String) = {
var rows: Array[Row] = null
if (codP != null) {
println(df)
rows = df.filter($"cod_a" === codA && $"cod_p" === codP).orderBy($"sales".desc).select($"cod_t", $"nom_t").collect
} else {
rows = df.filter($"cod_a" === codA).orderBy($"sales".desc).select($"cod_t", $"nom_t").collect
}
if (rows.length > 0) (row(0).asInstanceOf[String], row(1).asInstanceOf[String]) else null
}
val target: (String, String) = _myMap
Row(row(0), row(1), row(2), row(3), row(4), row(5), row(6), target._1, target._2, row(9))
}
}
Well, when I execute it, I have a NullPointerException on the instruction val investmentDF: DataFrame = hiveContext.read.table("myDB.Investment"), and more precisely hiveContext.read
If I analyze hiveContext in the "test" function, I can access to its SparkContext, and I can load my DF without any problem.
Nevertheless if I analyze my hiveContext object just before getting the NullPointerException, its sparkContext is null, and I suppose due to sparkContext is not Serializable (and as I am in a map function, I'm loosing part of my hiveContext object, am I right?)
Anyway, I don't know what's wrong exactly with my code, and how should I alter it to get my investmentDF without any NullPointerException?
Thanks!

Spark convert RDD to DataFrame - Enumeration is not supported

I have a case class which contains a enumeration field "PersonType". I would like to insert this record to a Hive table.
object PersonType extends Enumeration {
type PersonType = Value
val BOSS = Value
val REGULAR = Value
}
case class Person(firstname: String, lastname: String)
case class Holder(personType: PersonType.Value, person: Person)
And:
val hiveContext = new HiveContext(sc)
import hiveContext.implicits._
val item = new Holder(PersonType.REGULAR, new Person("tom", "smith"))
val content: Seq[Holder] = Seq(item)
val data : RDD[Holder] = sc.parallelize(content)
val df = data.toDF()
...
When I try to convert the corresponding RDD to DataFrame, I get the following exception:
Exception in thread "main" java.lang.UnsupportedOperationException:
Schema for type com.test.PersonType.Value is not supported
...
at org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:691)
at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:30)
at org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:630)
at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:30)
at org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:414)
at org.apache.spark.sql.SQLImplicits.rddToDataFrameHolder(SQLImplicits.scala:94)
I'd like to convert PersonType to String before inserting to Hive.
Is it possible to extend the implicitconversion to handle PersonType as well?
I tried something like this but didn't work:
object PersonTypeConversions {
implicit def toString(personType: PersonTypeConversions.Value): String = personType.toString()
}
import PersonTypeConversions._
Spark: 1.6.0