Dynamic row expansion in Spark Scala

Dynamic row expansion in Spark Scala - scala

Say in my code I have a static construct mapping a Spark RDD element into something:
val btRdd = rdd.map {
case Row (
field1: String,
field2: String,
field3: String,
field4: String,
field5: Double
) =>
How to make it dynamic so I can handle RDD of any arbitrary structure in terms of number and type of input fields?

Related

List[String] Object in scala case class

I am using dse 5.1.0 (packaged with spark 2.0.2.6 and scala 2.11.8).
reading a cassandra table as below.
val sparkSession = ...
val rdd1 = sparkSession.table("keyspace.table")
This table contains a List[String] column, say list1, which I read in scala rdd, say rdd1. But when I try to use encoder, it throws error.
val myVoEncoder = Encoders.bean(classOf[myVo])
val dataSet = rdd1.as(myVoEncoder)
I have tried with
scala.collection.mutable.list,
scala.collection.immutable.list,
scala.collection.list,
Seq,
WrappedArray. All gave the same error as below.
java.lang.UnsupportedOperationException: Cannot infer type for class scala.collection.immutable.List because it is not bean-compliant
MyVo.scala
case class MyVo(
#BeanProperty var id: String,
#BeanProperty var duration: Int,
#BeanProperty var list1: List[String],
) {
def this() = this("", 0, null)
}
Any help will be appriciated.

You should use Array[String]:
case class MyVo(
#BeanProperty var id: String,
#BeanProperty var duration: Int,
#BeanProperty var list1: Array[String]
) {
def this() = this("", 0, null)
}
although it is important to stress out, that more idiomatic approach would be:
import sparkSession.implicits._
case class MyVo(
id: String,
duration: Int,
list1: Seq[String]
)
rdd1.as[MyVo]

Scala Get All Fields without default value

I have the following code:
case class Person(name: String, age: Int = 0)
def extractFieldNames[A](implicit m: Manifest[A]): Array[String] =
m.runtimeClass.getDeclaredFields.map(_.getName)
extractFieldNames[Person]
// should return Array("name", "age")
But what if I want to exclude age because of its default parameter? How would I do this?

toMap error in mapping a Data Set

I'm having an error
error: Cannot prove that (Int, String, String, String, String, Double, String) <:< (T, U).
}.collect.toMap
when executing my application having the following code snippet.
val trains = sparkEnvironment.sc.textFile(dataDirectoryPath + "/trains.csv").map { line =>
val fields = line.split(",")
// format: (trainID,trainName,departure,arrival,cost,trainClass)
(fields(0).toInt, fields(1),fields(2),fields(3),fields(4).toDouble,fields(5))
}.collect.toMap
What could be the cause and can anyone please suggest a solution ?

if you want to do toMap on Seq, you show have a Seq of Tuple2. ScalaDoc of toMap states :
This method is unavailable unless the elements are members of Tuple2,
each ((T, U)) becoming a key-value pair in the map
So you should do:
val trains = sparkEnvironment.sc.textFile(dataDirectoryPath + "/trains.csv").map { line =>
val fields = line.split(",")
// format: (trainID,trainName,departure,arrival,cost,trainClass)
(fields(0).toInt, // first element of Tuple2 -> "key"
(fields(1),fields(2),fields(3),fields(4).toDouble,fields(5)) // 2nd element of Tuple2 -> "value"
)
}.collect.toMap
such that your map-statwment returns RDD[(Int, (String, String, String, String, Double, String))]

In Scala, only add items to Map if Optional values are present

I'm new to scala and I'm trying to do something like this in a clean way.
I have a method that takes in several optional parameters. I want to create a map and only add items to the map if the optional parameter has a value. Here's a dummy example:
def makeSomething(value1: Option[String], value2: Option[String], value3: Option[String]): SomeClass = {
val someMap: Map[String, String] =
value1.map(i => Map("KEY_1_NAME" -> i.toString)).getOrElse(Map())
}
In this case above, we're kind of doing what I want but only if we only care about value1 - I would want this done for all of the optional values and have them put into the map. I know I can do something brute-force:
def makeSomething(value1: Option[String], value2: Option[String], value3: Option[String]): SomeClass = {
// create Map
// if value1 has a value, add to map
// if value2 has a value, add to map
// ... etc
}
but I was wondering if scala any features that would help me able to clean this up.
Thanks in advance!

You can create a Map[String, Option[String]] and then use collect to remove empty values and "extract" the present ones from their wrapping Option:
def makeSomething(value1: Option[String], value2: Option[String], value3: Option[String]): SomeClass = {
val someMap: Map[String, String] =
Map("KEY1" -> value1, "KEY2" -> value2, "KEY3" -> value3)
.collect { case (key, Some(value)) => key -> value }
// ...
}

.collect is one possibility. Alternatively, use the fact that Option is easily convertible to a Seq:
value1.map("KEY1" -> _) ++
value2.map("KEY2" -> _) ++
value3.map("KEY3" -> _) toMap

Optimize map contains

I have the following condition
if (map.contains(clazz)) {
// .....
}
Where map is defined as Map[Clazz,String]
And Clazz is defined as
case class Clazz(key: String, field1: String, field2: String)
However, the key field alone identifies the object, so the comparison of field1 and field2 is redundant. How to optimize the contains statement?

Simple and straight forward way instead of overriding equals and hashCode
Redefine your Clazz like below.
case class Clazz(key: String)(field1: String, field2: String)
This way equals and hashCode methods will be generated with only key in consideration and field1, field2 will be ignored for the equality check.
That means key uniquely determines the clazz instance which is what you want.
Now you can do contains check which will only use key for internally for equality check.
Scala REPL
scala> case class Clazz(key: String)(field1: String, field2: String)
defined class Clazz
scala> val map = Map(Clazz("foo")("apple", "ball") -> "foo", Clazz("bar")("cat", "bat") -> "bar")
map: Map[Clazz, String] = Map(Clazz(foo) -> "foo", Clazz(bar) -> "bar")
scala> map contains Clazz("foo")("moo", "zoo")
res2: Boolean = true
scala> map contains Clazz("bar")("moo", "zoo")
res3: Boolean = true
scala> map contains Clazz("boo")("moo", "zoo")
res4: Boolean = false
Other way is to just override equals and hashCode
case class Clazz(key: String, field1: String, field2: String) {
override def equals(otherClazz: Any) = that match {
case otherClazz: Clazz => otherClazz.key.equals(key)
case _ => false
}
override def hashCode = key.hashCode
}
Third way is the least recommended way
Just maintain a Map[String, Clazz] key to Clazz map.
Now you can check contains like below
val keyMap = map.map { case (clazz, _) => clazz.key -> clazz}
keyMap.contains(clazz.key)
When match is successful then you can get the value using the code.
map.get(keyMap(clazz.key)) //this will give Option[String]

You can re-define equals and hashCode:
case class Clazz(key: String, field1: String, field2: String) {
override def equals(that: Any) = that match {
case that: Clazz => that.key.equals(key)
case _ => false
}
override def hashCode = key.hashCode
}

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Dynamic row expansion in Spark Scala - scala

Related

List[String] Object in scala case class

Scala Get All Fields without default value

toMap error in mapping a Data Set

In Scala, only add items to Map if Optional values are present

Optimize map contains

Categories

Resources