Different objects using updateStateByKey or MapWithState in Scala Spark - scala

I have DStreams with the following look:
(315, Object1(32, 100, 54))
(315, Object1(29, 121, 51))
(315, Object1(62, 89, 39))
(341, Object1(54, 90, 46))
(341, Object1(55, 94, 48))
...
where Object1 has a structure that looks something like:
class Object1(minutes: Int, temperature: Int, pressure: Int)
Now, I would like to create a second object called Object2 where I store aggregated information about entries in Object1. It would have a structure like the following:
class Object2(minMinutes: Int, maxMinutes: Int, avgTemperature: Double, sumPressure: Int)
This second object would be stored in another DStream being this DStream (key, Object2). Printed, it would look like:
(315, Object2(29, 62, 103.3, 144))
(341, Object2(54, 55, 54.5, 94))
What I would like to know is if from going from the DStream with Object1 to the DStream with Object2 I can use updateStateByKey or mapWithState, and if so, how?
Thank you very much!

Related

Using pureconfig to read tuples encoded as a list or array

I'd like to use the PureConfig library to read a list of tuples. That is, I'd like to read
{
foo: [
[1, 1.0, "Good job!"]
[2, 2.0, "Excellent job!"]
[3, 0.0, "Please play again."]
]
}
as a
case class Wrapper(foo:Vector[(Int, Double, String)])
A PureConfig issue from 2018 is tantalizing:
Think of tuples as containers instead; a pair holds two values with possibly different types and can be loaded as an array from a config, like [A, B]. You'll find plenty of use cases where you want to load such a structure but for some reason don't want to define a new object - for example, you can load a list of histogram counters (a List of (String, Int) entries) encoding it in the config as [[a, 1], [b, 3], [c, 2], ...].
but I haven't been able to figure out how to actually do it.
On the theory that I should walk before running, I've written the following to attempt to read just a single tuple, encoded as a list/array:
implicit val tReader: ConfigReader[(Int, Double, String)] = {
ConfigReader[Int].zip(ConfigReader[Double]).zip(ConfigReader[String]).map(x => (x._1._1, x._1._2, x._2))
}
case class Wrapper(foo: (Int, Double, String))
val mi: Result[Wrapper] =
ConfigSource.fromConfig(ConfigFactory.parseString("""foo: [5, 1.0, "hello" ]""")).load[Wrapper]
It fails when it encounters the list notation: [...]
Suggestions welcome.
The pureconfig-magnolia module does what you need. It's alternative of the generic module, but unlike it, that supports tuples. Once you added it in your build, all you need to do is add the corresponded import and create a reader for your wrapper class.
import pureconfig.module.magnolia.auto.reader.exportReader
case class Wrapper(foo: (Int, Double, String))
implicit val wrapperReader: ConfigReader[Wrapper] =
ConfigReader.exportedReader
val mi: Result[Wrapper] =
ConfigSource
.fromConfig(ConfigFactory.parseString("""foo: [5, 1.0, "hello" ]"""))
.load[Wrapper]
There you can find a README file and tests of this module: https://github.com/pureconfig/pureconfig/tree/master/modules/magnolia

How to convert a String to a list of ascii values with map in Scala?

I am taking a course on functional programming at the university and I am trying to solve a problem for an assignment. I have implemented a LinkedList in Scala and one of the functions I have implemented for that list is map if the following signature:
override def map(mapFunc: Int => Int): IntList
Now I have to write a function called encode which uses this map function in order to convert a String to a list of ASCII values of each character. For instance:
"abcdef" should return: SinglyLinkedIntList(97, 98, 99, 100, 101, 102)
My attemp so far is:
def encode(s: String): IntList = {
var a = "abcdef".map(_.toByte)
}
But this just converts the string to an Array. I am not sure how to continue from here. Could anyone help me understand how a problem like this could be solved using functional principles? Thanks in advance

Spark 2.1.0 UDF Schema type not supported

I am using a data type called a Point(x: Double, y: Double). I am trying to using columns _c1 and _c2 as input to Point(), and then create a new column of Point values as follows
val toPoint = udf{(x: Double, y: Double) => Point(x,y)}
Then I call the function:
val point = data.withColumn("Point", toPoint(watned("c1"),wanted("c2")))
However, when I declare the udf I get the following error:
java.lang.UnsupportedOperationException: Schema for type com.vividsolutions.jts.geom.Point is not supported
at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:733)
at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$2.apply(ScalaReflection.scala:729)
at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$2.apply(ScalaReflection.scala:728)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.immutable.List.map(List.scala:285)
at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:728)
at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:671)
at org.apache.spark.sql.functions$.udf(functions.scala:3084)
... 48 elided
I have properly imported this data type, and used it many times before. Now that I try to include it in the Schema of my udf it doesn't recognize it. What is the method to include types other than the standard Int, String, Array, etc...
I am using Spark 2.1.0 on Amazon EMR.
Here some related questions I've referenced:
How to define schema for custom type in Spark SQL?
Spark UDF error - Schema for type Any is not supported
You should define Point as a case class
case class Point(x: Double, y: Double)
or if you wish
case class MyPoint(x:Double,y:Double) extends com.vividsolutions.jts.geom.Point(x,y)
This way the schema is inferred automatically by Spark

How to flatten an array and convert it into a string?

I have an array in Scala e.g. Array("11","112","as2") and I want to parse it into a string:
"['11','112','as2']"
How can I do it in Scala?
mkString is one way to go. The scala REPL is great for this sort of thing:
scala> Array("11","112","as2")
res0: Array[String] = Array(11, 112, as2)
scala> "['"+res0.mkString("','")+"']"
res1: String = ['11','112','as2']
But if you are creating JSON, perhaps something from your JSON library would be more appropriate?

How to specify salat DAO model for nested list of mixed type?

I have data coming back from MongoDB that looks like this:
> db.foo.findOne()
[
{
"_id" : "some string",
"bar" : [
[
14960265,
0.5454545454545454
],
[
30680,
0.36363636363636365
],
[
12852625,
0.09090909090909091
]
],
}
]
The bar property contains a list of unknown size, each item of which is a list of length two containing an Int and a Double. In Scala, I would represent this as List[(Int, Double)].
How would I write the model for this structure to use with Salat?
Salat doesn't do tuples, so I tried:
case class FooEntry(a: Int, b: Double)
case class Foo(_id: String, bar: List[FooEntry])
but got:
java.lang.IllegalArgumentException: BasicBSONList can only work with
numeric keys, not: [a]
Also tried:
case class Foo(_id: String, sps: List[Any])
but got:
java.lang.ClassCastException: com.mongodb.BasicDBList cannot be cast
to scala.collection.immutable.List
Obviously, the data could be stored in a better form, with an object instead of the length-two arrays. But given that's what I've got, is there a good way to use Salat to deserialize it? Thanks!
Salat project lead here. No matter what your data structure, you would need to specify a type for the list. Salat doesn't support tuples yet, and while Salat supports polymorphic collections (this requires type hints!), it doesn't support lists of heterogeneous types like yours.
Can you restructure your data so that the array members are not lists but instead
[
{x: 123, y: 123.0},
{x: 456, y: 456.0}
]
Then you could use
case class Bar(x: Long, y: Double)
case class Foo(_id: String, sps: List[Bar])
Alternately, consider trying to use Miles Sabin's Shapeless project or Alois Cochard's Sherpa project to deserialize your data.