Scala: How to convert a Seq[Array[String]] into Seq[Double]? - scala

I need to split up the data in Seq[Array[String]] type into two Seq[Double] type items.
Sample data : ([4.0|1492168815],[11.0|1491916394],[2.0|1491812028]).
I used
var action1, timestamp1 = seq.map(t =>
(t.split("|"))).flatten.asInstanceOf[Seq[Double]]
but didn't get the results as expected. Looking out for valuable suggestions.

Assuming your input is in format "[double1|double2]",
scala> Seq("[4.0|1492168815]","[11.0|1491916394]","[2.0|1491812028]")
res72: Seq[String] = List([4.0|1492168815], [11.0|1491916394], [2.0|1491812028])
drop [ and ], then split by \\|, | is a metacharacter in regex.
scala> res72.flatMap {_.dropRight(1).drop(1).split("\\|").toList}.map{_.toDouble}
res74: Seq[Double] = List(4.0, 1.492168815E9, 11.0, 1.491916394E9, 2.0, 1.491812028E9)
Or you can do
scala> val actTime = seq.flatMap(t => t.map(x => { val temp = x.split("\\|"); (temp(0), temp(1))}))
actTime: Seq[(String, String)] = List((4.0,1492168815), (11.0,1491916394), (2.0,1491812028))
And to separate them into two Seq[Double] you can do
scala> val action1 = actTime.map(_._1.toDouble)
action1: Seq[Double] = List(4.0, 11.0, 2.0)
scala> val timestamp1 = actTime.map(_._2.toDouble)
timestamp1: Seq[Double] = List(1.492168815E9, 1.491916394E9, 1.491812028E9)
If there could be non-double data in input, you should use Try for safer Double conversion,
scala> Seq("[4.0|1492168815]","[11.0|1491916394]","[2.0|1491812028]", "[abc|abc]")
res75: Seq[String] = List([4.0|1492168815], [11.0|1491916394], [2.0|1491812028], [abc|abc])
scala> import scala.util.Success
import scala.util.Success
scala> import scala.util.Try
import scala.util.Try
scala> res75.flatMap {_.dropRight(1).drop(1).split("\\|").toList}
.map{d => Try(d.toDouble)}
.collect {case Success(x) => x }
res83: Seq[Double] = List(4.0, 1.492168815E9, 11.0, 1.491916394E9, 2.0, 1.491812028E9)

Extract each item in the input list with regular expression groups delimited with [, | and ],
val pat = "\\[(.*)\\|(.*)\\]".r
Hence if we suppose an input such as
val xs = List("[4.0|1492168815]","[11.0|1491916394]","[2.0|1491812028]")
consider
xs.map { v => val pat(a,b) = v; (a.toDouble, b.toLong) }.unzip
where we apply the regex defined in pat onto each item of the list, tuple each group for each item and finally unzip them so that we bisect the tuples into separate collections; viz.
(List(4.0, 11.0, 2.0),List(1492168815, 1491916394, 1491812028))

Related

Flatten a Seq of Maps to Map using Type polymorphism in Scala, Spark UDF

I have the following function that flattens a sequence of maps of string to double. How can I make type string to double generic?
val flattenSeqOfMaps = udf { values: Seq[Map[String, Double]] => values.flatten.toMap }
flattenSeqOfMaps: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,MapType(StringType,DoubleType,false),Some(List(ArrayType(MapType(StringType,DoubleType,false),true))))
I need something like,
val flattenSeqOfMaps[S,D] = udf { values: Seq[Map[S, D]] => values.flatten.toMap }
Thanks.
Edit 1:
I'm using spark 2.3. I am aware of higher order functions in spark 2.4
Edit 2: I got a bit closer. What do I need in place of f _ in val flattenSeqOfMaps = udf { f _}. Please compare joinMap type signature and flattenSeqOfMaps type signature below
scala> val joinMap = udf { values: Seq[Map[String, Double]] => values.flatten.toMap }
joinMap: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,MapType(StringType,DoubleType,false),Some(List(ArrayType(MapType(StringType,DoubleType,false),true))))
scala> def f[S,D](values: Seq[Map[S, D]]): Map[S,D] = { values.flatten.toMap}
f: [S, D](values: Seq[Map[S,D]])Map[S,D]
scala> val flattenSeqOfMaps = udf { f _}
flattenSeqOfMaps: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,MapType(NullType,NullType,true),Some(List(ArrayType(MapType(NullType,NullType,true),true))))
Edit 3: the following code worked for me.
scala> val flattenSeqOfMaps = udf { f[String,Double] _}
flattenSeqOfMaps: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,MapType(StringType,DoubleType,false),Some(List(ArrayType(MapType(StringType,DoubleType,false),true))))
While you could define your function as
import scala.reflect.runtime.universe.TypeTag
def flattenSeqOfMaps[S : TypeTag, D: TypeTag] = udf {
values: Seq[Map[S, D]] => values.flatten.toMap
}
and then use specific instances:
val df = Seq(Seq(Map("a" -> 1), Map("b" -> 1))).toDF("val")
val flattenSeqOfMapsStringInt = flattenSeqOfMaps[String, Int]
df.select($"val", flattenSeqOfMapsStringInt($"val") as "val").show
+--------------------+----------------+
| val| val|
+--------------------+----------------+
|[[a -> 1], [b -> 1]]|[a -> 1, b -> 1]|
+--------------------+----------------|
it is also possible to use built-in functions, without any need for explicit generics:
import org.apache.spark.sql.functions.{expr, flatten, map_from_arrays}
def flattenSeqOfMaps_(col: String) = {
val keys = flatten(expr(s"transform(`$col`, x -> map_keys(x))"))
val values = flatten(expr(s"transform(`$col`, x -> map_values(x))"))
map_from_arrays(keys, values)
}
df.select($"val", flattenSeqOfMaps_("val") as "val").show
+--------------------+----------------+
| val| val|
+--------------------+----------------+
|[[a -> 1], [b -> 1]]|[a -> 1, b -> 1]|
+--------------------+----------------+
The following code worked for me.
scala> def f[S,D](values: Seq[Map[S, D]]): Map[S,D] = { values.flatten.toMap}
f: [S, D](values: Seq[Map[S,D]])Map[S,D]
scala> val flattenSeqOfMaps = udf { f[String,Double] _}
flattenSeqOfMaps: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,MapType(StringType,DoubleType,false),Some(List(ArrayType(MapType(StringType,DoubleType,false),true))))

How to flatten a sequence of cats' ValidatedNel values

I need to flatten a sequence of cats.data.ValidatedNel[E, T] values to a single ValidatedNel value:
val results: Seq[cats.data.ValidatedNel[E, T]] = ???
val flattenedResult: cats.data.ValidatedNel[E, T]
I can do it like this:
import cats.std.list._, cats.syntax.cartesian._
results.reduce(_ |#| _ map { case _ => validatedValue })
but wonder if a pre-defined library methods exists.
It depends on how you want to combine them (what is validatedValue in your question ?)
import cats.data.{Validated, ValidatedNel}
import cats.implicits._
val validations1 = List(1.validNel[String], 2.valid, 3.valid)
val validations2 = List(1.validNel[String], "kaboom".invalidNel, "boom".invalidNel)
If you want to combine the Ts, you can use Foldable.combineAll which uses a Monoid[T] :
val valSum1 = validations1.combineAll
// Valid(6)
val valSum2 = validations2.combineAll
// Invalid(OneAnd(kaboom,List(boom)))
If you want to get a ValidationNel[String, List[T]], you can use Traverse.sequence :
val valList1: ValidatedNel[String, List[Int]] = validations1.sequence
// Valid(List(1, 2, 3))
val valList2: ValidatedNel[String, List[Int]] = validations2.sequence
// Invalid(OneAnd(kaboom,List(boom)))
If you don't care about the result, which seems to be the case, you can use Foldable.sequence_.
val result1: ValidatedNel[String, Unit] = validations1.sequence_
// Valid(())
val result2: ValidatedNel[String, Unit] = validations2.sequence_
// Invalid(OneAnd(kaboom,List(boom)))
validations1.sequence_.as(validatedValue) // as(x) is equal to map(_ => x)

Better way to iterate on a collection and find multiple values instead of 1

I have the following use case, in which I am iterating multiple times on the same collection, and every time I find a different item in that collection.
class Foo(policyToData: Map[String, MyClass]){
val milk: Option[MyClass] = policyToData.values.find(_.`type` == Milk)
val meat: Option[MyClass] = policyToData.values.find(_.`type` == Meat)
val bread: Option[MyClass] = policyToData.values.find(_.`type` == Bread)
val other: List[MyClass] = policyToData.values.filter(_.`type` == Other).toList
}
Is there a better way to do it? with one iteration?
If it's a large collection, folding into a map means you only build the collection of interest.
scala> case class C(name: String)
defined class C
scala> val cs = List(C("milk"),C("eggs"),C("meat"))
cs: List[C] = List(C(milk), C(eggs), C(meat))
scala> cs.foldLeft(Map.empty[String,C]) {
| case (m, c # C("milk" | "meat")) if !m.contains(c.name) => m + (c.name -> c)
| case (m, _) => m }
res5: scala.collection.immutable.Map[String,C] = Map(milk -> C(milk), meat -> C(meat))
then
scala> val milk = res5("milk")
milk: C = C(milk)
scala> val bread = res5.get("bread")
bread: Option[C] = None
The original groupBy solution was deleted because someone commented that it does extra work, but in fact it's a straightforward expression, if creating the intermediate Map of Lists is OK.
scala> cs.groupBy(_.name)
res0: scala.collection.immutable.Map[String,List[C]] = Map(meat -> List(C(meat)), eggs -> List(C(eggs)), milk -> List(C(milk)))
scala> res0.get("milk").map(_.head)
res1: Option[C] = Some(C(milk))
scala> res0.get("bread").map(_.head)
res2: Option[C] = None
or
scala> cs.filter { case C("milk" | "meat") => true case _ => false }.groupBy(_.name)
res4: scala.collection.immutable.Map[String,List[C]] = Map(meat -> List(C(meat)), milk -> List(C(milk)))
groupBy will do it:
val byType = list.groupBy(_.type).withDefaultValue(Nil)
val milk = byType(Milk).headOption
val other = byType(Other)
Etc ...

Convert Any to Double using asInstanceOf?

Is there a supported way to achieve a conversion of any numeric type to a double. E.g.
val i = 12345
val f = 1234.5F
val d = 1234.5D
val arr = Array[Any](i,f,d)
val anotherD = arr(0).asInstanceOf[Numeric].toDouble
Naturally the above code is not correct as given - since Numeric requires Type arguments.
scala> val i = 12345
i: Int = 12345
scala> val f = 1234.5F
f: Float = 1234.5
scala> val d = 1234.5D
d: Double = 1234.5
scala> val arr = Array[Any](i,f,d)
arr: Array[Any] = Array(12345, 1234.5, 1234.5)
scala> val anotherD = arr(0).asInstanceOf[Numeric].toDouble
<console>:11: error: type Numeric takes type parameters
val anotherD = arr(0).asInstanceOf[Numeric].toDouble
Now I realize the above may be achieved via match/case , along the following lines:
(a, e) match {
case (a : Double, e : Double) =>
Math.abs(a - e) <= CompareTol
case (a : Float, e : Float) =>
Math.abs(a - e) <= CompareTol
.. etc
But I was wondering if there were a means to more compactly express the operation. This code is within TEST classes and efficiency is not an important criterion. Specifically: reflection calls are OK. Thanks.
I assume you are on the JVM. The Number class does like what you want to achieve with the doubleValue method:
val arr = Array[Number](i,f,d)
val ds = arr.map(_.doubleValue())
This is horrible, and probably not efficient, but it works (on your example) :p
scala> import scala.language.reflectiveCalls
import scala.language.reflectiveCalls
scala> arr.map(_.asInstanceOf[{ def toDouble: Double }].toDouble)
res2: Array[Double] = Array(12345.0, 1234.5, 1234.5)

get first 2 values in a comma separated string

I am trying to get the first 2 values of a comma separated string in scala. For example
a,b,this is a test
How do i store the values a,b in 2 separate variables?
To keep it easy and clean.
KISS solution:
1.Use split for separation. Then use take which is defined on all ordered sequences to get the elements as needed:
scala> val res = "a,b,this is a test" split ',' take 2
res: Array[String] = Array(a, b)
2.Use Pattern matching to set the variables:
scala> val Array(x,y) = res
x: String = a
y: String = b*
Another solution using Sequence Pattern match in Scalaenter link description here
Welcome to Scala version 2.11.2 (OpenJDK 64-Bit Server VM, Java 1.7.0_65).
Type in expressions to have them evaluated.
Type :help for more information.
scala> val str = "a,b,this is a test"
str: String = a,b,this is a test
scala> val Array(x, y, _*) = str.split(",")
x: String = a
y: String = b
scala> println(s"x = $x, y = $y")
x = a, y = b
Are you looking for the method split ?
"a,b,this is a test".split(',')
res0: Array[String] = Array(a, b, this is a test)
If you want only the first two values you'll need to do something like:
val splitted = "a,b,this is a test".split(',')
val (first, second) = (splitted(0), splitted(1))
There should be some regex options here.
scala> val s = "a,b,this is a test"
s: String = a,b,this is a test
scala> val r = "[^,]+".r
r: scala.util.matching.Regex = [^,]+
scala> r findAllIn s
res0: scala.util.matching.Regex.MatchIterator = non-empty iterator
scala> .toList
res1: List[String] = List(a, b, this is a test)
scala> .take(2)
res2: List[String] = List(a, b)
scala> val a :: b :: _ = res2
a: String = a
b: String = b
but
scala> val a :: b :: _ = (r findAllIn "a" take 2).toList
scala.MatchError: List(a) (of class scala.collection.immutable.$colon$colon)
... 33 elided
or if you're not sure there is a second item, for instance:
scala> val r2 = "([^,]+)(?:,([^,]*))?".r.unanchored
r2: scala.util.matching.UnanchoredRegex = ([^,]+)(?:,([^,]*))?
scala> val (a,b) = "a" match { case r2(x,y) => (x, Option(y)) }
a: String = a
b: Option[String] = None
scala> val (a,b) = s match { case r2(x,y) => (x, Option(y)) }
a: String = a
b: Option[String] = Some(b)
This is a bit nicer if records are long strings.
Footnote: the Option cases look nicer with a regex interpolator.
If your string is short, you may as well just use String.split and take the first two elements.
val myString = "a,b,this is a test"
val splitString = myString.split(',') // Scala adds a split-by-character method in addition to Java's split-by-regex
val a = splitString(0)
val b = splitString(1)
Another solution would be to use a regex to extract the first two elements. I think it's quite elegant.
val myString = "a,b,this is a test"
val regex = """(.*),(.*),.*""".r // all groups (in parenthesis) will be extracted.
val regex(a, b) = myString // a="a", b="b"
Of course, you can tweak the regex to only allow non-empty tokens (or anything else you might need to validate) :
val regex = """(.+),(.+),.+""".r
Note that in my examples I assumed that the string always had at least two tokens. In the first example, you can test the length of the array if needed. The second one will throw a MatchError if the regex doesn't match the string.
I had originally proposed the following solution. I will leave it because it works and doesn't use any class formally marked as deprecated, but the Javadoc for StringTokenizer mentions that it is a legacy class and should no longer be used.
val myString = "a,b,this is a test"
val st = new StringTokenizer(",");
val a = st.nextToken()
val b = st.nextToken()
// You could keep calling st.nextToken(), as long as st.hasMoreTokens is true