Problem with outputting map values in scala - scala

I have the following code snippet:
val map = new LinkedHashMap[String,String]
map.put("City","Dallas")
println(map.get("City"))
This outputs Some(Dallas) instead of just Dallas. Whats the problem with my code ?
Thank You

Use the apply method, it returns directly the String and throws a NoSuchElementException if the key is not found:
scala> import scala.collection.mutable.LinkedHashMap
import scala.collection.mutable.LinkedHashMap
scala> val map = new LinkedHashMap[String,String]
map: scala.collection.mutable.LinkedHashMap[String,String] = Map()
scala> map.put("City","Dallas")
res2: Option[String] = None
scala> map("City")
res3: String = Dallas

It's not really a problem.
While Java's Map version uses null to indicate that a key don't have an associated value, Scala's Map[A,B].get returns a Options[B], which can be Some[B] or None, and None plays a similar role to java's null.
REPL session showing why this is useful:
scala> map.get("State")
res6: Option[String] = None
scala> map.get("State").getOrElse("Texas")
res7: String = Texas
Or the not recommended but simple get:
scala> map.get("City").get
res8: String = Dallas
scala> map.get("State").get
java.util.NoSuchElementException: None.get
at scala.None$.get(Option.scala:262)
Check the Option documentation for more goodies.

There are two more ways you can handle Option results.
You can pattern match them:
scala> map.get("City") match {
| case Some(value) => println(value)
| case _ => println("found nothing")
| }
Dallas
Or there is another neat approach that appears somewhere in Programming in Scala. Use foreach to process the result. If a result is of type Some, then it will be used. Otherwise (if it's None), nothing happens:
scala> map.get("City").foreach(println)
Dallas
scala> map.get("Town").foreach(println)

Related

How do I write a Dataset encoder to support mapping a function to a org.apache.spark.sql.Dataset[String] in Scala Spark

Moving from Spark 1.6 to Spark 2.2* has brought the error “error: Unable to find encoder for type stored in a 'Dataset'. Primitive types (Int, String, etc)” when trying to apply a method to a dataset returned from querying a parquet table.
I have oversimplified my code to demonstrate the same error. The code queries a parquet file to return the following datatype:
'org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]'
I apply a function to extract a string and integer , returning a string. Returning the following
datatype: Array[String]
Next, I need to perform extensive manipulations requiring a separate function. In this test function, I try to append a string producing the same error as my detailed example.
I have tried some encoder examples and use of the ‘case’ but have not come up with a workable solution. Any suggestions/ examples would be appreciated
scala> var d1 = hive.executeQuery(st)
d1: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [cvdt35_message_id_d: string,
cvdt35_input_timestamp_s: decimal(16,5) ... 2 more fields]
val parseCVDP_parquet = (s:org.apache.spark.sql.Row) => s.getString(2).split("0x"
(1)+","+s.getDecimal(1);
scala> var d2 = d1.map(parseCVDP_parquet)
d2: org.apache.spark.sql.Dataset[String] = [value: string]
scala> d2.take(1)
20/03/25 19:01:08 WARN TaskSetManager: Stage 3 contains a task of very large size (131 KB). The
maximum recommended task size is 100 KB.
res10: Array[String] = Array(ab04006000504304,1522194407.95162)
scala> def dd(s:String){
| s + "some string"
| }
dd: (s: String)Unit
scala> var d3 = d2.map{s=> dd(s) }
<console>:47: error: Unable to find encoder for type stored in a Dataset. Primitive types (Int,
String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support
for serializing other types will be added in future releases.
To distill the problem further, i believe this scenario (though I have not tried all possible solutions to) can be simplified further to the following code:
scala> var test = ( 1 to 3).map( _ => "just some words").toDS()
test: org.apache.spark.sql.Dataset[String] = [value: string]
scala> def f(s: String){
| s + "hi"
| }
f: (s: String)Unit
scala> var test2 = test.map{ s => f(s) }
<console>:42: error: Unable to find encoder for type stored in a Dataset.
Primitive types (Int, String, etc) and Product types (case classes) are
supported by importing spark.implicits._ Support for serializing other types
will be added in future releases.
var test2 = test.map{ s => f(s) }
I have a solution at least to my simplified Problem (below).
I will be testing more....
scala> var test = ( 1 to 3).map( _ => "just some words").toDS()
test: org.apache.spark.sql.Dataset[String] = [value: string]
scala> def f(s: String): String = {
| val r = s + "hi"
| return r
| }
f: (s: String)String
scala> var test2 = test.rdd.map{ s => f(s) }
test2: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[17] at map at <console>:43
scala> test2.take(1)
res9: Array[String] = Array(just some wordshi)
The first solution does not work on my initial (production) data set, rather producing the error "org.apache.spark.SparkException: Task not serializable" (interestingly though both stored as the same data type (org.apache.spark.sql.Dataset[String] = [value: string]) which I believe to be related. I included yet another solution to my test data set that eliminates the initial Encoder error and as shown actually works on my toy problem, does not ramp to a production data set. A bit confused as to exactly why my application is sidelined in the movement from 1.6 to 2.3 version spark as I didn't have to make any special accommodations to my application for years and have run it successfully for calculations that most likely count in the trillions. Other explorations have included wrapping my method as Serializable, explorations of the #transient keyword, leveraging the "org.apache.spark.serializer.KryoSerializer", writing my methods as functions and changing all vars to 'vals' (following related posts on 'stack').
scala> import spark.implicits._
import spark.implicits._
scala> var test = ( 1 to 3).map( _ => "just some words").toDS()
test: org.apache.spark.sql.Dataset[String] = [value: string]
scala> def f(s: String): String = {
| val r = s + "hi"
| return r
| }
f: (s: String)String
scala> var d2 = test.map{s => f(s)}(Encoders.STRING)
d2: org.apache.spark.sql.Dataset[String] = [value: string]
scala> d2.take(1)
res0: Array[String] = Array(just some wordshi)
scala>

Why does play-json lose precision while reading/parsing?

In the following example (scala 2.11 and play-json 2.13)
val j ="""{"t":2.2599999999999997868371792719699442386627197265625}"""
println((Json.parse(j) \ "t").as[BigDecimal].compare(BigDecimal("2.2599999999999997868371792719699442386627197265625")))
The output is -1. Shouldn't they be equal ? On printing the parsed value, it prints rounded off value:
println((Json.parse(j) \ "t").as[BigDecimal]) gives 259999999999999786837179271969944
The problem is that by default play-json configures the Jackson parser with the MathContext set to DECIMAL128. You can fix this by setting the play.json.parser.mathContext system property to unlimited. For example, in a Scala REPL that would look like this:
scala> System.setProperty("play.json.parser.mathContext", "unlimited")
res0: String = null
scala> val j ="""{"t":2.2599999999999997868371792719699442386627197265625}"""
j: String = {"t":2.2599999999999997868371792719699442386627197265625}
scala> import play.api.libs.json.Json
import play.api.libs.json.Json
scala> val res = (Json.parse(j) \ "t").as[BigDecimal]
res: BigDecimal = 2.2599999999999997868371792719699442386627197265625
scala> val expected = BigDecimal("2.2599999999999997868371792719699442386627197265625")
expected: scala.math.BigDecimal = 2.2599999999999997868371792719699442386627197265625
scala> res.compare(expected)
res1: Int = 0
Note that setProperty should happen first, before any reference to Json. In normal (non-REPL) use you'd set the property via -D on the command line or whatever.
Alternatively you could use Jawn's play-json parsing support, which just works as expected off the shelf:
scala> val j ="""{"t":2.2599999999999997868371792719699442386627197265625}"""
j: String = {"t":2.2599999999999997868371792719699442386627197265625}
scala> import org.typelevel.jawn.support.play.Parser
import org.typelevel.jawn.support.play.Parser
scala> val res = (Parser.parseFromString(j).get \ "t").as[BigDecimal]
res: BigDecimal = 2.2599999999999997868371792719699442386627197265625
Or for that matter you could switch to circe:
scala> import io.circe.Decoder, io.circe.jawn.decode
import io.circe.Decoder
import io.circe.jawn.decode
scala> decode(j)(Decoder[BigDecimal].prepare(_.downField("t")))
res0: Either[io.circe.Error,BigDecimal] = Right(2.2599999999999997868371792719699442386627197265625)
…which handles a range of number-related corner cases more responsibly than play-json in my view. For example:
scala> val big = "1e2147483648"
big: String = 1e2147483648
scala> io.circe.jawn.parse(big)
res0: Either[io.circe.ParsingFailure,io.circe.Json] = Right(1e2147483648)
scala> play.api.libs.json.Json.parse(big)
java.lang.NumberFormatException
at java.math.BigDecimal.<init>(BigDecimal.java:491)
at java.math.BigDecimal.<init>(BigDecimal.java:824)
at scala.math.BigDecimal$.apply(BigDecimal.scala:287)
at play.api.libs.json.jackson.JsValueDeserializer.parseBigDecimal(JacksonJson.scala:146)
...
But that's out of scope for this question.
To be honest I'm not sure why play-json defaults to DECIMAL128 for the MathContext, but that's a question for the play-json maintainers, and is also out of scope here.

How to head DataFrame with Map[String,Long] column and preserve types?

I have a data frame on which I have applied filter condition
val colNames = customerCountDF
.filter($"fiscal_year" === maxYear && $"fiscal_month" === maxMnth)
out of all the selected rows, I just want the last column of one row.
The last column type is Map[String, Long]. I want all the keys of the map as List[String].
I tried below syntax
val colNames = customerCountDF
.filter($"fiscal_year" === maxYear && $"fiscal_month" === maxMnth)
.head
.getMap(14)
.keySet
.toList
.map(_.toString)
I am using map(_.toString) to convert a List[Nothing] to List[String]. The error that I am getting is:
missing parameter type for expanded function ((x$1) => x$1.toString)
[error] val colNames = customerCountDF.filter($"fiscal_year" === maxYear && $"fiscal_month" === maxMnth).head().getMap(14).keySet.toList.map(_.toString)
The df is as follows:
+-------------+-----+----------+-----------+------------+-------------+--------------------+--------------+--------+----------------+-----------+----------------+-------------+-------------+--------------------+
|division_name| low| call_type|fiscal_year|fiscal_month| region_name|abandon_rate_percent|answered_calls|connects|equiv_week_calls|equiv_weeks|equivalent_calls|num_customers|offered_calls| pv|
+-------------+-----+----------+-----------+------------+-------------+--------------------+--------------+--------+----------------+-----------+----------------+-------------+-------------+--------------------+
| NATIONAL|PHONE|CABLE CARD| 2016| 1|ALL DIVISIONS| 0.02| 10626| 0| 0.0| 0.0| 10649.8| 0| 10864|Map(subscribers_c...|
| NATIONAL|PHONE|CABLE CARD| 2016| 1| CENTRAL| 0.02| 3591| 0| 0.0| 0.0| 3598.6| 0| 3667|Map(subscribers_c...|
+-------------+-----+----------+-----------+------------+-------------+--------------------+--------------+--------+----------------+-----------+----------------+-------------+-------------+--------------------+
one row of just last column selected is
[Map(subscribers_connects -> 5521287, disconnects_hsd -> 7992, subscribers_xfinity home -> 6277491, subscribers_bulk units -> 4978892, connects_cdv -> 41464, connects_disconnects -> 16945, connects_hsd -> 32908, disconnects_internet essentials -> 10319, disconnects_disconnects -> 3506, disconnects_video -> 8960, connects_xfinity home -> 43012)]
I'd like to get the keys of the last column as List[String] after applying the filter condition and taking just one row from the data frame.
The type problem is easy to solve, by explicitly specifying the type parameters at the source which is getMap(14). Since you know that the you are expecting a Map of String -> Int key-value pairs, just replace getMap(14) by getMap[String, Int](14).
And as far as the getMap[String, Int](14) being an empty Map, that has to do with your data and you simply have an empty map at index 14 in the head row.
More Details
In Scala when you create a List[A], Scala infers the type by using the information available.
For example,
// Explicitly provide the type parameter info
scala> val l1: List[Int] = List(1, 2)
// l1: List[Int] = List(1, 2)
// Infer the type parameter by using the arguments passed to List constructor,
scala> val l2 = List(1, 2)
// l2: List[Int] = List(1, 2)
So, what happens when you create an empty list,
// Explicitly provide the type parameter info
scala> val l1: List[Int] = List()
// l1: List[Int] = List()
// Infer the type parameter by using the arguments passed to List constructor,
// but surprise, there are no argument since you are creating empty list
scala> val l2 = List()
// l2: List[Nothing] = List()
So, now when Scala does not know anything, it will choose the most suitable type it can find which is the "empty" type Nothing.
The same thing happens when you do a toList on other collection objects, it tries to infer the type parameter from the source object.
scala> val ks1 = Map.empty[Int, Int].keySet
// ks1: scala.collection.immutable.Set[Int] = Set()
scala> val l1 = ks1.toList
// l1: List[Int] = List()
scala> val ks2 = Map.empty.keySet
// ks: scala.collection.immutable.Set[Nothing] = Set()
scala> val l2 = ks2.toList
// l1: List[Nothing] = List()
Similarly, the getMap(14) which you called on the head Row of the DataFrame, infers the type parameters for the Map using the values it is getting from the Row at index 14. So, if it does not get anything at the said index the returned map will be same as Map.empty which is a Map[Nothing, Nothing].
Which means that your whole,
val colNames = customerCountDF.filter($"fiscal_year" === maxYear && $"fiscal_month" === maxMnth).head.getMap(14).keySet.toList.map(_.toString)
is equivalent to,
val colNames = Map.empty.keySet.toList.map(_.toString)
And hence,
scala> val l = List()
// l1: List[Nothing] = List()
val colNames = l.map(_.toString)
To summarise the above, any List[Nothing] can only be an empty list.
Now, there are two problems, one is about the type-problem in List[Nothing] the other is about it being empty.
After filter you can just select the column and get as Map as below
first().getAs[Map[String, Long]]("pv").keySet
Since you're accessing a single column only (at the 14th position), why not making your developer's live a bit easier (and help the people who would support your code later)?
Try the following:
val colNames = customerCountDF
.where($"fiscal_year" === maxYear) // Split one long filter into two
.where($"fiscal_month" === maxMnth) // where is a SQL-like alias of filter
.select("pv") // Take just the field you need to work with
.as[Map[String, Long]] // Map it to the proper type
.head // Load just the single field (all others are left aside)
.keySet // That's just a pure Scala
I think the above code says what it does in such a clear way (and I think should be the fastest out of the provided solutions since it just loads a single pv field to a JVM object on the driver).
A workaround to get the final result in List[String]. Check this out:
scala> val customerCountDF=Seq((2018,12,Map("subscribers_connects" -> 5521287L, "disconnects_hsd" -> 7992L, "subscribers_xfinity home" -> 6277491L, "subscribers_bulk units" -> 4978892L, "connects_cdv" -> 41464L, "connects_disconnects" -> 16945L, "connects_hsd" -> 32908L, "disconnects_internet essentials" -> 10319L, "disconnects_disconnects" -> 3506L, "disconnects_video" -> 8960L, "connects_xfinity home" -> 43012L))).toDF("fiscal_year","fiscal_month","mapc")
customerCountDF: org.apache.spark.sql.DataFrame = [fiscal_year: int, fiscal_month: int ... 1 more field]
scala> val maxYear =2018
maxYear: Int = 2018
scala> val maxMnth = 12
maxMnth: Int = 12
scala> val colNames = customerCountDF.filter($"fiscal_year" === maxYear && $"fiscal_month" === maxMnth).first.getMap(2).keySet.mkString(",").split(",").toList
colNames: List[String] = List(subscribers_connects, disconnects_hsd, subscribers_xfinity home, subscribers_bulk units, connects_cdv, connects_disconnects, connects_hsd, disconnects_internet essentials, disconnects_disconnects, disconnects_video, connects_xfinity home)
scala>

How to iterate records spark scala?

I have a variable "myrdd" that is an avro file with 10 records loaded through hadoopfile.
When I do
myrdd.first_1.datum.getName()
I can get the name. Problem is, I have 10 records in "myrdd". When I do:
myrdd.map(x => {println(x._1.datum.getName())})
it does not work and prints out a weird object a single time. How can I iterate over all records?
Here is a log from a session using spark-shell with a similar scenario.
Given
scala> persons
res8: org.apache.spark.sql.DataFrame = [name: string, age: int]
scala> persons.first
res7: org.apache.spark.sql.Row = [Justin,19]
Your issue looks like
scala> persons.map(t => println(t))
res4: org.apache.spark.rdd.RDD[Unit] = MapPartitionsRDD[10]
so map just returns another RDD (the function is not applied immediately, the function is applied "lazily" when you really iterate over the result).
So when you materialize (using collect()) you get a "normal" collection:
scala> persons.collect()
res11: Array[org.apache.spark.sql.Row] = Array([Justin,19])
over which which you can map. Note that in this case you have a side-effect in the closure passed to map (the println), the result of println is Unit):
scala> persons.collect().map(t => println(t))
[Justin,19]
res5: Array[Unit] = Array(())
Same result if collect is applied at the end:
scala> persons.map(t => println(t)).collect()
[Justin,19]
res19: Array[Unit] = Array(())
But if you just want to print the rows, you can simplify it to using foreach:
scala> persons.foreach(t => println(t))
[Justin,19]
As #RohanAletty has pointed out in a comment, this works for a local Spark job. If the job runs in a cluster, collect is required as well:
persons.collect().foreach(t => println(t))
Notes
The same behaviour can be observed in the Iterator class.
The output of the session above has been reordered
Update
As for filtering: The location of collect is "bad", if you apply filters after collect which can be applied before.
For example these expressions give the same result:
scala> persons.filter("age > 20").collect().foreach(println)
[Michael,29]
[Andy,30]
scala> persons.collect().filter(r => r.getInt(1) >= 20).foreach(println)
[Michael,29]
[Andy,30]
but the 2nd case is worse, because that filter could have been applied before collect.
The same applies to any type of aggregation as well.

What's the new way to iterate over a Java Map in Scala 2.8.0?

How does scala.collection.JavaConversions supercede the answers given in Stack Overflow question Iterating over Java collections in Scala (it doesn't work because the "jcl" package is gone) and in Iterating over Map with Scala (it doesn't work for me in a complicated test which I'll try to boil down and post here later).
The latter is actually a Scala Map question, but I think I need to know both answers in order to iterate over a java.util.Map.
In 2.8, you import scala.collection.JavaConversions._ and use as a Scala map. Here's an example (in 2.8.0.RC1):
scala> val jmap:java.util.Map[String,String] = new java.util.HashMap[String,String]
jmap: java.util.Map[String,String] = {}
scala> jmap.put("Hi","there")
res0: String = null
scala> jmap.put("So","long")
res1: String = null
scala> jmap.put("Never","mind")
res2: String = null
scala> import scala.collection.JavaConversions._
import scala.collection.JavaConversions._
scala> jmap.foreach(kv => println(kv._1 + " -> " + kv._2))
Hi -> there
Never -> mind
So -> long
scala> jmap.keys.map(_.toUpperCase).foreach(println)
HI
NEVER
SO
If you specifically want a Scala iterator, use jmap.iterator (after the conversions import).