How to convert a map's keys from String to Int? - scala

This is my initial RDD output
scala> results
scala.collection.Map[String,Long] = Map(4.5 -> 1534824, 0.5 -> 239125, 3.0 -> 4291193, 3.5 -> 2200156, 2.0 -> 1430997, 1.5 -> 279252, 4.0 -> 5561926,
rating -> 1, 1.0 -> 680732, 2.5 -> 883398, 5.0 -> 2898660)
I am removing a string Key to keep only numbers.
scala> val resultsInt = results.filterKeys(_ != "rating")
resultsInt: scala.collection.Map[String,Long] = Map(4.5 -> 1534824, 0.5 -> 239125, 3.0 -> 4291193, 3.5 -> 2200156, 2.0 -> 1430997, 1.5 -> 279252, 4.0 -> 5561926, 1.0 -> 680732, 2.5 -> 883398, 5.0 -> 2898660)
Sorting the RDD based on values, it gives expected output, but I would like to convert the key from String to int before sorting to get consistent output.
scala> val sortedOut2 = resultsInt.toSeq.sortBy(_._1)
sortedOut2: Seq[(String, Long)] = ArrayBuffer((0.5,239125), (1.0,680732), (1.5,279252), (2.0,1430997), (2.5,883398), (3.0,4291193), (3.5,2200156), (4.0,5561926), (4.5,1534824), (5.0,2898660))
I am new to Scala and just started writing my Spark program. Please let me know some insights to convert the key of Map object.

Based on your sample output, I suppose you meant converting the key to Double?
val results: scala.collection.Map[String, Long] = Map(
"4.5" -> 1534824, "0.5" -> 239125, "3.0" -> 4291193, "3.5" -> 2200156,
"2.0" -> 1430997, "1.5" -> 279252, "4.0" -> 5561926, "rating" -> 1,
"1.0" -> 680732, "2.5" -> 883398, "5.0" -> 2898660
)
results.filterKeys(_ != "rating").
map{ case(k, v) => (k.toDouble, v) }.
toSeq.sortBy(_._1)
res1: Seq[(Double, Long)] = ArrayBuffer((0.5,239125), (1.0,680732), (1.5,279252), (2.0,1430997),
(2.5,883398), (3.0,4291193), (3.5,2200156), (4.0,5561926), (4.5,1534824), (5.0,2898660))

To map between different type, you just need to use map Spark/Scala operator.
You can check the syntax from here
Convert a Map[String, String] to Map[String, Int] in Scala
The same method can be used with Spark and Scala.

please see Scala - Convert keys from a Map to lower case?
the approach should be similar,
case class row (id: String, value:String)
val rddData = sc.parallelize(Seq(row("1", "hello world"), row("2", "hello there")))
rddData.map{
currentRow => (currentRow.id.toInt, currentRow.value)}
//scala> org.apache.spark.rdd.RDD[(Int, String)]
even if you didn't define a case class for the structure of the rdd and you used something like Tuple2 instead, you can just write
currentRow._1.toInt // instead of currentRow.id.toInt
please research on casting for information (when converting from String to Int), there's a few ways to go about that
hope this helps! good luck :)

Distilling your RDD into a Map is legal, but it defeats the purpose of using Spark in the first place. If you are operating at scale, your current approach renders the RDD meaningless. If you aren't, then you can just do Scala collection manipulation as you suggest, but then why bother with the overhead of Spark at all?
I would instead operate at the DataFrame level of abstraction and transform that String column into a Double like this:
import sparkSession.implicits._
dataFrame
.select("key", "value")
.withColumn("key", 'key.cast(DoubleType))
And this is of course assuming that Spark didn't recognize the key as a Double already after setting the inferSchema to true on initial data ingest.

If you are trying to filter out the key being non-number, you may just do the following:
import scala.util.{Try,Success,Failure}
(results map { case (k,v) => Try (k.toFloat) match {
case Success(x) => Some((x,v))
case Failure(_) => None
}}).flatten
res1: Iterable[(Float, Long)] = List((4.5,1534824), (0.5,239125), (3.0,4291193), (3.5,2200156), (2.0,1430997), (1.5,279252), (4.0,5561926), (1.0,680732), (2.5,883398), (5.0,2898660))

Related

How to read from a csv file to create a scala Map object?

I have a path to a csv I'd like to read from. This csv includes three columns: "topic, key, value" I am using spark to read this file as a csv file. The file looks like the following(lookupFile.csv):
Topic,Key,Value
fruit,aaa,apple
fruit,bbb,orange
animal,ccc,cat
animal,ddd,dog
//I'm reading the file as follows
val lookup = SparkSession.read.option("delimeter", ",").option("header", "true").csv(lookupFile)
I'd like to take what I just read and return a map that has the following properties:
The map uses the topic as a key
The value of this map is a map of the "Key" and "Value" columns
My hope is that I would get a map that looks like the following:
val result = Map("fruit" -> Map("aaa" -> "apple", "bbb" -> "orange"),
"animal" -> Map("ccc" -> "cat", "ddd" -> "dog"))
Any ideas on how I can do this?
scala> val in = spark.read.option("header", true).option("inferSchema", true).csv("""Topic,Key,Value
| fruit,aaa,apple
| fruit,bbb,orange
| animal,ccc,cat
| animal,ddd,dog""".split("\n").toSeq.toDS)
in: org.apache.spark.sql.DataFrame = [Topic: string, Key: string ... 1 more field]
scala> val res = in.groupBy('Topic).agg(map_from_entries(collect_list(struct('Key, 'Value))).as("subMap"))
res: org.apache.spark.sql.DataFrame = [Topic: string, subMap: map<string,string>]
scala> val scalaMap = res.collect.map{
| case org.apache.spark.sql.Row(k : String, v : Map[String, String]) => (k, v)
| }.toMap
<console>:26: warning: non-variable type argument String in type pattern scala.collection.immutable.Map[String,String] (the underlying of Map[String,String]) is unchecked since it is eliminated by erasure
case org.apache.spark.sql.Row(k : String, v : Map[String, String]) => (k, v)
^
scalaMap: scala.collection.immutable.Map[String,Map[String,String]] = Map(animal -> Map(ccc -> cat, ddd -> dog), fruit -> Map(aaa -> apple, bbb -> orange))
read in your data
val df1= spark.read.format("csv").option("inferSchema", "true").option("header", "true").load(path)
first put "key,value" into and array and groupBy Topic to get your target
separted into a key part and a value part.
val df2= df.groupBy("Topic").agg(collect_list(array($"Key",$"Value")).as("arr"))
now convert to dataset
val ds= df2.as[(String,Seq[Seq[String]])]
apply logic on the fields to get your map of maps and collect
val ds1 =ds.map(x=> (x._1,x._2.map(y=> (y(0),y(1))).toMap)).collect
now you data is set up with the Topic as a key and "key,value" as a Value, so now apply Map to get your result
ds1.toMap
Map(animal -> Map(ccc -> cat, ddd -> dog), fruit -> Map(aaa -> apple, bbb -> orange))

Spark - provide extra parameter to udf

I'm trying to create a spark scala udf in order to transform MongoDB objects of the following shape:
Object:
"1": 50.3
"8": 2.4
"117": 1.0
Into Spark ml SparseVector.
The problem is that in order to create a SparseVector, I need one more input parameter - its size.
And in my app I keep the Vector sizes in a separate MongoDB collection.
So, I defined the following UDF function:
val mapToSparseVectorUdf = udf {
(myMap: Map[String, Double], size: Int) => {
val vb: VectorBuilder[Double] = new VectorBuilder(length = -1)
vb.use(myMap.keys.map(key => key.toInt).toArray, myMap.values.toArray, size)
vb.toSparseVector
}
}
And I was trying to call it like this:
df.withColumn("VecColumn", mapToSparseVectorUdf(col("MapColumn"), vecSize)).drop("MapColumn")
However, my IDE says "Not applicable" to that udf call.
Is there a way to make this kind of UDF that can take an extra parameter?
Udf functions would require columns to be passed as arguments and the columns passed would be parsed to primitive data types through serialization and desirialization. Thats why udf functions are expensive
If vecSize is an Integer constant then you can simply use lit inbuilt function as
df.withColumn("VecColumn", mapToSparseVectorUdf(col("MapColumn"), lit(vecSize))).drop("MapColumn")
This will do it:
def mapToSparseVectorUdf(vectorSize: Int) = udf[Vector, Map[String, Double]](
(myMap: Map[String, Double]) => {
val elements = myMap.toSeq.map {case (index, value) => (index.toInt, value)}
Vectors.sparse(vectorSize, elements)
}
)
Usage:
val data = spark.createDataFrame(Seq(
("1", Map("1" -> 50.3, "8" -> 2.4)),
("2", Map("2" -> 23.5, "3" -> 41.2))
)).toDF("id", "MapColumn")
data.withColumn("VecColumn", mapToSparseVectorUdf(10)($"MapColumn")).show(false)
NOTE:
Consider fixing your MongoDB schema! ;) The size is a member of a SparseVector, I wouldn't separate it from it's elements.

How to create a Future of a map of a different type

I am using com.twitter.util.Future, scala 2.11.11
I have this piece of code that I'm trying to convert into a Future[Map[Long, String]]
val simpleMap: Map[Long, Int] = Map(1L -> 2, 2L -> 4)
val keyToNewFutureMap = Future.collect(simpleMap.map {
case (key, value) =>
val newFuture = getAFutureFromValue(value)
key -> newFuture
}.toSeq.toMap
)
val keyToFutureMap = Map(1L -> Future.value(1))
val futureMap = Future.collect(keyToFutureMap) // converts into a
Future[Map[Long, Int]]
Future.collect(Seq(futureMap, keyToNewFutureMap)) // Stuck here
I'm stuck here. I wanted to use the returned maps from both Futures and generate a new map. The new map will contain unique keys that appear in both futureMap and keyToNewFutureMap.
keyToFutureMap is given in the form of a Map[Long, Future[Option[Int]]], which is why I used a collect to turn it into a Future[Map[Long, Int]]
Any help is most appreciated.
If I understood correctly, you want this:
val newFutureMap = Future.traverse(simpleMap) {
case (key, value) =>
getAFutureFromValue(value).map(key -> _)
}.map(_.toMap)

Convert query string to map in scala

I have a query string in this form:
val query = "key1=val1&key2=val2&key3=val3
I want to create a map with the above key/value pairs. So far I'm doing it like this:
//creating an iterator with 2 values in each group. Each index consists of a key/value pair
val pairs = query.split("&|=").grouped(2)
//inserting the key/value pairs into a map
val map = pairs.map { case Array(k, v) => k -> v }.toMap
Are there any problems with doing it like I do? If so, is there some library I could use to do it?
Here is an approach using the URLEncodedUtils:
import java.net.URI
import org.apache.http.client.utils.URLEncodedUtils
import org.apache.http.{NameValuePair => ApacheNameValuePair}
import scala.collection.JavaConverters._
import scala.collection.immutable.Seq
object GetEncodingTest extends App {
val url = "?one=1&two=2&three=3&three=3a"
val params = URLEncodedUtils.parse(new URI(url), "UTF_8")
val convertedParams: Seq[ApacheNameValuePair] = collection.immutable.Seq(params.asScala: _*)
val scalaParams: Seq[(String, String)] = convertedParams.map(pair => pair.getName -> pair.getValue)
val paramsMap: Map[String, String] = scalaParams.toMap
paramsMap.foreach(println)
}
Assuming the query string you are working with is as simple as you showed, the use of grouped(2) is a great insight and gives a pretty elegant looking solution.
The next step from where you're at is to use the under-documented Array::toMap method:
val qs = "key=value&foo=bar"
qs.split("&|=") // Array(key, value, foo, bar)
.grouped(2) // <iterator>
.map(a => (a(0), a(1))) // <iterator>
.toMap // Map(key -> value, foo -> bar)
grouped(2) returns an Iterator[Array[String]], that's a little harder to follow because iterators don't serialize nicely on the Scala console.
Here's the same result, but a bit more step-by-step:
val qs = "key=value&foo=bar"
qs.split("&") // Array(key=value, foo=bar)
.map(kv => (kv.split("=")(0), kv.split("=")(1))) // Array((key,value), (foo,bar))
.toMap // Map(key -> value, foo -> bar)
If you want a more general solution for HTTP query strings, consider using a library for URL parsing.

Scala : How to find types of values inside a scala nested collection

consider the following variables in scala :
val nestedCollection_1 = Array(
"key_1" -> Map("key_11" -> "value_11"),
"key_2" -> Map("key_22" -> "value_22"))
val nestedCollection_2 = Map(
"key_3"-> ["key_33","value_33"],
"key_4"-> ["key_44"->"value_44"])
Following are my questions :
1) I want to read the values of the variables nestedCollection_1, nestedCollection_2 and ensure that the value of the variables are of the format
Array[Map[String, Map[String, String]]
and
Map[String, Array[String]]]
2) Is it possible to get the detailed type of a variable in scala? i.e. nestedColelction_1.SOME_METHOD should return Array[Map[String, Map[String, String]] as type of its values
I am not sure what exacltly do you mean. Compiler can ensure type of any variable if you just annotate the type:
val nestedCollection_2: Map[String, List[String]] = Map(
"key_3"-> List("key_33", "value_33"),
"key_4"-> List("key_44", "value_44"))
You can see type of variable in scala repl when you define it, or using Alt + = in Intellij Idea.
scala> val nestedCollection_2 = Map(
| "key_3"-> List("key_33", "value_33"),
| "key_4"-> List("key_44", "value_44"))
nestedCollection_2: scala.collection.immutable.Map[String,List[String]] = Map(key_3 -> List(key_33, value_33), key_4 -> List(key_44, value_44))
Edit
I think I get your question now. Here is how you can get type as String:
import scala.reflect.runtime.universe._
def typeAsString[A: TypeTag](elem: A) = {
typeOf[A].toString
}
Test:
scala> typeAsString(nestedCollection_2)
res0: String = Map[String,scala.List[String]]
scala> typeAsString(nestedCollection_1)
res1: String = scala.Array[(String, scala.collection.immutable.Map[String,String])]