The following Scala code that uses java.util.HashMap (I need to use Java because it's for a Java interface) works fine:
val row1 = new HashMap[String,String](Map("code" -> "B1", "name" -> "Store1"))
val row2 = new HashMap[String,String](Map("code" -> "B2", "name" -> "Store 2"))
val map1 = Array[Object](row1,row2)
Now, I'm trying to dynamically create map1 :
val values: Seq[Seq[String]] = ....
val mapx = values.map {
row => new HashMap[String,String](Map(row.map( col => "someName" -> col))) <-- error
}
val map1 = Array[Object](mapx)
But I get the following compilation error:
type mismatch; found : Seq[(String, String)] required: (?, ?)
How to fix this?
We can simplify your code a bit more:
val mapx = Map(Seq("someKey" -> "someValue"))
This still produces the same error message, so the error wasn't actually related to your use of Java HashMaps, but trying to use a Seq as an argument to Scala's Map.
The problem is that Map is variadic and expects key-value-pairs as its arguments, not some data structure containing them. In Java a variadic method can also be called with an array instead, without any type of conversion. This isn't true in Scala. In Scala you need to use : _* to explicitly convert a sequence to a list of arguments when calling a variadic method. So this works:
val mapx = Map(mySequence : _*)
Alternatively, you can just use .to_map to create a Map from a sequence of tuples:
val mapx = mySequence.toMap
Related
I'm trying to create a spark scala udf in order to transform MongoDB objects of the following shape:
Object:
"1": 50.3
"8": 2.4
"117": 1.0
Into Spark ml SparseVector.
The problem is that in order to create a SparseVector, I need one more input parameter - its size.
And in my app I keep the Vector sizes in a separate MongoDB collection.
So, I defined the following UDF function:
val mapToSparseVectorUdf = udf {
(myMap: Map[String, Double], size: Int) => {
val vb: VectorBuilder[Double] = new VectorBuilder(length = -1)
vb.use(myMap.keys.map(key => key.toInt).toArray, myMap.values.toArray, size)
vb.toSparseVector
}
}
And I was trying to call it like this:
df.withColumn("VecColumn", mapToSparseVectorUdf(col("MapColumn"), vecSize)).drop("MapColumn")
However, my IDE says "Not applicable" to that udf call.
Is there a way to make this kind of UDF that can take an extra parameter?
Udf functions would require columns to be passed as arguments and the columns passed would be parsed to primitive data types through serialization and desirialization. Thats why udf functions are expensive
If vecSize is an Integer constant then you can simply use lit inbuilt function as
df.withColumn("VecColumn", mapToSparseVectorUdf(col("MapColumn"), lit(vecSize))).drop("MapColumn")
This will do it:
def mapToSparseVectorUdf(vectorSize: Int) = udf[Vector, Map[String, Double]](
(myMap: Map[String, Double]) => {
val elements = myMap.toSeq.map {case (index, value) => (index.toInt, value)}
Vectors.sparse(vectorSize, elements)
}
)
Usage:
val data = spark.createDataFrame(Seq(
("1", Map("1" -> 50.3, "8" -> 2.4)),
("2", Map("2" -> 23.5, "3" -> 41.2))
)).toDF("id", "MapColumn")
data.withColumn("VecColumn", mapToSparseVectorUdf(10)($"MapColumn")).show(false)
NOTE:
Consider fixing your MongoDB schema! ;) The size is a member of a SparseVector, I wouldn't separate it from it's elements.
I have the following Scala code that generates a Jasper Report using an array of maps as data source (JRMapArrayDataSource). This works fine if all the values have the same type (such as String), but when I try to combine strings and integers in the HashMap I get a compilation error:
val map1 = new HashMap[String,Object](Map("f1"->"aaa1", "f2"-> "aaa2", "f3" -> 1 ))
val map2 = new HashMap[String,Object](Map("f1"->"bbb1", "f2"-> "bbb2", "f3" -> 2 ))
val dataSource = new JRMapArrayDataSource(Array(map1, map2));
val params = new HashMap[String,Object]()
val jasperPrint = JasperFillManager.fillReport("test1.jasper", params, dataSource);
JasperExportManager.exportReportToPdfFile(jasperPrint, "test1.pdf");
In the line of map1 and map2 I get the following:
overloaded method constructor HashMap with alternatives: (x$1:
java.util.Map[_ <: String, _ <:
Object])java.util.HashMap[String,Object] (x$1:
Int)java.util.HashMap[String,Object] cannot be applied to
(scala.collection.immutable.Map[String,Any])
Since I have in the report two string fields (f1 and f2) and one int field (f3) I need to have this combination in the HashMap. Any ideas?
The answer is to use Any:
val map1 = new HashMap[String,Any](Map("f1"->"aaa1", "f2"-> "aaa2", "f3" -> 1 ))
consider the following variables in scala :
val nestedCollection_1 = Array(
"key_1" -> Map("key_11" -> "value_11"),
"key_2" -> Map("key_22" -> "value_22"))
val nestedCollection_2 = Map(
"key_3"-> ["key_33","value_33"],
"key_4"-> ["key_44"->"value_44"])
Following are my questions :
1) I want to read the values of the variables nestedCollection_1, nestedCollection_2 and ensure that the value of the variables are of the format
Array[Map[String, Map[String, String]]
and
Map[String, Array[String]]]
2) Is it possible to get the detailed type of a variable in scala? i.e. nestedColelction_1.SOME_METHOD should return Array[Map[String, Map[String, String]] as type of its values
I am not sure what exacltly do you mean. Compiler can ensure type of any variable if you just annotate the type:
val nestedCollection_2: Map[String, List[String]] = Map(
"key_3"-> List("key_33", "value_33"),
"key_4"-> List("key_44", "value_44"))
You can see type of variable in scala repl when you define it, or using Alt + = in Intellij Idea.
scala> val nestedCollection_2 = Map(
| "key_3"-> List("key_33", "value_33"),
| "key_4"-> List("key_44", "value_44"))
nestedCollection_2: scala.collection.immutable.Map[String,List[String]] = Map(key_3 -> List(key_33, value_33), key_4 -> List(key_44, value_44))
Edit
I think I get your question now. Here is how you can get type as String:
import scala.reflect.runtime.universe._
def typeAsString[A: TypeTag](elem: A) = {
typeOf[A].toString
}
Test:
scala> typeAsString(nestedCollection_2)
res0: String = Map[String,scala.List[String]]
scala> typeAsString(nestedCollection_1)
res1: String = scala.Array[(String, scala.collection.immutable.Map[String,String])]
I'm trying to transform a dataframe via a function that takes an array as a parameter. My code looks something like this:
def getCategory(categories:Array[String], input:String): String = {
categories(input.toInt)
}
val myArray = Array("a", "b", "c")
val myCategories =udf(getCategory _ )
val df = sqlContext.parquetFile("myfile.parquet)
val df1 = df.withColumn("newCategory", myCategories(lit(myArray), col("myInput"))
However, lit doesn't like arrays and this script errors. I tried definining a new partially applied function and then the udf after that :
val newFunc = getCategory(myArray, _:String)
val myCategories = udf(newFunc)
val df1 = df.withColumn("newCategory", myCategories(col("myInput")))
This doesn't work either as I get a nullPointer exception and it appears myArray is not being recognized. Any ideas on how I pass an array as a parameter to a function with a dataframe?
On a separate note, any explanation as to why doing something simple like using a function on a dataframe is so complicated (define function, redefine it as UDF, etc, etc)?
Most likely not the prettiest solution but you can try something like this:
def getCategory(categories: Array[String]) = {
udf((input:String) => categories(input.toInt))
}
df.withColumn("newCategory", getCategory(myArray)(col("myInput")))
You could also try an array of literals:
val getCategory = udf(
(input:String, categories: Array[String]) => categories(input.toInt))
df.withColumn(
"newCategory", getCategory($"myInput", array(myArray.map(lit(_)): _*)))
On a side note using Map instead of Array is probably a better idea:
def mapCategory(categories: Map[String, String], default: String) = {
udf((input:String) => categories.getOrElse(input, default))
}
val myMap = Map[String, String]("1" -> "a", "2" -> "b", "3" -> "c")
df.withColumn("newCategory", mapCategory(myMap, "foo")(col("myInput")))
Since Spark 1.5.0 you can also use an array function:
import org.apache.spark.sql.functions.array
val colArray = array(myArray map(lit _): _*)
myCategories(lit(colArray), col("myInput"))
See also Spark UDF with varargs
I am try to create map with different map inside
val mymap = Map("name"->"somename",Map(1->2))
I got from compiler:
scala: type mismatch;
found : scala.collection.immutable.Map[Int,Int]
required: (?, ?)
val mymap = Map("name"->"somename",Map(1->2))
^
Why do you expect it to work? You've provided only key without value:
val key = Map(1->2)
val mymap = Map("name"->"somename", key)
Perhaps you wanted to combine two maps? This can be done with:
val mymap = Map("name"->"somename") ++ Map(1->2)
// scala.collection.immutable.Map[Any,Any] = Map(name -> somename, 1 -> 2)
A Map consists of key-value pairs (whose type is (?, ?)). You have to assign the Map value to a key as well:
val mymap = Map("name"->"somename","othername"->Map(1->2))