How to add to an Immutable map : Scala - scala

I have a ResultSet object returned from Hive using JDBC.
I am trying to store the values in a resultset in a Scala Immutable Map.
How can i add there values to an Immutable map as i am iterating the resultset using while loop
val m : Map[String, String] = null
while ( resultSet.next() ) {
val col = resultSet.getString("col_name")
val data = resultSet.getString("data_type")
m += (col -> data) // This Gives Reassignment error
}

I propose :
Iterator.continually{
val col = resultSet.getString("col_name")
val data = resultSet.getString("data_type")
col->data
}.takeWhile( _ => resultSet.next()).toMap
Instead of thinking "let's init an empty collection and fill it" which is imho the mutable way to think, this proposition rather think in terms of "let's declare how to build a collection with those elements in it and be done" :-)

You might want to use scala.collection.Iterator[A] so that you can create immutable map out of your java resultSet.
val myMap : Map[String, String] = new Iterator[(String, String)] {
override def hasNext = resultSet.next()
override def next() = {
val col = resultSet.getString("col_name")
val data = resultSet.getString("data_type")
col -> data
}
}.toMap
Otherwise you have to use mutable scala.collection.mutable.Map.

Related

Return type to assign to val for RDDs

I am playing around with spark code to know more about shuffling. I wrote the following code to see how are stages formed if there is a if-else statement. I have declared val result so that the result could be assigned to it later in the if statement. But I am not sure about the return type to assign to it.
Is there an abstract class that goes with all the RDDs?
val conf = new SparkConf().setMaster("local").setAppName("spark shuffle")
val sc = new SparkContext(conf)
val d = sc.parallelize(0 until 1000).map(i => (i%1000, i))
val x = d.reduceByKey(_+_)
val count = 1
val result: RDD // What is the correct return type here?
if(count == 1)
{
result= d.rightOuterJoin(x)
result.collect()
}
d is a RDD[(Int, Int)]
Then doing a reduce by key gives the same thing but reduced down
Doing a right outer join then gives you RDD of (Int, (Option[Int], Int)) - ie for each key the L and R value (with the L option being optional if not there)
So doing a collect gives you an array of the same thing
The API documentation is not easy to follow for all these functions, there is a lot of generic types, and a lot of implicit types. I would recommend that you either use an IDE which will hint the types for you, or else use a tool that gives you a console that you can try snippets in.
you can avoid assignment to var (it should be var, not val)
val conf = new SparkConf().setMaster("local").setAppName("spark shuffle")
val sc = new SparkContext(conf)
val d = sc.parallelize(0 until 1000).map(i => (i%1000, i))
val x = d.reduceByKey(_+_)
val count = 1
if (count == 1) {
d.rightOuterJoin(x).collect()
}

Converting Scala mutable arrays to a spark dataframe

I have three mutable arrays defined as:
import scala.collection.mutable.ArrayBuffer
var quartile_1 = ArrayBuffer[Double]()
var quartile_3 = ArrayBuffer[Double]()
var id = ArrayBuffer[String]()
quartile_1 and quartile_3 are information at id level and I am currently computing them as:
def func1(x: org.apache.spark.sql.Row) {
val apQuantile = df_auth_for_qnt.where($"id" === x(0).toString).stat.approxQuantile("tran_amt", Array(0.25, 0.75), 0.001)
quartile_1 += apQuantile(0)
quartile_3 += apQuantile(1)
id += x(0).toString()
}
val cardNumList = df_auth_for_qnt_gb.where($"tran_cnt" > 8).select("card_num_1").collect.foreach(func1)
Is there a better approach than appending them to mutable arrays? My goal is to have the quantile data, id available as a dataframe - so that I can do further joins.
Mutable structures like ArrayBuffer are evil, especially in parallelizable context. Here they can be avoided quite easily.
func1 can return a tuple of (String, Array[Double]), where the first element corresponds to the id (former id buffer) and the second element is the quartiles returned from approxQuantile:
def func1(x: Row): (String, Array[Double]) = {
val cardNum1 = x(0).toString
val quartiles = df_auth_for_qnt.where($"id" === cardNum1).stat.approxQuantile("tran_amt", Array(0.25, 0.75), 0.001)
(cardNum1, quartiles)
}
Now, using functional chaning we can obtain an immutable result structure.
As a DataFrame:
val resultDf = df_auth_for_qnt_gb.where($"tran_cnt" > 8).select("card_num_1").map(func1).toDF("id", "quartiles")
Or as a Map[String, Array[Double]] with same associations as in the tuples returned from func1:
val resultMap = df_auth_for_qnt_gb.where($"tran_cnt" > 8).select("card_num_1").map(func1).collect().toMap

How to create collection of RDDs out of RDD?

I have an RDD[String], wordRDD. I also have a function that creates an RDD[String] from a string/word. I would like to create a new RDD for each string in wordRDD. Here are my attempts:
1) Failed because Spark does not support nested RDDs:
var newRDD = wordRDD.map( word => {
// execute myFunction()
(new MyClass(word)).myFunction()
})
2) Failed (possibly due to scope issue?):
var newRDD = sc.parallelize(new Array[String](0))
val wordArray = wordRDD.collect
for (w <- wordArray){
newRDD = sc.union(newRDD,(new MyClass(w)).myFunction())
}
My ideal result would look like:
// input RDD (wordRDD)
wordRDD: org.apache.spark.rdd.RDD[String] = ('apple','banana','orange'...)
// myFunction behavior
new MyClass('apple').myFunction(): RDD[String] = ('pple','aple'...'appl')
// after executing myFunction() on each word in wordRDD:
newRDD: RDD[String] = ('pple','aple',...,'anana','bnana','baana',...)
I found a relevant question here: Spark when union a lot of RDD throws stack overflow error, but it didn't address my issue.
Use flatMap to get RDD[String] as you desire.
var allWords = wordRDD.flatMap { word =>
(new MyClass(word)).myFunction().collect()
}
You cannot create a RDD from within another RDD.
However, it is possible to rewrite your function myFunction: String => RDD[String], which generates all words from the input where one letter is removed, into another function modifiedFunction: String => Seq[String] such that it can be used from within an RDD. That way, it will also be executed in parallel on your cluster. Having the modifiedFunction you can obtain the final RDD with all words by simply calling wordRDD.flatMap(modifiedFunction).
The crucial point is to use flatMap (to map and flatten the transformations):
def main(args: Array[String]) {
val sparkConf = new SparkConf().setAppName("Test").setMaster("local[*]")
val sc = new SparkContext(sparkConf)
val input = sc.parallelize(Seq("apple", "ananas", "banana"))
// RDD("pple", "aple", ..., "nanas", ..., "anana", "bnana", ...)
val result = input.flatMap(modifiedFunction)
}
def modifiedFunction(word: String): Seq[String] = {
word.indices map {
index => word.substring(0, index) + word.substring(index+1)
}
}

Pass array as an UDF parameter in Spark SQL

I'm trying to transform a dataframe via a function that takes an array as a parameter. My code looks something like this:
def getCategory(categories:Array[String], input:String): String = {
categories(input.toInt)
}
val myArray = Array("a", "b", "c")
val myCategories =udf(getCategory _ )
val df = sqlContext.parquetFile("myfile.parquet)
val df1 = df.withColumn("newCategory", myCategories(lit(myArray), col("myInput"))
However, lit doesn't like arrays and this script errors. I tried definining a new partially applied function and then the udf after that :
val newFunc = getCategory(myArray, _:String)
val myCategories = udf(newFunc)
val df1 = df.withColumn("newCategory", myCategories(col("myInput")))
This doesn't work either as I get a nullPointer exception and it appears myArray is not being recognized. Any ideas on how I pass an array as a parameter to a function with a dataframe?
On a separate note, any explanation as to why doing something simple like using a function on a dataframe is so complicated (define function, redefine it as UDF, etc, etc)?
Most likely not the prettiest solution but you can try something like this:
def getCategory(categories: Array[String]) = {
udf((input:String) => categories(input.toInt))
}
df.withColumn("newCategory", getCategory(myArray)(col("myInput")))
You could also try an array of literals:
val getCategory = udf(
(input:String, categories: Array[String]) => categories(input.toInt))
df.withColumn(
"newCategory", getCategory($"myInput", array(myArray.map(lit(_)): _*)))
On a side note using Map instead of Array is probably a better idea:
def mapCategory(categories: Map[String, String], default: String) = {
udf((input:String) => categories.getOrElse(input, default))
}
val myMap = Map[String, String]("1" -> "a", "2" -> "b", "3" -> "c")
df.withColumn("newCategory", mapCategory(myMap, "foo")(col("myInput")))
Since Spark 1.5.0 you can also use an array function:
import org.apache.spark.sql.functions.array
val colArray = array(myArray map(lit _): _*)
myCategories(lit(colArray), col("myInput"))
See also Spark UDF with varargs

Slick 2.1: Return query results as a map [duplicate]

I have methods in my Play app that query database tables with over hundred columns. I can't define case class for each such query, because it would be just ridiculously big and would have to be changed with each alter of the table on the database.
I'm using this approach, where result of the query looks like this:
Map(columnName1 -> columnVal1, columnName2 -> columnVal2, ...)
Example of the code:
implicit val getListStringResult = GetResult[List[Any]] (
r => (1 to r.numColumns).map(_ => r.nextObject).toList
)
def getSomething(): Map[String, Any] = DB.withSession {
val columns = MTable.getTables(None, None, None, None).list.filter(_.name.name == "myTable").head.getColumns.list.map(_.column)
val result = sql"""SELECT * FROM myTable LIMIT 1""".as[List[Any]].firstOption.map(columns zip _ toMap).get
}
This is not a problem when query only runs on a single database and single table. I need to be able to use multiple tables and databases in my query like this:
def getSomething(): Map[String, Any] = DB.withSession {
//The line below is no longer valid because of multiple tables/databases
val columns = MTable.getTables(None, None, None, None).list.filter(_.name.name == "table1").head.getColumns.list.map(_.column)
val result = sql"""
SELECT *
FROM db1.table1
LEFT JOIN db2.table2 ON db2.table2.col1 = db1.table1.col1
LIMIT 1
""".as[List[Any]].firstOption.map(columns zip _ toMap).get
}
The same approach can no longer be used to retrieve column names. This problem doesn't exist when using something like PHP PDO or Java JDBCTemplate - these retrieve column names without any extra effort needed.
My question is: how do I achieve this with Slick?
import scala.slick.jdbc.{GetResult,PositionedResult}
object ResultMap extends GetResult[Map[String,Any]] {
def apply(pr: PositionedResult) = {
val rs = pr.rs // <- jdbc result set
val md = rs.getMetaData();
val res = (1 to pr.numColumns).map{ i=> md.getColumnName(i) -> rs.getObject(i) }.toMap
pr.nextRow // <- use Slick's advance method to avoid endless loop
res
}
}
val result = sql"select * from ...".as(ResultMap).firstOption
Another variant that produces map with not null columns (keys in lowercase):
private implicit val getMap = GetResult[Map[String, Any]](r => {
val metadata = r.rs.getMetaData
(1 to r.numColumns).flatMap(i => {
val columnName = metadata.getColumnName(i).toLowerCase
val columnValue = r.nextObjectOption
columnValue.map(columnName -> _)
}).toMap
})