Array of Sequences in Scala - scala

I'm trying to read distinct values column wise in a data frame and store them in a Array of sequence
def getColumnDistinctValues(df: DataFrame, colNames:String): Unit = {
val cols: Array[String] = colNames.split(',')
cols.foreach(println) // print column names
var colDistValues: Array[Seq[Any]] = null
for (i <- 0 until cols.length) {
colDistValues(i) = df.select(cols(i)).distinct.map(x => x.get(0)).collect // read distinct values from each column
}
The assignment to colDistValues(i) doesn't work and always results in null pointer exception, what is the correct syntax to assign it the distinct values for each column?
Regards

You're trying to access the ith index of a null pointer (which you assign yourself), of course you'll get a NullPointerException. You don't need to initialize an Array[T] beforehand, let the returned collection do that for you:
val colDistValues: Array[Array[Any]] =
cols.map(c => df.select(c).distinct.map(x => x.get(0)).collect)

You are initialising the colDistValues to null.
Replace
var colDistValues: Array[Seq[Any]] = null
with
var colDistValues: Array[Seq[Any]] = Array.ofDim[Seq[Any]](cols.length)

Related

Understanding Spark Imputer

I understand how an imputer is supposed to work, but I cannot fully understand the implementations of imputer in Spark. I expect a beginner-level explanation to the following code:
val results = $(strategy) match {
case Imputer.mean =>
// Function avg will ignore null automatically.
// For a column only containing null, avg will return null.
val row = dataset.select(cols.map(avg): _*).head()
Array.range(0, $(inputCols).length).map { i =>
if (row.isNullAt(i)) {
Double.NaN
} else {
row.getDouble(i)
}
}
case Imputer.median =>
// Function approxQuantile will ignore null automatically.
// For a column only containing null, approxQuantile will return an empty array.
dataset.select(cols: _*).stat.approxQuantile($(inputCols), Array(0.5), 0.001)
.map { array =>
if (array.isEmpty) {
Double.NaN
} else {
array.head
}
}
}
What I understand:
the logic of two strategies of an imputer in pseudo code.
basic scala & Spark dataframe.
What I do not understand in Imputer.mean:
Why do we have val row here? Why do we have .head()`?
How the missing value is imputed in Imputer.mean? I see how the average of each col is calculated, but I do not know how they get imputed. Is it by row.getDouble(i)
What is this Array? Where does it declared? Does it has any thing to do with val row?
What I do not understand in Imputer.median:
Don't we calculated median by dataset.select(cols: _*).stat.approxQuantile($(inputCols), Array(0.5), 0.001)? Why we have an array here? Why do we return the array.head?
Why do we have val row here?
Because it is descriptive name of variable as the type of the value is Row.
Why do we have .head()`?
Because query alone returns DataFrame and we want the result of the query.
How the missing value is imputed in Imputer.mean?
It is not. Imputation logic is implemented by ImputerModel.transform
override def transform(dataset: Dataset[_]): DataFrame = {
transformSchema(dataset.schema, logging = true)
val surrogates = surrogateDF.select($(inputCols).map(col): _*).head().toSeq
val newCols = $(inputCols).zip($(outputCols)).zip(surrogates).map {
case ((inputCol, outputCol), surrogate) =>
val inputType = dataset.schema(inputCol).dataType
val ic = col(inputCol)
when(ic.isNull, surrogate)
.when(ic === $(missingValue), surrogate)
.otherwise(ic)
.cast(inputType)
}
dataset.withColumns($(outputCols), newCols).toDF()
}
What is this Array
Because Imputer is designed to impute multiple values at once. So you need a collection of result, and Row, although collection-like, is not typed.
Why we have an array here? Why do we return the array.head?
This is explained in detail, in the comment you copied into your question:
For a column only containing null, approxQuantile will return an empty array.

Variable from a function to get each changed value

There are dataset in object. I can change the value, but like this I always get an old map.
var partnerURLs=getPartnerURLs()
def getPartnerURLs(): mutable.Map[String, List[String]] ={
val sql = "select * from table where status_flag=1 and del_flag=0"
val resultSet = DBUtils.doMapResult(sql)
var result = new scala.collection.mutable.HashMap[String, List[String]]()
resultSet.foreach(rows => {
val row = rows.asInstanceOf[util.HashMap[String, Any]]
result += (row.get("uname").toString -> row.get("urls").toString.split(",").toList)
})
result
}
If I update the database, the partnerURLS can't change, it always has old value.
this method is ok, can get latest value, I want this variable partnerURLs how get the latest value?
What I'm doing wrong?
var partnerURLs = getPartnerURLs()
This variable is your problem. If it is globally declared in the enclosing class or object, it will be initialized only once!
Also, please avoid using mutable state! Give some respect to Scala and Functional programming!

Extracting value of columns in spark dataframe

I have a requirement , where I need to filter out rows from spark dataframe where value of a certain column (say "price") needs to be matched with values present in a scala map.The key of scala map is value of another column (say "id").
My dataframe contains two columns : id and price.
I need to filter out all the columns where price does not match the price mentioned in scala map.
My code resembles this:
object obj1{
// This method returns value price for items as per their id
getPrice(id:String):String {
//lookup in a map and return the price
}
}
object Main{
val validIds = Seq[String]("1","2","3","4")
val filteredDf = baseDataframe.where(baseDataframe("id").in(validIDs.map(lit(_)): _*) &&
baseDataframe("price") === (obj1.getPrice(baseDataframe("id").toString())))
// But this line send string "id" to obj1.getPrice() function
// rather than value of id column
}
}
I am not able to pass value of id columns to function obj1.getPrice().
Any suggestion how to achieve this?
Thanks,
You can write a udf to do this:
val checkPrice(id: String, price: String) = validIds.exists(_ == id) && obj1.getPrice(id) == price
val checkPriceUdf = udf(checkPrice)
baseDataFrame.where(checkPriceUdf($"id", $"price"))
Or another solution is convert the Map of id -> price to a data frame, and then do an inner join with baseDataFrame on the id and price columns.

How to get comma separated values from List of Maps in Scala?

I have listMap1 variable of type List[Map [String, String]] and I want all values associated with key 'k1' as one string with comma separated values
import fiddle.Fiddle, Fiddle.println
import scalajs.js
#js.annotation.JSExport
object ScalaFiddle {
var m1:Map[String,String] = Map(("k1"->"v1"), ("k2"->"vv1"))
var m2:Map[String,String] = Map(("k1"->"v2"),("k2"->"vv2"))
var m3:Map[String,String] = Map(("k1"->"v3"),("k2"->"vv3"))
var listMap1 = List(m1,m2,m3)
var valList = ?? // need all values assoicated with k1 like --> v1,v2,v3...
}
A simple approach would be:
listMap1.flatMap(_.get("k1")).mkString(",")
be warned that this will not work if you're generating CSV data and the associated values contain ,s e.g. Map(("k1" -> "\some, string"))
is that ok ??
val r = listMap1.filter(l => l.contains("k1") ).map(r => r("k1") ).mkString(",")

HBase Table Retrieval Data

If I am trying to retrieve data from an HBase table using this code:
val get = new Get(Bytes.toBytes(extracted.user.rowKey.toString))
val res = table.get(get)
I am not sure if the val res = table.get(get) line will return a result or not since a row with this row key: extracted.socialUser.socialUserConnectionId.toString passed to the Get constructor may not exist in the HBase table.
I am trying something like this:
val get = new Get(Bytes.toBytes(extracted.socialUser.socialUserConnectionId.toString))
val res = table.get(get)
if (!res) {
/* Create the row in the HBase table */
}
But it is giving me the error in that if statement saying: Expression of this type result doesn't convert to type Boolean. Is there a way I can solve this problem?
At first glance, it appears val res = table.get(get) would return a type Optional.
Given that, you should be calling res.isEmpty instead of !res
Edit:
Better still, you could use getOrElse instead of get :
val res = table.getOrElse{
// Create the row in the HBase table
}