Spark Key/Value filter Function - scala

I have data in a Key Value pairing. I am trying to apply a filter function to the data that looks like:
def filterNum(x: Int) : Boolean = {
if (decimalArr.contains(x)) return true
else return false
}
My Spark code that has:
val numRDD = columnRDD.filter(x => filterNum(x(0)))
but that wont work and when I send in the:
val numRDD = columnRDD.filter(x => filterNum(x))
I get the error:
<console>:23: error: type mismatch;
found : (Int, String)
required: Int
val numRDD = columnRDD.filter(x => filterNum(x))
I also have tried to do other things like changing the inputs to the function

This is because RDD.filter is passing in the Key-Value Tuple, (Int, String), and filterNum is expecting an Int, which is why the first attempt works: tuple(index) pulls out the value at that index of the tuple.
You could change your filter function to be
def filterNum(x: (Int, String)) : Boolean = {
if (decimalArr.contains(x._1)) return true
else return false
}
Although, I would personally do a more terse version as the false is baked into the contains and you can just use the expression directly:
columnRDD.filter(decimalArr.contains(_._1))
Or, if you don't like the underscore syntax:
columnRDD.filter(x=>decimalArr.contains(x._1))
Also, do not use return in scala, the last evaluated line is the return automatically

Related

Scala map with function call results in references to the function instead of results

I have a list of keys for which I want to fetch data. The data is fetched via a function call for each key. I want to end up with a Map of key -> data. Here's what I've tried:
case class MyDataClass(val1: Int, val2: Boolean)
def getData(key: String): MyDataClass = {
// Dummy implementation
MyDataClass(1, true)
}
def getDataMapForKeys(keys: Seq[String]): Map[String, MyDataClass] = {
val dataMap: Map[String, MyDataClass] = keys.map((_, getData(_))).toMap
dataMap
}
This results in a type mismatch error:
type mismatch;
found : scala.collection.immutable.Map[String,String => MyDataClass]
required: Map[String,MyDataClass]
val dataMap: Map[String, MyDataClass] = keys.map((_, getData(_))).toMap
Why is it setting the values in the resulting Map to instances of the getData() function, rather than its result? How do I make it actually CALL the getData() function for each key and put the results as the values in the Map?
The code you wrote is the same as the following statements:
keys.map((_, getData(_)))
keys.map(x => (x, getData(_)))
keys.map(x => (x, y => getData(y)))
This should clarify why you obtain the error.
As suggested in the comments, stay away from _ unless in simple cases with only one occurrences.
The gist of the issue is (_, getData(_))) is creating a Tuple instead of a map entry for each key that is being mapped over. Using -> creates a Map which is what you want.
...
val dataMap: Map[String, MyDataClass] = keys.map(key => (key -> getData(key))).toMap
...

how do I loop a array in udf and return multiple variable value

I'm fresh with scala and udf, now I would like to write a udf which accept 3 parameters from a dataframe columns(one of them is array), for..loop current array, parse and return a case class which will be used afterwards. here's a my code roughly:
case class NewFeatures(dd: Boolean, zz: String)
val resultUdf = udf((arrays: Option[Row], jsonData: String, placement: Int) => {
for (item <- arrays) {
val aa = item.getAs[Long]("aa")
val bb = item.getAs[Long]("bb")
breakable {
if (aa <= 0 || bb <= 0) break
}
val cc = item.getAs[Long]("cc")
val dd = cc > 0
val jsonData = item.getAs[String]("json_data")
val jsonDataObject = JSON.parseFull(jsonData).asInstanceOf[Map[String, Any]]
var zz = jsonDataObject.getOrElse("zz", "").toString
NewFeatures(dd, zz)
}
})
when I run it, it will get exception:
java.lang.UnsupportedOperationException: Schema for type Unit is not supported
how should I modify above udf
First of all, try better naming for your variables, for instance in your case, "arrays" is of type Option[Row]. Here, for (item <- arrays) {...} is basically a .map method, using map on Options, you should provide a function, that uses Row and returns a value of some type (~= signature: def map[V](f: Row => V): Option[V], what you want in your case: def map(f: Row => NewFeatures): Option[NewFeature]). While you're breaking out of this map in some circumstances, so there's no assurance for the compiler that the function inside map method would always return an instance of NewFeatures. So it is Unit (it only returns on some cases, and not all).
What you want to do could be enhanced in something similar to this:
val funcName: (Option[Row], String, Int) => Option[NewFeatures] =
(rowOpt, jsonData, placement) => rowOpt.filter(
/* your break condition */
).map { row => // if passes the filter predicate =>
// fetch data from row, create new instance
}

How can i make a for loop with if else and make the good return type

i would like to resolve a problem,
i've done this code and i have this output
found : Unit
[error] required: Boolean
[error] for (data <- category(i)) {
i have to return : (List[String], (List[String], (List[String])
i choose to use the filter method for to realise a compact code.
i don't understand why it doesn't work. Why the code doesn't return a bool but a Unit.
I would like this method to return true if at least 1 element of the list starts with x otherwise the method must return false.
Thanks
def classifiedColumns(columnNames: List[String]): (List[Column], List[Column], List[Column]) = {
val category = List(
List("t01", "t03", "t11", "t1801", "t1803"), // 1
List("t05", "t1805"), // 2
List("t02", "t04", "t06", "t07", "t08", "t09", "t10", "t12", "t13", "t14", "t15", "t16", "t18")) // 3
def get_info(x: String, i: Int, category: List[List[String]]): Boolean = {
for (data <- category(i)) {
if (data.startsWith(x)) true
else false
}
}
(columnNames.filter(x => get_info(x, 1, category)).map(column),
columnNames.filter(x => get_info(x, 2, category)).map(column),
columnNames.filter(x => get_info(x, 3, category)).map(column))
}
classifiedColumns(List("t0201", "t0408", "t0600084"))
Your use of for does not behave as you expect. You're using this for-comprehension:
for (data <- category(i)) {
if (data.startsWith(x)) true
else false
}
This expression "desugars" into (i.e. is shorthand for):
category(i).foreach(data => {
if (data.startsWith(x)) true
else false
})
foreach returns Unit, and therefore the type of this expression (and the type returned from your get_info method, this being the body of the method) is Unit.
It's unclear what you expect to be returned here, I'm assuming you want the method to return true if any of the elements in category(i) start with x. If so, you can implement it as follows:
category(i).exists(_.startsWith(x))
Which returns a Boolean, as expected.

Scala function inside a filter loop not working (type mismatch)

I'm new in Scala, I have a function (that works)
object Utiles {
def func(param: String, param2: String): String = {
// Do Somthing
true
}
}
In a different file, I'm using this function successfully, but when i insert it to a filter it gives me an error
list.filter(value => {
Utiles.func(value.param ,value.param2)
})
the error I'm getting is:
type mismatch;
found : String
required: None.type
Utiles.func(value.param ,value.param2)
Any idea what i'm doing wrong?
You have three issues here (that I can see as the question is currently written):
Your func function doesn't compile. You have put the return type of the function as String, yet you are returning a Boolean (true). Either change the return type, or end the function by returning a String.
.filter(...) requires you to make something either true or false. This will be fixed if you change the return type of func to be Boolean. If your return type is supposed to be String, you'll need to compare that String to something. Eg:
List("foo", "bar").filter(x => func(x) == "baz")
Your type mismatch error is because you seem to be passing a String into your func function where it is expecting a None.type (for some reason).
What I'm getting at, is you have failed to give us a Minimal, Complete, and Verifiable example. I have debugged your code as you have presented it, but I have a strong feeling that you have tried to cut down your real function to a point where your errors (and the function itself) make no sense.
If you noticed filter takes a predicate
def filter(p: A => Boolean): List[A]
which means your filter function on List[SomeData] should be SomeData => Boolean.
example:
scala> def fun(param1: String, param2: String): Boolean = param1 == param2
fun: (param1: String, param2: String)Boolean
scala> List("updupd", "whatwhat").filter(p => fun(p, "updupd"))
res0: List[String] = List(updupd)
I'm not sure how you're able to use func in a different place because the return type is wrong. It should be Boolean:
object Utiles {
def func(param: String, param2: String): Boolean = {
// Do Somthing
true
}
}

Treat scala function with default parameter type as if it did not have default parameters

Disclaimer: I'm new to Scala.
I want to pass a function with default parameters as if its type did not have default parameters
import scala.util.hashing.MurmurHash3
type Record = Map[String, String]
type Dataset = Seq[Record]
def dropDuplicates(dataset: Dataset, keyF: Record => Any = recordHash) : Dataset = {
// keep only distinct records as defined by the key function
// removed method body for simplicity
return dataset
}
def recordHash(record: Record, attributes: Option[Seq[String]] = None) : Int = {
val values : Seq[String] = attributes
.getOrElse(record.keys.toSeq.sorted)
.map(attr => record(attr))
return MurmurHash3.seqHash(values)
}
Here's the error I'm getting at compile time:
error: type mismatch;
[ant:scalac] found : (Record, Option[Seq[String]]) => Int
[ant:scalac] required: Record => Any
[ant:scalac] def dropDuplicates(dataset: Dataset, keyF: Record => Any = recordHash) : Dataset = {
Intuitively, I think of recordHash as type Record => Int when the default parameter attributes is not provided. Is there a way to treat recordHash as type Record => Int?
I can't compile your code because I miss some types, but I think this will work.
def dropDuplicates(dataset: Dataset, keyF: Record => Any = recordHash(_)) : Dataset = {
// keep only distinct records as defined by the key function
}
This works because recordHash(_) is equivalent to x => recordHash(x), this way x (the input of the function) is Record which is the type you wanted.