How to sortBy in spark using a function? - scala

For example, I want to sort by using the difference of two values in the tuple. How could I do that in Spark?
I want for example something like as follows.
rdd.sortBy(_._2._1 - _._2._2)

You can't use underscore more than once or it will be interpreted as two different arguments (and the expected function should only have one). Instead, name the argument and use it twice:
rdd.sortBy(r => r._2._1 - r._2._2)

Related

Filter, Option or FlatMap in spark

I have next code in Spark:
dsPhs.filter(filter(_))
.map(convert)
.coalesce(partitions)
.filter(additionalFilter.IsValid(_))
At convert function I get more complex object - MyObject, so I need to prefilter basic object. I have 3 options:
Make map return option(MyObject) and filter it at additionalFilter
Replace map with flatMap and return empty array when filtered
Use filter before map function, to filter out RawObject, before converting it to MyObject.
Now I am go with option 3. But may be 1 or 2 is more preferable?
If, for option 2, you mean have convert return an empty array, there's another option: have convert return an Option[MyObject] and use flatMap instead of map. This has the best of options 1 and 2. Without knowing more about your use case, I can't say for sure whether this is better than option 3, but here are some considerations:
Should convert contain input validation logic? If so, consider modifying it to return an Option.
If convert is used, or will be used, in other places, could they benefit from this validation?
As a side note, this might be a good time to consider what convert currently does when passed an invalid argument.
Can you easily change convert and its signature? If not, consider using a filter.

Spark (python) - explain the difference between user defined functions and simple functions

I am a Spark beginner. I am using Python and Spark dataframes. I just learned about user defined functions (udf) that one has to register first in order to use it.
Question: in what situation do you want to create a udf vs. just a simple (Python) function?
Thank you so much!
Your code will be neater if you use UDFs, because it will take a function, and the correct return type (defaults to string if empty), and create a column expression, which means you can write nice things like:
my_function_udf = udf(my_function, DoubleType())
myDf.withColumn("function_output_column", my_function_udf("some_input_column"))
This is just one example of how you can use a UDF to treat a function as a column. They also make it easy to introduce stuff like lists or maps into your function logic via a closure, which is explained very well here

Spread syntax in function call in Reason

In Javascript you can use the spread syntax in a function call like this:
console.log(...[1,2,3]);
Is there an equivalent in Reason? I tried the following:
let bound = (number, lower, upper) => {
max(lower, min(upper, number));
};
let parameters = (1,0,20);
bound(...parameters) |> Js.log;
But this gives an unknown syntax error:
Try reason snippet
There's not. Reason is a statically typed language, and lists are dynamically-sized and homogenous. It would be of very limited use, and not at all obvious how it would deal with too few or too many arguments. If you want to pass a list, you should just accept a list and deal with it appropriately, as a separate function if desired.
You could of course use a tuple instead, which is fixed-size and heterogenous, but I don't see a use-case for that either, since you might as well just call the function directly then.
For JavaScript FFI there is however the bs.splice attribute, which will allow you to apply a variable number of arguments to a js function using an array. But it needs to be called with an array literal, not just any array.

Should I use partial function for database calls

As per my understanding, partial functions are functions that are defined for a subset of input values.
So should I use partial functions for DAO's. For example:
getUserById(userId: Long): User
There is always an input userId which does not exists in db. So can I say it is not defined. And lift it when I call this function.
If yes where do I stop. Should I use partial functions for all methods which are not defined, say for null.
PartialFunction is used when function is undefined for some elements of input data (input data may be Seq etc.)
For your case Option is better choice: it says that return data may be absent:
getUserById(userId:Long):Option[User]
I would avoid using partial functions at all, because scala makes it very easy to call a partial function as though it were a total function. Instead it's better to use a function that returns Option as #Sergey suggests; that way the "partial-ness" is always explicit.
Idiomatic scala does not use null so I wouldn't worry about methods which are not defined for null, but certainly it's worth returning Option for methods which are only defined for some of their possible input values. Better still, though, is to only accept suitable types as input. E.g. if you have a function that's only valid for non-empty lists, it should take (scalaz) NonEmptyList as input rather than List.

using variable length argument in scala

I know how to define a method with variable length argument:
case class taxonomy(vocabularies:(String,Set[String])*)
and client code is very clean:
val terms=taxonomy("topics"->Set("economic","politic")
,"tag"->Set("Libya","evolution")
)
but I want to Know how can I use this case class when I have a variable (instead of a Sequence of variable) like this:
val notFormattedTerms = Map("topics"->Set("economic","politic")
,"tag"->Set("Libya","evolution"))
taxonomy(notFormattedTerms.toSeq:_*)
With : _* you virtually transform a sequence argument so that it looks as if a several arguments had been passed to the variable length method. This transformation, however, only works for (ordered?) simple sequence types and, as in this case, not for a Map. Therefore, one will have to use an explicit toSeq before.