What does code block mean in a scala anonymous function in Spark? - scala

I am new to scala and don't understand what does a code block mean in anonymous function. Here is some piece of example code:
def makeIndex(langs: List[String], rdd: RDD[WikipediaArticle]): RDD[(String, Iterable[WikipediaArticle])] = {
val articles_Languages = rdd.flatMap(article => {
langs.filter(lang => article.mentionsLanguage(lang))
.map(lang => (lang, article))
})
articles_Languages.groupByKey
}
Does it mean that a WikipediaArticle object is transformed from its original to a list of tuples (lang,article),and then flatted, and by calling groupByKey it is transformed into RDD[(String, Iterable[WikipediaArticle])]?
Does it mean that I can wirte any code inside a {} block as long as the final line inside the block returns the object I want?In this way example code iteratored langs upon each article?

map and flatMap are high order functions, they receive a function as a parameter and you can call them in several ways. You can just pass a method that you have defined, an anonymous function inside () if it has only one line, or inside {} if you need more lines of code.
And yes, you can pass there whatever you want if you follow the required signature, meaning that input and output have to match with the signature.
In case of map you have a signature A => B so you can transform your A into anything you want

E.g. for an RDD of Int:
rdd.map(x=> x+1)
The x => x+1 is an anonymous function called by map or some other method.
Instead of using a def with input and output definitions, the output of Int is inferred by Scala, in this case.

Related

Scala Function Currying and call By name Functions, GenricTypes

I am bit new to scala curying and the call by name functions. I am facing difficulty in understanding the Syntax. What is the fllow of the function why there is need of returning the f(result) and what function is applied on it further.
def withScan[R](table: Table, scan: Scan)(f: (Seq[Result]) => R): R = {
var resultScanner: ResultScanner = null
try {
resultScanner = table.getScanner(scan)
val it: util.Iterator[Result] = resultScanner.iterator()
val results: mutable.ArrayBuffer[Result] = ArrayBuffer()
while (it.hasNext) {
results += it.next()
}
f(results)
} finally {
if (resultScanner != null)
resultScanner.close()
}
}
Let's look at just the function signature
def withScan[R](table: Table, scan: Scan)(f: (Seq[Result]) => R): R
Firstly, ignore the fancy currying syntax for now as you can always rewrite a curried function into a normal function by putting all the parameters in one parameter list i.e.
def withScan[R](table: Table, scan: Scan, f: Seq[Result] => R): R
Secondly, notice the last parameter is a function on its own and we don't know what it does yet. withScan will take a function somebody gives it and use that function on something. We might be interested in why someone needs such a function. Since we need to deal with a lot of resources that need to be opened and closed properly such as File, DatabaseConnection, Socket,... we will then repeat ourselves with the code that closes the resources or even worse, forget to close the resources. Hence we want to factor the boring common code out to give you a convenient function: if you use withScan to access the table, we will somehow give you the Result so that you can work on that and also we will make sure to close the resources properly for you so that you can just focus on the interesting operation. This is call the "loan pattern"
Now let's go back to the currying syntax. Although currying has other interesting use cases, I believe the reason it is written in this style is in Scala, you can use curly braces block to pass the parameter to the function i.e. one can use the function above like this
withScan(myTable, myScan) { results =>
//do whatever you want with the results
}
This looks just like a built in control flow like if-else or for loop!
As I understand that properly this is function which take some Table (probably db table) and try to scna this tabel using argument scan. After you collect data using relevant scanner this method just map collected sequence to object of type R.
For such mapping it is used f function.
You can use this function:
val list: List[Result] = withScan(table, scanner)(results => results.toList)
Or
val list: List[Result] = withScan(table, scanner)(results => ObjectWhichKeepAllData(results))
IMHO, it is not very well written code, and also I feel that the better would be to do mapping thing outside of this function. Let client do the mapping (which BTW should be for every single result) and leave scanning only for that function.
This is an example of a higher-order function: a function which takes another function as a parameter.
The function appears to do the following:
- opens the passed in table with the passed in scanner
- parses the table with an iterator, populating entries in a local ArrayBuffer
- calls a function, passed in by the caller, on the sequence of entries that have been parsed.
The function parameter allows this function to be used to carry out any operation on the scanned information, depending on the function passed in.
The function prototype could equally have been declared:
def withScan[R](table: Table, scan: Scan, f: (Seq[Result]) => R): R = {
The function has been declared with two argument lists; this is an example of currying. This is a benefit when calling the function, as it allows the method to be called with a clearer syntax.
Consider a function that might be passed into this function:
def getHighestValueEntry(results: Seq[Result]): R = {
Without currying, the function would be called like this:
withScan[R](table, scan, results => getHighestValueEntry(results))
With currying the function can be called in a manner that makes the function parameter stand out more clearly. This is helped by the ability in Scala to use curly braces instead of parentheses to surround the arguments to a function, if you are only passing in one argument:
withScan(table, scan) { results =>
getHighestValueEntry(results) }

How to call a Scala function with a single parameter and write the parameter first

I have some code that converts the elements of sequences with different functions like that:
someSequence map converterFunction
However, sometimes I have not a sequence but a single value that is to be passed to the function. For consistency with the other lines I'd like to write it like that
someValue applyTo converterFunction
So in a way it is like mapping a single value. Of course I can just call the function with the value, I'm just wondering if it is possible to write it similar to the way I proposed.
I agree with ChrisK that this idea doesn't sound as a good one to me in terms of code readability but if you really want it, you can do it using something like this:
implicit final class MapAnyOp[T](val value: T) extends AnyVal {
def map[R](f: T => R) = f(value)
}
def convert(v: Int): String = Integer.toHexString(v)
println(List(123, 234, 345) map convert)
println(123 map convert)
You could wrap it in a Seq:
Seq(someValue) map converterFunction
If someValue is a custom type/class, you can define an operator that will do that for you instead of having this explicit wrapping.
someValue.seq map converterFunction

Create spark function which accepts key ,value as argumets and return back RDD[string]?

I want to create a function which can later be used by three different RDD data sets.
Function takes key and value and converts to seq[String]
def ConvertToMap2(value: RDD[(String, (String,String,String,String,String,String))]): Seq[String] = {
value.collect().toMap.values.toSeq.map(x => x.toString.replace("(","").replace(")",""))
}
when I tried to apply by one data set its ok because it has one key with 6 values example:-
val StatusRDD=ConvertToMap(FilterDataSet("1013").map(x => ((x(5)+x(4)),(x(5),x(4),x(1),x(6),x(7),x(8)))))
but I tried to apply on another data set I need to we write the function because other data set contains 7 values with one key this makes to re write the function with same logic but different name.
def ConvertToMap2(value: RDD[(String,(String,String,String,String,String,String,String))]): Seq[String] = {
value.collect().toMap.values.toSeq.map(x => x.toString.replace("(","").replace(")",""))
}
val LuldRDD2=ConvertToMap2(FilterDataSet("1041").map(x => ((x(5)+x(4)),(x(5),x(4),x(1),x(6),x(7),x(8),x(9)))))
Is there a way to write one function for both which accepts 6 or 7 values of string with just one key ? or can I extend my function ?
TupleX classes inherit from Product, so I would define the function like this:
def convertToSeq(rdd: RDD[(String, Product)]): Seq[String] = {
rdd.values.map(x => x.productIterator.mkString).collect().toSeq
}
Note that TupleX classes have a productIterator that I'm using here to create the string (I found your way somewhat verbose and more difficult to read) and I'm also delaying the collect call until after converting the values, so the map operation is run in parallel.
Finially, I have changed the name of the function, since it converts to a Seq and not a Map.
Yep go the answer need to use data type of any
def ConvertToMap (value: RDD[(String,Any)]): Seq[String] = {
value.collect().toMap.values.toSeq.map(x => x.toString.replace("(","").replace(")",""))
}

How should I read this piece of Scala (Play) code?

I am new to Scala, and am learning it by going over some Play code. I have had a good read of the major concepts of Scala and am comfortable with functional programming having done some Haskell and ML.
I am really struggling to read this code, at the level of the syntax and the programming paradigms alone. I understand what the code is supposed to do, but not how it does it because I can't figure out the syntax.
// -- Home page
def index(ref: Option[String]): Action[AnyContent] = Prismic.action(ref) { implicit request =>
for {
someDocuments <- ctx.api.forms("everything").ref(ctx.ref).submit()
} yield {
Ok(views.html.index(someDocuments))
}
}
(Prismic is an API separate to Play and is not really that relevant). How would I describe this function (or is it a method??) to another developer over the phone: in other words, using English. For example in this code:
def add(a: Int, b: Int): Int = a + b
I would say "add is a function which takes two integers, adds them together and returns the result as another integer".
In the Play code above I don't even know how to describe it after getting to "index is a function which takes an Option of a String and returns an Action of type AnyContent by ....."
The bit after the '=' and then the curly braces and the '=>' scare me! How do I read them? And is the functional or OO?
Thanks for your assistance
Let's reduce it to this:
def index(ref: Option[String]): Action[AnyContent] = Prismic.action(ref)(function)
That's better, isn't it? index is a function from Option of String to Action of AnyContent (one word), which calls the action method of the object Prismic passing two curried parameters: ref, the parameter that index received, and a function (to be described).
So let's break down the anonymous function:
{ implicit request =>
for {
someDocuments <- ctx.api.forms("everything").ref(ctx.ref).submit()
} yield {
Ok(views.html.index(someDocuments))
}
}
First, it uses {} instead of () because Scala allows one to drop () as parameter delimiter if it's a single parameter (there are two parameter lists, but each has a single parameter), and that parameter is enclosed in {}.
So, what about {}? Well, it's an expression that contains declarations and statements, with semi-colon inference on new lines, whose value is that of the last statement. That is, the value of these two expressions is the same, 3:
{ 1; 2; 3 }
{
1
2
3
}
It's a syntactic convention to use {} when passing a function that extends for more than one line, even if, as in this case, that function could have been passed with just parenthesis.
The next thing confusing is the implicit request =>, Let's pick something simpler:
x => x * 2
That's pretty easy, right? It takes one parameter, x, and returns x * 2. In our case, it is the same thing: the function takes one parameter, request, and returns this:
for (someDocuments <- somethingSomething())
yield Ok(views.html.index(someDocuments))
That is, it calls some methods, iterate over the result, and map those results into a new value. This is a close equivalent to Haskell's do notation. You can rewrite it like below (I'm breaking it down into multiple lines for readability):
ctx
.api
.forms("everything")
.ref(ctx.ref)
.submit()
.map(someDocuments => Ok(views.html.index(someDocuments)))
So, back to our method definition, we have this:
def index(ref: Option[String]): Action[AnyContent] = Prismic.action(ref)(
implicit request =>
ctx
.api
.forms("everything")
.ref(ctx.ref)
.submit()
.map(someDocuments => Ok(views.html.index(someDocuments)))
)
The only remaining question here is what that implicit is about. Basically, it makes that parameter implicitly available through the scope of the function. Presumably, at least one of these method calls require an implicit parameter which is properly fielded by request. I could drop the implicit there an pass request explicitly, if I knew which of these methods require it, but since I don't, I'm skipping that.
An alternate way of writing it would be:
def index(ref: Option[String]): Action[AnyContent] = Prismic.action(ref)({
request =>
implicit val req = request
ctx
.api
.forms("everything")
.ref(ctx.ref)
.submit()
.map(someDocuments => Ok(views.html.index(someDocuments)))
})
Here I added {} back because I added a declaration to the body of the function, though I decided not to drop the parenthesis, which I could have.
Something like this:
index is a function which takes an Option of a String and returns an Action of type AnyContent. It calls the method action that takes as a first argument an Option and as a second argument a method that assumes an implicit value request of type Request is in scope. This method uses a For-comprehension that calls the submit method which returns an Option or a Future and then in case its execution is successful, it yields the result Ok(...) that will be wrapped in the Action returned by the action method of Prismic.
Prismic.action is a method that takes 2 groups of arguments (a.k.a. currying).
The first is ref
The second is { implicit request => ...}, a function defined in a block of a code
more information on Action

scala loan pattern, optional function param

I have a loan pattern that applies a function n times where 'i' is the incrementing variable. "Occasionally", I want the function passed in to have access to 'i'....but I don't want to require all functions passed in to require defining a param to accept 'i'. Example below...
def withLoaner = (n:Int) => (op:(Int) => String) => {
val result = for(i <- 1 to n) yield op(i)
result.mkString("\n")
}
def bob = (x:Int) => "bob" // don't need access to i. is there a way use () => "bob" instead?
def nums = (x:Int) => x.toString // needs access to i, define i as an input param
println(withLoaner(3)(bob))
println(withLoaner(3)(nums))
def withLoaner(n: Int) = new {
def apply(op: Int => String) : String = (1 to n).map(op).mkString("\n")
def apply(op: () => String) : String = apply{i: Int => op()}
}
(not sure how it is related to the loan pattern)
Edit Little explanation as requested in comment.
Not sure what you know and don't know of scala and what you don't undestand in that code. so sorry if what I just belabor the obvious.
First, a scala program consist of traits/classes (also singleton object) and methods. Everything that is done is done by methods (leaving constructor aside). Functions (as opposed to methods) are instances of (subtypes of) the various FunctionN traits (N the number of arguments). Each of them has as apply method that is the actual implemention.
If you write
val inc = {i: Int => i + 1}
it is desugared to
val inc = new Function1[Int, Int] {def apply(i: Int) = i + 1}
(defines an anonymous class extending Function1, with given apply method and creating an instance)
So writing a function has rather more weight than a simple method. Also you cannot have overloading (several methods with the same name, differing by the signature, just what I did above), nor use named arguments, or default value for arguments.
On the other hand, functions are first classes values (they can be passed as arguments, returned as result) while methods are not. They are automatically converted to functions when needed, however there may be some edges cases when doing that. If a method is intended solely to be used as a function value, rather than called as a method, it might be better to write a function.
A function f, with its apply method, is called with f(x) rather than f.apply(x) (which works too), because scala desugars function call notation on a value (value followed by parentheses and 0 or more args) to a call to method apply. f(x) is syntactic sugar for f.apply(x). This works whatever the type of f, it does not need to be one of the FunctionN.
What is done in withLoaner is returning an object (of an anonymous type, but one could have defined a class separately and returned an instance of it). The object has two apply methods, one accepting an Int => String, the other one an () => String. When you do withLoaner(n)(f) it means withLoaner(n).apply(f). The appropriate apply method is selected, if f has the proper type for one of them, otherwise, compile error.
Just in case you wonder withLoaner(n) does not mean withLoaner.apply(n) (or it would never stop, that could just as well mean withLoaner.apply.apply(n)), as withLoaner is a method, not a value.