What does the following code do?
var pair: RDD[(Long,Long)] = sparkContext.parallelize(Array((3L, 4L)))
Does variable "pair" consist of a single element/pair? The types of 3 and 4 is "long" not "int".
It creates an Array[Tuple2[Long,Long]] with one element (3L, 4L) (aka Tuple2(3L,4L)).
Such case could be written (maybe more readable), as Array(3L -> 4L).
Also note it's better not to use mutable type like Array (rather Seq or List).
Let's decompose it:
Array(el1, el2, el3, ...) creates a new Array containing the given elements. In our case, there is a single element in the array: (3L,4L).
Writing (a,b) creates a Tuple2 value containing a and b of types A and B (not necessarily the same types). The Tuple2[A,B] type can also be written as (A,B) for clarity.
In our case, a=3L and b=4L. You might be confused by the L suffix after literals. It is there so that the compiler can differentiate between 3 as an Int and 3L as a Long. Scala also supports other such suffixes, like f or d for explicitly declaring Float or Double literals.
So you're indeed creating an Array[Tuple2[Long,Long]], also expressed as Array[(Long,Long)], which goes into your RDD.
Related
I'm reading a bit about Scala currying here and I don't understand this example very much:
def foldLeft[B](z: B)(op: (B, A) => B): B
What is the [B] in square brackets? Why is it in brackets? The B after the colon is the return type right? What is the type?
It looks like this method has 2 parameter lists: one with a parameter named z and one with a parameter named op which is a function.
op looks like it takes a function (B, A) => B). What does the right side mean? It returns B?
And this is apparently how it is used:
val numbers = List(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
val res = numbers.foldLeft(0)((m, n) => m + n)
print(res) // 55
What is going on? Why wasn't the [B] needed when called?
In Scala documentation that A type (sometimes A1) is often the placeholder for a collection's element type. So if you have...
List('c','q','y').foldLeft( //....etc.
...then A becomes the reference for Char because the list is a List[Char].
The B is a placeholder for a 2nd type that foldLeft will have to deal with. Specifically it is the type of the 1st parameter as well as the type of the foldLeft result. Most of the time you actually don't need to specify it when foldLeft is invoked because the compiler will infer it. So, yeah, it could have been...
numbers.foldLeft[Int](0)((m, n) => m + n)
...but why bother? The compiler knows it's an Int and so does anyone reading it (anyone who knows Scala).
(B, A) => B is a function that takes 2 parameters, one of type B and one of type A, and produces a result of type B.
What is the [B] in square brackets?
A type parameter (that's also known as "generics" if you've seen something like Java before)
Why is it in brackets?
Because type parameters are written in brackets, that's just Scala syntax. Java, C#, Rust, C++ use angle brackets < > for similar purposes, but since arrays in Scala are accessed as arr(idx), and (unlike Haskell or Python) Scala does not use [ ... ] for list comprehensions, the square brackets could be used for type parameters, and there was no need for angular brackets (those are more difficult to parse anyway, especially in a language which allows almost arbitrary names for infix and postfix operators).
The B after the colon is the return type right?
Right.
What is the type?
Ditto. The return type is B.
It looks like this method has 2 parameter lists: one with a parameter named z and one with a parameter named op which is a function.
This method has a type parameter B and two argument lists for value arguments, correct. This is done to simplify the type inference: the type B can be inferred from the first argument z, so it does not have to be repeated when writing down the lambda-expression for op. This wouldn't work if z and op were in the same argument list.
op looks like it takes a function (B, A) => B.
The type of the argument op is (B, A) => B, that is, Function2[B, A, B], a function that takes a B and an A and returns a B.
What does the right side mean? It returns B?
Yes.
What is going on?
m acts as accumulator, n is the current element of the list. The fold starts with integer value 0, and then accumulates from left to right, adding up all numbers. Instead of (m, n) => m + n, you could have written _ + _.
Why wasn't the [B] needed when called?
It was inferred from the type of z. There are many other cases where the generic type cannot be inferred automatically, then you would have to specify the return type explicitly by passing it as an argument in the square brackets.
This is what is called polymorphism. The function can work on multiple types and you sometimes want to give a parameter of what type will be worked with. Basically the B is a type parameter and can either be given explicitly as a type, which would be Int and then it should be given in square brackets or implicitly in parentheses like you did with the 0. Read about polymorphism here Scala polymorphism
in insufficiently-polymorphic
the author says about:
def foo[A](fst: List[A], snd: List[A]): List[A]
There are fewer ways we can implement the function. In particular, we
can’t just hard-code some elements in a list, because we have no
ability to manufacture values of an arbitrary type.
I did not understand this, because also in the [Char] version we had no ability to manufacture values of an arbitrary type we had to have them of type [Char] so why are there less ways to implement this?
In the generic version you know that the output list can only contain some arrangement of the elements contained in fst and snd since there is no way to construct new values of some arbitrary type A. In contrast, if you know the output type is Char you can e.g.
def foo(fst: List[Char], snd: List[Char]) = List('a', 'b', 'c')
In addition you cannot use the values contained in the input lists to make decisions which affect the output, since you don't know what they are. You can do this if you know the input type e.g.
def foo(fst: List[Char], snd: List[Char]) = fst match {
case Nil => snd
case 'a'::fs => snd
case _ => fst
}
I'm assuming the author means, that there's no way to construct a non-empty List a but there's a way to construct a List Char, e.g. by using a String literal. You could just ignore the arguments and just return a hard-coded String.
An example of this would be:
foo :: List Char -> List Char -> List Char
foo a b = "Whatever"
You can't construct a value of an arbitrary type a, but you can construct a value of type Char.
This is a simple case of a property called "parametricity" or "free theorem", which applies to every polymorphic function.
An even simpler example is the following:
fun1 :: Int -> Int
fun2 :: forall a. a -> a
fun1 can be anything: successor, predecessor, square, factorial, etc. This is because it can "read" its input, and act accordingly.
fun2 must be the identity function (or loop forever). This because fun2 receives its input, but it can not examine it in any useful way: since it is of an abstract, unknown type a, no operations can be performed on it. The input is effectively an opaque token. The output of foo2 must be of type a, for which we do not know any construction means -- we can not create a value of type a from nothing. The only option is to take the input a and use it to craft the output a. Hence, fun2 is the identity.
The above parametricity result holds when you have no way to perform tests on the input or the type a. If we, e.g., allowed if x.instanceOf[Int] ..., or if x==null ..., or type casts (in OOP) then we could write fun2 in other ways.
I have a Tuple where I have stored anonymous functions, I want to iterate through them and execute them.
val functions = ((x:Int, y:Int) => x + y, (x:Int, y: Int) => x - y)
// I want to execute the anonymous functions in the Tuple
functions.productIterator.foreach(function => function)
Unfortunately I am not able to do
functions.productIterator.foreach(function => function(1, 2))
OR
functions.productIterator.foreach(_(1, 2))
what is the way out.
Tuples are not meant to be iterated over. The types get lost because each entry in a tuple is able to be a different type and so the type system just assumes Any (thus the Iterator[Any]). So the real suggestion is that if you want to iterate, use a collection like a Seq or Set.
On the other hand, if you know that the tuple contains functions of a particular type, then you can bypass the type checking by casting with asInstanceOf, but this is not recommended because type checking is your friend.
functions.productIterator.map(_.asInstanceOf[(Int,Int)=>Int](1, 2))
// produces `Iterator(3, -1)`
Alternatively, have a look at HLists in Shapeless, which have properties of both tuples and collections.
productIterator on a Tuple returns Iterator[Any] and not Iterator[Function2[Int, Int, Int]] as you're expecting.
We can extract the elements of the tuple into a Seq while preserving type information; hence
Seq(functions._1, functions._2).map(_(1,2))
List(3, -1)
I was trying to find line with maximum words, and i wrote the following lines, to run on spark-shell:
import java.lang.Math
val counts = textFile.map(line => line.split(" ").size).reduce((a, b) => Math.max(a, b))
But since, map is one to one , and flatMap is one to either zero or anything. So i tried replacing map with flatMap, in above code. But its giving error as:
<console>:24: error: type mismatch;
found : Int
required: TraversableOnce[?]
val counts = F1.flatMap(s => s.split(" ").size).reduce((a,b)=> Math.max(a,b))
If anybody could make me understand the reason, it will really be helpful.
flatMap must return an Iterable which is clearly not what you want. You do want a map because you want to map a line to the number of words, so you want a one-to-one function that takes a line and maps it to the number of words (though you could create a collection with one element, being the size of course...).
FlatMap is meant to associate a collection to an input, for instance if you wanted to map a line to all its words you would do:
val words = textFile.flatMap(x => x.split(" "))
and that would return an RDD[String] containing all the words.
In the end, map transforms an RDD of size N into another RDD of size N (e.g. your lines to their length) whereas flatMap transforms an RDD of size N into an RDD of size P (actually an RDD of size N into an RDD of size N made of collections, all these collections are then flattened to produce the RDD of size P).
P.S.: one last word that has nothing to do with your problem, it is more efficient to do (for a string s)
val nbWords = s.split(" ").length
than call .size(). Indeed, the split method returns an array of String and arrays do not have a size method. So when you call .size() you have an implicit conversion from Array[String] to SeqLike[String] which creates new objects. But Array[T] do have a length field so there's no conversion calling length. (It's a detail but I think it's good habit though).
Any use of map can be replaced by flatMap, but the function argument has to be changed to return a single-element List: textFile.flatMap(line => List(line.split(" ").size)). This isn't a good idea: it just makes your code less understandable and less efficient.
After reading Tired of Null Pointer Exceptions? Consider Using Java SE 8's Optional!'s part about why use flatMap() rather than Map(), I have realized the truly reason why flatMap() can not replace map() is that map() is not a special case of flatMap().
It's true that flatMap() means one-to-many, but that's not the only thing flatMap() does. It can also strip outer Stream() if put it simply.
See the definations of map and flatMap:
Stream<R> map(Function<? super T, ? extends R> mapper)
Stream<R> flatMap(Function<? super T, ? extends Stream<? extends R>> mapper)
the only difference is the type of returned value in inner function. What map() returned is "Stream<'what inner function returned'>", while what flatMap() returned is just "what inner function returned".
So you can say that flatMap() can kick outer Stream() away, but map() can't. This is the most difference in my opinion, and also why map() is not just a special case of flatMap().
ps:
If you really want to make one-to-one with flatMap, then you should change it into one-to-List(one). That means you should add an outer Stream() manually which will be stripped by flatMap() later. After that you'll get the same effect as using map().(Certainly, it's clumsy. So don't do like that.)
Here are examples for Java8, but the same as Scala:
use map():
list.stream().map(line -> line.split(" ").length)
deprecated use flatMap():
list.stream().flatMap(line -> Arrays.asList(line.split(" ").length).stream())
I am currently learning Haskell, so here are a beginner's questions:
What is meant by single type in the text below ?
Is single type a special Haskell term ? Does it mean atomic type here ?
Or does it mean that I can never make a list in Haskell in which I can put both 1 and 'c' ?
I was thinking that a type is a set of values.
So I cannot define a type that contains Chars and Ints ?
What about algebraic data types ?
Something like: data IntOrChar = In Int | Ch Char ? (I guess that should work but I am confused what the author meant by that sentence.)
Btw, is that the only way to make a list in Haskell in which I can put both Ints and Chars? Or is there a more tricky way ?
A Scala analogy: in Scala it would be possible to write implicit conversions to a type that represents both Ints and Chars (like IntOrChar) and then it would be possible to put seemlessly Ints and Chars into List[IntOrChar], is that not possible with Haskell ? Do I always have to explicitly wrap every Int or Char into IntOrChar if I want to put them into a list of IntOrChar ?
From Gentle Intro to Haskell:
Haskell also incorporates polymorphic types---types that are
universally quantified in some way over all types. Polymorphic type
expressions essentially describe families of types. For example,
(forall a)[a] is the family of types consisting of, for every type a,
the type of lists of a. Lists of integers (e.g. [1,2,3]), lists of
characters (['a','b','c']), even lists of lists of integers, etc., are
all members of this family. (Note, however, that [2,'b'] is not a
valid example, since there is no single type that contains both 2 and
'b'.)
Short answer.
In Haskell there are no implicit conversions. Also there are no union types - only disjoint unions(which are algebraic data types). So you can only write:
someList :: [IntOrChar]
someList = [In 1, Ch 'c']
Longer and certainly not gentle answer.
Note: This is a technique that's very rarely used. If you need it you're probably overcomplicating your API.
There are however existential types.
{-# LANGUAGE ExistentialQuantification, RankNTypes #-}
class IntOrChar a where
intOrChar :: a -> Either Int Char
instance IntOrChar Int where
intOrChar = Left
instance IntOrChar Char where
intOrChar = Right
data List = Nil
| forall a. (IntOrChar a) => Cons a List
someList :: List
someList = (1 :: Int) `Cons` ('c' `Cons` Nil)
Here I have created a typeclass IntOrChar with only function intOrChar. This way you can convert anything of type forall a. (IntOrChar a) => a to Either Int Char.
And also a special kind of list that uses existential type in its second constructor.
Here type variable a is bound(with forall) at the constructor scope. Therefore every time
you use Cons you can pass anything of type forall a. (IntOrChar a) => a as a first argument. Consequently during a destruction(i.e. pattern matching) the first argument will
still be forall a. (IntOrChar a) => a. The only thing you can do with it is either pass it on or call intOrChar on it and convert it to Either Int Char.
withHead :: (forall a. (IntOrChar a) => a -> b) -> List -> Maybe b
withHead f Nil = Nothing
withHead f (Cons x _) = Just (f x)
intOrCharToString :: (IntOrChar a) => a -> String
intOrCharToString x =
case intOrChar of
Left i -> show i
Right c -> show c
someListHeadString :: Maybe String
someListHeadString = withHead intOrCharToString someList
Again note that you cannot write
{- Wont compile
safeHead :: IntOrChar a => List -> Maybe a
safeHead Nil = Nothing
safeHead (Cons x _) = Just x
-}
-- This will
safeHead2 :: List -> Maybe (Either Int Char)
safeHead2 Nil = Nothing
safeHead2 (Cons x _) = Just (intOrChar x)
safeHead will not work because you want a type of IntOrChar a => Maybe a with a bound at safeHead scope and Just x will have a type of IntOrChar a1 => Maybe a1 with a1 bound at Cons scope.
In Scala there are types that include both Int and Char such as AnyVal and Any, which are both supertypes of Char and Int. In Haskell there is no such hierarchy, and all the basic types are disjoint.
You can create your own union types which describe the concept of 'either an Int or a Char (or you could use the built-in Either type), but there are no implicit conversions in Haskell to transparently convert an Int into an IntOrChar.
You could emulate the concept of 'Any' using existential types:
data AnyBox = forall a. (Show a, Hashable a) => AB a
heteroList :: [AnyBox]
heteroList = [AB (1::Int), AB 'b']
showWithHash :: AnyBox -> String
showWithHash (AB v) = show v ++ " - " ++ (show . hash) v
let strs = map showWithHash heteroList
Be aware that this pattern is discouraged however.
I think that the distinction that is being made here is that your algebraic data type IntOrChar is a "tagged union" - that is, when you have a value of type IntOrChar you will know if it is an Int or a Char.
By comparison consider this anonymous union definition (in C):
typedef union { char c; int i; } intorchar;
If you are given a value of type intorchar you don't know (apriori) which selector is valid. That's why most of the time the union constructor is used in conjunction with a struct to form a tagged-union construction:
typedef struct {
int tag;
union { char c; int i; } intorchar_u
} IntOrChar;
Here the tag field encodes which selector of the union is valid.
The other major use of the union constructor is to overlay two structures to get an efficient mapping between sub-structures. For example, this union is one way to efficiently access the individual bytes of a int (assuming 8-bit chars and 32-bit ints):
union { char b[4]; int i }
Now, to illustrate the main difference between "tagged unions" and "anonymous unions" consider how you go about defining a function on these types.
To define a function on an IntOrChar value (the tagged union) I claim you need to supply two functions - one which takes an Int (in the case that the value is an Int) and one which takes a Char (in case the value is a Char). Since the value is tagged with its type, it knows which of the two functions it should use.
If we let F(a,b) denote the set of functions from type a to type b, we have:
F(IntOrChar,b) = F(Int,b) \times F(Char,b)
where \times denotes the cross product.
As for the anonymous union intorchar, since a value doesn't encode anything bout its type the only functions which can be applied are those which are valid for both Int and Char values, i.e.:
F(intorchar,b) = F(Int,b) \cap F(Char,b)
where \cap denotes intersection.
In Haskell there is only one function (to my knowledge) which can be applied to both integers and chars, namely the identity function. So there's not much you could do with a list like [2, 'b'] in Haskell. In other languages this intersection may not be empty, and then constructions like this make more sense.
To summarize, you can have integers and characters in the same list if you create a tagged-union, and in that case you have to tag each of the values which will make you list look like:
[ I 2, C 'b', ... ]
If you don't tag your values then you are creating something akin to an anonymous union, but since there aren't any (useful) functions which can be applied to both integers and chars there's not really anything you can do with that kind of union.