I have this structure:
Seq[(Int, mutable.Set[List[A]])]
and my aim is flatten it in order to obtain a structure like List[List[A]].
In particular, my structure is something like this:
val order = List((1,HashSet(List('o'))), (4,HashSet(List('i', 'j', 'k', 'l'))), (3,HashSet(List('a', 'b', 'c'), List('f', 'g', 'h'))), (2,HashSet(List('d', 'e'), List('m', 'n'), List('z', 'x')))) and i want to obtain something like val order = List( List('o'), List('i', 'j', 'k', 'l'), List('a', 'b', 'c'), ...).
I tried to make a map in this way:
order map {
case (_, Set[List[A]]) => List[A]
}
but it doesn't work. The error is "pattern type is incompatible with expected type".
The same thing also happens with:
case (Seq[(_, Set[List[A]])]) => List[A] and case (Seq[(_, mutable.Set[List[A]])]) => List[A].
I'm pretty sure that to do the flattening, the solution is to use the map (or flatMap), but apparently I'm using it in the wrong way. Anyone have any suggestions / ideas of how I could do it? Thanks!
How about this?
order.flatMap(_._2)
Related
I have an RDD (r2Join1) which holds the following data
(100,(102|1001,201))
(100,(102|1001,200))
(100,(103|1002,201))
(100,(103|1002,200))
(150,(151|1003,204))
I want to transform this to the following
(102, (1001, 201))
(102, (1001, 200))
(103, (1002, 201))
(103, (1002, 200))
(151, (1003, 204))
i.e., I want to transform (k, (v1|v2, v3)) to (v1, (v2, v3)).
I did the following:
val m2 = r2Join1.map({case (k, (v1, v2)) => val p: Array[String] = v1.split("\\|") (p(0).toLong, (p(1).toLong, v2.toLong))})
I get the following error
error: too many arguments for method apply: (i: Int)String in class Array
I am new to Spark & Scala. Please let me know how this error can be resolved.
The code looks like it might be off in other areas, but without the rest I can't be sure, so at minimum this should get you moving you need either a semicolon after your split or to put the two separate statements on separate lines.
v1.split("\\|");(p(0).toLong, (p(1).toLong, v2.toLong))
Without the semicolon the compiler is interpreting it as:
v1.split("\\|").apply(p(0).toLong...)
where apply acts as an indexer of the array in this case.
I am looking for example code that implements a nested loop in spark. I am looking for the following functionality.
Given a RDD data1 = sc.parallelize(range(10)) and another dataset data2 = sc.parallelize(['a', 'b', 'c']), I am looking for something which will pick each 'key' from data2, append each 'value' from data1 to create a list of key value pairs that look, perhaps in internal memory, something like [(a,1), (a, 2), (a, 3), ..., (c, 8), (c, 9)] and then do a reduce by key using a simple reducer function, say lambda x, y: x+y.
From the logic described above, the expected output is
(a, 45)
(b, 45)
(c, 45)
My attempt
data1 = sc.parallelize(range(100))
data2 = sc.parallelize(['a', 'b', 'c'])
f = lambda x: data2.map(lambda y: (y, x))
data1.map(f).reduceByKey(lambda x, y: x+y)
The obtained error
Exception: It appears that you are attempting to broadcast an RDD or
reference an RDD from an action or transformation. RDD transformations
and actions can only be invoked by the driver, not inside of other
transformations; for example, rdd1.map(lambda x: rdd2.values.count() * x)
is invalid because the values transformation and count action cannot be
performed inside of the rdd1.map transformation. For more information,
see SPARK-5063.
I am a complete newbie this, so any help is highly appreciated!
OS Information
I am running this on a standalone spark installation on linux. Details available if relevant.
Here is a potential solution. I am not too happy with it, though, because it doesn't represent a true for loop.
data1 = sc.parallelize(range(10))
data2 = sc.parallelize(['a', 'b', 'c'])
data2.cartesian(data1).reduceByKey(lambda x, y: x+y).collect()
gives
[('a', 45), ('c', 45), ('b', 45)]
define ['a', 'b', 'c'], (A, B, C,) ->
I want write
define
['a', 'b', 'c']
, (A, B, C,) ->
How to do it without a compiler error ?
define(
['a', 'b', 'c']
(A, B, C) ->
"D"
)
Which compiles to:
define(['a', 'b', 'c'], function(A, B, C) {
return "D";
});
As a general rule, if you have multiple arguments that you want to be comma separated in the output but line separated in the input, put them at the same indent level.
The parentheses after the define are necessary to tell the compiler that there is a set of things that need to be passed into the function.
The comma after the C in your input was causing an error as well.
First, you have to get rid of the trailing comma in your anonymous function's argument list. Then you have a few options:
define \
['a', 'b', 'c']
(A, B, C) ->
Be careful that the backslash doesn't have anything other than a newline after it. Or you could add parentheses:
define(
['a', 'b', 'c']
(A, B, C) ->
)
but be very careful that you don't leave any space between define and ( or you'll get an accidental JavaScript comma operator in the JavaScript version.
Trying to generate, from a list of chars, a list of unique chars mapped to their frequency - e.g. something like:
List('a','b','a') -> List(('a',2), ('b',1))
So, just mucking around in the console, this works:
val l = List('a', 'b', 'c', 'b', 'c', 'a')
val s = l.toSet
s.map(i => (i, l.filter(x => x == i).size))
but, shortening by just combining the last 2 lines doesn't?
l.toSet.map(i => (i, l.filter(x => x == i).size))
gives the error "missing parameter type".
Can someone explain why Scala complains about this syntax?
When you say val s = l.toSet the compiler figures that the only sensible type for toSet is Char--that's the most specific choice. Then, given that s is a set of Char, the compiler realizes that the map must be from a Char.
But in the second case, it withholds judgment on what the type of elements in toSet is. It might be Char, but AnyVal would also work, as would Any.
l.toSet.map((i: Any) => (i, l.filter(x => x == i).size))
Normally the rule is that the compiler should pick the most specific value. But since functions are contravariant in their argument, they are most specific when they take an Any as an argument, so the compiler can't decide. There could exist a rule to break the tie ("prefer the early assumption"), but there isn't one implemented. So it asks for your help.
You can provide the type either on the function argument or on the toSet to fix the problem:
l.toSet.map((i: Char) => (i, l.filter(x => x == i).size))
l.toSet[Char].map(i => (i, l.filter(x => x == i).size))
Adding the type [Char] to toSet does the trick.
scala> l.toSet[Char].map(i => (i, l.filter(x => x == i).size))
scala.collection.immutable.Set[(Char, Int)] = Set((a,2), (b,2), (c,2))
Why this code doesn't work:
scala> List('a', 'b', 'c').toSet.subsets.foreach(e => println(e))
<console>:8: error: missing parameter type
List('a', 'b', 'c').toSet.subsets.foreach(e => println(e))
^
But when I split it then it works fine:
scala> val itr=List('a', 'b', 'c').toSet.subsets
itr: Iterator[scala.collection.immutable.Set[Char]] = non-empty iterator
scala> itr.foreach(e => println(e))
Set()
Set(a)
Set(b)
Set(c)
Set(a, b)
Set(a, c)
Set(b, c)
Set(a, b, c)
And this code is OK as well:
Set('a', 'b', 'c').subsets.foreach(e => println(e))
First, there's a simpler version of the code that has the same issue:
List('a', 'b', 'c').toSet.foreach(e => println(e))
This doesn't work either
List('a', 'b', 'c').toBuffer.foreach(e => println(e))
However, these work just fine:
List('a', 'b', 'c').toList.foreach(e => println(e))
List('a', 'b', 'c').toSeq.foreach(e => println(e))
List('a', 'b', 'c').toArray.foreach(e => println(e))
If you go take a look at the List class documentation you'll see that the methods that work return some type parameterized with A, whereas methods that don't work return types parameterized with B >: A. The problem is that the Scala compiler can't figure out which B to use! That means it will work if you tell it the type:
List('a', 'b', 'c').toSet[Char].foreach(e => println(e))
Now as for why toSet and toBuffer have that signature, I have no idea...
Lastly, not sure if this is helpful, but this works too:
// I think this works because println can take type Any
List('a', 'b', 'c').toSet.foreach(println)
Update: After poking around the docs a little bit more I noticed that the method works on all the types with a covariant type parameter, but the ones with an invariant type parameter have the B >: A in the return type. Interestingly, although Array is invariant in Scala they provide two version of the method (one with A and one with B >: A), which is why it doesn't have that error.
I also never really answered why breaking the expression into two lines works. When you simply call toSet on its own, the compiler will automatically infer A as B in the type for the resulting Set[B], unless you do give it a specific type to pick. This is just how the type inference algorithm works. However, when you throw another unknown type into the mix (i.e. the type of e in your lambda) then the inference algorithm chokes and dies—it just can't handle an unknown B >: A and an unknown type of e as well.