how to find all possible combinations between tuples without duplicates scala - scala

suppose I have list of tuples:
val a = ListBuffer((1, 5), (6, 7))
Update: Elements in a are assumed to be distinct inside each of the tuples2, in other words, it can be for example (1,4) (1,5) but not (1,1) (2,2).
I want to generate results of all combinations of ListBuffer a between these two tuples but without duplication. The result will look like:
ListBuffer[(1,5,6), (1,5,7), (6,7,1), (6,7,5)]
Update: elements in result tuple3 are also distinct. tuples them selves are also distinct, means as long as (6,7,1) is present, then (1,7,6) should not be in the result tuple3.
If, for example val a = ListBuffer((1, 4), (1, 5)) then the result output should be ListBuffer[(1,4,5)] in which (1,4,1) and (1,5,1) are discarded
How can I do that in Scala?
Note: I just gave an example. Usually the val a has tens of scala.Tuple2

If the individual elements are unique, as you've commented, then you should be able to flatten everything (un-tuple), get the desired combinations(), and re-tuple.
updated
val a = collection.mutable.ListBuffer((1, 4), (1, 5))
a.flatMap(t => Seq(t._1, t._2)) //un-tuple
.distinct //no duplicates
.combinations(3) //unique sets of 3
.map{case Seq(x,y,z) => (x,y,z)} //re-tuple
.toList //if you don't want an iterator

Related

Perform a nested for loop with RDD.map() in Scala

I'm rather new to Spark and Scala and have a Java background. I have done some programming in haskell, so not completely new to functional programming.
I'm trying to accomplish some form of a nested for-loop. I have a RDD which I want to manipulate based on every two elements in the RDD. The pseudo code (java-like) would look like this:
// some RDD named rdd is available before this
List list = new ArrayList();
for(int i = 0; i < rdd.length; i++){
list.add(rdd.get(i)._1);
for(int j = 0; j < rdd.length; j++){
if(rdd.get(i)._1 == rdd.get(j)._1){
list.add(rdd.get(j)._1);
}
}
}
// Then now let ._1 of the rdd be this list
My scala solution (that does not work) looks like this:
val aggregatedTransactions = joinedTransactions.map( f => {
var list = List[Any](f._2._1)
val filtered = joinedTransactions.filter(t => f._1 == t._1)
for(i <- filtered){
list ::= i._2._1
}
(f._1, list, f._2._2)
})
I'm trying to achieve to put item _2._1 into a list if ._1 of both items is equal.
I am aware that i cannot do any filter or map function within another map function. I've read that you could achieve something like this with a join, but I don't see how I could actually get these items into a list or any structure that can be used as list.
How do you achieve an effect like this with RDDs?
Assuming your input has the form RDD[(A, (A, B))] for some types A, B, and that the expected result should have the form RDD[A] - not a List (because we want to keep data distributed) - this would seem to do what you need:
rdd.join(rdd.values).keys
Details:
It's hard to understand the exact input and expected output, as the data structure (type) of neither is explicitly stated, and the requirement is not well explained by the code example. So I'll make some assumptions and hope that it will help with your specific case.
For the full example, I'll assume:
Input RDD has type RDD[(Int, (Int, Int))]
Expected output has the form RDD[Int], and would contain a lot of duplicates - if the original RDD has the "key" X multiple times, each match (in ._2._1) would appear once per occurrence of X as a key
If that's the case we're trying to solve - this join would solve it:
// Some sample data, assuming all ints
val rdd = sc.parallelize(Seq(
(1, (1, 5)),
(1, (2, 5)),
(2, (1, 5)),
(3, (4, 5))
))
// joining the original RDD with an RDD of the "values" -
// so the joined RDD will have "._2._1" as key
// then we get the keys only, because they equal the values anyway
val result: RDD[Int] = rdd.join(rdd.values).keys
// result is a key-value RDD with the original keys as keys, and a list of matching _2._1
println(result.collect.toList) // List(1, 1, 1, 1, 2)

make tuples scala string

I have an RDD of Array[String] that looks like this:
mystring= ['thisisastring', 'thisisastring', 'thisisastring', 'thisisastring' ......]
I need to make each element, or each line into a Tuple, which combines a fixed number of items together so that they can be passed around as a whole.
So, essentially, it's like:
(1, 'thisisastring')
(2, 'thisisastring')
(3, 'thisisastring')
So I think I need to use Tuple2, which is Tuple2[Int, String]. Remind me if I'm wrong.
when I did this: val vertice = Tuple2(1, mystring).
I realized that I'm just adding int 1 to every line.
So I need a loop iterating through my Array[String], to add 1, 2, and 3, to line 1, line 2 and line 3.
I thought about using while(count<14900).
But val count is a fixed number, I can't update the value of count each time.
Do you have a better way to do this?
It sounds like you are looking for zipWithIndex.
You don't specify the type you want the resulting RDD to be but
this will give you RDD[(Int, String)]:
rdd.flatMap(_.zipWithIndex)
This will give you RDD[Array[(Int, String)]:
rdd.map(_.zipWithIndex)
how about using for & yield.
for ( i <- 1 to count ) yield Tuple2(i, mystring(i) )

Why does splitting strings give ArrayOutOfBoundsException in Spark 1.1.0 (works fine in 1.4.0)?

I'm using Spark 1.1.0 and Scala 2.10.4.
I have an input as follows:
100,aviral,Delhi,200,desh
200,ashu,hyd,300,desh
While executing:
sc.textFile(inputFile).keyBy(line => line.split(',')(2))
Spark gives me ArrayOutOfBoundsException. Why?
Please note that the same code works fine in Spark 1.4.0. Can anyone explain the reason for different behaviour?
It works fine here in Spark 1.4.1 / spark-shell
Define an rdd with some data
val rdd = sc.parallelize(Array("1,abc,2,xyz,3","4,qwerty,5,abc,4","9,fee,11,fie,13"))
rdd: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at parallelize at <console>:21
Run it through .keyBy()
rdd.keyBy( line => (line.split(','))(2) ).collect()
res4: Array[(String, String)] = Array((2,1,abc,2,xyz,3), (5,4,qwerty,5,abc,4), (11,9,fee,11,fie,13))
Notice it makes the key from the 3rd element after splitting, but the printing seems odd. At first it doesn't look correctly tupled but this turns out to be a printing artifact from missing any quotes on the string. We could test this to pick off the values and see if we get the line back:
rdd.keyBy(line => line.split(',')(2) ).values.collect()
res12: Array[String] = Array(1,abc,2,xyz,3, 4,qwerty,5,abc,4, 9,fee,11,fie,13)
and this looks as expected. Note that there are only 3 elements in the array, and the commas here are within the element strings.
We can also use .map() to make pairs, like so:
rdd.map( line => (line.split(',')(2), line.split(',')) ).collect()
res7: Array[(String, Array[String])] = Array((2,Array(1, abc, 2, xyz, 3)), (5,Array(4, qwerty, 5, abc, 4)), (11,Array(9, fee, 11, fie, 13)))
which is printed as Tuples...
Or to avoid duplicating effort, maybe:
def splitter(s:String):(String,Array[String]) = {
val parsed = s.split(',')
(parsed(2), parsed)
}
rdd.map(splitter).collect()
res8: Array[(String, Array[String])] = Array((2,Array(1, abc, 2, xyz, 3)), (5,Array(4, qwerty, 5, abc, 4)), (11,Array(9, fee, 11, fie, 13))
which is a bit easier to read. It is also slightly more parsed, because here we have split the line into its separate values.
The problem is you have a blank line after 1st row - splitting it does not return an Array containing necessary number of columns.
1,abc,2,xyz,3
<empty line - here lies the problem>
4,qwerty,5,abc,4
Remove the empty line.
Another possibility is that one of the rows does not have enough columns. You can filter all rows that does not have the required number of columns (be aware of possible data loss though).
sc.textFile(inputFile)
.map.(_.split(","))
.filter(_.size == EXPECTED_COLS_NUMBER)
.keyBy(line => line(2))

What is the difference between partition and groupBy?

I am reading through Twitter's Scala School right now and was looking at the groupBy and partition methods for collections. And I am not exactly sure what the difference between the two methods is.
I did some testing on my own:
scala> List(1, 2, 3, 4, 5, 6).partition(_ % 2 == 0)
res8: (List[Int], List[Int]) = (List(2, 4, 6),List(1, 3, 5))
scala> List(1, 2, 3, 4, 5, 6).groupBy(_ % 2 == 0)
res9: scala.collection.immutable.Map[Boolean,List[Int]] = Map(false -> List(1, 3, 5), true -> List(2, 4, 6))
So does this mean that partition returns a list of two lists and groupBy returns a Map with boolean keys and list values? Both have the same "effect" of splitting a list into two different parts based on a condition. I am not sure why I would use one over the other. So, when would I use partition over groupBy and vice-versa?
groupBy is better suited for lists of more complex objects.
Say, you have a class:
case class Beer(name: String, cityOfBrewery: String)
and a List of beers:
val beers = List(Beer("Bitburger", "Bitburg"), Beer("Frueh", "Cologne") ...)
you can then group beers by cityOfBrewery:
val beersByCity = beers.groupBy(_.cityOfBrewery)
Now you can get yourself a list of all beers brewed in any city you have in your data:
val beersByCity("Cologne") = List(Beer("Frueh", "Cologne"), ...)
Neat, isn't it?
And I am not exactly sure what the difference between the two methods
is.
The difference is in their signature. partition expects a function A => Boolean while groupBy expects a function A => K.
It appears that in your case the function you apply with groupBy is A => Boolean too, but you don't want always to do this, sometimes you want to group by a function that don't always returns a boolean based on its input.
For example if you want to group a List of strings by their length, you need to do it with groupBy.
So, when would I use partition over groupBy and vice-versa?
Use groupBy if the image of the function you apply is not in the boolean set (i.e f(x) for an input x yield another result than a boolean). If it's not the case then you can use both, it's up to you whether you prefer a Map or a (List, List) as output.
Partition is when you need to split some collection into two basing on yes/no logic (even/odd numbers, uppercase/lowecase letters, you name it). GroupBy has more general usage: producing many groups, basing on some function. Let's say you want to split corpus of words into bins depending on their first letter (resulting into 26 groups), it simply not possible with .partition.

groupby not behaving as expected

The below code is suppose to sum the values of a list of tuples but when two or more tuples contain the same value , the tuple is just outputted once :
var data = List((1, "1") , (1, "one")) //> data : List[(Int, java.lang.String)] = List((1,1), (1,one))
data = data.groupBy(_._2).map {
case (label, vals) => (vals.map(_._1).sum, label)
}.toList.sortBy(_._1).reverse
println(data) //> List((1,1))
The output of above is List((1,1)) when I'm expecting List((1,1) , (1,"one"))
Does the groupBy function paramaters need to be tweaked to fix this ?
Actually, it does behave as expected. groupBy returns a map. When you map over a map, you construct a new map, where, of course, each key is unique. Here, you'd have the key 1 twice…
You should then call toList before calling map, not after.