make tuples scala string - scala

I have an RDD of Array[String] that looks like this:
mystring= ['thisisastring', 'thisisastring', 'thisisastring', 'thisisastring' ......]
I need to make each element, or each line into a Tuple, which combines a fixed number of items together so that they can be passed around as a whole.
So, essentially, it's like:
(1, 'thisisastring')
(2, 'thisisastring')
(3, 'thisisastring')
So I think I need to use Tuple2, which is Tuple2[Int, String]. Remind me if I'm wrong.
when I did this: val vertice = Tuple2(1, mystring).
I realized that I'm just adding int 1 to every line.
So I need a loop iterating through my Array[String], to add 1, 2, and 3, to line 1, line 2 and line 3.
I thought about using while(count<14900).
But val count is a fixed number, I can't update the value of count each time.
Do you have a better way to do this?

It sounds like you are looking for zipWithIndex.
You don't specify the type you want the resulting RDD to be but
this will give you RDD[(Int, String)]:
rdd.flatMap(_.zipWithIndex)
This will give you RDD[Array[(Int, String)]:
rdd.map(_.zipWithIndex)

how about using for & yield.
for ( i <- 1 to count ) yield Tuple2(i, mystring(i) )

Related

how to find all possible combinations between tuples without duplicates scala

suppose I have list of tuples:
val a = ListBuffer((1, 5), (6, 7))
Update: Elements in a are assumed to be distinct inside each of the tuples2, in other words, it can be for example (1,4) (1,5) but not (1,1) (2,2).
I want to generate results of all combinations of ListBuffer a between these two tuples but without duplication. The result will look like:
ListBuffer[(1,5,6), (1,5,7), (6,7,1), (6,7,5)]
Update: elements in result tuple3 are also distinct. tuples them selves are also distinct, means as long as (6,7,1) is present, then (1,7,6) should not be in the result tuple3.
If, for example val a = ListBuffer((1, 4), (1, 5)) then the result output should be ListBuffer[(1,4,5)] in which (1,4,1) and (1,5,1) are discarded
How can I do that in Scala?
Note: I just gave an example. Usually the val a has tens of scala.Tuple2
If the individual elements are unique, as you've commented, then you should be able to flatten everything (un-tuple), get the desired combinations(), and re-tuple.
updated
val a = collection.mutable.ListBuffer((1, 4), (1, 5))
a.flatMap(t => Seq(t._1, t._2)) //un-tuple
.distinct //no duplicates
.combinations(3) //unique sets of 3
.map{case Seq(x,y,z) => (x,y,z)} //re-tuple
.toList //if you don't want an iterator

Why does splitting strings give ArrayOutOfBoundsException in Spark 1.1.0 (works fine in 1.4.0)?

I'm using Spark 1.1.0 and Scala 2.10.4.
I have an input as follows:
100,aviral,Delhi,200,desh
200,ashu,hyd,300,desh
While executing:
sc.textFile(inputFile).keyBy(line => line.split(',')(2))
Spark gives me ArrayOutOfBoundsException. Why?
Please note that the same code works fine in Spark 1.4.0. Can anyone explain the reason for different behaviour?
It works fine here in Spark 1.4.1 / spark-shell
Define an rdd with some data
val rdd = sc.parallelize(Array("1,abc,2,xyz,3","4,qwerty,5,abc,4","9,fee,11,fie,13"))
rdd: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at parallelize at <console>:21
Run it through .keyBy()
rdd.keyBy( line => (line.split(','))(2) ).collect()
res4: Array[(String, String)] = Array((2,1,abc,2,xyz,3), (5,4,qwerty,5,abc,4), (11,9,fee,11,fie,13))
Notice it makes the key from the 3rd element after splitting, but the printing seems odd. At first it doesn't look correctly tupled but this turns out to be a printing artifact from missing any quotes on the string. We could test this to pick off the values and see if we get the line back:
rdd.keyBy(line => line.split(',')(2) ).values.collect()
res12: Array[String] = Array(1,abc,2,xyz,3, 4,qwerty,5,abc,4, 9,fee,11,fie,13)
and this looks as expected. Note that there are only 3 elements in the array, and the commas here are within the element strings.
We can also use .map() to make pairs, like so:
rdd.map( line => (line.split(',')(2), line.split(',')) ).collect()
res7: Array[(String, Array[String])] = Array((2,Array(1, abc, 2, xyz, 3)), (5,Array(4, qwerty, 5, abc, 4)), (11,Array(9, fee, 11, fie, 13)))
which is printed as Tuples...
Or to avoid duplicating effort, maybe:
def splitter(s:String):(String,Array[String]) = {
val parsed = s.split(',')
(parsed(2), parsed)
}
rdd.map(splitter).collect()
res8: Array[(String, Array[String])] = Array((2,Array(1, abc, 2, xyz, 3)), (5,Array(4, qwerty, 5, abc, 4)), (11,Array(9, fee, 11, fie, 13))
which is a bit easier to read. It is also slightly more parsed, because here we have split the line into its separate values.
The problem is you have a blank line after 1st row - splitting it does not return an Array containing necessary number of columns.
1,abc,2,xyz,3
<empty line - here lies the problem>
4,qwerty,5,abc,4
Remove the empty line.
Another possibility is that one of the rows does not have enough columns. You can filter all rows that does not have the required number of columns (be aware of possible data loss though).
sc.textFile(inputFile)
.map.(_.split(","))
.filter(_.size == EXPECTED_COLS_NUMBER)
.keyBy(line => line(2))

What is the difference between partition and groupBy?

I am reading through Twitter's Scala School right now and was looking at the groupBy and partition methods for collections. And I am not exactly sure what the difference between the two methods is.
I did some testing on my own:
scala> List(1, 2, 3, 4, 5, 6).partition(_ % 2 == 0)
res8: (List[Int], List[Int]) = (List(2, 4, 6),List(1, 3, 5))
scala> List(1, 2, 3, 4, 5, 6).groupBy(_ % 2 == 0)
res9: scala.collection.immutable.Map[Boolean,List[Int]] = Map(false -> List(1, 3, 5), true -> List(2, 4, 6))
So does this mean that partition returns a list of two lists and groupBy returns a Map with boolean keys and list values? Both have the same "effect" of splitting a list into two different parts based on a condition. I am not sure why I would use one over the other. So, when would I use partition over groupBy and vice-versa?
groupBy is better suited for lists of more complex objects.
Say, you have a class:
case class Beer(name: String, cityOfBrewery: String)
and a List of beers:
val beers = List(Beer("Bitburger", "Bitburg"), Beer("Frueh", "Cologne") ...)
you can then group beers by cityOfBrewery:
val beersByCity = beers.groupBy(_.cityOfBrewery)
Now you can get yourself a list of all beers brewed in any city you have in your data:
val beersByCity("Cologne") = List(Beer("Frueh", "Cologne"), ...)
Neat, isn't it?
And I am not exactly sure what the difference between the two methods
is.
The difference is in their signature. partition expects a function A => Boolean while groupBy expects a function A => K.
It appears that in your case the function you apply with groupBy is A => Boolean too, but you don't want always to do this, sometimes you want to group by a function that don't always returns a boolean based on its input.
For example if you want to group a List of strings by their length, you need to do it with groupBy.
So, when would I use partition over groupBy and vice-versa?
Use groupBy if the image of the function you apply is not in the boolean set (i.e f(x) for an input x yield another result than a boolean). If it's not the case then you can use both, it's up to you whether you prefer a Map or a (List, List) as output.
Partition is when you need to split some collection into two basing on yes/no logic (even/odd numbers, uppercase/lowecase letters, you name it). GroupBy has more general usage: producing many groups, basing on some function. Let's say you want to split corpus of words into bins depending on their first letter (resulting into 26 groups), it simply not possible with .partition.

Converting string features to numeric features: algorithm efficiency

I'm converting several columns of strings to numeric features I can use in a LabeledPoint. I'm considering two approaches:
Create a mapping of strings to doubles, iterate through the RDD and lookup each string and assign the appropriate value.
Sort the RDD by the column, iterate through the RDD with a counter, assign each string to the current counter value until the string changes at which time the counter value is incremented and assigned. Since we never see a string twice (thanks to sorting) this will effectively assign a unique value to each string.
In the first approach we must collect the unique values for our map. I'm not sure how long this takes (linear time?). Then we iterate through the list of values and build up a HashMap - linear time and memory. Finally we iterate and lookup each value, N * eC (effective constant time).
In the second approach we sort (n log n time) and then iterate and keep track of a simple counter and a few variables.
What approach is recommended? There are memory, performance, and coding style considerations. The first feels like 2N + eC * N with N * (String, Double) memory and can be written in a functional style. The second is N log N + N but feels imperative. Will Spark need to transfer the static map around? I could see that being a deal breaker.
The second method unfortunately won't work the reason is you can not read form counter you can only increment it. What is even worst you dont really know when value changes you dont have state to remember previous vector. I guess you could use something like mapPartition and total order partitioner. You would have to know that your partitions are processed in order and there cant be same keys in more then one partition but this feels really hacky (and i dont know if it would work).
I dont think its possible to do this in one pass. But you can do it in two. In your first method you can use for example set accumulator put all you values in it then number them in driver and use in second pass to replace them. The complexity would be 2N (assuming that number of values << N).
Edit:
implicit object SetAcc extends AccumulatorParam[Set[String]] {
def zero(s: Set[String]) = Set()
def addInPlace(s1: Set[String], s2: Set[String]) = s1 ++ s2
}
val rdd = sc.parallelize(
List((1, "a"), (2, "a"), (3, "b"), (4, "a"), (5, "c"), (6, "b"))
)
val acc: Accumulator[Set[String]] = sc.accumulator(Set())
rdd.foreach(p => acc += Set(p._2))
val encoding = acc.value.zipWithIndex.toMap
val result = rdd map {p => (p._1, encoding(p._2))}
If you feel like this dictionary is too big you can of course brodcast it. If you have to many features and values in them and you dont want to create so many big accumulators then you can just use reduce function to process them all together and collect on driver. Just my thoughts about it. I guess you just have to try and see whats suits the best your usecase.
Edit:
In mllib there is class meant for this purpose HashingTF. It allows you to translate you data set in one pass. The drawback is that it uses hashing modulo specified parameter to map Objects to Doubles. This can lead to collisions if parameter is too small.
val tf = new HashingTF(numFeatures = 10000)
val transformed = data.map(line => tf.transform(line.split("""\s+"""))
Ofc you can do the same thing by hand without using HashingTF class.

In Scala, how to get a slice of a list from nth element to the end of the list without knowing the length?

I'm looking for an elegant way to get a slice of a list from element n onwards without having to specify the length of the list. Lets say we have a multiline string which I split into lines and then want to get a list of all lines from line 3 onwards:
string.split("\n").slice(3,X) // But I don't know what X is...
What I'm really interested in here is whether there's a way to get hold of a reference of the list returned by the split call so that its length can be substituted into X at the time of the slice call, kind of like a fancy _ (in which case it would read as slice(3,_.length)) ? In python one doesn't need to specify the last element of the slice.
Of course I could solve this by using a temp variable after the split, or creating a helper function with a nice syntax, but I'm just curious.
Just drop first n elements you don't need:
List(1,2,3,4).drop(2)
res0: List[Int] = List(3, 4)
or in your case:
string.split("\n").drop(2)
There is also paired method .take(n) that do the opposite thing, you can think of it as .slice(0,n).
In case you need both parts, use .splitAt:
val (left, right) = List(1,2,3,4).splitAt(2)
left: List[Int] = List(1, 2)
right: List[Int] = List(3, 4)
The right answer is takeRight(n):
"communism is sharing => resource saver".takeRight(3)
//> res0: String = ver
You can use scala's list method 'takeRight',This will not throw exception when List's length is not enough, Like this:
val t = List(1,2,3,4,5);
t.takeRight(3);
res1: List[Int] = List(3,4,5)
If list is not longer than you want take, this will not throw Exception:
val t = List(4,5);
t.takeRight(3);
res1: List[Int] = List(4,5)
get last 2 elements:
List(1,2,3,4,5).reverseIterator.take(2)