Why does splitting strings give ArrayOutOfBoundsException in Spark 1.1.0 (works fine in 1.4.0)? - scala

I'm using Spark 1.1.0 and Scala 2.10.4.
I have an input as follows:
100,aviral,Delhi,200,desh
200,ashu,hyd,300,desh
While executing:
sc.textFile(inputFile).keyBy(line => line.split(',')(2))
Spark gives me ArrayOutOfBoundsException. Why?
Please note that the same code works fine in Spark 1.4.0. Can anyone explain the reason for different behaviour?

It works fine here in Spark 1.4.1 / spark-shell
Define an rdd with some data
val rdd = sc.parallelize(Array("1,abc,2,xyz,3","4,qwerty,5,abc,4","9,fee,11,fie,13"))
rdd: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at parallelize at <console>:21
Run it through .keyBy()
rdd.keyBy( line => (line.split(','))(2) ).collect()
res4: Array[(String, String)] = Array((2,1,abc,2,xyz,3), (5,4,qwerty,5,abc,4), (11,9,fee,11,fie,13))
Notice it makes the key from the 3rd element after splitting, but the printing seems odd. At first it doesn't look correctly tupled but this turns out to be a printing artifact from missing any quotes on the string. We could test this to pick off the values and see if we get the line back:
rdd.keyBy(line => line.split(',')(2) ).values.collect()
res12: Array[String] = Array(1,abc,2,xyz,3, 4,qwerty,5,abc,4, 9,fee,11,fie,13)
and this looks as expected. Note that there are only 3 elements in the array, and the commas here are within the element strings.
We can also use .map() to make pairs, like so:
rdd.map( line => (line.split(',')(2), line.split(',')) ).collect()
res7: Array[(String, Array[String])] = Array((2,Array(1, abc, 2, xyz, 3)), (5,Array(4, qwerty, 5, abc, 4)), (11,Array(9, fee, 11, fie, 13)))
which is printed as Tuples...
Or to avoid duplicating effort, maybe:
def splitter(s:String):(String,Array[String]) = {
val parsed = s.split(',')
(parsed(2), parsed)
}
rdd.map(splitter).collect()
res8: Array[(String, Array[String])] = Array((2,Array(1, abc, 2, xyz, 3)), (5,Array(4, qwerty, 5, abc, 4)), (11,Array(9, fee, 11, fie, 13))
which is a bit easier to read. It is also slightly more parsed, because here we have split the line into its separate values.

The problem is you have a blank line after 1st row - splitting it does not return an Array containing necessary number of columns.
1,abc,2,xyz,3
<empty line - here lies the problem>
4,qwerty,5,abc,4
Remove the empty line.
Another possibility is that one of the rows does not have enough columns. You can filter all rows that does not have the required number of columns (be aware of possible data loss though).
sc.textFile(inputFile)
.map.(_.split(","))
.filter(_.size == EXPECTED_COLS_NUMBER)
.keyBy(line => line(2))

Related

how to find all possible combinations between tuples without duplicates scala

suppose I have list of tuples:
val a = ListBuffer((1, 5), (6, 7))
Update: Elements in a are assumed to be distinct inside each of the tuples2, in other words, it can be for example (1,4) (1,5) but not (1,1) (2,2).
I want to generate results of all combinations of ListBuffer a between these two tuples but without duplication. The result will look like:
ListBuffer[(1,5,6), (1,5,7), (6,7,1), (6,7,5)]
Update: elements in result tuple3 are also distinct. tuples them selves are also distinct, means as long as (6,7,1) is present, then (1,7,6) should not be in the result tuple3.
If, for example val a = ListBuffer((1, 4), (1, 5)) then the result output should be ListBuffer[(1,4,5)] in which (1,4,1) and (1,5,1) are discarded
How can I do that in Scala?
Note: I just gave an example. Usually the val a has tens of scala.Tuple2
If the individual elements are unique, as you've commented, then you should be able to flatten everything (un-tuple), get the desired combinations(), and re-tuple.
updated
val a = collection.mutable.ListBuffer((1, 4), (1, 5))
a.flatMap(t => Seq(t._1, t._2)) //un-tuple
.distinct //no duplicates
.combinations(3) //unique sets of 3
.map{case Seq(x,y,z) => (x,y,z)} //re-tuple
.toList //if you don't want an iterator

How do I split a Spark rdd Array[(String, Array[String])]?

I'm practicing on doing sorts in the Spark shell. I have an rdd with about 10 columns/variables. I want to sort the whole rdd on the values of column 7.
rdd
org.apache.spark.rdd.RDD[Array[String]] = ...
From what I gather the way to do that is by using sortByKey, which in turn only works on pairs. So I mapped it so I'd have a pair consisting of column7 (string values) and the full original rdd (array of strings)
rdd2 = rdd.map(c => (c(7),c))
rdd2: org.apache.spark.rdd.RDD[(String, Array[String])] = ...
I then apply sortByKey, still no problem...
rdd3 = rdd2.sortByKey()
rdd3: org.apache.spark.rdd.RDD[(String, Array[String])] = ...
But now how do I split off, collect and save that sorted original rdd from rdd3 (Array[String])? Whenever I try a split on rdd3 it gives me an error:
val rdd4 = rdd3.map(_.split(',')(2))
<console>:33: error: value split is not a member of (String, Array[String])
What am I doing wrong here? Are there other, better ways to sort an rdd on one of its columns?
what you did with rdd2 = rdd.map(c => (c(7),c)) is to map it to a tuple.
rdd2: org.apache.spark.rdd.RDD[(String, Array[String])]
exactly as it says :).
now if you want to split the record you need to get it from this tuple.
you can map again, taking only the second part of the tuple (which is the array of Array[String]...) like so : rdd3.map(_._2)
but i would strongly suggest to use try rdd.sortBy(_(7)) or something of this sort. this way you do not need to bother yourself with tuple and such.
if you want to sort the rdd using the 7th string in the array, you can just do it directly by
rdd.sortBy(_(6)) // array starts at 0 not 1
or
rdd.sortBy(arr => arr(6))
That will save you all the hassle of doing multiple transformations. The reason why rdd.sortBy(_._7) or rdd.sortBy(x => x._7) won't work is because that's not how you access an element inside an Array. To access the 7th element of an array, say arr, you should do arr(6).
To test this, i did the following:
val rdd = sc.parallelize(Array(Array("ard", "bas", "wer"), Array("csg", "dip", "hwd"), Array("asg", "qtw", "hasd")))
// I want to sort it using the 3rd String
val sorted_rdd = rdd.sortBy(_(2))
Here's the result:
Array(Array("ard", "bas", "wer"), Array("csg", "dip", "hwd"), Array("asg", "qtw", "hasd"))
just do this:
val rdd4 = rdd3.map(_._2)
I thought you don't familiar with Scala,
So, below should help you understand more,
rdd3.map(kv => {
println(kv._1) // This represent String
println(kv._2) // This represent Array[String]
})

How to generate key-value format using Scala in Spark

I am studying Spark on VirtualBox. I use ./bin/spark-shell to open Spark and use Scala. Now I got confused about key-value format using Scala.
I have a txt file in home/feng/spark/data, which looks like:
panda 0
pink 3
pirate 3
panda 1
pink 4
I use sc.textFile to get this txt file. If I do
val rdd = sc.textFile("/home/feng/spark/data/rdd4.7")
Then I can use rdd.collect() to show rdd on the screen:
scala> rdd.collect()
res26: Array[String] = Array(panda 0, pink 3, pirate 3, panda 1, pink 4)
However, if I do
val rdd = sc.textFile("/home/feng/spark/data/rdd4.7.txt")
which no ".txt" here. Then when I use rdd.collect(), I got a mistake:
org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/home/feng/spark/A.txt
at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:285)
......
But I saw other examples. All of them have ".txt" at the end. Is there something wrong with my code or my system?
Another thing is when I tried to do:
scala> val rddd = rdd.map(x => (x.split(" ")(0),x))
rddd: org.apache.spark.rdd.RDD[(String, String)] = MapPartitionsRDD[2] at map at <console>:29
scala> rddd.collect()
res0: Array[(String, String)] = Array((panda,panda 0), (pink,pink 3), (pirate,pirate 3), (panda,panda 1), (pink,pink 4))
I intended to select the first column of the data and use it as the key. But rddd.collect() looks like not that way as the words occur twice, which is not right. I cannot keep doing the rest operations like mapbykey, reducebykey or others. Where did I do wrong?
Just for example I create a String with your dataset, after this I split the record by line, and use SparkContext's parallelize method to create an RDD. Notice that after I create the RDD I use its map method to split the String stored in each record and convert it to a Row.
import org.apache.spark.sql.Row
val text = "panda 0\npink 3\npirate 3\npanda 1\npink 4"
val rdd = sc.parallelize(text.split("\n")).map(x => Row(x.split(" "):_*))
rdd.take(3)
The output from the take method is:
res4: Array[org.apache.spark.sql.Row] = Array([panda,0], [pink,3], [pirate,3])
About your first question, there is no need for files to have any extension. Because, in this case files are seen as plain text.

make tuples scala string

I have an RDD of Array[String] that looks like this:
mystring= ['thisisastring', 'thisisastring', 'thisisastring', 'thisisastring' ......]
I need to make each element, or each line into a Tuple, which combines a fixed number of items together so that they can be passed around as a whole.
So, essentially, it's like:
(1, 'thisisastring')
(2, 'thisisastring')
(3, 'thisisastring')
So I think I need to use Tuple2, which is Tuple2[Int, String]. Remind me if I'm wrong.
when I did this: val vertice = Tuple2(1, mystring).
I realized that I'm just adding int 1 to every line.
So I need a loop iterating through my Array[String], to add 1, 2, and 3, to line 1, line 2 and line 3.
I thought about using while(count<14900).
But val count is a fixed number, I can't update the value of count each time.
Do you have a better way to do this?
It sounds like you are looking for zipWithIndex.
You don't specify the type you want the resulting RDD to be but
this will give you RDD[(Int, String)]:
rdd.flatMap(_.zipWithIndex)
This will give you RDD[Array[(Int, String)]:
rdd.map(_.zipWithIndex)
how about using for & yield.
for ( i <- 1 to count ) yield Tuple2(i, mystring(i) )

Spark processing columns in parallel

I've been playing with Spark, and I managed to get it to crunch my data. My data consists of flat delimited text file, consisting of 50 columns and about 20 millions of rows. I have scala scripts that will process each column.
In terms of parallel processing, I know that RDD operation run on multiple nodes. So, every time I process a column, they are processed in parallel, but the column itself is processed sequentially.
A simple example: if my data is 5 column text delimited file and each column contain text, and I want to do word count for each column. I would do:
for(i <- 0 until 4){
data.map(_.split("\t",-1)(i)).map((_,1)).reduce(_+_)
}
Although each column's operation is run in parallel, the column itself is processed sequentially(bad wording I know. Sorry!). In other words, column 2 is processed after column 1 is done. Column 3 is processed after column 1 and 2 are done, and so on.
My question is: Is there anyway to process multiple column at a time? If you know a way, cor a tutorial, would you mind sharing it with me?
thank you!!
Suppose the inputs are seq. Following can be done to process columns concurrently. The basic idea is to using sequence (column, input) as the key.
scala> val rdd = sc.parallelize((1 to 4).map(x=>Seq("x_0", "x_1", "x_2", "x_3")))
rdd: org.apache.spark.rdd.RDD[Seq[String]] = ParallelCollectionRDD[26] at parallelize at <console>:12
scala> val rdd1 = rdd.flatMap{x=>{(0 to x.size - 1).map(idx=>(idx, x(idx)))}}
rdd1: org.apache.spark.rdd.RDD[(Int, String)] = FlatMappedRDD[27] at flatMap at <console>:14
scala> val rdd2 = rdd1.map(x=>(x, 1))
rdd2: org.apache.spark.rdd.RDD[((Int, String), Int)] = MappedRDD[28] at map at <console>:16
scala> val rdd3 = rdd2.reduceByKey(_+_)
rdd3: org.apache.spark.rdd.RDD[((Int, String), Int)] = ShuffledRDD[29] at reduceByKey at <console>:18
scala> rdd3.take(4)
res22: Array[((Int, String), Int)] = Array(((0,x_0),4), ((3,x_3),4), ((2,x_2),4), ((1,x_1),4))
The example output: ((0, x_0), 4) means the first column, key is x_0, and value is 4. You can start from here to process further.
You can try the following code, which use the scala parallize collection feature,
(0 until 4).map(index => (index,data)).par.map(x => {
x._2.map(_.split("\t",-1)(x._1)).map((_,1)).reduce(_+_)
}
data is a reference, so duplicate the data will not cost to much. And rdd is read-only, so parallelly processing can work. The par method use the parallely collection feature. You can check the parallel jobs on the spark web UI.