How to generate key-value format using Scala in Spark - scala

I am studying Spark on VirtualBox. I use ./bin/spark-shell to open Spark and use Scala. Now I got confused about key-value format using Scala.
I have a txt file in home/feng/spark/data, which looks like:
panda 0
pink 3
pirate 3
panda 1
pink 4
I use sc.textFile to get this txt file. If I do
val rdd = sc.textFile("/home/feng/spark/data/rdd4.7")
Then I can use rdd.collect() to show rdd on the screen:
scala> rdd.collect()
res26: Array[String] = Array(panda 0, pink 3, pirate 3, panda 1, pink 4)
However, if I do
val rdd = sc.textFile("/home/feng/spark/data/rdd4.7.txt")
which no ".txt" here. Then when I use rdd.collect(), I got a mistake:
org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/home/feng/spark/A.txt
at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:285)
......
But I saw other examples. All of them have ".txt" at the end. Is there something wrong with my code or my system?
Another thing is when I tried to do:
scala> val rddd = rdd.map(x => (x.split(" ")(0),x))
rddd: org.apache.spark.rdd.RDD[(String, String)] = MapPartitionsRDD[2] at map at <console>:29
scala> rddd.collect()
res0: Array[(String, String)] = Array((panda,panda 0), (pink,pink 3), (pirate,pirate 3), (panda,panda 1), (pink,pink 4))
I intended to select the first column of the data and use it as the key. But rddd.collect() looks like not that way as the words occur twice, which is not right. I cannot keep doing the rest operations like mapbykey, reducebykey or others. Where did I do wrong?

Just for example I create a String with your dataset, after this I split the record by line, and use SparkContext's parallelize method to create an RDD. Notice that after I create the RDD I use its map method to split the String stored in each record and convert it to a Row.
import org.apache.spark.sql.Row
val text = "panda 0\npink 3\npirate 3\npanda 1\npink 4"
val rdd = sc.parallelize(text.split("\n")).map(x => Row(x.split(" "):_*))
rdd.take(3)
The output from the take method is:
res4: Array[org.apache.spark.sql.Row] = Array([panda,0], [pink,3], [pirate,3])
About your first question, there is no need for files to have any extension. Because, in this case files are seen as plain text.

Related

Why does Spark increment the RDD ID by 2 instead of 1 when reading in text files?

I noticed something interesting when working with the spark-shell and I'm curious as to why this is happening. I load a text file into Spark using the basic syntax, and then I just simply repeat this command. The output of the REPL is below:
scala> val myreviews = sc.textFile("Reviews.csv")
myreviews: org.apache.spark.rdd.RDD[String] = Reviews.csv MapPartitionsRDD[1] at textFile at <console>:24
scala> val myreviews = sc.textFile("Reviews.csv")
myreviews: org.apache.spark.rdd.RDD[String] = Reviews.csv MapPartitionsRDD[3] at textFile at <console>:24
scala> val myreviews = sc.textFile("Reviews.csv")
myreviews: org.apache.spark.rdd.RDD[String] = Reviews.csv MapPartitionsRDD[5] at textFile at <console>:24
scala> val myreviews = sc.textFile("Reviews.csv")
myreviews: org.apache.spark.rdd.RDD[String] = Reviews.csv MapPartitionsRDD[7] at textFile at <console>:24
I know that the MapPartitionsRDD[X] portion features X as the RDD identifier. However, based upon this SO post on RDD identifiers, I'd expect that the identifier integer increments by one each time a new RDD is created. So why exactly is it incrementing by 2?
My guess is that loading a text file creates an intermediate RDD? Because clearly creating an RDD from parallelize() only increments the RDD counter by 1 (before it was 7):
scala> val arrayrdd = sc.parallelize(Array(3,4,5))
arrayrdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[8] at parallelize at <console>:24
Note: I don't believe the number has anything to do w/ partitions. If I call, I get that my RDD is partitioned into 9 partitions:
scala> myreviews.partitions.size
res2: Int = 9
Because a single method call can create more than one intermediate RDD. It will be obvious if you check the debug string
sc.textFile("README.md").toDebugString
String =
(2) README.md MapPartitionsRDD[1] at textFile at <console>:25 []
| README.md HadoopRDD[0] at textFile at <console>:25 []
As you see the lineage consist of two RDDs.
The first one is a HadoopRDD which corresponds to data import
hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text],
minPartitions)
The second one is MapPartitionsRDD and corresponds to the subsequent map which drops keys (offsets) and converts Text to String.
.map(pair => pair._2.toString).setName(path)

Spark scala filter multiple rdd based on string length

I am trying to solve one of the quiz, the question is as below,
Write the missing code in the given program to display the expected output to identify animals that have names with four
letters.
Output: Array((4,lion))
Program
val a = sc.parallelize(List("dog","tiger","lion","cat","spider","eagle"),2)
val b = a.keyBy(_.length)
val c = sc.parallelize(List("ant","falcon","squid"),2)
val d = c.keyBy(_.length)
I have tried to write code in spark shell but get stuck with syntax to add 4 RDD and applying filter.
How about using the PairRDD lookup method:
b.lookup(4).toArray
// res1: Array[String] = Array(lion)
d.lookup(4).toArray
// res2: Array[String] = Array()

How do I split a Spark rdd Array[(String, Array[String])]?

I'm practicing on doing sorts in the Spark shell. I have an rdd with about 10 columns/variables. I want to sort the whole rdd on the values of column 7.
rdd
org.apache.spark.rdd.RDD[Array[String]] = ...
From what I gather the way to do that is by using sortByKey, which in turn only works on pairs. So I mapped it so I'd have a pair consisting of column7 (string values) and the full original rdd (array of strings)
rdd2 = rdd.map(c => (c(7),c))
rdd2: org.apache.spark.rdd.RDD[(String, Array[String])] = ...
I then apply sortByKey, still no problem...
rdd3 = rdd2.sortByKey()
rdd3: org.apache.spark.rdd.RDD[(String, Array[String])] = ...
But now how do I split off, collect and save that sorted original rdd from rdd3 (Array[String])? Whenever I try a split on rdd3 it gives me an error:
val rdd4 = rdd3.map(_.split(',')(2))
<console>:33: error: value split is not a member of (String, Array[String])
What am I doing wrong here? Are there other, better ways to sort an rdd on one of its columns?
what you did with rdd2 = rdd.map(c => (c(7),c)) is to map it to a tuple.
rdd2: org.apache.spark.rdd.RDD[(String, Array[String])]
exactly as it says :).
now if you want to split the record you need to get it from this tuple.
you can map again, taking only the second part of the tuple (which is the array of Array[String]...) like so : rdd3.map(_._2)
but i would strongly suggest to use try rdd.sortBy(_(7)) or something of this sort. this way you do not need to bother yourself with tuple and such.
if you want to sort the rdd using the 7th string in the array, you can just do it directly by
rdd.sortBy(_(6)) // array starts at 0 not 1
or
rdd.sortBy(arr => arr(6))
That will save you all the hassle of doing multiple transformations. The reason why rdd.sortBy(_._7) or rdd.sortBy(x => x._7) won't work is because that's not how you access an element inside an Array. To access the 7th element of an array, say arr, you should do arr(6).
To test this, i did the following:
val rdd = sc.parallelize(Array(Array("ard", "bas", "wer"), Array("csg", "dip", "hwd"), Array("asg", "qtw", "hasd")))
// I want to sort it using the 3rd String
val sorted_rdd = rdd.sortBy(_(2))
Here's the result:
Array(Array("ard", "bas", "wer"), Array("csg", "dip", "hwd"), Array("asg", "qtw", "hasd"))
just do this:
val rdd4 = rdd3.map(_._2)
I thought you don't familiar with Scala,
So, below should help you understand more,
rdd3.map(kv => {
println(kv._1) // This represent String
println(kv._2) // This represent Array[String]
})

Why does splitting strings give ArrayOutOfBoundsException in Spark 1.1.0 (works fine in 1.4.0)?

I'm using Spark 1.1.0 and Scala 2.10.4.
I have an input as follows:
100,aviral,Delhi,200,desh
200,ashu,hyd,300,desh
While executing:
sc.textFile(inputFile).keyBy(line => line.split(',')(2))
Spark gives me ArrayOutOfBoundsException. Why?
Please note that the same code works fine in Spark 1.4.0. Can anyone explain the reason for different behaviour?
It works fine here in Spark 1.4.1 / spark-shell
Define an rdd with some data
val rdd = sc.parallelize(Array("1,abc,2,xyz,3","4,qwerty,5,abc,4","9,fee,11,fie,13"))
rdd: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at parallelize at <console>:21
Run it through .keyBy()
rdd.keyBy( line => (line.split(','))(2) ).collect()
res4: Array[(String, String)] = Array((2,1,abc,2,xyz,3), (5,4,qwerty,5,abc,4), (11,9,fee,11,fie,13))
Notice it makes the key from the 3rd element after splitting, but the printing seems odd. At first it doesn't look correctly tupled but this turns out to be a printing artifact from missing any quotes on the string. We could test this to pick off the values and see if we get the line back:
rdd.keyBy(line => line.split(',')(2) ).values.collect()
res12: Array[String] = Array(1,abc,2,xyz,3, 4,qwerty,5,abc,4, 9,fee,11,fie,13)
and this looks as expected. Note that there are only 3 elements in the array, and the commas here are within the element strings.
We can also use .map() to make pairs, like so:
rdd.map( line => (line.split(',')(2), line.split(',')) ).collect()
res7: Array[(String, Array[String])] = Array((2,Array(1, abc, 2, xyz, 3)), (5,Array(4, qwerty, 5, abc, 4)), (11,Array(9, fee, 11, fie, 13)))
which is printed as Tuples...
Or to avoid duplicating effort, maybe:
def splitter(s:String):(String,Array[String]) = {
val parsed = s.split(',')
(parsed(2), parsed)
}
rdd.map(splitter).collect()
res8: Array[(String, Array[String])] = Array((2,Array(1, abc, 2, xyz, 3)), (5,Array(4, qwerty, 5, abc, 4)), (11,Array(9, fee, 11, fie, 13))
which is a bit easier to read. It is also slightly more parsed, because here we have split the line into its separate values.
The problem is you have a blank line after 1st row - splitting it does not return an Array containing necessary number of columns.
1,abc,2,xyz,3
<empty line - here lies the problem>
4,qwerty,5,abc,4
Remove the empty line.
Another possibility is that one of the rows does not have enough columns. You can filter all rows that does not have the required number of columns (be aware of possible data loss though).
sc.textFile(inputFile)
.map.(_.split(","))
.filter(_.size == EXPECTED_COLS_NUMBER)
.keyBy(line => line(2))

Spark DataFrame zipWithIndex

I am using a DataFrame to read in a .parquet files but than turning them into an rdd to do my normal processing I wanted to do on them.
So I have my file:
val dataSplit = sqlContext.parquetFile("input.parquet")
val convRDD = dataSplit.rdd
val columnIndex = convRDD.flatMap(r => r.zipWithIndex)
I get the following error even when I convert from a dataframe to RDD:
:26: error: value zipWithIndex is not a member of
org.apache.spark.sql.Row
Anyone know how to do what I am trying to do, essentially trying to get the value and the column index.
I was thinking something like:
val dataSplit = sqlContext.parquetFile(inputVal.toString)
val schema = dataSplit.schema
val columnIndex = dataSplit.flatMap(r => 0 until schema.length
but getting stuck on the last part as not sure how to do the same of zipWithIndex.
You can simply convert Row to Seq:
convRDD.flatMap(r => r.toSeq.zipWithIndex)
Important thing to note here is that extracting type information becomes tricky. Row.toSeq returns Seq[Any] and resulting RDD is RDD[(Any, Int)].