load hdfs file into spark context - scala

I am new to spark/scala and need to load a file from hdfs to spark. I have a file in hdfs (/newhdfs/abc.txt), and I could see my file contents by using hdfs dfs -cat /newhdfs/abc.txt
I did in below order to load the file into spark context
spark-shell #It entered into scala console window
scala> import org.apache.spark._; //Line 1
scala> val conf=new SparkConf().setMaster("local[*]");
scala> val sc = new SparkContext(conf);
scala> val input=sc.textFile("hdfs:///newhdfs/abc.txt"); //Line 4
Once I hit enter on line 4, I am getting below message.
input: org.apache.spark.rdd.RDD[String] = hdfs:///newhdfs/abc.txt MapPartitionsRDD[19] at textFile at <console>:27``
Is this a fatal error? What do I need to do to solve this?
(Using Spark-2.0.0 and Hadoop 2.7.0)

This is not an error, it just says the name of the file for your RDD.
In the Basic docs, there is this example:
scala> val textFile = sc.textFile("README.md")
textFile: org.apache.spark.rdd.RDD[String] = README.md MapPartitionsRDD[1] at textFile at <console>:25
which demonstrates the very same behavior.
How would you expect an error to happen without an action triggering actual work to happen?
If you want to check that everything is OK, do a count of your input RDD, which is an action and will trigger the actual read of the file, and then the count of the elements of your RDD.

Related

Spark Shell allowing to redeclare the same immutable variable [duplicate]

This question already has answers here:
Why is it possible to declare variable with same name in the REPL?
(2 answers)
Closed 3 years ago.
I am working with Spark-shell for Scala and found a strange behaviour in Spark-shell REPL which is not there if i use any IDE.
I can declare the same immutable variable again and again in REPL, but the same is not allowed in IDE.
Here is the code in REPL:
scala> val rdd = sc.textFile("README.md")
rdd: org.apache.spark.rdd.RDD[String] = README.md MapPartitionsRDD[5] at textFile at <console>:24
scala> val rdd = sc.textFile("README.md")
rdd: org.apache.spark.rdd.RDD[String] = README.md MapPartitionsRDD[7] at textFile at <console>:24
scala> val rdd = sc.textFile("README.md")
rdd: org.apache.spark.rdd.RDD[String] = README.md MapPartitionsRDD[9] at textFile at <console>:24
scala> val rdd = sc.textFile("README.md")
rdd: org.apache.spark.rdd.RDD[String] = README.md MapPartitionsRDD[11] at textFile at <console>:24
And here is the same thing I am trying in Eclipse IDE, and it shows compile time error:
Is there anything i am missing any configuration for Spark-shell REPL?
Or is it the expected behaviour?
In your REPL, your code is actually translated as follows:
object SomeName {
val rdd = sc.textFile("README.md")
}
object Some_Other_Name {
val rdd = sc.textFile("README.md")
}
Since both your rdd vals are defined in separate Singletons, there is no name collision between them. and since this happens behind the scenes in the REPL, you feel as if you are making reassignments to the same val.
In an IDE, all our code is written inside Classes or Scala singletons(Objects), so in the same scope (inside the resp. Class/Object) you are returned an error.

Why does Spark increment the RDD ID by 2 instead of 1 when reading in text files?

I noticed something interesting when working with the spark-shell and I'm curious as to why this is happening. I load a text file into Spark using the basic syntax, and then I just simply repeat this command. The output of the REPL is below:
scala> val myreviews = sc.textFile("Reviews.csv")
myreviews: org.apache.spark.rdd.RDD[String] = Reviews.csv MapPartitionsRDD[1] at textFile at <console>:24
scala> val myreviews = sc.textFile("Reviews.csv")
myreviews: org.apache.spark.rdd.RDD[String] = Reviews.csv MapPartitionsRDD[3] at textFile at <console>:24
scala> val myreviews = sc.textFile("Reviews.csv")
myreviews: org.apache.spark.rdd.RDD[String] = Reviews.csv MapPartitionsRDD[5] at textFile at <console>:24
scala> val myreviews = sc.textFile("Reviews.csv")
myreviews: org.apache.spark.rdd.RDD[String] = Reviews.csv MapPartitionsRDD[7] at textFile at <console>:24
I know that the MapPartitionsRDD[X] portion features X as the RDD identifier. However, based upon this SO post on RDD identifiers, I'd expect that the identifier integer increments by one each time a new RDD is created. So why exactly is it incrementing by 2?
My guess is that loading a text file creates an intermediate RDD? Because clearly creating an RDD from parallelize() only increments the RDD counter by 1 (before it was 7):
scala> val arrayrdd = sc.parallelize(Array(3,4,5))
arrayrdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[8] at parallelize at <console>:24
Note: I don't believe the number has anything to do w/ partitions. If I call, I get that my RDD is partitioned into 9 partitions:
scala> myreviews.partitions.size
res2: Int = 9
Because a single method call can create more than one intermediate RDD. It will be obvious if you check the debug string
sc.textFile("README.md").toDebugString
String =
(2) README.md MapPartitionsRDD[1] at textFile at <console>:25 []
| README.md HadoopRDD[0] at textFile at <console>:25 []
As you see the lineage consist of two RDDs.
The first one is a HadoopRDD which corresponds to data import
hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text],
minPartitions)
The second one is MapPartitionsRDD and corresponds to the subsequent map which drops keys (offsets) and converts Text to String.
.map(pair => pair._2.toString).setName(path)

unable to create dataframe from sequence file in Spark created by Sqoop

I want to read orders data and create RDD out of it which is stored as sequence file in hadoop fs in cloudera vm. Below are my steps:
1) Importing orders data as sequence file:
sqoop import --connect jdbc:mysql://localhost/retail_db --username retail_dba --password cloudera --table orders -m 1 --target-dir /ordersDataSet --as-sequencefile
2) Reading file in spark scala:
Spark 1.6
val sequenceData=sc.sequenceFile("/ordersDataSet",classOf[org.apache.hadoop.io.Text],classOf[org.apache.hadoop.io.Text]).map(rec => rec.toString())
3) When I try to read data from above RDD it throws below error:
Caused by: java.io.IOException: WritableName can't load class: orders
at org.apache.hadoop.io.WritableName.getClass(WritableName.java:77)
at org.apache.hadoop.io.SequenceFile$Reader.getValueClass(SequenceFile.java:2108)
... 17 more
Caused by: java.lang.ClassNotFoundException: Class orders not found
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2185)
at org.apache.hadoop.io.WritableName.getClass(WritableName.java:75)
... 18 more
I don't know why it says that it can't find orders. Where am I going wrong ?
I referred codes from these two links as well but no luck:
1) Refer sequence part
2) Refer step no. 8
The sqoop has little to do with it, here is an example of a more realistic scenario, whereby saveAsSequenceFile always assumes k, v pairs - this may help you:
import org.apache.hadoop.io._
val RDD = sc.parallelize( List( (1, List("A", "B")) , (2, List("B", "C")) , (3, List("C", "D", "E")) ) )
val RDD2 = RDD.map(x => (x._1, x._2.mkString("/")))
RDD2.saveAsSequenceFile("/rushhour/seq-directory/2")
val sequence_data = sc.sequenceFile("/rushhour/seq-directory/*", classOf[IntWritable], classOf[Text])
.map{case (x, y) => (x.get(), y.toString().split("/")(0), y.toString().split("/")(1))}
sequence_data.collect
returns:
res20: Array[(Int, String, String)] = Array((1,A,B), (2,B,C), (3,C,D), (1,A,B), (2,B,C), (3,C,D))
I am not sure if you want an RDD or DF, but converting RDD to DF is of course trivial.
I figured out the solution to my own problem. Well, I am going to write a lengthy solution but I hope it will make some sense.
1) When I tried to read the data which was imported in HDFS using SQOOP, it gives an error because of following reasons:
A) Sequence file is all about key-value pair. So when I import it using sqoop, the data which is imported it is not in key value pair that is why while reading it throws an error.
B) If you try to read few characters from which you can figure out the two classes required for passing as input while reading sequence file you ll get data as below:
[cloudera#quickstart ~]$ hadoop fs -cat /user/cloudera/problem5/sequence/pa* | head -c 300
SEQ!org.apache.hadoop.io.LongWritableorders�;�M��c�K�����#���-OCLOSED#���PENDING_PAYMENT#���/COMPLETE#���"{CLOSED#���cat: Unable to write to output stream.
Above you can see only one class i.e org.apache.hadoop.io.LongWritable and when I pass this while reading the sequence data it throws an error which is mentioned in the post.
val sequenceData=sc.sequenceFile("/ordersDataSet",classOf[org.apache.hadoop.io.LongWritable],classOf[org.apache.hadoop.io.LongWritable]).map(rec => rec.toString())
I don't think that the B point is the main reason of that error but I am very much sure that A point is the real culprit of that error.
2) Below is the way how I solved my problem.
I imported data as avro data file in other destination using SQOOP. Then I created the dataframe from avro using below ways:
scala> import com.databricks.spark.avro._;
scala> val avroData=sqlContext.read.avro("path")
Now I created key-value pair and saved it as sequence file
avroData.map(p=>(p(0).toString,(p(0)+"\t"+p(1)+"\t"+p(2)+"\t"+p(3)))).saveAsSequenceFile("/user/cloudera/problem5/sequence")
Now when I try to read few characters of the above written file it gives me two classes which I need while reading the file as below:
[cloudera#quickstart ~]$ hadoop fs -cat /user/cloudera/problem5/sequence/part-00000 | head -c 300
SEQorg.apache.hadoop.io.Textorg.apache.hadoop.io.Text^#%���8P���11 1374735600000 11599 CLOSED&2#2 1374735600000 256 PENDING_PAYMENT!33 1374735600000 12111 COMPLETE44 1374735600000 8827 CLOSED!55 1374735600000 11318 COMPLETE 66 1374cat: Unable to write to output stream.
scala> val sequenceData=sc.sequenceFile("/user/cloudera/problem5/sequence",classOf[org.apache.hadoop.io.Text],classOf[org.apache.hadoop.io.Text]).map(rec=>rec.toString)
sequenceData: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[26] at map at <console>:30
Now when I try to print data it displays data as below:
scala> sequenceData.take(4).foreach(println)
(1,1 1374735600000 11599 CLOSED)
(2,2 1374735600000 256 PENDING_PAYMENT)
(3,3 1374735600000 12111 COMPLETE)
(4,4 1374735600000 8827 CLOSED)
Last but not the least, Thank you everyone for your much appreciated efforts. Cheers!!

Loading files in a loop in spark

I have n number of files in a directory with same .txt extension and I want to load them in a loop and then make separate dataframes for each of them.
I have read this but in my case all my files have same extension and I want to iterate over them one by one and make dataframe for every file.
I started by counting files in a directory with following line of code
sc.wholeTextFiles("/path/to/dir/*.txt").count()
but I don't know how should I proceed further?
Please guide me.
I am using Spark 2.3 and Scala.
Thanks.
The wholetextiles returns a paired Rdd Function
def wholeTextFiles(path: String, minPartitions: Int): rdd.RDD[(String, String)]
You can do map over the rdd, the key of the rdd is path of the file and value is content of the file
sc.wholeTextFiles("/path/to/dir/*.txt").take(2)
sc.wholeTextFiles("/path/to/dir/*.txt").map((x,y)=> some logic on x and y )
You could use the hadoop fs and get the list of files under the directory and then iterate it over and save it to differnet dataframes.
Something like the below:
// Hadoop FS
val hadoop_fs = FileSystem.get(sc1.hadoopConfiguration)
// Get list of part files
val fs_status = hadoop_fs.listLocatedStatus(new Path(fileFullPath))
while (fs_status.hasNext) {
val fileStatus = fs_status.next.getPath
val filepath = fileStatus.toString
val df = sc1.textFile(filepath)
}

How to generate key-value format using Scala in Spark

I am studying Spark on VirtualBox. I use ./bin/spark-shell to open Spark and use Scala. Now I got confused about key-value format using Scala.
I have a txt file in home/feng/spark/data, which looks like:
panda 0
pink 3
pirate 3
panda 1
pink 4
I use sc.textFile to get this txt file. If I do
val rdd = sc.textFile("/home/feng/spark/data/rdd4.7")
Then I can use rdd.collect() to show rdd on the screen:
scala> rdd.collect()
res26: Array[String] = Array(panda 0, pink 3, pirate 3, panda 1, pink 4)
However, if I do
val rdd = sc.textFile("/home/feng/spark/data/rdd4.7.txt")
which no ".txt" here. Then when I use rdd.collect(), I got a mistake:
org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/home/feng/spark/A.txt
at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:285)
......
But I saw other examples. All of them have ".txt" at the end. Is there something wrong with my code or my system?
Another thing is when I tried to do:
scala> val rddd = rdd.map(x => (x.split(" ")(0),x))
rddd: org.apache.spark.rdd.RDD[(String, String)] = MapPartitionsRDD[2] at map at <console>:29
scala> rddd.collect()
res0: Array[(String, String)] = Array((panda,panda 0), (pink,pink 3), (pirate,pirate 3), (panda,panda 1), (pink,pink 4))
I intended to select the first column of the data and use it as the key. But rddd.collect() looks like not that way as the words occur twice, which is not right. I cannot keep doing the rest operations like mapbykey, reducebykey or others. Where did I do wrong?
Just for example I create a String with your dataset, after this I split the record by line, and use SparkContext's parallelize method to create an RDD. Notice that after I create the RDD I use its map method to split the String stored in each record and convert it to a Row.
import org.apache.spark.sql.Row
val text = "panda 0\npink 3\npirate 3\npanda 1\npink 4"
val rdd = sc.parallelize(text.split("\n")).map(x => Row(x.split(" "):_*))
rdd.take(3)
The output from the take method is:
res4: Array[org.apache.spark.sql.Row] = Array([panda,0], [pink,3], [pirate,3])
About your first question, there is no need for files to have any extension. Because, in this case files are seen as plain text.