unable to create dataframe from sequence file in Spark created by Sqoop - scala

I want to read orders data and create RDD out of it which is stored as sequence file in hadoop fs in cloudera vm. Below are my steps:
1) Importing orders data as sequence file:
sqoop import --connect jdbc:mysql://localhost/retail_db --username retail_dba --password cloudera --table orders -m 1 --target-dir /ordersDataSet --as-sequencefile
2) Reading file in spark scala:
Spark 1.6
val sequenceData=sc.sequenceFile("/ordersDataSet",classOf[org.apache.hadoop.io.Text],classOf[org.apache.hadoop.io.Text]).map(rec => rec.toString())
3) When I try to read data from above RDD it throws below error:
Caused by: java.io.IOException: WritableName can't load class: orders
at org.apache.hadoop.io.WritableName.getClass(WritableName.java:77)
at org.apache.hadoop.io.SequenceFile$Reader.getValueClass(SequenceFile.java:2108)
... 17 more
Caused by: java.lang.ClassNotFoundException: Class orders not found
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2185)
at org.apache.hadoop.io.WritableName.getClass(WritableName.java:75)
... 18 more
I don't know why it says that it can't find orders. Where am I going wrong ?
I referred codes from these two links as well but no luck:
1) Refer sequence part
2) Refer step no. 8

The sqoop has little to do with it, here is an example of a more realistic scenario, whereby saveAsSequenceFile always assumes k, v pairs - this may help you:
import org.apache.hadoop.io._
val RDD = sc.parallelize( List( (1, List("A", "B")) , (2, List("B", "C")) , (3, List("C", "D", "E")) ) )
val RDD2 = RDD.map(x => (x._1, x._2.mkString("/")))
RDD2.saveAsSequenceFile("/rushhour/seq-directory/2")
val sequence_data = sc.sequenceFile("/rushhour/seq-directory/*", classOf[IntWritable], classOf[Text])
.map{case (x, y) => (x.get(), y.toString().split("/")(0), y.toString().split("/")(1))}
sequence_data.collect
returns:
res20: Array[(Int, String, String)] = Array((1,A,B), (2,B,C), (3,C,D), (1,A,B), (2,B,C), (3,C,D))
I am not sure if you want an RDD or DF, but converting RDD to DF is of course trivial.

I figured out the solution to my own problem. Well, I am going to write a lengthy solution but I hope it will make some sense.
1) When I tried to read the data which was imported in HDFS using SQOOP, it gives an error because of following reasons:
A) Sequence file is all about key-value pair. So when I import it using sqoop, the data which is imported it is not in key value pair that is why while reading it throws an error.
B) If you try to read few characters from which you can figure out the two classes required for passing as input while reading sequence file you ll get data as below:
[cloudera#quickstart ~]$ hadoop fs -cat /user/cloudera/problem5/sequence/pa* | head -c 300
SEQ!org.apache.hadoop.io.LongWritableorders�;�M��c�K�����#���-OCLOSED#���PENDING_PAYMENT#���/COMPLETE#���"{CLOSED#���cat: Unable to write to output stream.
Above you can see only one class i.e org.apache.hadoop.io.LongWritable and when I pass this while reading the sequence data it throws an error which is mentioned in the post.
val sequenceData=sc.sequenceFile("/ordersDataSet",classOf[org.apache.hadoop.io.LongWritable],classOf[org.apache.hadoop.io.LongWritable]).map(rec => rec.toString())
I don't think that the B point is the main reason of that error but I am very much sure that A point is the real culprit of that error.
2) Below is the way how I solved my problem.
I imported data as avro data file in other destination using SQOOP. Then I created the dataframe from avro using below ways:
scala> import com.databricks.spark.avro._;
scala> val avroData=sqlContext.read.avro("path")
Now I created key-value pair and saved it as sequence file
avroData.map(p=>(p(0).toString,(p(0)+"\t"+p(1)+"\t"+p(2)+"\t"+p(3)))).saveAsSequenceFile("/user/cloudera/problem5/sequence")
Now when I try to read few characters of the above written file it gives me two classes which I need while reading the file as below:
[cloudera#quickstart ~]$ hadoop fs -cat /user/cloudera/problem5/sequence/part-00000 | head -c 300
SEQorg.apache.hadoop.io.Textorg.apache.hadoop.io.Text^#%���8P���11 1374735600000 11599 CLOSED&2#2 1374735600000 256 PENDING_PAYMENT!33 1374735600000 12111 COMPLETE44 1374735600000 8827 CLOSED!55 1374735600000 11318 COMPLETE 66 1374cat: Unable to write to output stream.
scala> val sequenceData=sc.sequenceFile("/user/cloudera/problem5/sequence",classOf[org.apache.hadoop.io.Text],classOf[org.apache.hadoop.io.Text]).map(rec=>rec.toString)
sequenceData: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[26] at map at <console>:30
Now when I try to print data it displays data as below:
scala> sequenceData.take(4).foreach(println)
(1,1 1374735600000 11599 CLOSED)
(2,2 1374735600000 256 PENDING_PAYMENT)
(3,3 1374735600000 12111 COMPLETE)
(4,4 1374735600000 8827 CLOSED)
Last but not the least, Thank you everyone for your much appreciated efforts. Cheers!!

Related

What is "WARN ParallelCollectionRDD: Spark does not support nested RDDs (see SPARK-5063)"?

I have a following syntax
val data = sc.textFile("log1.txt,log2.txt")
val s = Seq(data)
val par = sc.parallelize(s)
Result that i obtained is as follows:
WARN ParallelCollectionRDD: Spark does not support nested RDDs (see SPARK-5063)
par: org.apache.spark.rdd.RDD[org.apache.spark.rdd.RDD[String]] = ParallelCollectionRDD[2] at parallelize at :28
Question 1
How does a parallelCollection work?.
Question 2
Can I iterate through them and perform transformation?
Question 3
RDD transformations and actions are NOT invoked by the driver, but inside of other transformations; for example, rdd1.map(x => rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation. For more information, see SPARK-5063.
What does this mean?
(An interesting case indeed)
When in doubt, I always recommend to follow the types in Scala (after all the types are why we, Scala developers, use the language in the first place, don't we?)
So, let's reveal the types:
scala> val data = sc.textFile("log1.txt,log2.txt")
data: org.apache.spark.rdd.RDD[String] = log1.txt,log2.txt MapPartitionsRDD[1] at textFile at <console>:24
scala> val s = Seq(data)
s: Seq[org.apache.spark.rdd.RDD[String]] = List(log1.txt,log2.txt MapPartitionsRDD[1] at textFile at <console>:24)
scala> val par = sc.parallelize(s)
WARN ParallelCollectionRDD: Spark does not support nested RDDs (see SPARK-5063)
par: org.apache.spark.rdd.RDD[org.apache.spark.rdd.RDD[String]] = ParallelCollectionRDD[3] at parallelize at <console>:28
As you were told, org.apache.spark.rdd.RDD[org.apache.spark.rdd.RDD[String]] is not supported case in Spark (however it was indeed accepted by the Scala compiler since it matches the signature of SparkContext.parallelize method...unfortunately).
You don't really need val s = Seq(data) since the records in the two files log1.txt,log2.txt are already "inside" RDD and Spark will process them in distributed and parallel manner all records in all the two files (which I believe is your use case).
I do think I've answered all the three questions that I think are based on false expectations and hence they all are pretty much alike :)

load hdfs file into spark context

I am new to spark/scala and need to load a file from hdfs to spark. I have a file in hdfs (/newhdfs/abc.txt), and I could see my file contents by using hdfs dfs -cat /newhdfs/abc.txt
I did in below order to load the file into spark context
spark-shell #It entered into scala console window
scala> import org.apache.spark._; //Line 1
scala> val conf=new SparkConf().setMaster("local[*]");
scala> val sc = new SparkContext(conf);
scala> val input=sc.textFile("hdfs:///newhdfs/abc.txt"); //Line 4
Once I hit enter on line 4, I am getting below message.
input: org.apache.spark.rdd.RDD[String] = hdfs:///newhdfs/abc.txt MapPartitionsRDD[19] at textFile at <console>:27``
Is this a fatal error? What do I need to do to solve this?
(Using Spark-2.0.0 and Hadoop 2.7.0)
This is not an error, it just says the name of the file for your RDD.
In the Basic docs, there is this example:
scala> val textFile = sc.textFile("README.md")
textFile: org.apache.spark.rdd.RDD[String] = README.md MapPartitionsRDD[1] at textFile at <console>:25
which demonstrates the very same behavior.
How would you expect an error to happen without an action triggering actual work to happen?
If you want to check that everything is OK, do a count of your input RDD, which is an action and will trigger the actual read of the file, and then the count of the elements of your RDD.

How to generate key-value format using Scala in Spark

I am studying Spark on VirtualBox. I use ./bin/spark-shell to open Spark and use Scala. Now I got confused about key-value format using Scala.
I have a txt file in home/feng/spark/data, which looks like:
panda 0
pink 3
pirate 3
panda 1
pink 4
I use sc.textFile to get this txt file. If I do
val rdd = sc.textFile("/home/feng/spark/data/rdd4.7")
Then I can use rdd.collect() to show rdd on the screen:
scala> rdd.collect()
res26: Array[String] = Array(panda 0, pink 3, pirate 3, panda 1, pink 4)
However, if I do
val rdd = sc.textFile("/home/feng/spark/data/rdd4.7.txt")
which no ".txt" here. Then when I use rdd.collect(), I got a mistake:
org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/home/feng/spark/A.txt
at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:285)
......
But I saw other examples. All of them have ".txt" at the end. Is there something wrong with my code or my system?
Another thing is when I tried to do:
scala> val rddd = rdd.map(x => (x.split(" ")(0),x))
rddd: org.apache.spark.rdd.RDD[(String, String)] = MapPartitionsRDD[2] at map at <console>:29
scala> rddd.collect()
res0: Array[(String, String)] = Array((panda,panda 0), (pink,pink 3), (pirate,pirate 3), (panda,panda 1), (pink,pink 4))
I intended to select the first column of the data and use it as the key. But rddd.collect() looks like not that way as the words occur twice, which is not right. I cannot keep doing the rest operations like mapbykey, reducebykey or others. Where did I do wrong?
Just for example I create a String with your dataset, after this I split the record by line, and use SparkContext's parallelize method to create an RDD. Notice that after I create the RDD I use its map method to split the String stored in each record and convert it to a Row.
import org.apache.spark.sql.Row
val text = "panda 0\npink 3\npirate 3\npanda 1\npink 4"
val rdd = sc.parallelize(text.split("\n")).map(x => Row(x.split(" "):_*))
rdd.take(3)
The output from the take method is:
res4: Array[org.apache.spark.sql.Row] = Array([panda,0], [pink,3], [pirate,3])
About your first question, there is no need for files to have any extension. Because, in this case files are seen as plain text.

Join DStream with dynamic dataset

I am new to Spark Streaming. I need to enrich events coming from stream, with data from dynamic dataset. I have problem with creating dynamic dataset. This dataset should be ingested by data coming from different stream (but this stream will be much lower throughput than the main stream of events). Additionally size of dataset will be approximately 1-3 GB so using simple HashMap will not be sufficient (in my opinion).
In Spark Streaming Programming Guide I have found:
val dataset: RDD[String, String] = ...
val windowedStream = stream.window(Seconds(20))...
val joinedStream = windowedStream.transform { rdd => rdd.join(dataset) }
and explanation: "In fact, you can also dynamically change the dataset you want to join against." This part I don't understand at all- how RDD can be dynamically changed? Isn't it immutable?
Below you can see my code. The point is to add every new RDD from the myStream to myDataset but apparently this doesn't work the way I would like this to work.
val ssc = new StreamingContext(conf, Seconds(5))
val myDataset: RDD[String] = ssc.sparkContext.emptyRDD[String]
val myStream = ssc.socketTextStream("localhost", 9997)
lines7.foreachRDD(rdd => {myDataset.union(rdd)})
myDataset.foreach(println)
I would appreciate any help or advice.
Regards!
Yes, RDDs are immutable. One issue with your code is that union() returns a new RDD, it does not alter the existing myDataset RDD.
The Programming Guide says the following:
In fact, you can also dynamically change the dataset you want to join
against. The function provided to transform is evaluated every batch
interval and therefore will use the current dataset that dataset
reference points to.
The first sentence might read better as follows:
In fact, you can also dynamically change which dataset you want to join
against.
So we can change the RDD that dataset references, but not the RDD itself. Here's an example of how this could work (using Python):
# Run as follows:
# $ spark-submit ./match_ips_streaming_simple.py.py 2> err
# In another window run:
# $ nc -lk 9999
# Then enter IP addresses separated by spaces into the nc window
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
import time
BATCH_INTERVAL = 2
SLEEP_INTERVAL = 8
sc = SparkContext("local[*]", "IP-Matcher")
ssc = StreamingContext(sc, BATCH_INTERVAL)
ips_rdd = sc.parallelize(set())
lines_ds = ssc.socketTextStream("localhost", 9999)
# split each line into IPs
ips_ds = lines_ds.flatMap(lambda line: line.split(" "))
pairs_ds = ips_ds.map(lambda ip: (ip, 1))
# join with the IPs RDD
matches_ds = pairs_ds.transform(lambda rdd: rdd.join(ips_rdd))
matches_ds.pprint()
ssc.start()
# alternate between two sets of IP addresses for the RDD
IP_FILES = ('ip_file1.txt', 'ip_file2.txt')
file_index = 0
while True:
with open(IP_FILES[file_index]) as f:
ips = f.read().splitlines()
ips_rdd = sc.parallelize(ips).map(lambda ip: (ip, 1))
print "using", IP_FILES[file_index]
file_index = (file_index + 1) % len(IP_FILES)
time.sleep(SLEEP_INTERVAL)
#ssc.awaitTermination()
In the while loop, I change the RDD that ips_rdd references every 8 seconds. The join() transformation will use whatever RDD that ips_rdd currently references.
$ cat ip_file1.txt
1.2.3.4
10.20.30.40
$ cat ip_file2.txt
5.6.7.8
50.60.70.80
$ spark-submit ./match_ips_streaming_simple.py 2> err
using ip_file1.txt
-------------------------------------------
Time: 2015-09-09 17:18:20
-------------------------------------------
-------------------------------------------
Time: 2015-09-09 17:18:22
-------------------------------------------
-------------------------------------------
Time: 2015-09-09 17:18:24
-------------------------------------------
('1.2.3.4', (1, 1))
('10.20.30.40', (1, 1))
using ip_file2.txt
-------------------------------------------
Time: 2015-09-09 17:18:26
-------------------------------------------
-------------------------------------------
Time: 2015-09-09 17:18:28
-------------------------------------------
('50.60.70.80', (1, 1))
('5.6.7.8', (1, 1))
...
While the above job is running, in another window:
$ nc -lk 9999
1.2.3.4 50.60.70.80 10.20.30.40 5.6.7.8
<... wait for the other RDD to load ...>
1.2.3.4 50.60.70.80 10.20.30.40 5.6.7.8

Spark processing columns in parallel

I've been playing with Spark, and I managed to get it to crunch my data. My data consists of flat delimited text file, consisting of 50 columns and about 20 millions of rows. I have scala scripts that will process each column.
In terms of parallel processing, I know that RDD operation run on multiple nodes. So, every time I process a column, they are processed in parallel, but the column itself is processed sequentially.
A simple example: if my data is 5 column text delimited file and each column contain text, and I want to do word count for each column. I would do:
for(i <- 0 until 4){
data.map(_.split("\t",-1)(i)).map((_,1)).reduce(_+_)
}
Although each column's operation is run in parallel, the column itself is processed sequentially(bad wording I know. Sorry!). In other words, column 2 is processed after column 1 is done. Column 3 is processed after column 1 and 2 are done, and so on.
My question is: Is there anyway to process multiple column at a time? If you know a way, cor a tutorial, would you mind sharing it with me?
thank you!!
Suppose the inputs are seq. Following can be done to process columns concurrently. The basic idea is to using sequence (column, input) as the key.
scala> val rdd = sc.parallelize((1 to 4).map(x=>Seq("x_0", "x_1", "x_2", "x_3")))
rdd: org.apache.spark.rdd.RDD[Seq[String]] = ParallelCollectionRDD[26] at parallelize at <console>:12
scala> val rdd1 = rdd.flatMap{x=>{(0 to x.size - 1).map(idx=>(idx, x(idx)))}}
rdd1: org.apache.spark.rdd.RDD[(Int, String)] = FlatMappedRDD[27] at flatMap at <console>:14
scala> val rdd2 = rdd1.map(x=>(x, 1))
rdd2: org.apache.spark.rdd.RDD[((Int, String), Int)] = MappedRDD[28] at map at <console>:16
scala> val rdd3 = rdd2.reduceByKey(_+_)
rdd3: org.apache.spark.rdd.RDD[((Int, String), Int)] = ShuffledRDD[29] at reduceByKey at <console>:18
scala> rdd3.take(4)
res22: Array[((Int, String), Int)] = Array(((0,x_0),4), ((3,x_3),4), ((2,x_2),4), ((1,x_1),4))
The example output: ((0, x_0), 4) means the first column, key is x_0, and value is 4. You can start from here to process further.
You can try the following code, which use the scala parallize collection feature,
(0 until 4).map(index => (index,data)).par.map(x => {
x._2.map(_.split("\t",-1)(x._1)).map((_,1)).reduce(_+_)
}
data is a reference, so duplicate the data will not cost to much. And rdd is read-only, so parallelly processing can work. The par method use the parallely collection feature. You can check the parallel jobs on the spark web UI.