General purpose Tuple in Hadoop - scala

I'm new to Hadoop, so please do not judge strictly my seemingly simple question.
The short version: What tuple data type can I use in Hadoop, to store 2 longs as a single value is a sequence file?
Moreover, I want to be able to read and process this file with Apache Pig like A = LOAD '/my/file' AS (a:long, (b:long, c:long)) and with Scala & Spark like val a = sc.sequenceFile[LongWritable, DesiredTuple]("/my/file", 1).
The full story:
I'm writing a Hadoop Job in Java, and I need to output a sequence file, which contains 3 long values at each line. I use first value a a key and group two other values together as a value in my Reducer.
I tried several variants:
Using org.apache.hadoop.mapreduce.lib.join.TupleWritable
public class MyReducer extends Reducer<...> {
public void reduce(Context context){
long a,b,c;
// ...
context.write(a, new TupleWritable(
new LongWritable[]{new LongWritable(b), new LongWritable(c)}));
}
}
But the javadoc of TupleWritable class says " * This is not a general-purpose tuple type." It seems to be ok for first attempt, but I can't get back my Tuples. Look as a simple script in Apace Pig:
A = LOAD '/my/file' USING org.apache.pig.piggybank.storage.SequenceFileLoader()
AS (a:long, (b:long, t:long));
DUMP A;
I got Something like this:
(2220,)
(5640,)
(6240,)
...
So what is the Apache Pig way of reading Hadoop's TupleWritable from a sequence file?
Furthermore, I tried to change sequence format to text format: job.setOutputFormatClass(TextOutputFormat.class);
This time I just looked in one of outputed files:
> hdfs dfs -cat /my/file/part-r-00000 | head
2220 [,]
5640 [,]
6240 [,]
...
So is the next question: Why there is nothing in my TupleWritable value?
After that, I tried org.apache.mahout.cf.taste.hadoop.EntityEntityWritable.
For a sequence file I got the same result as before:
grunt> A = LOAD '/my/file' USING org.apache.pig.piggybank.storage.SequenceFileLoader() AS (a:long, (b:long, c:long));
(2220,)
(5640,)
(6240,)
...
For a text file I got the desired result:
2220 2 15
5640 1 9
6240 0 1
...
And next question is: How to read such tuples (EntityEntityWritable) and may be other custom objects back from Hadoop-written sequence file?

Related

Spark write text file without ignoring escape(backslash)

I'm trying write DataSet into text file.
Example
datasets
.wirte
.text(path)
What I intended is to write "some\text"(String which dataset contains).
As scala to interpret this String, we should set String value like something this
val text: String = "some\\text"
Of course when testing in scala, it prints out correct value ("some\text").
But when I write this dataset with spark.write, it appears to be written "some\\text"
Reading the internal codes, I just found escape option only for csv writing.
Is there any way to solve this problem?
Thanks

Conditionally map through rows in CSV file in Scala / Spark to produce another CSV file

I am quite new to Scala / Spark and I have been thrown into the deep end. I have been trying hard since several weeks to find a solution for a seemingly simple problem on Scala 2.11.8 but have been unable to find a good solution for it.
I have a large database in csv format close to 150 GB, with plenty of null values, which needs to be reduced and cleaned based on the values of individual columns.
The schema of the original CSV file is as follows:
Column 1: Double
Columnn 2: Integer
Column 3: Double
Column 4: Double
Columnn 5: Integer
Column 6: Double
Columnn 7: Integer
So, I want to conditionally map through all the rows of the CSV file and export the results to another CSV file with the following conditions for each row:
If the value for column 4 is not null, then the values for columns 4, 5, 6 and 7 of that row should be stored as an array called lastValuesOf4to7. (In the dataset if the element in column 4 is not null, then columns 1, 2 and 3 are null and can be ignored)
If the value of column 3 is not null, then the values of columns 1, 2 and 3 and the four elements from the lastValuesOf4to7 array, as described above, should be exported as a new row into another CSV file called condensed.csv. (In the dataset if the element in column 3 is not null, then columns 4, 5, 6 & 7 are null and can be ignored)
So in the end I should get a csv file called condensed.csv, which has 7 columns.
I have tried using the following code in Scala but have not been able to progress further:
import scala.io.Source
object structuringData {
def main(args: Array[String]) {
val data = Source.fromFile("/path/to/file.csv")
var lastValuesOf4to7 = Array("0","0","0","0")
val lines = data.getLines // Get the lines of the file
val splitLine = lines.map(s => s.split(',')).toArray // This gives an out of memory error since the original file is huge.
data.close
}
}
As you can see from the code above, I have tried to move it into an array but have been unable to progress further since I am unable to process each line individually.
I am quite certain that there must be straightforward solution to processing csv files on Scala / Spark.
Use the Spark-csv package and then use the Sql query to query the data and make the filters according to your use case and then export it at the end.
If you are using spark 2.0.0 then spark-csv will be present in spark-sql or else if you are using a old version add the dependency accordingly.
You can find a link to the spark-csv here.
You can also look at the example here: http://blog.madhukaraphatak.com/analysing-csv-data-in-spark/
Thank you for the response. I managed to create a solution myself using Bash Script. I had to start with a blank condensed.csv file first. My code shows how easy it was to achieve this:
#!/bin/bash
OLDIFS=$IFS
IFS=","
last1=0
last2=0
last3=0
last4=0
while read f1 f2 f3 f4 f5 f6 f7
do
if [[ $f4 != "" ]];
then
last1=$f4
last2=$f5
last3=$f6
last4=$f7
elif [[ $f3 != "" ]];
then
echo "$f1,$f2,$f3,$last1,$last2,$last3,$last4" >> path/to/condensed.csv
fi
done < $1
IFS=$OLDIFS
If the script is saved with the name extractcsv.sh then it should be run using the following format:
$ ./extractcsv.sh path/to/original/file.csv
This only goes to confirm my observation that ETL is easier on Bash than in Scala. Thank you for your help, though.

Using Custom Hadoop input format for processing binary file in Spark

I have developed a hadoop based solution that process a binary file. This uses classic hadoop MR technique. The binary file is about 10GB and divided into 73 HDFS blocks, and the business logic written as map process operates on each of these 73 blocks. We have developed a customInputFormat and CustomRecordReader in Hadoop that returns key (intWritable) and value (BytesWritable) to the map function. The value is nothing but the contents of a HDFS block(bianry data). The business logic knows how to read this data.
Now, I would like to port this code in spark. I am a starter in spark and could run simple examples (wordcount, pi example) in spark. However, could not straightforward example to process binaryFiles in spark. I see there are two solutions for this use case. In the first, avoid using custom input format and record reader. Find a method (approach) in spark the creates a RDD for those HDFS blocks, use a map like method that feeds HDFS block content to the business logic. If this is not possible, I would like to re-use the custom input format and custom reader using some methods such as HadoopAPI, HadoopRDD etc. My problem:- I do not know whether the first approach is possible or not. If possible, can anyone please provide some pointers that contains examples? I was trying second approach but highly unsuccessful. Here is the code snippet I used
package org {
object Driver {
def myFunc(key : IntWritable, content : BytesWritable):Int = {
println(key.get())
println(content.getSize())
return 1
}
def main(args: Array[String]) {
// create a spark context
val conf = new SparkConf().setAppName("Dummy").setMaster("spark://<host>:7077")
val sc = new SparkContext(conf)
println(sc)
val rd = sc.newAPIHadoopFile("hdfs:///user/hadoop/myBin.dat", classOf[RandomAccessInputFormat], classOf[IntWritable], classOf[BytesWritable])
val count = rd.map (x => myFunc(x._1, x._2)).reduce(_+_)
println("The count is *****************************"+count)
}
}
}
Please note that the print statement in the main method prints 73 which is the number of blocks whereas the print statements inside the map function prints 0.
Can someone tell where I am doing wrong here? I think I am not using API the right way but failed to find some documentation/usage examples.
A couple of problems at a glance. You define myFunc but call func. Your myFunc has no return type, so you can't call collect(). If your myFunc truly doesn't have a return value, you can do foreach instead of map.
collect() pulls the data in an RDD to the driver to allow you to do stuff with it locally (on the driver).
I have made some progress in this issue. I am now using the below function which does the job
var hRDD = new NewHadoopRDD(sc, classOf[RandomAccessInputFormat],
classOf[IntWritable],
classOf[BytesWritable],
job.getConfiguration()
)
val count = hRDD.mapPartitionsWithInputSplit{ (split, iter) => myfuncPart(split, iter)}.collect()
However, landed up with another error the details of which i have posted here
Issue in accessing HDFS file inside spark map function
15/10/30 11:11:39 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, 40.221.94.235): java.io.IOException: No FileSystem for scheme: spark
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2584)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2591)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630)

Spark/Scala read hadoop file

In a pig script I saved a table using PigStorage('|').
I have in the corresponding hadoop folder files like
part-r-00000
etc.
What is the best way to load it in Spark/Scala ? In this table I have 3 fields: Int, String, Float
I tried:
text = sc.hadoopFile("file", classOf[TextInputFormat], classOf[LongWritable], classOf[Text], sc.defaultMinPartitions)
But then I would need somehow to split each line. Is there a better way to do it?
If I were coding in python I would create a Dataframe indexed by the first field and whose columns are the values found in the string field and coefficients the float values. But I need to use scala to use the pca module. And the dataframes don't seem that close to python's ones
Thanks for the insight
PigStorage creates a text file without schema information so you need to do that work yourself something like
sc.textFile("file") // or directory where the part files are
val data = csv.map(line => {
vals=line.split("|")
(vals(0).toInt,vals(1),vals(2).toDouble)}
)

Spark: RDD.saveAsTextFile when using a pair of (K,Collection[V])

I have a dataset of employees and their leave-records. Every record (of type EmployeeRecord) contains EmpID (of type String) and other fields. I read the records from a file and then transform into PairRDDFunctions:
val empRecords = sc.textFile(args(0))
....
val empsGroupedByEmpID = this.groupRecordsByEmpID(empRecords)
At this point, 'empsGroupedByEmpID' is of type RDD[String,Iterable[EmployeeRecord]]. I transform this into PairRDDFunctions:
val empsAsPairRDD = new PairRDDFunctions[String,Iterable[EmployeeRecord]](empsGroupedByEmpID)
Then, I go for processing the records as per the logic of the application. Finally, I get an RDD of type [Iterable[EmployeeRecord]]
val finalRecords: RDD[Iterable[EmployeeRecord]] = <result of a few computations and transformation>
When I try to write the contents of this RDD to a text file using the available API thus:
finalRecords.saveAsTextFile("./path/to/save")
the I find that in the file every record begins with an ArrayBuffer(...). What I need is a file with one EmployeeRecord in each line. Is that not possible? Am I missing something?
I have spotted the missing API. It is well...flatMap! :-)
By using flatMap with identity, I can get rid of the Iterator and 'unpack' the contents, like so:
finalRecords.flatMap(identity).saveAsTextFile("./path/to/file")
That solves the problem I have been having.
I also have found this post suggesting the same thing. I wish I saw it a bit earlier.