I have a task to write some sequence files, for example: sequence1, sequence2, sequence3 into one folder.
If I'm trying like that
sequence1.saveAsSequenceFile("home/sample1")
sequence2.saveAsSequenceFile("home/sample1")
sequence3.saveAsSequenceFile("home/sample1")
I receive the error on second line: "directory home/sample1 already exists.
Anybody know, Is there any way to do this?
Just union them
val rddUnion = sequence1.union(sequence2).union(sequence3)
And then write them all together
rddUnion.saveAsSequenceFile("home/sample1")
Related
I'm reading from a path say /json//myfiles_.json
I'm then flattening the json using explode. This causes an error since I have some empty files. How do I tell it to ignore empty files of somehow filter them out?
I can detect individual files checking if the head is empty but I need to do this on the collection of files iterated in the dataframe with the use of the wildcard path.
So the answer seems to be that I need to provide a schema explicitly because it can't infer one from empty file - as you would expect!
e.g.
val schemadf = sqlContext.read.json(schemapath) //infer schema from file with data or do manually
val schema = schemadf.schema
val raw = sqlContext.read.schema(schema).json(monthfile)
val prep = raw.withColumn("MyArray", explode($"MyArray"))
.select($"ID", $"name", $"CreatedAt")
display(prep)
I am new to spark and scala. I have below requirement. I need to process all the files under a path which have sub directories. I guess, I need to write a for-loop logic to process across all the files.
Below is the example of my case:
src/proj_fldr/dataset1/20170624/file1.txt
src/proj_fldr/dataset1/20170624/file2.txt
src/proj_fldr/dataset1/20170624/file3.txt
src/proj_fldr/dataset1/20170625/file1.txt
src/proj_fldr/dataset1/20170625/file2.txt
src/proj_fldr/dataset1/20170625/file3.txt
src/proj_fldr/dataset1/20170626/file1.txt
src/proj_fldr/dataset1/20170626/file2.txt
src/proj_fldr/dataset1/20170626/file3.txt
src/proj_fldr/dataset2/20170624/file1.txt
src/proj_fldr/dataset2/20170624/file2.txt
src/proj_fldr/dataset2/20170624/file3.txt
src/proj_fldr/dataset2/20170625/file1.txt
src/proj_fldr/dataset2/20170625/file2.txt
src/proj_fldr/dataset2/20170625/file3.txt
src/proj_fldr/dataset2/20170626/file1.txt
src/proj_fldr/dataset2/20170626/file2.txt
src/proj_fldr/dataset2/20170626/file3.txt
I need the code to iterate the files like
In src
loop (proj_fldr
loop(dataset
loop(datefolder
loop(file1 then, file2....))))
Since you have a regular file structure you can use the wildcard * when reading the files. You can do the following to read all the files into a single RDD:
val spark = SparkSession.builder.getOrCreate()
val rdd = spark.sparkContext.wholeTextFiles("src/*/*/*/*.txt")
The result will be a RDD[(String, String)] with the path and the content in a tuple for each processed file.
To explicitly set if you want to use local or HDFS files you can append "hdfs://" or "file://" to the beginning of the path.
I'm new to Hadoop, so please do not judge strictly my seemingly simple question.
The short version: What tuple data type can I use in Hadoop, to store 2 longs as a single value is a sequence file?
Moreover, I want to be able to read and process this file with Apache Pig like A = LOAD '/my/file' AS (a:long, (b:long, c:long)) and with Scala & Spark like val a = sc.sequenceFile[LongWritable, DesiredTuple]("/my/file", 1).
The full story:
I'm writing a Hadoop Job in Java, and I need to output a sequence file, which contains 3 long values at each line. I use first value a a key and group two other values together as a value in my Reducer.
I tried several variants:
Using org.apache.hadoop.mapreduce.lib.join.TupleWritable
public class MyReducer extends Reducer<...> {
public void reduce(Context context){
long a,b,c;
// ...
context.write(a, new TupleWritable(
new LongWritable[]{new LongWritable(b), new LongWritable(c)}));
}
}
But the javadoc of TupleWritable class says " * This is not a general-purpose tuple type." It seems to be ok for first attempt, but I can't get back my Tuples. Look as a simple script in Apace Pig:
A = LOAD '/my/file' USING org.apache.pig.piggybank.storage.SequenceFileLoader()
AS (a:long, (b:long, t:long));
DUMP A;
I got Something like this:
(2220,)
(5640,)
(6240,)
...
So what is the Apache Pig way of reading Hadoop's TupleWritable from a sequence file?
Furthermore, I tried to change sequence format to text format: job.setOutputFormatClass(TextOutputFormat.class);
This time I just looked in one of outputed files:
> hdfs dfs -cat /my/file/part-r-00000 | head
2220 [,]
5640 [,]
6240 [,]
...
So is the next question: Why there is nothing in my TupleWritable value?
After that, I tried org.apache.mahout.cf.taste.hadoop.EntityEntityWritable.
For a sequence file I got the same result as before:
grunt> A = LOAD '/my/file' USING org.apache.pig.piggybank.storage.SequenceFileLoader() AS (a:long, (b:long, c:long));
(2220,)
(5640,)
(6240,)
...
For a text file I got the desired result:
2220 2 15
5640 1 9
6240 0 1
...
And next question is: How to read such tuples (EntityEntityWritable) and may be other custom objects back from Hadoop-written sequence file?
This question is related to this.
I am processing an S3 folder containing csv.gz files in Spark. Each csv.gz file has a header that contains column names. This has been solved by the above SO link and the solution looks like this:
val rdd = sc.textFile("s3://.../my-s3-path").mapPartitions(_.drop(1))
The problem now is that it looks like some of the files have newline ('\n') at the end (we assume we are not sure which file). So when converting the RDD to DataFrame, I'm getting some error. The question now is:
How do I get rid of the last line of each file if it is '\n'?
Why not a simple filter:
val rdd = sc.textFile("s3...").filter(line => !line.equalsIgnoreCase("\n")).mapPartition...
Or filter any empty line:
val rdd = sc.textFile("s3...").filter(line => !line.trim().isEmpty)...
I have a dataset of employees and their leave-records. Every record (of type EmployeeRecord) contains EmpID (of type String) and other fields. I read the records from a file and then transform into PairRDDFunctions:
val empRecords = sc.textFile(args(0))
....
val empsGroupedByEmpID = this.groupRecordsByEmpID(empRecords)
At this point, 'empsGroupedByEmpID' is of type RDD[String,Iterable[EmployeeRecord]]. I transform this into PairRDDFunctions:
val empsAsPairRDD = new PairRDDFunctions[String,Iterable[EmployeeRecord]](empsGroupedByEmpID)
Then, I go for processing the records as per the logic of the application. Finally, I get an RDD of type [Iterable[EmployeeRecord]]
val finalRecords: RDD[Iterable[EmployeeRecord]] = <result of a few computations and transformation>
When I try to write the contents of this RDD to a text file using the available API thus:
finalRecords.saveAsTextFile("./path/to/save")
the I find that in the file every record begins with an ArrayBuffer(...). What I need is a file with one EmployeeRecord in each line. Is that not possible? Am I missing something?
I have spotted the missing API. It is well...flatMap! :-)
By using flatMap with identity, I can get rid of the Iterator and 'unpack' the contents, like so:
finalRecords.flatMap(identity).saveAsTextFile("./path/to/file")
That solves the problem I have been having.
I also have found this post suggesting the same thing. I wish I saw it a bit earlier.