Gzipping Har Files on HDFS using Spark - scala

I have huge data in hadoop archive .har format. Since, har doesn't include any compression, I am trying to further gzip it in and store in HDFS. The only thing I can get to work without error is :
harFile.coalesce(1, "true")
.saveAsTextFile("hdfs://namenode/archive/GzipOutput", classOf[org.apache.hadoop.io.compress.GzipCodec])
//`coalesce` because Gzip isn't splittable.
But, this doesn't give me the correct results. A Gzipped file is generated but with invalid output ( a single line saying the rdd type etc.)
Any help will be appreciated. I am also open to any other approaches.
Thanks.

A Java code snippet to create a compressed version of an existing HDFS file.
Built in a hurry, in a text editor, from bits and pieces of a Java app I wrote some time ago, hence not tested; some typos and gaps to be expected.
// HDFS API
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.security.UserGroupInformation;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.fs.FileStatus;
// native Hadoop compression libraries
import org.apache.hadoop.io.compress.CompressionCodecFactory;
import org.apache.hadoop.io.compress.CompressionCodec;
import org.apache.hadoop.io.compress.Compressor;
import org.apache.hadoop.io.compress.GzipCodec;
import org.apache.hadoop.io.compress.BZip2Codec;
import org.apache.hadoop.io.compress.SnappyCodec;
import org.apache.hadoop.io.compress.Lz4Codec;
..............
// Hadoop "Configuration" (and its derivatives for HDFS, HBase etc.) constructors try to auto-magically
// find their config files by searching CLASSPATH for directories, and searching each dir for hard-coded
// name "core-site.xml", plus "hdfs-site.xml" and/or "hbase-site.xml" etc.
// WARNING - if these config files are not found, the "Configuration" reverts to hard-coded defaults without
// any warning, resulting in bizarre error messages later > let's run some explicit controls here
Configuration cnfHadoop = new Configuration() ;
String propDefaultFs =cnfHadoop.get("fs.defaultFS") ;
if (propDefaultFs ==null || ! propDefaultFs.startsWith("hdfs://"))
{ throw new IllegalArgumentException(
"HDFS configuration is missing - no proper \"core-site.xml\" found, please add\n"
+"directory /etc/hadoop/conf/ (or custom dir with custom XML conf files) in CLASSPATH"
) ;
}
/*
// for a Kerberised cluster, either you already have a valid TGT in the default
// ticket cache (via "kinit"), or you have to authenticate by code
UserGroupInformation.setConfiguration(cnfHadoop) ;
UserGroupInformation.loginUserFromKeytab("user#REALM", "/some/path/to/user.keytab") ;
*/
FileSystem fsCluster =FileSystem.get(cnfHadoop) ;
Path source = new Path("/some/hdfs/path/to/XXX.har") ;
Path target = new Path("/some/hdfs/path/to/XXX.har.gz") ;
// alternative: "BZip2Codec" for better compression (but higher CPU cost)
// alternative: "SnappyCodec" or "Lz4Codec" for lower compression (but much lower CPU cost)
CompressionCodecFactory codecBootstrap = new CompressionCodecFactory(cnfHadoop) ;
CompressionCodec codecHadoop =codecBootstrap.getCodecByClassName(GzipCodec.class.getName()) ;
Compressor compressorHadoop =codecHadoop.createCompressor() ;
byte[] buffer = new byte[16*1024*1024] ;
int bufUsedCapacity ;
InputStream sourceStream =fsCluster.open(source) ;
OutputStream targetStream =codecHadoop.createOutputStream(fsCluster.create(target, true), compressorHadoop) ;
while ((bufUsedCapacity =sourceStream.read(buffer)) >0)
{ targetStream.write(buffer, 0, bufUsedCapacity) ; }
targetStream.close() ;
sourceStream.close() ;
..............

Related

ParquetProtoWriters creates an unreadable parquet file

My .proto file contains one field of map type.
Message Foo {
...
...
map<string, uint32> fooMap = 19;
}
I'm consuming messages from Kafka source and trying to write the messages as a parquet file to S3 bucket.
The relevant part of the code looks like this:
val basePath = "s3a:// ..."
env
.fromSource(source, WatermarkStrategy.noWatermarks(), "source")
.map(x => toJavaProto(x))
.sinkTo(
FileSink
.forBulkFormat(basePath, ParquetProtoWriters.forType(classOf(Foo)))
.withOutputFileConfig(
OutputFileConfig
.builder()
.withPartPrefix("foo")
.withPartSuffix(".parquet")
.build()
)
.build()
)
.setParallelism(1)
env.execute()
The result is that a parquet file was actually written for S3, but the file appears to be corrupted. When I try to read the file using the Avro / Parquet Viewer plugin I can see this error:
Unable to process file
.../Downloads/foo-9366c15f-270e-4939-ad88-b77ee27ddc2f-0.parquet
java.lang.UnsupportedOperationException: REPEATED not supported
outside LIST or MAP. Type: repeated group fooMap = 19 { optional
binary key (STRING) = 1; optional int32 value = 2; } at
org.apache.parquet.avro.AvroSchemaConverter.convertFields(AvroSchemaConverter.java:277)
at
org.apache.parquet.avro.AvroSchemaConverter.convert(AvroSchemaConverter.java:264)
at
org.apache.parquet.avro.AvroReadSupport.prepareForRead(AvroReadSupport.java:134)
at
org.apache.parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:185)
at
org.apache.parquet.hadoop.ParquetReader.initReader(ParquetReader.java:156)
at
org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:135)
at
uk.co.hadoopathome.intellij.viewer.fileformat.ParquetFileReader.getRecords(ParquetFileReader.java:99)
at
uk.co.hadoopathome.intellij.viewer.FileViewerToolWindow$2.doInBackground(FileViewerToolWindow.java:193)
at
uk.co.hadoopathome.intellij.viewer.FileViewerToolWindow$2.doInBackground(FileViewerToolWindow.java:184)
at java.desktop/javax.swing.SwingWorker$1.call(SwingWorker.java:304)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.desktop/javax.swing.SwingWorker.run(SwingWorker.java:343) at
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
Flink version 1.15
proto 2
There are some breaking changes in parquet-format and parquet-mr. I'm not familiar with Flink, but I guess you have to configure org.apache.flink.formats.parquet.protobuf.ParquetProtoWriters to generate a correct format.
I used parquet-mr directly and encountered the same issue. An avro parquet reader is unable to read the parquet file generated by the following code:
import org.apache.parquet.proto.ProtoParquetWriter;
import org.apache.parquet.proto.ProtoWriteSupport;
...
var conf = new Configuration();
ProtoWriteSupport.setWriteSpecsCompliant(conf, false);
var builder = ProtoParquetWriter.builder(file)
.withMessage(Xxx.class)
.withCompressionCodec(CompressionCodecName.GZIP)
.withWriteMode(Mode.OVERWRITE)
.withConf(conf);
try (var writer = builder.build()) {
writer.write(pb.toBuilder());
}
If the config value is changed to true, it will succeed:
ProtoWriteSupport.setWriteSpecsCompliant(conf, true);
By looking its source code, we can know this function is for setting the boolean value of parquet.proto.writeSpecsCompliant in the config.
In ParquetProtoWriters.forType's source code, it create a factory with builder class ParquetProtoWriterBuilder, which uses org.apache.parquet.proto.ProtoWriteSupport internally. I guess you can somehow assign it with a correctly configured ProtoWriteSupport to it.
I also install this Python tool: https://pypi.org/project/parquet-tools/ to inspect parquet files. List fields generated in older format will be like:
...
############ Column(f2) ############
name: f2
path: f1.f2
...
and in newer format will be like:
...
############ Column(element) ############
name: element
path: f1.f2.list.element
...
Hope this article could give you some direction.
References
https://github.com/apache/parquet-format/blob/54e53e5d7794d383529dd30746378f19a12afd58/LogicalTypes.md#nested-types
https://github.com/apache/parquet-format/pull/51
https://github.com/apache/parquet-mr/pull/463

Generating a single output file for each processed input file in Apach Flink

I am using Scala and Apache Flink to build an ETL that reads all the files under a directory in my local file system periodically and write the result of processing each file in a single output file under another directory.
So an example of this is would be:
/dir/to/input/files/file1
/dir/to/intput/files/fil2
/dir/to/input/files/file3
and the output of the ETL would be exactly:
/dir/to/output/files/file1
/dir/to/output/files/file2
/dir/to/output/files/file3
I have tried various approaches including reducing the parallel processing to one when writing to the dataSink but I still can't achieve the required result.
This is my current code:
val path = "/path/to/input/files/"
val format = new TextInputFormat(new Path(path))
val socketStream = env.readFile(format, path, FileProcessingMode.PROCESS_CONTINUOUSLY, 10)
val wordsStream = socketStream.flatMap(value => value.split(",")).map(value => WordWithCount(value,1))
val keyValuePair = wordsStream.keyBy(_.word)
val countPair = keyValuePair.sum("count")
countPair.print()
countPair.writeAsText("/path/to/output/directory/"+
DateTime.now().getHourOfDay.toString
+
DateTime.now().getMinuteOfHour.toString
+
DateTime.now().getSecondOfMinute.toString
, FileSystem.WriteMode.NO_OVERWRITE)
// The first write method I trid:
val sink = new BucketingSink[WordWithCount]("/path/to/output/directory/")
sink.setBucketer(new DateTimeBucketer[WordWithCount]("yyyy-MM-dd--HHmm"))
// The second write method I trid:
val sink3 = new BucketingSink[WordWithCount]("/path/to/output/directory/")
sink3.setUseTruncate(false)
sink3.setBucketer(new DateTimeBucketer("yyyy-MM-dd--HHmm"))
sink3.setWriter(new StringWriter[WordWithCount])
sink3.setBatchSize(3)
sink3.setPendingPrefix("file-")
sink3.setPendingSuffix(".txt")
Both writing methods fail in producing the wanted result.
Can some with experience with Apache Flink guide me to the write approach please.
I solved this issue importing the next dependencies to run on local machine:
hadoop-aws-2.7.3.jar
aws-java-sdk-s3-1.11.183.jar
aws-java-sdk-core-1.11.183.jar
aws-java-sdk-kms-1.11.183.jar
jackson-annotations-2.6.7.jar
jackson-core-2.6.7.jar
jackson-databind-2.6.7.jar
joda-time-2.8.1.jar
httpcore-4.4.4.jar
httpclient-4.5.3.jar
You can review it on :
https://ci.apache.org/projects/flink/flink-docs-stable/ops/deployment/aws.html
Section "Provide S3 FileSystem Dependency"

Spark: run an external process in parallel

Is it possible with Spark to "wrap" and run an external process managing its input and output?
The process is represented by a normal C/C++ application that usually runs from command line. It accepts a plain text file as input and generate another plain text file as output. As I need to integrate the flow of this application with something bigger (always in Spark), I was wondering if there is a way to do this.
The process can be easily run in parallel (at the moment I use GNU Parallel) just splitting its input in (for example) 10 part files, run 10 instances in memory of it, and re-join the final 10 part files output in one file.
The simplest thing you can do is to write a simple wrapper which takes data from standard input, writes to file, executes an external program, and outputs results to the standard output. After that all you have to do is to use pipe method:
rdd.pipe("your_wrapper")
The only serious considerations is IO performance. If it is possible it would be better to adjust program you want to call so it can read and write data directly without going through disk.
Alternativelly you can use mapPartitions combined with process and standard IO tools to write to the local file, call your program and read the output.
If you end up here based on the question title from a Google search, but you don't have the OP restriction that the external program needs to read from a file--i.e., if your external program can read from stdin--here is a solution. For my use case, I needed to call an external decryption program for each input file.
import org.apache.commons.io.IOUtils
import sys.process._
import scala.collection.mutable.ArrayBuffer
val showSampleRows = true
val bfRdd = sc.binaryFiles("/some/files/*,/more/files/*")
val rdd = bfRdd.flatMap{ case(file, pds) => { // pds is a PortableDataStream
val rows = new ArrayBuffer[Array[String]]()
var errors = List[String]()
val io = new ProcessIO (
in => { // "in" is an OutputStream; write the encrypted contents of the
// input file (pds) to this stream
IOUtils.copy(pds.open(), in) // open() returns a DataInputStream
in.close
},
out => { // "out" is an InputStream; read the decrypted data off this stream.
// Even though this runs in another thread, we can write to rows, since it
// is part of the closure for this function
for(line <- scala.io.Source.fromInputStream(out).getLines) {
// ...decode line here... for my data, it was pipe-delimited
rows += line.split('|')
}
out.close
},
err => { // "err" is an InputStream; read any errors off this stream
// errors is part of the closure for this function
errors = scala.io.Source.fromInputStream(err).getLines.toList
err.close
}
)
val cmd = List("/my/decryption/program", "--decrypt")
val exitValue = cmd.run(io).exitValue // blocks until subprocess finishes
println(s"-- Results for file $file:")
if (exitValue != 0) {
// TBD write to string accumulator instead, so driver can output errors
// string accumulator from #zero323: https://stackoverflow.com/a/31496694/215945
println(s"exit code: $exitValue")
errors.foreach(println)
} else {
// TBD, you'll probably want to move this code to the driver, otherwise
// unless you're using the shell, you won't see this output
// because it will be sent to stdout of the executor
println(s"row count: ${rows.size}")
if (showSampleRows) {
println("6 sample rows:")
rows.slice(0,6).foreach(row => println(" " + row.mkString("|")))
}
}
rows
}}
scala> :paste "test.scala"
Loading test.scala...
...
rdd: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[62] at flatMap at <console>:294
scala> rdd.count // action, causes Spark code to actually run
-- Results for file hdfs://path/to/encrypted/file1: // this file had errors
exit code: 255
ERROR: Error decrypting
my_decryption_program: Bad header data[0]
-- Results for file hdfs://path/to/encrypted/file2:
row count: 416638
sample rows:
<...first row shown here ...>
...
<...sixth row shown here ...>
...
res43: Long = 843039
References:
https://www.scala-lang.org/api/current/scala/sys/process/ProcessIO.html
https://alvinalexander.com/scala/how-to-use-closures-in-scala-fp-examples#using-closures-with-other-data-types

How to bundle many files in S3 using Spark

I have 20 million files in S3 spanning roughly 8000 days.
The files are organized by timestamps in UTC, like this: s3://mybucket/path/txt/YYYY/MM/DD/filename.txt.gz. Each file is UTF-8 text containing between 0 (empty) and 100KB of text (95th percentile, although there are a few files that are up to several MBs).
Using Spark and Scala (I'm new to both and want to learn), I would like to save "daily bundles" (8000 of them), each containing whatever number of files were found for that day. Ideally I would like to store the original filenames as well as their content. The output should reside in S3 as well and be compressed, in some format that is suitable for input in further Spark steps and experiments.
One idea was to store bundles as a bunch of JSON objects (one per line and '\n'-separated), e.g.
{id:"doc0001", meta:{x:"blah", y:"foo", ...}, content:"some long string here"}
{id:"doc0002", meta:{x:"foo", y:"bar", ...}, content: "another long string"}
Alternatively, I could try the Hadoop SequenceFile, but again I'm not sure how to set that up elegantly.
Using the Spark shell for example, I saw that it was very easy to read the files, for example:
val textFile = sc.textFile("s3n://mybucket/path/txt/1996/04/09/*.txt.gz")
// or even
val textFile = sc.textFile("s3n://mybucket/path/txt/*/*/*/*.txt.gz")
// which will take for ever
But how do I "intercept" the reader to provide the file name?
Or perhaps I should get an RDD of all the files, split by day, and in a reduce step write out K=filename, V=fileContent?
You can use this
First You can get a Buffer/List of S3 Paths :
import scala.collection.JavaConverters._
import java.util.ArrayList
import com.amazonaws.services.s3.AmazonS3Client
import com.amazonaws.services.s3.model.ObjectListing
import com.amazonaws.services.s3.model.S3ObjectSummary
import com.amazonaws.services.s3.model.ListObjectsRequest
def listFiles(s3_bucket:String, base_prefix : String) = {
var files = new ArrayList[String]
//S3 Client and List Object Request
var s3Client = new AmazonS3Client();
var objectListing: ObjectListing = null;
var listObjectsRequest = new ListObjectsRequest();
//Your S3 Bucket
listObjectsRequest.setBucketName(s3_bucket)
//Your Folder path or Prefix
listObjectsRequest.setPrefix(base_prefix)
//Adding s3:// to the paths and adding to a list
do {
objectListing = s3Client.listObjects(listObjectsRequest);
for (objectSummary <- objectListing.getObjectSummaries().asScala) {
files.add("s3://" + s3_bucket + "/" + objectSummary.getKey());
}
listObjectsRequest.setMarker(objectListing.getNextMarker());
} while (objectListing.isTruncated());
//Removing Base Directory Name
files.remove(0)
//Creating a Scala List for same
files.asScala
}
Now Pass this List object to the following piece of code, note : sc is an object of SQLContext
var df: DataFrame = null;
for (file <- files) {
val fileDf= sc.textFile(file)
if (df!= null) {
df= df.unionAll(fileDf)
} else {
df= fileDf
}
}
Now you got a final Unified RDD i.e. df
Optional, And You can also repartition it in a single BigRDD
val files = sc.textFile(filename, 1).repartition(1)
Repartitioning always works :D
have you tried something along the lines of sc.wholeTextFiles?
It creates an RDD where the key is the filename and the value is the byte array of the whole file. You can then map this so the key is the file date, and then groupByKey?
http://spark.apache.org/docs/latest/programming-guide.html
At your scale, elegant solution would be a stretch.
I would recommend against using sc.textFile("s3n://mybucket/path/txt/*/*/*/*.txt.gz") as it takes forever. What you can do is use AWS DistCp or something similar to move files into HDFS. Once its in HDFS, spark is quite fast in ingesting the information in whatever way suits you.
Note that most of these processes require some sort of file list so you'll need to generate that somehow. for 20 mil files, this creation of file list will be a bottle neck. I'd recommend creating a file that get appended with the file path, every-time a file gets uploaded to s3.
Same for output, put into hdfs and then move to s3 (although direct copy might be equally efficient).

How do I generate binary RFC822-style headers in Python 3.2?

How do I convince email.generator.Generator to use binary in Python 3.2? This seems like precisely the use case for the policy framework that was introduced in Python 3.3, but I would like my code to run in 3.2.
from email.parser import Parser
from email.generator import Generator
from io import BytesIO, StringIO
data = "Key: \N{SNOWMAN}\r\n\r\n"
message = Parser().parse(StringIO(data))
with open("/tmp/rfc882test", "w") as out:
Generator(out, maxheaderlen=0).flatten(message)
Fails with UnicodeEncodeError: 'ascii' codec can't encode character '\u2603' in position 0: ordinal not in range(128).
Your data is not a valid RFC2822 header, which I suspect misleads you. It's a Unicode string, but RFC2822 is always only ASCII. To have non-ASCII characters you need to encode them with a character set and either base64 or quoted-printable encoding.
Hence, valid code would be this:
from email.parser import Parser
from email.generator import Generator
from io import BytesIO, StringIO
data = "Key: =?utf8?b?4piD?=\r\n\r\n"
message = Parser().parse(StringIO(data))
with open("/tmp/rfc882test", "w") as out:
Generator(out, maxheaderlen=0).flatten(message)
Which of course avoids the error completely.
The question is how to generate such headers as =?utf8?b?4piD?= and the answer lies in the email.header module.
I made this example with:
>>> from email import header
>>> header.Header('\N{SNOWMAN}', 'utf8').encode()
'=?utf8?b?4piD?='
To handle files that have a Key: Value format the email module is the wrong solution. Handling such files are easy enough without the email module, and you will not have to work around the restrictions of RF2822. For example:
# -*- coding: UTF-8 -*-
import io
import sys
if sys.version_info > (3,):
def u(s): return s
else:
def u(s): return s.decode('unicode-escape')
def parse(infile):
res = {}
payload = ''
for line in infile:
key, value = line.strip().split(': ',1)
if key in res:
raise ValueError(u("Key {0} appears twice").format(key))
res[key] = value
return res
def generate(outfile, data):
for key in data:
outfile.write(u("{0}: {1}\n").format(key, data[key]))
if __name__ == "__main__":
# Ensure roundtripping:
data = {u('Key'): u('Value'), u('Foo'): u('Bar'), u('Frötz'): u('Öpöpöp')}
with io.open('/tmp/outfile.conf', 'wt', encoding='UTF8') as outfile:
generate(outfile, data)
with io.open('/tmp/outfile.conf', 'rt', encoding='UTF8') as infile:
res = parse(infile)
assert data == res
That code took 15 minutes to write, and works in both Python 2 and Python 3. If you want line continuations etc that's easy to add as well.
Here is a more complete one that supports comments etc.
A useful solution comes from http://mail.python.org/pipermail/python-dev/2010-October/104409.html :
from email.parser import Parser
from email.generator import BytesGenerator
# How do I get surrogateescape from a BytesIO/StringIO?
data = "Key: \N{SNOWMAN}\r\n\r\n" # write this to headers.txt
headers = open("headers.txt", "r", encoding="ascii", errors="surrogateescape")
message = Parser().parse(headers)
with open("/tmp/rfc882test", "wb") as out:
BytesGenerator(out, maxheaderlen=0).flatten(message)
This is for a program that wants to read and write a binary Key: value file without caring about the encoding. To consume the headers as decoded text without being able to write them back out with Generator(), Parser().parse(open("headers.txt", "r", encoding="utf-8")) should be sufficient.