Spark Bzip2 compression ratio is not efficient

Spark Bzip2 compression ratio is not efficient - scala

Today am seeking your help with an issue am having in the last couple of days with bzip2 compression. We need to compress our output text files into bzip2 format.
The problem is that we only pass from 5 Gb uncompressed to 3.2 Gb compressed with bzip2. Seeing other projects compressing their 5 GB files to only 400 Mb makes me wonder if am doing something wrong.
Here is my code:
iDf
.repartition(iNbPartition)
.write
.option("compression","bzip2")
.mode(SaveMode.Overwrite)
.text(iOutputPath)
I am also importing this codec :
import org.apache.hadoop.io.compress.BZip2Codec
Besides that am not setting any configs in my spark-submit because i've tried many with no luck.
Would really appreciate your help with this.

Thanks for your help guys, the solution was in the algorithm bzip itself. Actually given that my data is anonymized in a random way, it was very random that the algorithme is no longer efficient.
Thank you again

Related

How to handle large gz file in Spark

I am trying to read large gz file and, then inserting into table. this is taking so long.
sparkSession.read.format("csv").option("header", "true").load("file-about-5gb-size.gz").repartition( 1000).coalesce(1000).write.mode("overwrite").format("orc").insertInto(table)
Is there any way I can optimize this, please help.
Note: I have used random repartition and coalesce

You won't be able to do read optimization if your file is in gzip compression. The gzip compression is not splittable in spark. There's no way to avoid reading the complete file in the spark driver node.
If you want to parallelize, you need to make this file splittable by unzip it and then process it.

How to append data to a compressed file?

I am getting a lot of data from websocket screem and I want to store them on disk. The amount of data received is ~300 MB per hour and I want to store this data long term (months, years).
In .NET there is a way how to read/write from/to zipped files using compressed streams. Is there a way to write directly to compressed file in Swift?
This is Mac OS (OSX) question.
Edit:
Stream compression here might be a solution but I am not used to work with unsafe pointers and don't even know whether it can be used to write to compressed file... I am stacked on this for few hours now. Code sample or directions how to approach it would help. Cocoapods wrapper for stream compression would be even better.

gzlog does what you're looking for. It is written in C and uses the zlib library. zlib is available on macOS, and you can link to C code from Swift.

Reading in multiple files compressed in tar.gz archive into Spark [duplicate]

This question already has answers here:
Read whole text files from a compression in Spark
(2 answers)
Closed 6 years ago.
I'm trying to create a Spark RDD from several json files compressed into a tar.
For example, I have 3 files
file1.json
file2.json
file3.json
And these are contained in archive.tar.gz.
I want to create a dataframe from the json files. The problem is that Spark is not reading in the json files correctly. Creating an RDD using sqlContext.read.json("archive.tar.gz") or sc.textFile("archive.tar.gz") results in garbled/extra output.
Is there some way to handle gzipped archives containing multiple files in Spark?
UPDATE
Using the method given in the answer to Read whole text files from a compression in Spark I was able to get things running, but this method does not seem to be suitable for large tar.gz archives (>200 mb compressed) as the application chokes up on large archive sizes. As some of the archives I'm dealing with reach sizes upto 2 GB after compression I'm wondering if there is some efficient way to deal with the problem.
I'm trying to avoid extracting the archives and then merging the files together as this would be time consuming.

A solution is given in Read whole text files from a compression in Spark .
Using the code sample provided, I was able to create a DataFrame from the compressed archive like so:
val jsonRDD = sc.binaryFiles("gzarchive/*").
flatMapValues(x => extractFiles(x).toOption).
mapValues(_.map(decode())
val df = sqlContext.read.json(jsonRDD.map(_._2).flatMap(x => x))
This method works fine for tar archives of a relatively small size, but is not suitable for large archive sizes.
A better solution to the problem seems to be to convert the tar archives to Hadoop SequenceFiles, which are splittable and hence can be read and processed in parallel in Spark (as opposed to tar archives.)
See: A Million Little Files – Digital Digressions by Stuart Sierra.

Files inside of a *.tar.gz file, as you already have mentioned are compressed. You cannot put the 3 files into a single compressed tar file and expect the import function (which is looking for only text) to know how to handle decompressing the files, unpacking them from the tar archive, and then importing each file individually.
I would recommend you take the time to upload each individual json file manually since both sc.textfile and sqlcontext.read.json functions cannot handle compressed data.

Import a CSV into Matlab in multiple parts

I have a very large CSV file (870mb) that I'm trying to import into Matlab. Some of the data are numeric and some are text. I have 16GB of RAM and a SSD but the import wizard script is using 37GB and doesn't progress past 0% scanning file after a couple of hours.
Is there a way to break up the import wizard script to import the first 500,000 rows and save them to variables and empty dataArray then import the next 500,000 rows and append it to the variables, etc, until the file is complete? I'm surprised that Matlab doesn't do something like this natively.
Thank you for your help.

Have a look at the memory-mapping approach I described here. If you know the format of your file, or can deduce it from the contents, I have found this to be the fastest way to read large CSV files into Matlab. It also helps reduce memory usage.

How to efficiently process 300+ Files concurrently in scala

I'm going to work on comparing around 300 binary files using Scala, bytes-by-bytes, 4MB each. However, judging from what I've already done, processing 15 files at the same time using java.BufferedInputStream tooks me around 90 sec on my machine so I don't think my solution would scale well in terms of large number of files.
Ideas and suggestions are highly appreciated.
EDIT: The actual task is not just comparing the difference but to processing those files in the same sequence order. Let's say I have to look at byte ith in every file at the same time, and moving on to (ith + 1).

Did you notice your hard drive slowly evaporating as you read the files? Reading that many files in parallel is not something mechanical hard drives are designed to do at full-speed.
If the files will always be this small (4MB is plenty small enough), I would read the entire first file into memory, and then compare each file with it in series.
I can't comment on solid-state drives, as I have no first-hand experience with their performance.

You are quite screwed, indeed.
Let's see... 300 * 4 MB = 1.2 GB. Does that fit your memory budget? If it does, by all means read them all into memory. But, to speed things up, you might try the following:
Read 512 KB of every file, sequentially. You might try reading from 2 to 8 at the same time -- perhaps through Futures, and see how well it scales. Depending on your I/O system, you may gain some speed by reading a few files at the same time, but I do not expect it to scale much. EXPERIMENT! BENCHMARK!
Process those 512 KB using Futures.
Go back to step 1, unless you are finished with the files.
Get the result back from the processing Futures.
On step number 1, by limiting the parallel reads you avoid trashing your I/O subsystem. Push it as much as you can, maybe a bit less than that, but definitely not more than that.
By not reading all files on step number 1, you use some of the time spent reading these files doing useful CPU work. You may experiment with lowering the bytes read on step 1 as well.

Are the files exactly the same number of bytes? If they are not, the files can be compared simply via the File.length() method to determine a first-order guess of equality.
Of course you may be wanting to do a much deeper comparison than just "are these files the same?"

If you are just looking to see if they are the same I would suggest using a hashing algorithm like SHA1 to see if they match.
Here is some java source to make that happen
many large systems that handle data use sha1 Including the NSA and git
Its simply more efficient use a hash instead of a byte compare. the hashes can also be stored for later to see if the data has been altered.
Here is a talk by Linus Torvalds specifically about git, it also mentions why he uses SHA1.

I would suggest using nio if possible. Introudction To Java NIO and NIO2 seems like a decent guide to using NIO if you are not familiar with it. I would not suggest reading a file and doing a comparison byte by byte, if that is what you are currently doing. You can create a ByteBuffer to read in chunks of data from a file and then do comparisons from that.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Spark Bzip2 compression ratio is not efficient - scala

Thanks for your help guys, the solution was in the algorithm bzip itself. Actually given that my data is anonymized in a random way, it was very random that the algorithme is no longer efficient. Thank you again

Related

How to handle large gz file in Spark

How to append data to a compressed file?

Reading in multiple files compressed in tar.gz archive into Spark [duplicate]

Import a CSV into Matlab in multiple parts

How to efficiently process 300+ Files concurrently in scala

Categories

Resources