I am looking to use to Scala to get faster performance in accessing and downloading a Amazon S3 file. The file comes in as a InputStream and it is large (over 30 million rows).
I have tried this in python (pandas), but it is too slow. I am hoping to increase the speed with Scala.
So far I am doing this, but it is too slow. Have I encountered a bottle neck in the stream in that I cannot access data from the stream any faster than what I have with the code below?
val obj = amazonS3Client.getObject(bucket_name, file_name)
val reader = new BufferedReader(new InputStreamReader(obj.getObjectContent()))
while(line != null) {
list_of_lines = list_of_lines ::: List(line)
line = reader.readLine
}
I'm looking for serious speed improvement compared to the above approach. Thanks.
I suspect that your performance bottleneck is appending to a (linked) List in that loop:
list_of_lines = list_of_lines ::: List(line)
With 30 million lines, it should take a few hundred trillion times longer to process all the lines than it takes to process one line. If the first iteration of that loop is 1ns, then this should take somewhere around 15 minutes to execute.
Switching to prepending to the List and then reversing at the end should improve the speed of your loop by a factor of more than a million:
while(line != null) {
list_of_lines = line :: list_of_lines
line = reader.readLine
}
list_of_lines = list_of_lines.reverse
List is also notoriously memory-inefficient, so for this many elements, it's may also be worth doing something like this (which is also more idiomatic Scala):
import scala.io.{ Codec, Source }
val obj = amazonS3Client.getObject(bucket_name, file_name)
val source = Source.fromInputStream(obj.getObjectContent())(Codec.defaultCharsetCodec)
val lines = source.getLines().toVector
Vector being more memory-efficient than List should dramatically reduce GC thrashing.
The best way to achieve better performance is using the TransferManager provided by the AWS Java SDKs. It's a high level file transfer manager that will automatically parallelise downloads. I'd recommend using SDK v2, but the same can be done with SDK v1. Though, be aware, SDK v1 comes with limitations and only multipart files can be downloaded in parallel.
You need the following dependency (assuming you are using sbt with Scala). But note, there's no benefit of using Scala over Java with the TransferManager.
libraryDependencies += "software.amazon.awssdk" % "s3-transfer-manager" % "2.17.243-PREVIEW"
Example (Java):
S3TransferManager transferManager = S3TransferManager.create();
FileDownload download =
transferManager.downloadFile(b -> b.destination(Paths.get("myFile.txt"))
.getObjectRequest(req -> req.bucket("bucket").key("key")));
download.completionFuture().join();
I recommend reading more on the topic here: Introducing Amazon S3 Transfer Manager in the AWS SDK for Java 2.x
Related
The microsoft documentation here:
https://learn.microsoft.com/en-us/azure/databricks/kb/sql/find-size-of-table#size-of-a-delta-table
suggests two methods:
Method 1:
import com.databricks.sql.transaction.tahoe._
val deltaLog = DeltaLog.forTable(spark, "dbfs:/<path-to-delta-table>")
val snapshot = deltaLog.snapshot // the current delta table snapshot
println(s"Total file size (bytes): ${deltaLog.snapshot.sizeInBytes}")`
Method 2:
spark.read.table("<non-delta-table-name>").queryExecution.analyzed.stats
For my table, they both return ~300 MB.
But then in storage explorer Folder statistics or in a recursive dbutils.fs.ls walk, I get ~900MB.
So those two methods that are much quicker than literally looking at every file underreport by 67%. This would be fine to use the slower methods except when I try to scale up to the entire container, it takes 55 hours to scan all 1 billion files and 2.6 PB.
So what is the best way to get the size of a table in ADLS Gen 2? Bonus points if it works for folders that are not tables as that's really the number I need. dbutils.fs.ls is single threaded and only works on the driver, so it's not even very parallelizable. It can be threaded but only within the driver.
deltaLog.snapshot returns just the current snapshot. You can have more files present in table's directory, those belong to historical versions that have been deleted/replaced from the current snapshot.
Also it returns 0 without complaints for non-delta paths. So I'm using this piece of code to get a database-level summary:
import com.databricks.sql.transaction.tahoe._
val databasePath = "dbfs:/<path-to-database>"
def size(path: String): Long =
dbutils.fs.ls(path).map { fi => if (fi.isDir) size(fi.path) else fi.size }.sum
val tables = dbutils.fs.ls(databasePath).par.map { fi =>
val totalSize = size(fi.path)
val snapshotSize = DeltaLog.forTable(spark, fi.path).snapshot.sizeInBytes
(fi.name, totalSize / 1024 / 1024 / 1024, snapshotSize / 1024 / 1024 / 1024)
}
display(tables.seq.sorted.toDF("name", "total_size_gb", "snapshot_size_gb"))
This does parallelize on driver only, still it's only file listing, so it's pretty fast. I admit I don't have a billion files, but well, if it's slow for you just use a bigger driver and tune the number of threads.
I am working on an application where every 30sec(can be 5 sec also) some files will be dropped in a file system. I have to read it parse it and push some records to REDIS.
In each file all records are independent and I am not doing any calculation that will require updateStateByKey.
My question is if due to some issue (eg: REDIS connection issue, Data issue in a file etc) some file is not processed completely I want to reprocess (say n times) the files again and also keep a track of the files already processed.
For testing purpose I am reading from a local folder. Also I am not sure how to conclude that one file is fully processed and mark it as completed (ie write in a text file or db that this file processed)
val lines = ssc.textFileStream("E:\\SampleData\\GG")
val words = lines.map(x=>x.split("_"))
words.foreachRDD(
x=> {
x.foreach(
x => {
var jedis = jPool.getResource();
try{
i=i+1
jedis.set("x"+i+"__"+x(0)+"__"+x(1), x(2))
}finally{
jedis.close()
}
}
)
}
)
Spark has a fault tolerance guide. Read more :
https://spark.apache.org/docs/2.1.0/streaming-programming-guide.html#fault-tolerance-semantics
The orange color is the "OldGen", Green is "Eden Space", and blue is "survivor space". I used YourKit to do this profiling. This is how I wrote my file reading code:
val inputStream = new FileInputStream("E:\\Allen\\DataScience\\train\\train.csv")
val sc = new Scanner(inputStream, "UTF-8")
var counter = 0
while (sc.hasNextLine) {
rowActors(counter % 20) ! Row(sc.nextLine())
counter += 1
}
sc.close()
inputStream.close()
It seems like a big chunk of memory if taken by Scanner. However, my original file is only 5 GB large. I wonder if I was mishandling the file reading procedure! If not, how should I read in and process my file? I'm very frustrated with the Garbage Collection right now.
Akka-stream provides safer way for parallel processing of files: https://github.com/typesafehub/activator-akka-stream-scala/blob/master/src/main/scala/sample/stream/GroupLogFile.scala
I am using Mirth 3.0.1 version. I am reading a file (using File Reader) having 34,000 records. Every record is having 45 columns and are pipe(|) separated. Mirth is taking too much time while reading the file from the disk. Mirth is installed on the same server where file is located.Earlier, I was facing the java head space issue which I resolved after setting the -Xms1024m -Xmx4096m in files mcserver.vmoptions & mcservice.vmoptions. Now I have to solve reading performance issue. Please find in attachment the channel for the same.
The answer to this problem is highly dependent on the solution itself. As an example, if you are doing transformations when you benchmark, it might be that the problem is not with reading the files, but rather with doing massive amounts of filtering and transformations in Mirth. Since Mirth converts everything you configure into basically one gigantic Javascript that executes on the server, it might just as well be that this is causing the performance problem. Pre-processor scripts might also create a problem if you do something that causes Mirth to read the whole file.
It migh also be that your 34.000 lines in the file contains huge quantities of information, simply making the file very big and extensive to process. If every record in the file is supposed to create new messages within Mirth, you might also want to check your batch settings for the reader.
And in addition to this, the performance of the read operations from disk is of course affected a lot by the infrastructure and hardware of the platform itself. You did mention that you are reading the files locally and that you had to increase the memory for Mirth. All of this could of course be a problem in itself. To make a benchmark you would want to compare this to something else. Maybe write a small Java program to just read the file to compare performance outside of Mirth.
Thanks for the suggestions.
I have used router.routeMessage('channelName','PartOfMsg') to route the 5000 records(from one channel to second channel) from the file having 34000 of records. This has helped to read faster from the file and processing the records at the same time.
For Mirth Community, below is the code to route the msg from one channel to other channel, this solution is also for the requirement if you have bulk of records to process in batches
In Source Transformer,
debug = "ON";
XML.ignoreWhitespace = true;
logger.debug('Inside source transformer "SplitFileIntoFiles" of channel: SplitFile');
var
subSegmentCounter = 0,
xmlMessageProcessCounter = 0,
singleFileLimit = 5000,
isError = false,
xmlMessageProcess = new XML(<delimited><row><column1></column1><column2></column2></row></delimited>),
newSubSegment = <row><column1></column1><column2></column2></row>,
totalPatientRecords = msg.children().length();
logger.debug('Total number of records found in the patient input file are: ');
logger.debug(totalPatientRecords);
try{
for each (seg in msg.children())
{
xmlMessageProcess.appendChild(newSubSegment);
xmlMessageProcess['row'][xmlMessageProcessCounter] = msg['row'][subSegmentCounter];
if (xmlMessageProcessCounter == singleFileLimit -1)
{
logger.debug('Now sending the 5000 records to the next channel from channel DOR Batch File Process IHI');
router.routeMessage('DOR SendPatientsToMedicare',xmlMessageProcess);
logger.debug('After sending the 5000 records to the next channel from channel DOR Batch File Process IHI');
xmlMessageProcessCounter = 0;
delete xmlMessageProcess['row'];
}
subSegmentCounter++;
xmlMessageProcessCounter++;
}// End of FOR loop
}// End of try block
catch (exception)
{
logger.error('The exception has been raised in source transformer "SplitFileIntoFiles" of channel: SplitFile');
logger.error(exception);
globalChannelMap.put('isFailed',true);
globalChannelMap.put('errDesc',exception);
return true;
}
if (xmlMessageProcessCounter > 1)
{
try
{
logger.debug('Now sending the remaining records to the next channel from channel DOR Batch File Process IHI');
router.routeMessage('DOR SendPatientsToMedicare',xmlMessageProcess);
logger.debug('After sending the remaining records to the next channel from channel DOR Batch File Process IHI');
delete xmlMessageProcess['row'];
}
catch (exception)
{
logger.error('The exception has been raised in source transformer "SplitFileIntoFiles" of channel: SplitFile');
logger.error(exception);
globalChannelMap.put('isFailed',true);
globalChannelMap.put('errDesc',exception);
return true;
}
}
return true;
// End of JavaScript
Hope, this will help.
Problem: limit binary files download rate.
def test = {
Logger.info("Call test action")
val file = new File("/home/vidok/1.jpg")
val fileIn = new FileInputStream(file)
response.setHeader("Content-type", "application/force-download")
response.setHeader("Content-Disposition", "attachment; filename=\"1.jpg\"")
response.setHeader("Content-Length", file.lenght + "")
val bufferSize = 1024 * 1024
val bb = new Array[Byte](bufferSize)
val bis = new java.io.BufferedInputStream(is)
var bytesRead = bis.read(bb, 0, bufferSize)
while (bytesRead > 0) {
bytesRead = bis.read(bb, 0, bufferSize)
//sleep(1000)?
response.writeChunk(bytesRead)
}
}
But its working only for the text files. How to work with binary files?
You've got the basic idea right: each time you've read a certain number of bytes (which are stored in your buffer) you need to:
evaluate how fast you've been reading (= X B/ms)
calculate the difference between X and how fast you should have been reading (= Y ms)
use sleep(Y) on the downloading thread if needed to slow the download rate down
There's already a great question about this right here that should have everything you need. I think especially the ThrottledInputStream solution (which is not the accepted answer) is rather elegant.
A couple of points to keep in mind:
Downloading using 1 thread for everything is the simplest way, however it's also the least efficient way if you want to keep serving requests.
Usually, you'll want to at least offload the actual downloading of a file to its own separate thread.
To speed things up: consider downloading files in chunks (using HTTP Content-Range) and Java NIO. However, keep in mind that this will make thing a lot more complex.
I wouldn't implement something which any good webserver should be able to for me. In enterprise systems this kind of thing is normally handled by a web entry server or firewall. But if you have to do this, then the answer by tmbrggmn looks good to me. NIO is a good tip.