How can you find the size of a delta table quickly and accurately? - scala

The microsoft documentation here:
https://learn.microsoft.com/en-us/azure/databricks/kb/sql/find-size-of-table#size-of-a-delta-table
suggests two methods:
Method 1:
import com.databricks.sql.transaction.tahoe._
val deltaLog = DeltaLog.forTable(spark, "dbfs:/<path-to-delta-table>")
val snapshot = deltaLog.snapshot // the current delta table snapshot
println(s"Total file size (bytes): ${deltaLog.snapshot.sizeInBytes}")`
Method 2:
spark.read.table("<non-delta-table-name>").queryExecution.analyzed.stats
For my table, they both return ~300 MB.
But then in storage explorer Folder statistics or in a recursive dbutils.fs.ls walk, I get ~900MB.
So those two methods that are much quicker than literally looking at every file underreport by 67%. This would be fine to use the slower methods except when I try to scale up to the entire container, it takes 55 hours to scan all 1 billion files and 2.6 PB.
So what is the best way to get the size of a table in ADLS Gen 2? Bonus points if it works for folders that are not tables as that's really the number I need. dbutils.fs.ls is single threaded and only works on the driver, so it's not even very parallelizable. It can be threaded but only within the driver.

deltaLog.snapshot returns just the current snapshot. You can have more files present in table's directory, those belong to historical versions that have been deleted/replaced from the current snapshot.
Also it returns 0 without complaints for non-delta paths. So I'm using this piece of code to get a database-level summary:
import com.databricks.sql.transaction.tahoe._
val databasePath = "dbfs:/<path-to-database>"
def size(path: String): Long =
dbutils.fs.ls(path).map { fi => if (fi.isDir) size(fi.path) else fi.size }.sum
val tables = dbutils.fs.ls(databasePath).par.map { fi =>
val totalSize = size(fi.path)
val snapshotSize = DeltaLog.forTable(spark, fi.path).snapshot.sizeInBytes
(fi.name, totalSize / 1024 / 1024 / 1024, snapshotSize / 1024 / 1024 / 1024)
}
display(tables.seq.sorted.toDF("name", "total_size_gb", "snapshot_size_gb"))
This does parallelize on driver only, still it's only file listing, so it's pretty fast. I admit I don't have a billion files, but well, if it's slow for you just use a bigger driver and tune the number of threads.

Related

Scala Amazon S3 streaming object

I am looking to use to Scala to get faster performance in accessing and downloading a Amazon S3 file. The file comes in as a InputStream and it is large (over 30 million rows).
I have tried this in python (pandas), but it is too slow. I am hoping to increase the speed with Scala.
So far I am doing this, but it is too slow. Have I encountered a bottle neck in the stream in that I cannot access data from the stream any faster than what I have with the code below?
val obj = amazonS3Client.getObject(bucket_name, file_name)
val reader = new BufferedReader(new InputStreamReader(obj.getObjectContent()))
while(line != null) {
list_of_lines = list_of_lines ::: List(line)
line = reader.readLine
}
I'm looking for serious speed improvement compared to the above approach. Thanks.
I suspect that your performance bottleneck is appending to a (linked) List in that loop:
list_of_lines = list_of_lines ::: List(line)
With 30 million lines, it should take a few hundred trillion times longer to process all the lines than it takes to process one line. If the first iteration of that loop is 1ns, then this should take somewhere around 15 minutes to execute.
Switching to prepending to the List and then reversing at the end should improve the speed of your loop by a factor of more than a million:
while(line != null) {
list_of_lines = line :: list_of_lines
line = reader.readLine
}
list_of_lines = list_of_lines.reverse
List is also notoriously memory-inefficient, so for this many elements, it's may also be worth doing something like this (which is also more idiomatic Scala):
import scala.io.{ Codec, Source }
val obj = amazonS3Client.getObject(bucket_name, file_name)
val source = Source.fromInputStream(obj.getObjectContent())(Codec.defaultCharsetCodec)
val lines = source.getLines().toVector
Vector being more memory-efficient than List should dramatically reduce GC thrashing.
The best way to achieve better performance is using the TransferManager provided by the AWS Java SDKs. It's a high level file transfer manager that will automatically parallelise downloads. I'd recommend using SDK v2, but the same can be done with SDK v1. Though, be aware, SDK v1 comes with limitations and only multipart files can be downloaded in parallel.
You need the following dependency (assuming you are using sbt with Scala). But note, there's no benefit of using Scala over Java with the TransferManager.
libraryDependencies += "software.amazon.awssdk" % "s3-transfer-manager" % "2.17.243-PREVIEW"
Example (Java):
S3TransferManager transferManager = S3TransferManager.create();
FileDownload download =
transferManager.downloadFile(b -> b.destination(Paths.get("myFile.txt"))
.getObjectRequest(req -> req.bucket("bucket").key("key")));
download.completionFuture().join();
I recommend reading more on the topic here: Introducing Amazon S3 Transfer Manager in the AWS SDK for Java 2.x

Graphite showing rolling gap in data

I recently upgraded one of our Graphite instances from 0.9.2 to 1.1.1, and have since run into an issue where, for the lack of a better word, there is a rolling gap of data.
It shows the last few minutes correctly (I'm guessing what's in carbon cache), and after about 10-15 minutes past, it shows all of the data correctly as well.
However, inside that 10-15 minute gap, it's completely blank. I can see the gap both in Graphite, and in Grafana. It disappears after restarting carbon cache, and then comes back about a day later.
Example screenshot:
This happens for most graphs/dashboards I have.
I've spent a lot of effort optimizing disk IO, so I doubt it to be the case -> Cloudwatch shows 100% burst credit for disk. It's an m3.xlarge instance with 4 cores and 16 GB RAM. Swap file is on ephemeral storage and looks barely utilized.
Using 1 Carbon Cache instance with Whisper backend.
storage_schemas.conf:
[carbon]
pattern = ^carbon\.
retentions = 60:90d
[dumbo]
pattern = ^collectd\.dumbo # load test containers, we don't care about their data
retentions = 300:1
[collectd]
pattern = ^collectd
retentions = 10s:8h,30s:1d,1m:3d,5m:30d,15m:90d
[statsite]
pattern = ^statsite
retentions = 10s:8h,30s:1d,1m:3d,5m:30d,15m:90d
[default_1min_for_1day]
pattern = .*
retentions = 60s:1d
Non-default (or potentially relevant) carbon.conf settings:
[cache]
MAX_CACHE_SIZE = inf
MAX_UPDATES_PER_SECOND = 100 # was slagging disk write IO until I dropped it down from 500
MAX_CREATES_PER_MINUTE = 50
CACHE_WRITE_STRATEGY = sorted
RELAY_METHOD = rules
DESTINATIONS = 127.0.0.1:2004
MAX_DATAPOINTS_PER_MESSAGE = 500
MAX_QUEUE_SIZE = 10000
Graphite local_settings.py
CARBONLINK_TIMEOUT = 10.0
CARBONLINK_QUERY_BULK = True
USE_WORKER_POOL = False
We've seen this with some workloads on 1.1.1, can you try updating carbon to current master? If not 1.1.2 will be released shortly which should solve the problem.

Spark Local File Streaming - Fault tolerance

I am working on an application where every 30sec(can be 5 sec also) some files will be dropped in a file system. I have to read it parse it and push some records to REDIS.
In each file all records are independent and I am not doing any calculation that will require updateStateByKey.
My question is if due to some issue (eg: REDIS connection issue, Data issue in a file etc) some file is not processed completely I want to reprocess (say n times) the files again and also keep a track of the files already processed.
For testing purpose I am reading from a local folder. Also I am not sure how to conclude that one file is fully processed and mark it as completed (ie write in a text file or db that this file processed)
val lines = ssc.textFileStream("E:\\SampleData\\GG")
val words = lines.map(x=>x.split("_"))
words.foreachRDD(
x=> {
x.foreach(
x => {
var jedis = jPool.getResource();
try{
i=i+1
jedis.set("x"+i+"__"+x(0)+"__"+x(1), x(2))
}finally{
jedis.close()
}
}
)
}
)
Spark has a fault tolerance guide. Read more :
https://spark.apache.org/docs/2.1.0/streaming-programming-guide.html#fault-tolerance-semantics

[Q/KDB+]: wsfull when creating splayed table from csv using `.Q.fs`

I have a 9.6GB csv file from which I would like to create an on-disk splayed table.
When I run this code, my 32-bit q process (on Win 10, 16GB RAM machine) runs out of memory ('wsfull) and crashes after creating an incomplete 4.68GB splayed table (see the screenshot).
path:{` sv (hsym x 0), 1_x}
symh: {`$1_ string x}
colnames: `ric`open`high`low`close`volume`date
dir: `:F:
db: `db
tbl: `ohlcv
tbldisk: path dir,db,tbl
tblsplayed: path dir,db,tbl,`
dbsympath: symh path dir,db
csvpath: `:F:/prices.csv
.Q.fs[{ .[ tblsplayed; (); ,; .Q.en[path dir,db] flip colnames!("SFFFFID";",")0:x]}] csvpath
What exactly is going on in the memory and on the disk behind the scene when reading the csv file with .Q.fs and 0:? Is the csv read row by row or column by column?
I thought that only the 132kB chunks are held in the memory at any given time, hoping that .Q.fs is 'wsfull resistant.
Is the q process actually taking in the whole column (splay) into memory, one at a time, as it increments the chunks?
Considering that: (according to this source, among others):
on 32-bit systems the main memory OLTP portion of a database is
limited to about 1GB of raw data, i.e. 1/4 of the address space
that would nearly explain running out of memory. As shown on this screenshot taken right after the 'wsfull, couple of columns are near the 1GB limit.
Here is a run with memory profiling:
.Q.fs[{ 0N!.Q.w[]; .[ tblsplayed; (); ,; .Q.en[path dir,db] flip colnames!("SFFFFID";",")0:x]}] csvpath
I believe it's row by row when Q reads a csv. The reason your q session crashes is probably because you didn't clear memory during
.Q.fs[{ .[ tblsplayed; (); ,; .Q.en[path dir,db] flip colnames!("SFFFFID";",")0:x]}] csvpath
Try to add .Q.gc[]
.Q.fs[{ .Q.gc[]; .[ tblsplayed; (); ,; .Q.en[path dir,db] flip colnames!("SFFFFID";",")0:x]}] csvpath

Best way for limit rate downloads in play framework scala

Problem: limit binary files download rate.
def test = {
Logger.info("Call test action")
val file = new File("/home/vidok/1.jpg")
val fileIn = new FileInputStream(file)
response.setHeader("Content-type", "application/force-download")
response.setHeader("Content-Disposition", "attachment; filename=\"1.jpg\"")
response.setHeader("Content-Length", file.lenght + "")
val bufferSize = 1024 * 1024
val bb = new Array[Byte](bufferSize)
val bis = new java.io.BufferedInputStream(is)
var bytesRead = bis.read(bb, 0, bufferSize)
while (bytesRead > 0) {
bytesRead = bis.read(bb, 0, bufferSize)
//sleep(1000)?
response.writeChunk(bytesRead)
}
}
But its working only for the text files. How to work with binary files?
You've got the basic idea right: each time you've read a certain number of bytes (which are stored in your buffer) you need to:
evaluate how fast you've been reading (= X B/ms)
calculate the difference between X and how fast you should have been reading (= Y ms)
use sleep(Y) on the downloading thread if needed to slow the download rate down
There's already a great question about this right here that should have everything you need. I think especially the ThrottledInputStream solution (which is not the accepted answer) is rather elegant.
A couple of points to keep in mind:
Downloading using 1 thread for everything is the simplest way, however it's also the least efficient way if you want to keep serving requests.
Usually, you'll want to at least offload the actual downloading of a file to its own separate thread.
To speed things up: consider downloading files in chunks (using HTTP Content-Range) and Java NIO. However, keep in mind that this will make thing a lot more complex.
I wouldn't implement something which any good webserver should be able to for me. In enterprise systems this kind of thing is normally handled by a web entry server or firewall. But if you have to do this, then the answer by tmbrggmn looks good to me. NIO is a good tip.