Async reading of large Cassandra table using Scala / Phantom-DSL - scala

I have an issue reading a table containing >800k rows. I need to read the rows from top to bottom in order to process them.
I use Scala and Phantom for the purpose.
Here is how my table look.
CREATE TABLE raw (
id uuid PRIMARY KEY,
b1 text,
b2 timestamp,
b3 text,
b4 text,
b5 text
) WITH bloom_filter_fp_chance = 0.01
AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'
AND comment = ''
AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy'}
AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99.0PERCENTILE';
So far I've tried to read the table using:
def getAllRecords : Future[Seq[row]] = select.fetch
or the more fancy Play Enumerator and combine it with a Iteratee
def getAllRecords : Enumerator = select.fetchEnumrator
Nothing of this works, it seems like cassandra/driver/my program always tries to read all records upfront, what am I missing here ?

Have you tried reviewing the code in the bigger read tests?
class IterateeBigReadPerformanceTest extends BigTest with ScalaFutures {
it should "read the correct number of records found in the table" in {
val counter: AtomicLong = new AtomicLong(0)
val result = TestDatabase.primitivesJoda.select
.fetchEnumerator run Iteratee.forEach {
r => counter.incrementAndGet()
}
result.successful {
query => {
info(s"done, reading: ${counter.get}")
counter.get() shouldEqual 2000000
}
}
}
}
This is not something that will read your records upfront. In fact we have tests than run more than one hour to guarantee sufficient GC pauses, no GC overhead, permgen/metaspace pressure remains within bounds, etc.
If anything has indeed changed it is only in error, but this should still work.

Related

Performance of PySpark DataFrames vs Glue DynamicFrames

So I recently started using Glue and PySpark for the first time. The task was to create a Glue job that does the following:
Load data from parquet files residing in an S3 bucket
Apply a filter to the data
Add a column, the value of which is derived from 2 other columns
Write the result to S3
Since the data is going from S3 to S3, I assumed that Glue DynamicFrames should be a decent fit for this, and I came up with the following code:
def AddColumn(r):
if r["option_type"] == 'S':
r["option_code_derived"]= 'S'+ r["option_code_4"]
elif r["option_type"] == 'P':
r["option_code_derived"]= 'F'+ r["option_code_4"][1:]
elif r["option_type"] == 'L':
r["option_code_derived"]= 'P'+ r["option_code_4"]
else:
r["option_code_derived"]= None
return r
glueContext = GlueContext(create_spark_context(role_arn=args['role_arn']))
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
inputGDF = glueContext.create_dynamic_frame_from_options(connection_type = "s3", connection_options = {"paths": [source_path], "recurse" : True}, format = source_format, additional_options = {"useS3ListImplementation":True})
filtered_gdf = Filter.apply(frame = inputGDF, f = lambda x: x["my_filter_column"] in ['50','80'])
additional_column_gdf = Map.apply(frame = filtered_gdf, f = AddColumn)
gdf_mapped = ApplyMapping.apply(frame = additional_column_gdf, mappings = mappings, transformation_ctx = "gdf_mapped")
glueContext.purge_s3_path(full_target_path_purge, {"retentionPeriod": 0})
outputGDF = glueContext.write_dynamic_frame.from_options(frame = gdf_mapped, connection_type = "s3", connection_options = {"path": full_target_path}, format = target_format)
This works but takes a very long time (just short of 10 hours with 20 G1.X workers).
Now, the dataset is quite large (almost 2 billion records, over 400 GB), but this was still unexpected (to me at least).
Then I gave it another try, this time with PySpark DataFrames instead of DynamicFrames.
The code looks like the following:
glueContext = GlueContext(create_spark_context(role_arn=args['role_arn'], source_bucket=args['s3_source_bucket'], target_bucket=args['s3_target_bucket']))
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
df = spark.read.parquet(full_source_path)
df_filtered = df.filter( (df.model_key_status == '50') | (df.model_key_status == '80') )
df_derived = df_filtered.withColumn('option_code_derived',
when(df_filtered.option_type == "S", concat(lit('S'), df_filtered.option_code_4))
.when(df_filtered.option_type == "P", concat(lit('F'), df_filtered.option_code_4[2:42]))
.when(df_filtered.option_type == "L", concat(lit('P'), df_filtered.option_code_4))
.otherwise(None))
glueContext.purge_s3_path(full_purge_path, {"retentionPeriod": 0})
df_reorderered = df_derived.select(target_columns)
df_reorderered.write.parquet(full_target_path, mode="overwrite")
This also works, but with otherwise identical settings (20 workers of type G1.X, same dataset), this takes less than 20 minutes.
My question is: Where does this massive difference in performance between DynamicFrames and DataFrames come from? Was I doing something fundamentally wrong in the first try?

How to use nn.MultiheadAttention together with nn.LSTM?

I'm trying to build a Pytorch network for image captioning.
Currently I have a working network of Encoder and Decoder, and I want to add nn.MultiheadAttnetion layer to it (to be used as self attention).
Currently my decode looks like this:
class Decoder(nn.Module):
def __init__(self, hidden_size, embed_dim, vocab_size, layers = 1):
super(Decoder, self).__init__()
self.embed_dim = embed_dim
self.vocab_size = vocab_size
self.layers = layers
self.hidden_size = hidden_size
self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0)
self.lstm = nn.LSTM(input_size = embed_dim, hidden_size = hidden_size, batch_first = True, num_layers = layers)
#self.attention = nn.MultiheadAttention(hidden_size, num_heads=1, batch_first= True)
self.fc = nn.Linear(hidden_size, self.vocab_size)
def init_hidden(self, batch_size):
h = torch.zeros(self.layers, batch_size, self.hidden_size).to(device)
c = torch.zeros(self.layers, batch_size, self.hidden_size).to(device)
return h,c
def forward(self, features, caption):
batch_size = caption.size(0)
caption_size = caption.size(1)
h,c = self.init_hidden(batch_size)
embeddings = self.embedding(caption)
lstm_input = torch.cat((features.unsqueeze(1), embeddings[:,:-1,:]), dim=1)
output, (h,c) = self.lstm(lstm_input, (h,c))
#output, _ = self.attention(output, output, output)
output = self.fc(output)
return output
def generate_caption(self, features, max_caption_size = MAX_LEN):
h,c = self.init_hidden(1)
caption = ""
embeddings = features.unsqueeze(1)
for i in range(max_caption_size):
output, (h, c) = self.lstm(embeddings, (h,c))
#output, _ = self.attention(output, output, output)
output = self.fc(output)
_, word_index = torch.max(output, dim=2) # take the word with highest probability
if word_index == vocab.get_index(END_WORD):
break
caption += vocab.get_word(word_index) + " "
embeddings = self.embedding(torch.LongTensor([word_index]).view(1,-1).to(device))
return caption
and it gives relatively good results for image captioning.
I want to add the commented out lines so the model will use Attention. But- when I do that- the model breaks, although the loss becomes extremely low (decreasing from 2.7 to 0.2 during training instead of 2.7 to 1 without the attention) - the caption generation is not really working (predicts the same word over and over again).
My questions are:
Am I using the nn.MultiheadAttention correctly? it is very weird to me that it should be used after the LSTM, but I saw this online, and it works from dimension sizes perspective
Any idea why my model breaks when I use Attention?
EDIT: I also tried to put the Attention before the LSTM, and it didn't work as well (network predicted the same caption for every picture)

Exclude items from training set data

I have my data in two colors and excluded_colors.
colors contains all colors
excluded_colors contains some colors that I wish to exclude from my trainingset.
I am trying to split the data into a training and testing set and ensure that the colors in excluded_colors are not in my training set but exist in the testing set.
In order to achieve the above, I did this
var colors = spark.sql("""
select colors.*
from colors
LEFT JOIN excluded_colors
ON excluded_colors.color_id = colors.color_id
where excluded_colors.color_id IS NULL
"""
)
val trainer: (Int => Int) = (arg:Int) => 0
val sqlTrainer = udf(trainer)
val tester: (Int => Int) = (arg:Int) => 1
val sqlTester = udf(tester)
val rsplit = colors.randomSplit(Array(0.7, 0.3))
val train_colors = splits(0).select("color_id").withColumn("test",sqlTrainer(col("color_id")))
val test_colors = splits(1).select("color_id").withColumn("test",sqlTester(col("color_id")))
However, I'm realizing that by doing the above the colors in excluded_colors are completely ignored. They are not even in my testing set.
Question
How can I split the data in 70/30 while also ensuring that the colors in excluded_colors are not in training but are present in testing.
What we want to do is remove the "excluded colors" from the training set but have them in the testing and have a training/test split of 70/30.
What we need is a bit of math.
Given the total dataset (TD) and the excluded colors dataset (E) we can say that for train dataset (Tr) and test dataset (Ts) that:
|Tr| = x * (|TD|-|E|)
|Ts| = |E| + (1-x) * |TD|
We also know that |Tr| = 0.7 |TD|
Hence x = 0.7 |TD| / (|TD| - |E|)
Now that we know the sampling factor x, we can say:
Tr = (TD-E).sample(withReplacement = false, fraction = x)
// where (TD - E) is the result of the SQL expr above
Ts = TD.sample(withReplacement = false, fraction = 0.3)
// we sample the test set from the original dataset

Scope of 'spark.driver.maxResultSize'

I'm running a Spark job to aggregate data. I have a custom data structure called a Profile, which basically contains a mutable.HashMap[Zone, Double]. I want to merge all profiles that share a given key (a UUID), with the following code:
def merge = (up1: Profile, up2: Profile) => { up1.addWeights(up2); up1}
val aggregated = dailyProfiles
.aggregateByKey(new Profile(), 3200)(merge, merge).cache()
Curiously, Spark fails with the following error:
org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 116318 tasks (1024.0 MB) is bigger than spark.driver.maxResultSize (1024.0 MB)
The obvious solution is to increment "spark.driver.maxResultSize", but two things puzzle me.
Too much of a coincidence that I get 1024.0 greater than 1024.0
All the documentation and help I found googling this particular error and configuration parameter indicates that it affect functions that take a value back to the driver. (say take() or collect()), but I'm not taking ANYTHING to the driver, just reading from HDFS, aggregating, saving back to HDFS.
Does anyone know why I'm getting this error?
Yes, It's failing because The values we see in exception message are
rounded off by one precision and comparison happening in bytes.
That serialized output must be more than 1024.0 MB and less than 1024.1 MB.
Check added Apache Spark code snippet, It's very interesting and very rare to get this error. :)
Here totalResultSize > maxResultSize both are Long types and in holds the value in bytes. But msg holds rounded value from Utils.bytesToString().
//TaskSetManager.scala
def canFetchMoreResults(size: Long): Boolean = sched.synchronized {
totalResultSize += size
calculatedTasks += 1
if (maxResultSize > 0 && totalResultSize > maxResultSize) {
val msg = s"Total size of serialized results of ${calculatedTasks} tasks " +
s"(${Utils.bytesToString(totalResultSize)}) is bigger than spark.driver.maxResultSize " +
s"(${Utils.bytesToString(maxResultSize)})"
logError(msg)
abort(msg)
false
} else {
true
}
}
Apache Spark 1.3 - source
//Utils.scala
def bytesToString(size: Long): String = {
val TB = 1L << 40
val GB = 1L << 30
val MB = 1L << 20
val KB = 1L << 10
val (value, unit) = {
if (size >= 2*TB) {
(size.asInstanceOf[Double] / TB, "TB")
} else if (size >= 2*GB) {
(size.asInstanceOf[Double] / GB, "GB")
} else if (size >= 2*MB) {
(size.asInstanceOf[Double] / MB, "MB")
} else if (size >= 2*KB) {
(size.asInstanceOf[Double] / KB, "KB")
} else {
(size.asInstanceOf[Double], "B")
}
}
"%.1f %s".formatLocal(Locale.US, value, unit)
}
Apache Spark 1.3 - source

Paradoxal timing functions

I have a function to compute the time spent in a block:
import collection.mutable.{Map => MMap}
var timeTotalMap = MMap[String, Long]()
var numMap = MMap[String, Float]()
var averageMsMap = MMap[String, Float]()
def time[T](key: String)(block: =>T): T = {
val start = System.nanoTime()
val res = block
val total = System.nanoTime - start
timeTotalMap(key) = timeTotalMap.getOrElse(key, 0L) + total
numMap(key) = numMap.getOrElse(key, 0f) + 1
averageMsMap(key) = timeTotalMap(key)/1000000000f/numMap(key)
res
}
I am timing a function and the place where it is called in one place.
time("outerpos") { intersectPos(a, b) }
and the function itself starts with:
def intersectPos(p1: SPosition, p2: SPosition)(implicit unify: Option[Identifier] =None): SPosition
= time("innerpos") {(p1, p2) match {
...
}
When I display the nano times for each key (timeTotalMap), I get (added spaces for lisibility)
outerpos-> 37 870 034 714
innerpos-> 53 647 956
It means that the total execution time of innerpos is 1000 less than the one of outerpos.
What ! there is a factor 1000 between the two ? And its says that the outer call takes 1000x more time than all the inner functions? am I missing something or is there a memory leaking ghost somewhere ?
Update
When I compare the number of execution for each block (numMap), I find the following:
outerpos -> 362878
innerpos -> 21764
This is paradoxal. Even if intersectPos is called somewhere else, shouldn't the number of times it is called be as great as the number of times outerpos is called?
EDIT
If I move the line numMap(key) = numMap.getOrElse(key, 0f) + 1 to the top of the time functinon, then these numbers become approximately equal.
nanoTime on JVM is considered safe but not very accurate. Here are some good reads:
Is System.nanoTime() completely useless?
System.currentTimeMillis vs System.nanoTime
Basically your test will suffer from timer inaccuracy. One way to work around that would be to call time("outerpos") for a quite a long time (note JIT compiler optimization might kick in at some point) and measure start and end times only once. Take an average and repeat with time("innerpos"). That's all I could think of. Still not the best test ever ;)