Speed up geomesa query - scala

I've been testing geomesa with simple spatial queries and comparing it with Postgis. For example this SQL query runs in 30 sec in Postgis:
with series as (
select
generate_series(0, 5000) as i
),
points as (
select ST_Point(i, i*2) as geom from series
)
select st_distance(a.geom, b.geom) from points as a, points as b
Now, the following geomesa version takes 5 min (using -Xmx10g ):
import org.apache.spark.sql.SparkSession
import org.locationtech.geomesa.spark.jts._
import org.locationtech.jts.geom._
object HelloWorld {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder
.config("spark.sql.crossJoin.enabled", "true")
.config("spark.executor.memory", "12g")
.config("spark.driver.memory", "12g")
.config("spark.cores.max", "4")
.master("local")
.appName("Geomesa")
.getOrCreate()
spark.withJTS
import spark.implicits._
val x = 0 until 5000
val y = for (i <- x) yield i*2
val coords = for ((i, n) <- x.zipWithIndex) yield (i, y(n))
val points = for (i <- coords) yield new GeometryFactory().createPoint(new Coordinate(i._1, i._2))
val points2 = for (i <- coords) yield new GeometryFactory().createPoint(new Coordinate(i._1, i._2))
val all_points = for {
i <- points
j <- points2} yield (i, j)
val df = all_points.toDF("point", "point2")
val df2 = df.withColumn("dist", st_distance($"point", $"point2"))
df2.show()
}
}
I'd have expected similar or better performance from geomesa, what can be done to tune a query like this?
FIRST EDIT
As Emilio suggests, this is not really a query but a computation.
This query could have been written without spark. The code below runs in less than two seconds:
import org.locationtech.jts.geom._
object HelloWorld {
def main(args: Array[String]): Unit = {
val x = 0 until 5000
val y = for (i <- x) yield i*2
val coords = for ((i, n) <- x.zipWithIndex) yield (i, y(n))
val points = for (i <- coords) yield new GeometryFactory().createPoint(new Coordinate(i._1, i._2))
val points2 = for {
i <- points
j <- points} yield i.distance(j)
println(points2.slice(0,30))
}
}

GeoMesa is not going to be as fast as PostGIS for small amounts of data. GeoMesa is designed for distributed, NoSQL databases. If your dataset fits in PostGIS, you should probably just use PostGIS. Once you start hitting the limits of PostGIS, you should consider using GeoMesa. GeoMesa does offer integration with arbitrary GeoTools data stores (including PostGIS), which can make some of the GeoMesa Spark and command-line features available to PostGIS.
For your particular snippet, I suspect that most of the time is spent spinning up an RDD and running through the loops. There isn't really a 'query', as you are just running a pair-wise calculation. If you are querying data stored in a table, then GeoMesa has a chance to optimize the scan. However, GeoMesa isn't a SQL database, and doesn't have any native support for joins. Generally the join is done in memory by Spark, although there are some things you can do to speed it up (i.e. a broadcast join or RDD partitioning). If you want to do complex spatial joins, you might want to check out GeoSpark and/or Magellan, which specialize in spatial Spark operations.

Related

spark scala: Performance degrade with simple UDF over large number of columns

I have a dataframe with 100 million rows and ~ 10,000 columns. The columns are of two types, standard (C_i) followed by dynamic (X_i). This dataframe was obtained after some processing, and the performance was fast. Now only 2 steps remain:
Goal:
A particular operation needs to be done on every X_i using identical subset of C_i columns.
Convert each of X-i column into FloatType.
Difficulty:
Performance degrades terribly with increasing number of columns.
After a while, only 1 executor seems to work (%CPU use < 200%), even on a sample data with 100 rows and 1,000 columns. If I push it to 1,500 columns, it crashes.
Minimal code:
import spark.implicits._
import org.apache.spark.sql.types.FloatType
// sample_udf
val foo = (s_val: String, t_val: String) => {
t_val + s_val.takeRight(1)
}
val foos_udf = udf(foo)
spark.udf.register("foos_udf", foo)
val columns = Seq("C1", "C2", "X1", "X2", "X3", "X4")
val data = Seq(("abc", "212", "1", "2", "3", "4"),("def", "436", "2", "2", "1", "8"),("abc", "510", "1", "2", "5", "8"))
val rdd = spark.sparkContext.parallelize(data)
var df = spark.createDataFrame(rdd).toDF(columns:_*)
df.show()
for (cols <- df.columns.drop(2)) {
df = df.withColumn(cols, foos_udf(col("C2"),col(cols)))
}
df.show()
for (cols <- df.columns.drop(2)) {
df = df.withColumn(cols,col(cols).cast(FloatType))
}
df.show()
Error on 1,500 column data:
Exception in thread "main" java.lang.StackOverflowError
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.isStreaming(LogicalPlan.scala:37)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$isStreaming$1.apply(LogicalPlan.scala:37)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$isStreaming$1.apply(LogicalPlan.scala:37)
at scala.collection.LinearSeqOptimized$class.exists(LinearSeqOptimized.scala:93)
at scala.collection.immutable.List.exists(List.scala:84)
...
Thoughts:
Perhaps var could be replaced, but the size of the data is close to 40% of the RAM.
Perhaps for loop for dtype casting could be causing degradation of performance, though I can't see how, and what are the alternatives. From searching on internet, I have seen people suggesting foldLeft based approach, but that apparently still gets translated to for loop internally.
Any inputs on this would be greatly appreciated.
A faster solution was to call UDF on row itself rather than calling on each column. As Spark stores data as rows, the earlier approach was exhibiting terrible performance.
def my_udf(names: Array[String]) = udf[String,Row]((r: Row) => {
val row = Array.ofDim[String](names.length)
for (i <- 0 until row.length) {
row(i) = r.getAs(i)
}
...
}
...
val df2 = df1.withColumn(results_col,my_udf(df1.columns)(struct("*"))).select(col(results_col))
Type casting can be done as suggested by Riccardo
not sure if this will fix the performance on your side with 10000~ columns, but I was able to run it locally with 1500 using the following code.
I addressed points #1 and #2, which may have had some impact on performance. One note, to my understanding foldLeft should be a pure recursive function without an internal for loop, so it might have an impact on performance in this case.
Also, the two for loops can be simplified into a single for loop that I refactored as foldLeft.
We might also get a performance increase if we replace the udf with a spark function.
import spark.implicits._
import org.apache.spark.sql.types.FloatType
import org.apache.spark.sql.functions._
// sample_udf
val foo = (s_val: String, t_val: String) => {
t_val + s_val.takeRight(1)
}
val foos_udf = udf(foo)
spark.udf.register("foos_udf", foo)
val numberOfColumns = 1500
val numberOfRows = 100
val colNames = (1 to numberOfColumns).map(s => s"X$s")
val colValues = (1 to numberOfColumns).map(_.toString)
val columns = Seq("C1", "C2") ++ colNames
val schema = StructType(columns.map(field => StructField(field, StringType)))
val rowFields = Seq("abc", "212") ++ colValues
val listOfRows = (1 to numberOfRows).map(_ => Row(rowFields: _*))
val listOfRdds = spark.sparkContext.parallelize(listOfRows)
val df = spark.createDataFrame(listOfRdds, schema)
df.show()
val newDf = df.columns.drop(2).foldLeft(df)((df, colName) => {
df.withColumn(colName, foos_udf(col("C2"), col(colName)) cast FloatType)
})
newDf.show()
Hope this helps!
*** EDIT
Found a way better solution that circumvents loops. Simply make a single expression with SelectExpr, this way sparks casts all columns in one go without any kind of recursion. From my previous example:
instead of doing fold left, just replace it with these lines. I just tested it with 10k columns 100 rows in my local computer, lasted a few seconds
val selectExpression = Seq("C1", "C2") ++ colNames.map(s => s"cast($s as float)")
val newDf = df.selectExpr(selectExpression:_*)

Increase of hash tables in MinHashLSH, decreases accuracy and f1

I have used MinHashLSH with approximateSimilarityJoin with Scala and Spark 2.4 to find edges between a network. Link prediction based on document similarity. My problem is that while I am increasing the hash tables in the MinHashLSH, my accuracy and F1 score are decreasing. All that I have already read for this algorithm shows me that I have an issue.
I have tried a different number of hash tables and I have provided different numbers of Jaccard similarity thresholds but I have the same exact problem, the accuracy is decreasing rapidly. I have also tried different samplings of my dataset and nothing changed. My workflow goes on like this: I am concatenating all the text columns of my dataframe, which includes title, authors, journal and abstract and next I am tokenizing the concatenated column into words. Then I am using a CountVectorizer to transform this "bag of words" into vectors. Next, I am providing this column in MinHashLSH with some hash tables and finaly I am doing an approximateSimilarityJoin to find similar "papers" which are under my given threshold. My implementation is the following.
import org.apache.spark.ml.feature._
import org.apache.spark.ml.linalg._
import UnsupervisedLinkPrediction.BroutForce.join
import org.apache.log4j.{Level, Logger}
import org.apache.spark.ml.Pipeline
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.{col, udf, when}
import org.apache.spark.sql.types._
object lsh {
def main(args: Array[String]): Unit = {
Logger.getLogger("org").setLevel(Level.ERROR) // show only errors
// val cores=args(0).toInt
// val partitions=args(1).toInt
// val hashTables=args(2).toInt
// val limit = args(3).toInt
// val threshold = args(4).toDouble
val cores="*"
val partitions=1
val hashTables=16
val limit = 1000
val jaccardDistance = 0.89
val master = "local["+cores+"]"
val ss = SparkSession.builder().master(master).appName("MinHashLSH").getOrCreate()
val sc = ss.sparkContext
val inputFile = "resources/data/node_information.csv"
println("reading from input file: " + inputFile)
println
val schemaStruct = StructType(
StructField("id", IntegerType) ::
StructField("pubYear", StringType) ::
StructField("title", StringType) ::
StructField("authors", StringType) ::
StructField("journal", StringType) ::
StructField("abstract", StringType) :: Nil
)
// Read the contents of the csv file in a dataframe. The csv file contains a header.
// var papers = ss.read.option("header", "false").schema(schemaStruct).csv(inputFile).limit(limit).cache()
var papers = ss.read.option("header", "false").schema(schemaStruct).csv(inputFile).limit(limit).cache()
papers.repartition(partitions)
println("papers.rdd.getNumPartitions"+papers.rdd.getNumPartitions)
import ss.implicits._
// Read the original graph edges, ground trouth
val originalGraphDF = sc.textFile("resources/data/Cit-HepTh.txt").map(line => {
val fields = line.split("\t")
(fields(0), fields(1))
}).toDF("nodeA_id", "nodeB_id").cache()
val originalGraphCount = originalGraphDF.count()
println("Ground truth count: " + originalGraphCount )
val nullAuthor = ""
val nullJournal = ""
val nullAbstract = ""
papers = papers.na.fill(nullAuthor, Seq("authors"))
papers = papers.na.fill(nullJournal, Seq("journal"))
papers = papers.na.fill(nullAbstract, Seq("abstract"))
papers = papers.withColumn("nonNullAbstract", when(col("abstract") === nullAbstract, col("title")).otherwise(col("abstract")))
papers = papers.drop("abstract").withColumnRenamed("nonNullAbstract", "abstract")
papers.show(false)
val filteredGt= originalGraphDF.as("g").join(papers.as("p"),(
$"g.nodeA_id" ===$"p.id") || ($"g.nodeB_id" ===$"p.id")
).select("g.nodeA_id","g.nodeB_id").distinct().cache()
filteredGt.show()
val filteredGtCount = filteredGt.count()
println("Filtered GroundTruth count: "+ filteredGtCount)
//TOKENIZE
val tokPubYear = new Tokenizer().setInputCol("pubYear").setOutputCol("pubYear_words")
val tokTitle = new Tokenizer().setInputCol("title").setOutputCol("title_words")
val tokAuthors = new RegexTokenizer().setInputCol("authors").setOutputCol("authors_words").setPattern(",")
val tokJournal = new Tokenizer().setInputCol("journal").setOutputCol("journal_words")
val tokAbstract = new Tokenizer().setInputCol("abstract").setOutputCol("abstract_words")
println("Setting pipeline stages...")
val stages = Array(
tokPubYear, tokTitle, tokAuthors, tokJournal, tokAbstract
// rTitle, rAuthors, rJournal, rAbstract
)
val pipeline = new Pipeline()
pipeline.setStages(stages)
println("Transforming dataframe\n")
val model = pipeline.fit(papers)
papers = model.transform(papers)
println(papers.count())
papers.show(false)
papers.printSchema()
val udf_join_cols = udf(join(_: Seq[String], _: Seq[String], _: Seq[String], _: Seq[String], _: Seq[String]))
val joinedDf = papers.withColumn(
"paper_data",
udf_join_cols(
papers("pubYear_words"),
papers("title_words"),
papers("authors_words"),
papers("journal_words"),
papers("abstract_words")
)
).select("id", "paper_data").cache()
joinedDf.show(5,false)
val vocabSize = 1000000
val cvModel: CountVectorizerModel = new CountVectorizer().setInputCol("paper_data").setOutputCol("features").setVocabSize(vocabSize).setMinDF(10).fit(joinedDf)
val isNoneZeroVector = udf({v: Vector => v.numNonzeros > 0}, DataTypes.BooleanType)
val vectorizedDf = cvModel.transform(joinedDf).filter(isNoneZeroVector(col("features"))).select(col("id"), col("features"))
vectorizedDf.show()
val mh = new MinHashLSH().setNumHashTables(hashTables)
.setInputCol("features").setOutputCol("hashValues")
val mhModel = mh.fit(vectorizedDf)
mhModel.transform(vectorizedDf).show()
vectorizedDf.createOrReplaceTempView("vecDf")
println("MinHashLSH.getHashTables: "+mh.getNumHashTables)
val dfA = ss.sqlContext.sql("select id as nodeA_id, features from vecDf").cache()
dfA.show(false)
val dfB = ss.sqlContext.sql("select id as nodeB_id, features from vecDf").cache()
dfB.show(false)
val predictionsDF = mhModel.approxSimilarityJoin(dfA, dfB, jaccardDistance, "JaccardDistance").cache()
println("Predictions:")
val predictionsCount = predictionsDF.count()
predictionsDF.show()
println("Predictions count: "+predictionsCount)
predictionsDF.createOrReplaceTempView("predictions")
val pairs = ss.sqlContext.sql("select datasetA.nodeA_id, datasetB.nodeB_id, JaccardDistance from predictions").cache()
pairs.show(false)
val totalPredictions = pairs.count()
println("Properties:\n")
println("Threshold: "+threshold+"\n")
println("Hahs tables: "+hashTables+"\n")
println("Ground truth: "+filteredGtCount)
println("Total edges found: "+totalPredictions +" \n")
println("EVALUATION PROCESS STARTS\n")
println("Calculating true positives...\n")
val truePositives = filteredGt.as("g").join(pairs.as("p"),
($"g.nodeA_id" === $"p.nodeA_id" && $"g.nodeB_id" === $"p.nodeB_id") || ($"g.nodeA_id" === $"p.nodeB_id" && $"g.nodeB_id" === $"p.nodeA_id")
).cache().count()
println("True Positives: "+truePositives+"\n")
println("Calculating false positives...\n")
val falsePositives = predictionsCount - truePositives
println("False Positives: "+falsePositives+"\n")
println("Calculating true negatives...\n")
val pairsPerTwoCount = (limit *(limit - 1)) / 2
val trueNegatives = (pairsPerTwoCount - truePositives) - falsePositives
println("True Negatives: "+trueNegatives+"\n")
val falseNegatives = filteredGtCount - truePositives
println("False Negatives: "+falseNegatives)
val truePN = (truePositives+trueNegatives).toFloat
println("TP + TN sum: "+truePN+"\n")
val sum = (truePN + falseNegatives+ falsePositives).toFloat
println("TP +TN +FP+ FN sum: "+sum+"\n")
val accuracy = (truePN/sum).toFloat
println("Accuracy: "+accuracy+"\n")
val precision = truePositives.toFloat / (truePositives+falsePositives).toFloat
val recall = truePositives.toFloat/(truePositives+falseNegatives).toFloat
val f1Score = 2*(recall*precision)/(recall+precision).toFloat
println("F1 score: "+f1Score+"\n")
ss.stop()
I forget to tell you that I am running this code in a cluster with 40 cores and 64g of RAM. Note that approximate similarity join (Spark's implementation) works with JACCARD DISTANCE and not with JACCARD INDEX. So I provide as a similarity threshold the JACCARD DISTANCE which for my case is jaccardDistance = 1 - threshold. (threshold = Jaccard Index ).
I was expecting to get higher accuracy and f1 score while I am increasing the hash tables. Do you have any idea about my issue?
Thank all of you in advance!
There are multiple visible problems here, and probably more hidden, so just to enumerate a few:
LSH is not really a classifier and attempt to evaluate it as one doesn't make much sense, even if you assume that text similarity is somehow a proxy for citation (which is big if).
If the problem was to be framed as classification problem it should be treated as multi-label classification (each paper can cite or be cited by multiple sources) not multi-class classification, hence simple accuracy is not meaningful.
Even if it was a classification and could be evaluated as such your calculations don't include actual negatives, which don't meet the threshold of the approxSimilarityJoin
Also setting threshold to 1 restricts joins to either exact matches or cases of hash collisions - hence preference towards LSH with higher collisions rates.
Additionally:
Text processing approach you took is rather pedestrian and prefers non-specific features (remember you don't optimize your actual goal, but text similarity).
Such approach, especially treating everything as equal, discards majority of useful information in the set primarily, but not limited to, temporal relationships..

XGBoost failing after using windowing functions on label column

I have successfully trained an XGBoost model where trainDF is a dataframe hacing two columns: features and label where we have 11k 1s and 57M 0's (unbalanced dataset). Everything works fine.
val udnersample = 0.1
// Undersampling of 0's -- choosing 10%
val training1 = output1.filter($"datestr" < end_period1 &&
$"label" === 1)
val training0 = output1.filter($"datestr" < end_period1 &&
$"label" === 0).sample(
false, undersample)
val training = training0.unionAll(training1)
val traindDF = training.select("label",
"features").toDF("label", "features")}
val paramMap = List("eta" -> 0.05,
"max_depth" -> 6,
"objective" -> "binary:logistic").toMap
val num_trees = 400
val num_cores = 200
val XGBModel = XGBoost.trainWithDataFrame(trainDF,
paramMap,
num_trees,
num_cores,
useExternalMemory = true)
Then, I want to change the y label with some windowing, so that in each group, I can predict y label earlier.
val sum_label = "sum_label"
val label_window_length = 19
val sliding_window_label = Window.partitionBy("id").orderBy(
asc("timestamp")).rowsBetween(0, label_window_length)
val training_source = output1.filter($"datestr" <
end_period1).withColumn(
sum_label, sum($"label").over(sliding_window_label)).drop(
"label").withColumnRenamed(sum_label, "label")
val training1 = training_source.filter(col("label") === 1)
val training0 = training_source.filter(col("label") === 0).sample(false, 0.099685)
val training = training0.unionAll(training1)
val traindDF = training.select("label",
"features").toDF("label", "features")}
The result has 57M 0's and 214k 1's (soughly the same number of rows though). No NAs in "label" column of trainDF and the type is still double (nullable=true). Then xgboost fails:
Name: ml.dmlc.xgboost4j.java.XGBoostError
Message: XGBoostModel training failed
StackTrace: at ml.dmlc.xgboost4j.scala.spark.XGBoost$.postTrackerReturnProcessing(XGBoost.scala:316)
at ml.dmlc.xgboost4j.scala.spark.XGBoost$.trainWithRDD(XGBoost.scala:293)
at ml.dmlc.xgboost4j.scala.spark.XGBoostEstimator.train(XGBoostEstimator.scala:138)
at ml.dmlc.xgboost4j.scala.spark.XGBoostEstimator.train(XGBoostEstimator.scala:35)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:118)
at ml.dmlc.xgboost4j.scala.spark.XGBoost$.trainWithDataFrame(XGBoost.scala:169)
I can include the logs as needed. My confusion is that using the windowing function and literally not changing any other setting, causes XGB to fail. I would appreciate any thoughts on this.
It turns out that saving the table traindDF in hive and reloading it into Spark solves the problem:
traindDF.write.mode("overwrite").saveAsTable("database.tablename")
Then, you can easily load the table:
val traindDF = spark.sql("""select * from database.tablename""")
This trick solved the problem. It seems like spark windowing function is a bit unstable and saving the result into a hive table makes it work.
A better way to do this is using windowing functions in hive instead of Spark.

How to filter a sorted RDD by taking top N rows

I have 2 key-value pair RDD's A and B that I work with. Let's say that B has 10000 rows and I have sorted B by its values:
B = B0.map(_.swap).sortByKey().map(_.swap)
I need to take top 5000 from B and use that to join with A. I know I could do:
B1 = B.take(5000)
or
B1 = B.zipWithIndex().filter(_._2 < 5000).map(_._1)
It seems that both will trigger computation. Since B1 is just an intermediate result, I would like to have it not trigger real computation. Is there a better way to achieve that?
As far as I know, there is no other way to achieve that using RDD. But you can leverage the dataframe to achieve the same.
First convert your RDD to a dataframe.
Then limit the dataframe to limit 5000 value.
Then you can pick the new RDD from the dataframe.
Upto this point no calculation will be triggered by spark.
Below is a sample proof of concept.
def main(arg: Array[String]): Unit = {
import spark.implicits._
val a =
Array(
Array("key_1", "value_1"),
Array("key_2", "value_2"),
Array("key_3", "value_3"),
Array("key_4", "value_4"),
Array("key_5", "value_5")
)
val rdd = spark.sparkContext.makeRDD(a)
val df = rdd.map({
case Array(key, value) => PairRdd(key, value)
}).toDF()
val dfWithTop = df.limit(3)
val rddWithTop = dfWithTop.rdd
// upto this point no computation has been triggered
// rddWithTop.take(100) will trigger computation
}
case class PairRdd(key: String, value: String)

Spark program performance - GC & Task Deserialization & Concurrent execution

I have a cluster of 4 machines, 1 master and three workers, each with 128G memory and 64 cores. I'm using Spark 1.5.0 in stand alone mode. My program reads data from Oracle tables using JDBC, then does ETL, manipulating data, and does machine learning tasks like k-means.
I have a DataFrame (myDF.cache()) which is join results with two other DataFrames, and cached. The DataFrame contains 27 million rows and the size of data is around 1.5G. I need to filter the data and calculate 24 histogram as follows:
val h1 = myDF.filter("pmod(idx, 24) = 0").select("col1").histogram(arrBucket)
val h2 = myDF.filter("pmod(idx, 24) = 1").select("col1").histogram(arrBucket)
// ......
val h24 = myDF.filter("pmod(idx, 24) = 23").select("col1").histogram(arrBucket)
Problems:
Since my DataFrame is cached, I expect the filter, select, and histogram is very fast. However, the actual time is about 7 seconds for each calculation, which is not acceptable. From UI, it show the GC time takes 5 seconds and Task Deserialization Time 4 seconds. I've tried different JVM parameters but cannot improve further. Right now I'm using
-Xms25G -Xmx25G -XX:MaxPermSize=512m -XX:+UseG1GC -XX:MaxGCPauseMillis=200 \
-XX:ParallelGCThreads=32 \
-XX:ConcGCThreads=8 -XX:InitiatingHeapOccupancyPercent=70
What puzzles me is that the size of data is nothing compared with available memory. Why does GC kick in every time filter/select/histogram running? Is there any way to reduce the GC time and Task Deserialization Time?
I have to do parallel computing for h[1-24], instead of sequential. I tried Future, something like:
import scala.concurrent.{Await, Future, blocking}
import scala.concurrent.ExecutionContext.Implicits.global
val f1 = Future{myDF.filter("pmod(idx, 24) = 1").count}
val f2 = Future{myDF.filter("pmod(idx, 24) = 2").count}
val f3 = Future{myDF.filter("pmod(idx, 24) = 3").count}
val future = for {c1 <- f1; c2 <- f2; c3 <- f3} yield {
c1 + c2 + c3
}
val summ = Await.result(future, 180 second)
The problem is that here Future only means jobs are submitted to the scheduler near-simultaneously, not that they end up being scheduled and run simultaneously. Future used here doesn't improve performance at all.
How to make the 24 computation jobs run simultaneously?
A couple of things you can try:
Don't compute pmod(idx, 24) all over again. Instead you can simply compute it once:
import org.apache.spark.sql.functions.{pmod, lit}
val myDfWithBuckets = myDF.withColumn("bucket", pmod($"idx", lit(24)))
Use SQLContext.cacheTable instead of cache. It stores table using compressed columnar storage which can be used to access only required columns and as stated in the Spark SQL and DataFrame Guide "will automatically tune compression to minimize memory usage and GC pressure".
myDfWithBuckets.registerTempTable("myDfWithBuckets")
sqlContext.cacheTable("myDfWithBuckets")
If you can, cache only the columns you actually need instead of projecting each time.
It is not clear for me what is the source of a histogram method (do you convert to RDD[Double] and use DoubleRDDFunctions.histogram?) and what is the argument but if you want to compute all histograms at the same time you can try to groupBy bucket and apply histogram once for example using histogram_numeric UDF:
import org.apache.spark.sql.functions.callUDF
val n: Int = ???
myDfWithBuckets
.groupBy($"bucket")
.agg(callUDF("histogram_numeric", $"col1", lit(n)))
If you use predefined ranges you can obtain a similar effect using custom UDF.
Notes
how to extract values computed by histogram_numeric? First lets create a small helper
import org.apache.spark.sql.Row
def extractBuckets(xs: Seq[Row]): Seq[(Double, Double)] =
xs.map(x => (x.getDouble(0), x.getDouble(1)))
now we can map using pattern matching as follows:
import org.apache.spark.rdd.RDD
val histogramsRDD: RDD[(Int, Seq[(Double, Double)])] = histograms.map{
case Row(k: Int, hs: Seq[Row #unchecked]) => (k, extractBuckets(hs)) }