Parquet data size calculation from Spark Dataset - scala

Is there a way to calculate/estimate what the dimension of a Parquet file would be, starting from a Spark Dataset?
For example, I would need a stuff like the following:
// This dataset would have 1GB of data, for example
val dataset: DataFrame = spark.table("users")
// I expect that `parquetSize` is 10MB.
val parquetSize: Long = ParquetSizeCalculator.from(dataset)
So, I need to know what would be the size of a parquet file given a spark dataset. If there's no code/library over there, I would appreciate an advice of how to calculate it by myself.
I hope I made the question clear :-)
Thank you so much in advance.

Related

Computing number of rows in parquet

Do you know any way for computing number of rows in parquet in Scala? Any hadoop library? Or parquet library? I would like to avoid spark. I mean something like:
number_rows("hdfs:///tmp/parquet")

How to tackle skewness and output file size in Apache Spark

I am facing skewness problem, when I am trying to join 2 datasets. One of data partition ( the column I am trying to perform the join operation) has skewness than rest of the partition and due to this one of a final output part file is 40 times greater than rest of the output part files.
I am using Scala, Apache spark for performing my calculation and file format used is parquet.
So I am looking for 2 solutions:
First is how can I tackle the skewness as time taken to process that
skewed data is taking a lot of time. (For the skewed data I have tried Broadcasting but it did not helped )
Seconds is how can I make all the final output part files stored
within a 256 MB range. I have tried a property
spark.sql.files.maxPartitionBytes=268435456 but it is not making any
difference.
Thanks,
Skewness is a common problem when dealing with data.
To handle it, there is a technique called salting exists.
First, you may check out this video by Ted Malaska to get the intuition about salting.
Second, examine his repository oh this theme.
I think that each issue with skewness has its own method to solve it.
Hope these materials will help you.

Read CSV into Matrix in Spark Shell

I have a ~1GB csv file (but am open to other data types e.g parquet), with 5m rows and 23 columns that I want to read into Spark so that I can multiply it to create a scoring matrix.
On a smaller version of the file I currently using this process:
// csv -> array -> Dense Matrix
import org.apache.spark.mllib.linalg.{Matrix, Matrices, Dense Matrix}
val test = scala.io.Source.fromFile("/hdfs/landing/test/scoreTest.csv").getLines.toArray.flatmap(._split(",")).map(_.toDouble)
val m1: DenseMatrix = new DenseMatrix(1000,23,test)
Then I can multiply m1 with m1.multiply() which is all fine. However when I try this with the large file I run into memory error exceptions and other issues, even though the file is only 1GB.
Is this the best way to create a matrix object in spark ready for multiplication? The whole read in as array, then convert to DenseMatrix seems unnecessary and is causing memory issues.
Very new to scala/spark so any help is appreciated.
Note: I know that this could be done in memory in python, R, matlab etc but this is more a proof of concept so that it can be used for much larger files.
Try to use the distrubuted matrix implementation in org.apache.spark.mllib.linalg.distributed, this uses the RDD API and you'll gonna benefit from the parallelism offered by spark.
Please refer to the official documentation for more information.
I'd also recommend you to read this blog entitled Scalable Matrix Multiplication using Spark

Convert RDD to DStream to apply StreamingKMeans algorithm in Apache Spark MlLib

I have my scala code for anomaly detection on the KDD cup dataset.
The code is at https://github.com/prashantprakash/KDDDataResearch/blob/master/Code/approach1Plus2/src/main/scala/PCA.scala
I wanted to try a new technique by using StreamingKMeans algorithm from MlLib and update my StreamingKmeans model whenever line 288 in the above code is true "if( dist < threshold ) {"; ie when the test point is classified as normal, update the KMeans model with the new "normal datapoint".
I see that StreamingKmeans take data in the form of DStreams.
"Please help in converting the existing RDD to Dstreams."
I found a link http://apache-spark-user-list.1001560.n3.nabble.com/RDD-to-DStream-td11145.html but it didn't help much.
Also please advice if there is a better design to solve the problem.
As far as I know, an RDD cannot be converted into a DStream because an RDD is a collection of data, while a DStream is a concept referring to incoming data.
If you want to use StreamingKMeans, take the data that you formed into an RDD, and instead convert it to a DStream, possibly using KafkaUtils.createDirectStream or ssc.textFileStream.
Hope this helps!

Cache an RDD before or after a split in Spark?

I am training an org.apache.spark.mllib.recommendation.ALS model on an quite big RDD rdd. I'd like to select a decent regularization hyperparameter so that my model doesn't over- (or under-) fit. To do so, I split rdd (using randomSplit) into a train set and a test set and perform a cross-validation with a defined set of hyperparameters on these.
As I'm using the train and test RDDs several times in the cross-validation it seems natural to cache() the data at some point for faster computation. However, my Spark knowledge is quite limited and I'm wondering which of these two options is better (and why):
Cache the initial RDD rdd before splitting it, that is:
val train_proportion = 0.75
val seed = 42
rdd.cache()
val split = rdd.randomSplit(Array(train_proportion, 1 - train_proportion), seed)
val train_set = split(0)
val test_set = split(1)
Cache the train and test RDDs after splitting the initial RDD:
val train_proportion = 0.75
val seed = 42
val split = rdd.randomSplit(Array(train_proportion, 1 - train_proportion), seed)
val train_set = split(0).cache()
val test_set = split(1).cache()
My speculation is that option 1 is better because the randomSplit would also benefit from the fact that rdd is cached, but I'm not sure whether it would negatively impact the (multiple) future accesses to train_set and test_set with respect to option 2.
This answer seems to confirm my intuition, but it received no feedback, so I'd like to be sure by asking here.
What do you think? And more importantly: Why?
Please note that I have run the experiment on a Spark cluster, but it is often busy these days so my conclusions may be wrong. I also checked the Spark documentation and found no answer to my question.
If the calculation on the RDD are made before the split, than it is better to cache it before, as (in my experience) all the transformation will be run only once and triggered by the cache() action.
I suppose split() cache() cache() are 3 actions vs cache() split() 2. EDIT: cache is not an action.
And indeed I find a confirmation in other similar questions around the web
Edit: to clarify my first sentence: the DAG will perform all the transformation on the RDD and then cache it, so all the things done to it afterwards will need no more computation, although the splitted parts will be calculated again.
In conclusion, should you operate heavier transformations on the splitted part than the original RDD itself, you would want to cache them instead. (I hope someone will back me up here)