Computing number of rows in parquet - scala

Do you know any way for computing number of rows in parquet in Scala? Any hadoop library? Or parquet library? I would like to avoid spark. I mean something like:
number_rows("hdfs:///tmp/parquet")

Related

equivalent of sklearn's StratifiedGroupKFold for PySpark?

I have a dataframe for single-label binary classification with some class imbalance and I want to make a train-test split. Some observations are members of groups in the data that should only appear in either the test split or train split but not both.
Outside of PySpark, I could use StratifiedGroupKFold from sklearn. What is the easiest way to achieve the same effect with PySpark?
I looked at the sampleBy method from PySpark, but I'm not sure how to use it while keeping the groups separate.
Documentation links:
StratifiedGroupKFold
sampleBy

Parquet data size calculation from Spark Dataset

Is there a way to calculate/estimate what the dimension of a Parquet file would be, starting from a Spark Dataset?
For example, I would need a stuff like the following:
// This dataset would have 1GB of data, for example
val dataset: DataFrame = spark.table("users")
// I expect that `parquetSize` is 10MB.
val parquetSize: Long = ParquetSizeCalculator.from(dataset)
So, I need to know what would be the size of a parquet file given a spark dataset. If there's no code/library over there, I would appreciate an advice of how to calculate it by myself.
I hope I made the question clear :-)
Thank you so much in advance.

Dropping columns from a wide Spark Dataframe efficiently based on column values

if i have a wide dataframe (200m cols) that contains only IP addresses and i want to drop the columns that contains null values or poorly formatted IP addresses, what would be the most efficient way to do this in Spark? My understanding is that Spark performs row based processing in parallel, not column based. Thus if I attempt to apply transformations on a column, there would be a lot of shuffling. Would transposing the dataframe first then applying filters to drop the rows, then retransposing be a good way to take advantage of the parallelism of spark?
You can store a matrix in CSC format using the structure org.apache.spark.ml.linalg.SparseMatrix
If you can get away with filtering on this datatype and converting back to a dataframe that would be your best bet

Read CSV into Matrix in Spark Shell

I have a ~1GB csv file (but am open to other data types e.g parquet), with 5m rows and 23 columns that I want to read into Spark so that I can multiply it to create a scoring matrix.
On a smaller version of the file I currently using this process:
// csv -> array -> Dense Matrix
import org.apache.spark.mllib.linalg.{Matrix, Matrices, Dense Matrix}
val test = scala.io.Source.fromFile("/hdfs/landing/test/scoreTest.csv").getLines.toArray.flatmap(._split(",")).map(_.toDouble)
val m1: DenseMatrix = new DenseMatrix(1000,23,test)
Then I can multiply m1 with m1.multiply() which is all fine. However when I try this with the large file I run into memory error exceptions and other issues, even though the file is only 1GB.
Is this the best way to create a matrix object in spark ready for multiplication? The whole read in as array, then convert to DenseMatrix seems unnecessary and is causing memory issues.
Very new to scala/spark so any help is appreciated.
Note: I know that this could be done in memory in python, R, matlab etc but this is more a proof of concept so that it can be used for much larger files.
Try to use the distrubuted matrix implementation in org.apache.spark.mllib.linalg.distributed, this uses the RDD API and you'll gonna benefit from the parallelism offered by spark.
Please refer to the official documentation for more information.
I'd also recommend you to read this blog entitled Scalable Matrix Multiplication using Spark

Convert RDD to DStream to apply StreamingKMeans algorithm in Apache Spark MlLib

I have my scala code for anomaly detection on the KDD cup dataset.
The code is at https://github.com/prashantprakash/KDDDataResearch/blob/master/Code/approach1Plus2/src/main/scala/PCA.scala
I wanted to try a new technique by using StreamingKMeans algorithm from MlLib and update my StreamingKmeans model whenever line 288 in the above code is true "if( dist < threshold ) {"; ie when the test point is classified as normal, update the KMeans model with the new "normal datapoint".
I see that StreamingKmeans take data in the form of DStreams.
"Please help in converting the existing RDD to Dstreams."
I found a link http://apache-spark-user-list.1001560.n3.nabble.com/RDD-to-DStream-td11145.html but it didn't help much.
Also please advice if there is a better design to solve the problem.
As far as I know, an RDD cannot be converted into a DStream because an RDD is a collection of data, while a DStream is a concept referring to incoming data.
If you want to use StreamingKMeans, take the data that you formed into an RDD, and instead convert it to a DStream, possibly using KafkaUtils.createDirectStream or ssc.textFileStream.
Hope this helps!