I have a ~1GB csv file (but am open to other data types e.g parquet), with 5m rows and 23 columns that I want to read into Spark so that I can multiply it to create a scoring matrix.
On a smaller version of the file I currently using this process:
// csv -> array -> Dense Matrix
import org.apache.spark.mllib.linalg.{Matrix, Matrices, Dense Matrix}
val test = scala.io.Source.fromFile("/hdfs/landing/test/scoreTest.csv").getLines.toArray.flatmap(._split(",")).map(_.toDouble)
val m1: DenseMatrix = new DenseMatrix(1000,23,test)
Then I can multiply m1 with m1.multiply() which is all fine. However when I try this with the large file I run into memory error exceptions and other issues, even though the file is only 1GB.
Is this the best way to create a matrix object in spark ready for multiplication? The whole read in as array, then convert to DenseMatrix seems unnecessary and is causing memory issues.
Very new to scala/spark so any help is appreciated.
Note: I know that this could be done in memory in python, R, matlab etc but this is more a proof of concept so that it can be used for much larger files.
Try to use the distrubuted matrix implementation in org.apache.spark.mllib.linalg.distributed, this uses the RDD API and you'll gonna benefit from the parallelism offered by spark.
Please refer to the official documentation for more information.
I'd also recommend you to read this blog entitled Scalable Matrix Multiplication using Spark
Related
I am facing skewness problem, when I am trying to join 2 datasets. One of data partition ( the column I am trying to perform the join operation) has skewness than rest of the partition and due to this one of a final output part file is 40 times greater than rest of the output part files.
I am using Scala, Apache spark for performing my calculation and file format used is parquet.
So I am looking for 2 solutions:
First is how can I tackle the skewness as time taken to process that
skewed data is taking a lot of time. (For the skewed data I have tried Broadcasting but it did not helped )
Seconds is how can I make all the final output part files stored
within a 256 MB range. I have tried a property
spark.sql.files.maxPartitionBytes=268435456 but it is not making any
difference.
Thanks,
Skewness is a common problem when dealing with data.
To handle it, there is a technique called salting exists.
First, you may check out this video by Ted Malaska to get the intuition about salting.
Second, examine his repository oh this theme.
I think that each issue with skewness has its own method to solve it.
Hope these materials will help you.
Is there any convenient way to convert Dataframe from Spark to the type used by DL4j? Currently using Daraframe in algorithms with DL4j I get an error:
"type mismatch, expected: RDD[DataSet], actual: Dataset[Row]".
In general, we use datavec for that. I can point you at examples for that if you want. Dataframes make too many assumptions that make it too brittle to be used for real world deep learning.
Beyond that, a data frame is not typically a good abstraction for representing linear algebra. (It falls down when dealing with images for example)
We have some interop with spark.ml here: https://github.com/deeplearning4j/deeplearning4j/blob/master/deeplearning4j/deeplearning4j-scaleout/spark/dl4j-spark-ml/src/test/java/org/deeplearning4j/spark/ml/impl/SparkDl4jNetworkTest.java
But in general, a dataset is just a pair of ndarrays just like numpy. If you have to use spark tools, and want to use ndarrays on the last mile only, then my advice would be to get the dataframe to match some form of schema that is purely numerical, map that to an ndarray "row".
In general, a big reason we do this is because all of our ndarrays are off heap.
Spark has many limitations when it comes to working with their data pipelines and using the JVM for things it shouldn't be(matrix math) - we took a different approach that allows us to use gpus and a bunch of other things efficiently.
When we do that conversion, it ends up being:
raw data -> numerical representation -> ndarray
What you could do is map dataframes on to a double/float array and then use Nd4j.create(float/doubleArray) or you could also do:
someRdd.map(inputFloatArray -> new DataSet(Nd4j.create(yourInputArray),yourLabelINDARray))
That will give you a "dataset" You need a pair of ndarrays matching your input data and a label.
The label from there is relative to the kind of problem you're solving whether that be classification or regression though.
I have my scala code for anomaly detection on the KDD cup dataset.
The code is at https://github.com/prashantprakash/KDDDataResearch/blob/master/Code/approach1Plus2/src/main/scala/PCA.scala
I wanted to try a new technique by using StreamingKMeans algorithm from MlLib and update my StreamingKmeans model whenever line 288 in the above code is true "if( dist < threshold ) {"; ie when the test point is classified as normal, update the KMeans model with the new "normal datapoint".
I see that StreamingKmeans take data in the form of DStreams.
"Please help in converting the existing RDD to Dstreams."
I found a link http://apache-spark-user-list.1001560.n3.nabble.com/RDD-to-DStream-td11145.html but it didn't help much.
Also please advice if there is a better design to solve the problem.
As far as I know, an RDD cannot be converted into a DStream because an RDD is a collection of data, while a DStream is a concept referring to incoming data.
If you want to use StreamingKMeans, take the data that you formed into an RDD, and instead convert it to a DStream, possibly using KafkaUtils.createDirectStream or ssc.textFileStream.
Hope this helps!
I had csv files of size 6GB and I tried using the import function on Matlab to load them but it failed due to memory issue. Is there a way to reduce the size of the files?
I think the no. of columns are causing the problem. I have a 133076 rows by 2329 columns. I had another file which is of the same no. of rows but only 12 rows and Matlab could handle that. However, once the columns increases, the files got really big.
Ulitmately, if I can read the data column wise so that I can have 2329 column vector of 133076, that will be great.
I am using Matlab 2014a
Numeric data are by default stored by Matlab in double precision format, which takes up 8 bytes per number. Data of size 133076 x 2329 therefore take up 2.3 GiB in memory. Do you have that much free memory? If not, reducing the file size won't help.
If the problem is not that the data themselves don't fit into memory, but is really about the process of reading such a large csv-file, then maybe using the syntax
M = csvread(filename,R1,C1,[R1 C1 R2 C2])
might help, which allows you to only read part of the data at one time. Read the data in chunks and assemble them in a (preallocated!) array.
If you do not have enough memory, another possibility is to read chunkwise and then convert each chunk to single precision before storing it. This reduces memory consumption by a factor of two.
And finally, if you don't process the data all at once, but can implement your algorithm such that it uses only a few rows or columns at a time, that same syntax may help you to avoid having all the data in memory at the same time.
I want to save an array efficiently in matlab. I have the array of size 3000 by 9000. If I save this array in the mat file it consumes around 214 MB using just the save function. If I use fwrite and use float data type came to be around 112. Is there any other way that I can still reduce the hard disk space consumed when I save this array in matlab?
I would suggest writing using binary mode and then using compression algorithm such as bzip
There are a few ways to reduce the required memory:
1. Reducing precision
Rather than using that double you normally have, consider using a single, or perhaps even a uint8 or logical. Using the print function will also so this, but you may want to consider compressing further afterwards as printing does not create a compressed file.
2. Utilizing a pattern
If your matrix has a certain pattern, this can sometimes be leveraged to store the data more efficiently. Or at least the information to create the data. The most common example is that your matrix is storeable as a few vectors. For example when it is sparse.