spark: overflow / underflow for Decimal / BigDecinal - scala

just trying to find if there is a way to detect overflow/underflow for DecimalType column in spark (DataFrame API)? there is a method to detect this happening for numeric types with a small range of values.
I understand that could be done on the lower level via using UDFs and detecting if there is an overflow in the corresponding BigDecimal value representing result, but I would prefer to use the existing DataFrame function etc.

Related

How to compare case classes with floating numbers in Scalatest?

Can you compare two objects that have floating numbers in field or values of maps, without manually comparing each field; i.e., enforce tolerance whenever floats or doubles are encountered using custom matchers and Scala generics?

How to handle arithmetic overflow errors while dealing with aggregating the long data in spark scala

Wondering if there is any way to handle arithmetic overflow while doing aggregation on the dataframes? Note data is being aggregated over 30 day window and data is of long type.

Dropping columns from a wide Spark Dataframe efficiently based on column values

if i have a wide dataframe (200m cols) that contains only IP addresses and i want to drop the columns that contains null values or poorly formatted IP addresses, what would be the most efficient way to do this in Spark? My understanding is that Spark performs row based processing in parallel, not column based. Thus if I attempt to apply transformations on a column, there would be a lot of shuffling. Would transposing the dataframe first then applying filters to drop the rows, then retransposing be a good way to take advantage of the parallelism of spark?
You can store a matrix in CSC format using the structure org.apache.spark.ml.linalg.SparseMatrix
If you can get away with filtering on this datatype and converting back to a dataframe that would be your best bet

Converting Dataframe from Spark to the type used by DL4j

Is there any convenient way to convert Dataframe from Spark to the type used by DL4j? Currently using Daraframe in algorithms with DL4j I get an error:
"type mismatch, expected: RDD[DataSet], actual: Dataset[Row]".
In general, we use datavec for that. I can point you at examples for that if you want. Dataframes make too many assumptions that make it too brittle to be used for real world deep learning.
Beyond that, a data frame is not typically a good abstraction for representing linear algebra. (It falls down when dealing with images for example)
We have some interop with spark.ml here: https://github.com/deeplearning4j/deeplearning4j/blob/master/deeplearning4j/deeplearning4j-scaleout/spark/dl4j-spark-ml/src/test/java/org/deeplearning4j/spark/ml/impl/SparkDl4jNetworkTest.java
But in general, a dataset is just a pair of ndarrays just like numpy. If you have to use spark tools, and want to use ndarrays on the last mile only, then my advice would be to get the dataframe to match some form of schema that is purely numerical, map that to an ndarray "row".
In general, a big reason we do this is because all of our ndarrays are off heap.
Spark has many limitations when it comes to working with their data pipelines and using the JVM for things it shouldn't be(matrix math) - we took a different approach that allows us to use gpus and a bunch of other things efficiently.
When we do that conversion, it ends up being:
raw data -> numerical representation -> ndarray
What you could do is map dataframes on to a double/float array and then use Nd4j.create(float/doubleArray) or you could also do:
someRdd.map(inputFloatArray -> new DataSet(Nd4j.create(yourInputArray),yourLabelINDARray))
That will give you a "dataset" You need a pair of ndarrays matching your input data and a label.
The label from there is relative to the kind of problem you're solving whether that be classification or regression though.

RowMatrix from DataFrame containing null values

I have a DataFrame of user ratings (from 1 to 5) relative to movies. In order to get the DataFrame where the first column is movie id and the rest columns are the ratings for that movie by each user, I do the following:
val ratingsPerMovieDF = imdbRatingsDF
.groupBy("imdbId")
.pivot("userId")
.max("rating")
Now, here I get a DataFrame where most of the values are null due to the fact that most users have rated only few movies.
I'm interested in calculating similarities between those movies (item-based collaborative filtering).
I was trying to assemble a RowMatrix (for further similarities calculations using mllib) using the rating columns values. However, I don't know how to deal with null values.
The following code where I try to get a Vector for each row:
val assembler = new VectorAssembler()
.setInputCols(movieRatingsDF.columns.drop("imdbId"))
.setOutputCol("ratings")
val ratingsDF = assembler.transform(movieRatingsDF).select("imdbId", "ratings")
Gives me an error:
Caused by: org.apache.spark.SparkException: Values to assemble cannot be null.
I could substitute them with 0s using .na.fill(0) but that would produce incorrect correlation results since almost all Vectors would become very similar.
Can anyone suggest what to do in this case? The end goal here is to calculate correlations between rows. I was thinking of using SparseVectors somehow (to ignore null values but I don't know how.
I'm new to Spark and Scala so some of this might make little sense. I'm trying to understand things better.
I believe you are approaching this in a wrong way. Dealing with nuances of Spark API is secondary to a proper problem definition - what exactly do you mean by correlation in case of sparse data.
Filling data with zeros in case of explicit feedback (rating), is problematic not because all Vectors would become very similar (variation of the metric will be driven by existing ratings, and results can be always rescaled using min-max scaler), but because it introduces information which is not present in the original dataset. There is a significant difference between item which hasn't been rated and item which has the lowest possible rating.
Overall you can approach this problem in two ways:
You can compute pairwise similarity using only entries where both items have non-missing values. This should work reasonably well if dataset is reasonably dense. It could be expressed using self-join on the input dataset. With pseudocode:
imdbRatingsDF.alias("left")
.join(imdbRatingsDF.alias("right"), Seq("userId"))
.where($"left.imdbId" =!= $"right.imdbId")
.groupBy($"left.imdbId", $"right.imdbId")
.agg(simlarity($"left.rating", $"right.rating"))
where similarity implements required similarity metric.
You can impute missing ratings, for example using some measure of central tendency. Using average (Replace missing values with mean - Spark Dataframe) is probably the most natural choice.
More advanced imputation techniques might provide more reliable results, but likely won't scale very well in a distributed system.
Note
Using SparseVectors is essentially equivalent to na.fill(0).