How to deal with data skewing in spark when using percent_rank - pyspark

As far as I know,Salting is a solution to data skewness in Spark. However, when calculate using percent_rank() over a window,I can't just simply add some salt to the key.If some random salt is added to the key,the distribution of data upon that key is different from original data ,thus the percent_rank result is also changed.So how can I get thing right?

Related

PySpark MLLib APproximate nearest neighbour search for multiple keys

I want to use ANN from PySpark. I have a DataFrame of 100K keys for which I want to perform top-10 ANN searches on an already transformed Spark DataFrame. But it seems that API of BucketedRandomProjectionLSH expects only one key at a time. I also want to avoid using approxSimilarityJoin, because it only allows you to set a threshold, but this would lead to a variable k for each key (it also fails in my case saying that for some records there are no NNs for a given threshold).
Currently, the best thing I came up with is .collect() the keys and call approxNearestNeighbors in a for loop on the driver, but it is terribly inefficient.
Does anyone know how I can get top-10 ANN searches for my 100K keys in parallel?
Thank you.

Time Series Predictions in PostgreSQL

I am new to PostgreSQL and database systems, and I am currently trying to create a database to store observed values as well as all predictions made in the past for some time series.
I have already built a table (actually a view) for observed values, with rows looking basically like:
(time, object, value)
Now I want to store predictions, which means for each time, what has been predicted by some software for the following next N time steps, N being variable since the software has different prediction types.
I have thought about multiple solutions, which are the following:
Store each prediction as a row, using max(N)=240 columns i.e (time, object, value 1, value 2, ..., value 240).
Store each prediction as a row, with the prediction values as a binary JSON, i.e (time, object, JSONB prediction).
Store each prediction value as a row, with a column specifying the delay of the prediction in hours, i.e
(time, object, delay, value).
I don't know how each of these choices would affect performance when I will retrieve and compute summary values on the predictions. A typical thing I would like to do is to retrieve the performance of the prediction for some delay, i.e. how big is the prediction error when we predict x days ahead, and I need this query to be executed pretty fast, to display it in a dashboard.
Which choice do you think is the best? Or do you have any other idea?
Thanks a lot!
Without further information about the access patterns for the collected data i would strongly recommend to use jsonb.
Using one column per timestep will result in bloat of the system catalog and statistics.
If you need to filter on the values of the predictions, you don't want to maintain 240 indexes also.
If you don't need to use these values within a WHERE condition you may use json instead of jsonb.

Are there alternative solution without cross-join in Spark 2?

Stackoverflow!
I wonder if there is a fancy way in Spark 2.0 to solve the situation below.
The situation is like this.
Dataset1 (TargetData) has this schema and has about 20 milion records.
id (String)
vector of embedding result (Array, 300 dim)
Dataset2 (DictionaryData) has this schema and has about 9,000 records.
dict key (String)
vector of embedding result (Array, 300 dim)
For each vector of records in dataset 1, I want to find the dict key that will be the maximum when I compute cosine similarity it with dataset 2.
Initially, I tried cross-join dataset1 and dataset2 and calculate cosine simliarity of all records, but the amount of data is too large to be available in my environment.
I have not tried it yet, but I thought of collecting dataset2 as a list and then applying udf.
Are there any other method in this situation?
Thanks,
There might be two options the one is to broadcast Dataset2 since you need to scan it for each row of Dataset1 thus avoid the network delays by accessing it from a different node. Of course in this case you need to consider first if your cluster can handle the memory cost which 9000rows x 300cols(not too big in my opinion). Also you still need your join although with broadcasting should be faster. The other option is to populate a RowMatrix from your existing vectors and leave spark do the calculations for you

RowMatrix from DataFrame containing null values

I have a DataFrame of user ratings (from 1 to 5) relative to movies. In order to get the DataFrame where the first column is movie id and the rest columns are the ratings for that movie by each user, I do the following:
val ratingsPerMovieDF = imdbRatingsDF
.groupBy("imdbId")
.pivot("userId")
.max("rating")
Now, here I get a DataFrame where most of the values are null due to the fact that most users have rated only few movies.
I'm interested in calculating similarities between those movies (item-based collaborative filtering).
I was trying to assemble a RowMatrix (for further similarities calculations using mllib) using the rating columns values. However, I don't know how to deal with null values.
The following code where I try to get a Vector for each row:
val assembler = new VectorAssembler()
.setInputCols(movieRatingsDF.columns.drop("imdbId"))
.setOutputCol("ratings")
val ratingsDF = assembler.transform(movieRatingsDF).select("imdbId", "ratings")
Gives me an error:
Caused by: org.apache.spark.SparkException: Values to assemble cannot be null.
I could substitute them with 0s using .na.fill(0) but that would produce incorrect correlation results since almost all Vectors would become very similar.
Can anyone suggest what to do in this case? The end goal here is to calculate correlations between rows. I was thinking of using SparseVectors somehow (to ignore null values but I don't know how.
I'm new to Spark and Scala so some of this might make little sense. I'm trying to understand things better.
I believe you are approaching this in a wrong way. Dealing with nuances of Spark API is secondary to a proper problem definition - what exactly do you mean by correlation in case of sparse data.
Filling data with zeros in case of explicit feedback (rating), is problematic not because all Vectors would become very similar (variation of the metric will be driven by existing ratings, and results can be always rescaled using min-max scaler), but because it introduces information which is not present in the original dataset. There is a significant difference between item which hasn't been rated and item which has the lowest possible rating.
Overall you can approach this problem in two ways:
You can compute pairwise similarity using only entries where both items have non-missing values. This should work reasonably well if dataset is reasonably dense. It could be expressed using self-join on the input dataset. With pseudocode:
imdbRatingsDF.alias("left")
.join(imdbRatingsDF.alias("right"), Seq("userId"))
.where($"left.imdbId" =!= $"right.imdbId")
.groupBy($"left.imdbId", $"right.imdbId")
.agg(simlarity($"left.rating", $"right.rating"))
where similarity implements required similarity metric.
You can impute missing ratings, for example using some measure of central tendency. Using average (Replace missing values with mean - Spark Dataframe) is probably the most natural choice.
More advanced imputation techniques might provide more reliable results, but likely won't scale very well in a distributed system.
Note
Using SparseVectors is essentially equivalent to na.fill(0).

Running k-medoids algorithm in ELKI

I am trying to run ELKI to implement k-medoids (for k=3) on a dataset in the form of an arff file (using the ARFFParser in ELKI):
The dataset is of 7 dimensions, however the clustering results that I obtain show clustering only on the level of one dimension, and does this only for 3 attributes, ignoring the rest. Like this:
Could anyone help with how can I obtain a clustering visualization for all dimensions?
ELKI is mostly used with numerical data.
Currently, ELKI does not have a "mixed" data type, unfortunately.
The ARFF parser will split your data set into multiple relations:
a 1-dimensional numerical relation containing age
a LabelList relation storing sex and region
a 1-dimensional numerical relation containing salary
a LabelList relation storing married
a 1-dimensional numerical relation storing children
a LabelList relation storing car
Apparently it has messed up the relation labels, though. But other than that, this approach works perfectly well with arff data sets that consist of numerical data + a class label, for example - the use case this parser was written for. It is a well-defined and consistent behaviour, though not what you expected it to do.
The algorithm then ran on the first relation it could work with, i.e. age only.
So here is what you need to do:
Implement an efficient data type for storing mixed type data.
Modify the ARFF parser to produce a single relation of mixed type data.
Implement a distance function for this type, because the lack of a mixed type data representation means we do not have a distance to go with it either.
Choose this new distance function in k-Medoids.
Share the code, so others do not have to do this again. ;-)
Alternatively, you could write a script to encode your data in a numerical data set, then it will work fine. But in my opinion, the results of one-hot-encoding etc. are not very convincing usually.