BucketedRandomProjectionLSHModel approxNearestNeighbors function on entire dataframe - scala

I'm trying to evaluate an entire DataFrame through the approxNearestNeighbors function of BucketedRandomProjectionLSHModel
What I expect:
A DataFrame containing the following information:
cookieId NN
id1 [id3, id5, id7]
id2 [id8, id9]
...
Input DataFrame (daily_content_transformed):
cookieID features(a sparse vector)
id1 sparse vector with features
id2 sparse vector with features
...
This works:
val key = Vectors.sparse(37599,
Array(1,4,6,7,16,57,81,104,166,225,290,692,763),
Array(1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0))
model.approxNearestNeighbors(daily_content_transformed, key, 20).show(20, false)
It returns a DataFrame with 21 rows. I could extract the cookieId column from this DataFrame and store it in the expected DataFrame.
Where I'm stuck:
instead of hard coding the key to retrieve NN from, run the method for every row in the input dataframe, and make a dataframe as expected above
Any help?
Edit in reply to first response:
After playing around with the suggestion to use approxSimilarityJoin instead of approxNearestNeighbors I came to the following conclusions:
the suggested solution works well for daily_content_transformed.limit(3000)
starting from daily_content_transformed.limit(5000), my spark job terminates with an java.lang.OutOfMemoryError.
my input table contains +- 800 000 unique cookieID's (rows).
Although the suggested solution works for small inputs, scalability is an issue.

BucketedRandomProjectionLSHModel doesn't provide required API. I think you approximate it using approxSimilarityJoin:
import org.apache.spark.sql.functions.{struct, udf, collect_list. sort_array}
val threshold: Double
val n: Int
def take(n: Int) = udf((xs: Seq[String]) => xs.take(n))
model
.approxNearestNeighbors(
daily_content_transformed.alias("left"),
daily_content_transformed.alias("right"))
.groupBy($"datasetA.id" as "cookieId")
// Collect pairs (dist, id)
.agg(collect_list(struct($"distCol", $"datasetB.id" as "id") as "NN"))
// Sort by dist, drop dist and take n
.withColumn("NN" take(n)(sort_array($"NN", false).getItem("id")))
This guarantees to preserve at most n neighbors.

Related

Vector vs Vectors in spark [duplicate]

This question already has an answer here:
Difference between spark Vectors and scala immutable Vector?
(1 answer)
Closed 4 years ago.
i am a newbie in Spark.
I am trying to read a text file that has data like:
timestamp id counter value
00:01 1 c1 0.5
00:02 5 c3 0.3
00:03 1 c2 0.1
00:04 2 c2 0.13
and transform them to:
(id, array_of_counters):
(1, [ c1 c2 ])
[ 0.5 0.1]
So, for every id, i create an 2d array, which will have every counter and every value for that specific id in the text file.
I tried to do it with Vectors but i think that what is stored in them, must be double and that i cannot add two vectors, except the case they are breeze Vectors.
Then, i found out there is a data structure called just Vector but i can't find any details about it.
So, my question is what are the main differences between Vector and Vectors in mllib?
Code:
val inputRdd = sc.textFile(inputFile).map(x => x.split(","))
val data = inputRdd.map(y => (y(1), Vector(y(2), y(3)))).reduceByKey(_++_)
I don't think a Vector is necessary or appropriate for what it appears you are trying to do here (I could be wrong, we need more specifics on what you want to accomplish). The only way it makes sense is if there is a fixed number of counters (c1, c2, etc...) for each id. If you simply want a set of every id, with it's corresponding list of counters and values, try this (I'm assuming counters are unique to each id):
val data = inputRdd
.map(y => (y(1).toLong, y(2), y(3).toDouble))
.toDF("id", "counter", "value")
.groupBy("id")
.agg(collect_list(map($"counter", $"value")))
.as[(Long, Seq[Map[String, Double]])]
.map(r => (r._1, r._2.reduce(_++_)))
//this results in a Dataset[(Long, Map[String, Double])]
A spark ml.linalg.Vector is basically an Array[Double], and would require a fixed number of counter for every record. You could tranform from the data above into a vector by ordering the Map[String, Double] by it's ._1 and creating a Vector from it's .values.
ml.linalg.Vectors is just a helper object with functions for creating Vector objects.
Factory methods for org.apache.spark.ml.linalg.Vector. We don't use the name Vector because Scala imports scala.collection.immutable.Vector by default.
It's also worth noting that mllib is intended for the older RDD API while ml is intended for the newer Dataframe/Dataset API.
Edit: RDD[(Long, Seq[(String, Double)])]
val data = inputRdd
.map(y => (y(1).toLong, Seq[(String, Double)]((y(2), y(3).toDouble))))
.reduceByKey(_++_)

Replace missing values with mean - Spark Dataframe

I have a Spark Dataframe with some missing values. I would like to perform a simple imputation by replacing the missing values with the mean for that column. I am very new to Spark, so I have been struggling to implement this logic. This is what I have managed to do so far:
a) To do this for a single column (let's say Col A), this line of code seems to work:
df.withColumn("new_Col", when($"ColA".isNull, df.select(mean("ColA"))
.first()(0).asInstanceOf[Double])
.otherwise($"ColA"))
b) However, I have not been able to figure out, how to do this for all the columns in my dataframe. I was trying out the Map function, but I believe it loops through each row of a dataframe
c) There is a similar question on SO - here. And while I liked the solution (using Aggregated tables and coalesce), I was very keen to know if there is a way to do this by looping through each column (I come from R, so looping through each column using a higher order functional like lapply seems more natural to me).
Thanks!
Spark >= 2.2
You can use org.apache.spark.ml.feature.Imputer (which supports both mean and median strategy).
Scala :
import org.apache.spark.ml.feature.Imputer
val imputer = new Imputer()
.setInputCols(df.columns)
.setOutputCols(df.columns.map(c => s"${c}_imputed"))
.setStrategy("mean")
imputer.fit(df).transform(df)
Python:
from pyspark.ml.feature import Imputer
imputer = Imputer(
inputCols=df.columns,
outputCols=["{}_imputed".format(c) for c in df.columns]
)
imputer.fit(df).transform(df)
Spark < 2.2
Here you are:
import org.apache.spark.sql.functions.mean
df.na.fill(df.columns.zip(
df.select(df.columns.map(mean(_)): _*).first.toSeq
).toMap)
where
df.columns.map(mean(_)): Array[Column]
computes an average for each column,
df.select(_: *).first.toSeq: Seq[Any]
collects aggregated values and converts row to Seq[Any] (I know it is suboptimal but this is the API we have to work with),
df.columns.zip(_).toMap: Map[String,Any]
creates aMap: Map[String, Any] which maps from the column name to its average, and finally:
df.na.fill(_): DataFrame
fills the missing values using:
fill: Map[String, Any] => DataFrame
from DataFrameNaFunctions.
To ingore NaN entries you can replace:
df.select(df.columns.map(mean(_)): _*).first.toSeq
with:
import org.apache.spark.sql.functions.{col, isnan, when}
df.select(df.columns.map(
c => mean(when(!isnan(col(c)), col(c)))
): _*).first.toSeq
For imputing the median (instead of the mean) in PySpark < 2.2
## filter numeric cols
num_cols = [col_type[0] for col_type in filter(lambda dtype: dtype[1] in {"bigint", "double", "int"}, df.dtypes)]
### Compute a dict with <col_name, median_value>
median_dict = dict()
for c in num_cols:
median_dict[c] = df.stat.approxQuantile(c, [0.5], 0.001)[0]
Then, apply na.fill
df_imputed = df.na.fill(median_dict)
For PySpark, this is the code I used:
mean_dict = { col: 'mean' for col in df.columns }
col_avgs = df.agg( mean_dict ).collect()[0].asDict()
col_avgs = { k[4:-1]: v for k,v in col_avgs.iteritems() }
df.fillna( col_avgs ).show()
The four steps are:
Create the dictionary mean_dict mapping column names to the aggregate operation (mean)
Calculate the mean for each column, and save it as the dictionary col_avgs
The column names in col_avgs start with avg( and end with ), e.g. avg(col1). Strip the parentheses out.
Fill the columns of the dataframe with the averages using col_avgs

Transforming Spark Dataframe Column

I am working with Spark dataframes. I have a categorical variable in my dataframe with many levels. I am attempting a simple transformation of this variable - Only pick the top few levels which has greater than n observations (say,1000). Club all other levels into an "Others" category.
I am fairly new to Spark, so I have been struggling to implement this. This is what I have been able to achieve so far:
# Extract all levels having > 1000 observations (df is the dataframe name)
val levels_count = df.groupBy("Col_name").count.filter("count >10000").sort(desc("count"))
# Extract the level names
val level_names = level_count.select("Col_name").rdd.map(x => x(0)).collect
This gives me an Array which has the level names that I would like to retain. Next, I should define the transformation function which can be applied to the column. This is where I am getting stuck. I believe we need to create a User defined function. This is what I tried:
# Define UDF
val var_transform = udf((x: String) => {
if (level_names contains x) x
else "others"
})
# Apply UDF to the column
val df_new = df.withColumn("Var_new", var_transform($"Col_name"))
However, when I try df_new.show it throws a "Task not serializable" exception. What am I doing wrong? Also, is there a better way to do this?
Thanks!
Here is a solution that would be, in my opinion, better for such a simple transformation: stick to the DataFrame API and trust catalyst and Tungsten to be optimised (e.g. making a broadcast join):
val levels_count = df
.groupBy($"Col_name".as("new_col_name"))
.count
.filter("count >10000")
val df_new = df
.join(levels_count,$"Col_name"===$"new_col_name", joinType="leftOuter")
.drop("Col_name")
.withColumn("new_col_name",coalesce($"new_col_name", lit("other")))

Spark RDD: Sum one column without creating SQL DataFrame

Is there an efficient way to sum up the values in a column in spark RDD directly? I do not want to create a SQL DataFrame just for this.
I have an RDD of LabeledPoint in which each LabeledPoint uses a sparse vector representation. Suppose I am interested in sum of the values of first feature.
The following code does not work for me:
//lp_RDD is RDD[LabeledPoint]
var total = 0.0
for(x <- lp_RDD){
total += x.features(0)
}
The value of totalAmt after this loop is still 0.
What you want is to extract the first element from the feature vector using RDD.map and then sum them all up using DoubleRDDFunctions.sum:
val sum: Double = rdd.map(_.features(0)).sum()

Spark Build Custom Column Function, user defined function

I’m using Scala and want to build my own DataFrame function. For example, I want to treat a column like an array , iterate through each element and make a calculation.
To start off, I’m trying to implement my own getMax method. So column x would have the values [3,8,2,5,9], and the expected output of the method would be 9.
Here is what it looks like in Scala
def getMax(inputArray: Array[Int]): Int = {
var maxValue = inputArray(0)
for (i <- 1 until inputArray.length if inputArray(i) > maxValue) {
maxValue = inputArray(i)
}
maxValue
}
This is what I have so far, and get this error
"value length is not a member of org.apache.spark.sql.column",
and I don't know how else to iterate through the column.
def getMax(col: Column): Column = {
var maxValue = col(0)
for (i <- 1 until col.length if col(i) > maxValue){
maxValue = col(i)
}
maxValue
}
Once I am able to implement my own method, I will create a column function
val value_max:org.apache.spark.sql.Column=getMax(df.col(“value”)).as(“value_max”)
And then I hope to be able to use this in a SQL statement, for example
val sample = sqlContext.sql("SELECT value_max(x) FROM table")
and the expected output would be 9, given input column [3,8,2,5,9]
I am following an answer from another thread Spark Scala - How do I iterate rows in dataframe, and add calculated values as new columns of the data frame where they create a private method for standard deviation.
The calculations I will do will be more complex than this, (e.g I will be comparing each element in the column) , am I going in the correct directions or should I be looking more into User Defined Functions?
In a Spark DataFrame, you can't iterate through the elements of a Column using the approaches you thought of because a Column is not an iterable object.
However, to process the values of a column, you have some options and the right one depends on your task:
1) Using the existing built-in functions
Spark SQL already has plenty of useful functions for processing columns, including aggregation and transformation functions. Most of them you can find in the functions package (documentation here). Some others (binary functions in general) you can find directly in the Column object (documentation here). So, if you can use them, it's usually the best option. Note: don't forget the Window Functions.
2) Creating an UDF
If you can't complete your task with the built-in functions, you may consider defining an UDF (User Defined Function). They are useful when you can process each item of a column independently and you expect to produce a new column with the same number of rows as the original one (not an aggregated column). This approach is quite simple: first, you define a simple function, then you register it as an UDF, then you use it. Example:
def myFunc: (String => String) = { s => s.toLowerCase }
import org.apache.spark.sql.functions.udf
val myUDF = udf(myFunc)
val newDF = df.withColumn("newCol", myUDF(df("oldCol")))
For more information, here's a nice article.
3) Using an UDAF
If your task is to create aggregated data, you can define an UDAF (User Defined Aggregation Function). I don't have a lot of experience with this, but I can point you to a nice tutorial:
https://ragrawal.wordpress.com/2015/11/03/spark-custom-udaf-example/
4) Fall back to RDD processing
If you really can't use the options above, or if you processing task depends on different rows for processing one and it's not an aggregation, then I think you would have to select the column you want and process it using the corresponding RDD. Example:
val singleColumnDF = df("column")
val myRDD = singleColumnDF.rdd
// process myRDD
So, there was the options I could think of. I hope it helps.
An easy example is given in the excellent documentation, where a whole section is dedicated to UDFs:
import org.apache.spark.sql._
val df = Seq(("id1", 1), ("id2", 4), ("id3", 5)).toDF("id", "value")
val spark = df.sparkSession
spark.udf.register("simpleUDF", (v: Int) => v * v)
df.select($"id", callUDF("simpleUDF", $"value"))