Spark RDD: Sum one column without creating SQL DataFrame - scala

Is there an efficient way to sum up the values in a column in spark RDD directly? I do not want to create a SQL DataFrame just for this.
I have an RDD of LabeledPoint in which each LabeledPoint uses a sparse vector representation. Suppose I am interested in sum of the values of first feature.
The following code does not work for me:
//lp_RDD is RDD[LabeledPoint]
var total = 0.0
for(x <- lp_RDD){
total += x.features(0)
}
The value of totalAmt after this loop is still 0.

What you want is to extract the first element from the feature vector using RDD.map and then sum them all up using DoubleRDDFunctions.sum:
val sum: Double = rdd.map(_.features(0)).sum()

Related

Converting from vector column to Double[Array] column in scala Spark

I have a data frame doubleSeq whose structure is as below
res274: org.apache.spark.sql.DataFrame = [finalFeatures: vector]
The first record of the column is as follows
res281: org.apache.spark.sql.Row = [[3.0,6.0,-0.7876947819954485,-0.21757635218517163,0.9731844373162398,-0.6641741696340383,-0.6860072219935377,-0.2990737363481845,-0.7075863760365155,0.8188108975549018,-0.8468559840943759,-0.04349947247406488,-0.45236764452589984,1.0333959313820456,0.6097566070878347,-0.7106619551471779,-0.7750330808435969,-0.08097610412658443,-0.45338437108038904,-0.2952869863393396,-0.30959772365257004,0.6988768123463287,0.17049117199049213,3.2674649019757385,-0.8333373234944124,1.8462942520757128,-0.49441222531240125,-0.44187299748074166,-0.300810826687287]]
I want to extract the double array
[3.0,6.0,-0.7876947819954485,-0.21757635218517163,0.9731844373162398,-0.6641741696340383,-0.6860072219935377,-0.2990737363481845,-0.7075863760365155,0.8188108975549018,-0.8468559840943759,-0.04349947247406488,-0.45236764452589984,1.0333959313820456,0.6097566070878347,-0.7106619551471779,-0.7750330808435969,-0.08097610412658443,-0.45338437108038904,-0.2952869863393396,-0.30959772365257004,0.6988768123463287,0.17049117199049213,3.2674649019757385,-0.8333373234944124,1.8462942520757128,-0.49441222531240125,-0.44187299748074166,-0.300810826687287]
from this -
doubleSeq.head(1)(0)(0)
gives
Any = [3.0,6.0,-0.7876947819954485,-0.21757635218517163,0.9731844373162398,-0.6641741696340383,-0.6860072219935377,-0.2990737363481845,-0.7075863760365155,0.8188108975549018,-0.8468559840943759,-0.04349947247406488,-0.45236764452589984,1.0333959313820456,0.6097566070878347,-0.7106619551471779,-0.7750330808435969,-0.08097610412658443,-0.45338437108038904,-0.2952869863393396,-0.30959772365257004,0.6988768123463287,0.17049117199049213,3.2674649019757385,-0.8333373234944124,1.8462942520757128,-0.49441222531240125,-0.44187299748074166,-0.300810826687287]
Which is not solving my problem
Scala Spark - split vector column into separate columns in a Spark DataFrame
Is not solving my issue but its an indicator
So you want to extract a Vector from a Row, and turn it into an array of doubles.
The problem with your code is that the get method (and the implicit apply method you are using) returns an object of type Any. Indeed, a Row is a generic, unparametrized object and there is no way to now at compile time what types it contains. It's a bit like Lists in java 1.4 and before. To solve it in spark, you can use the getAs method that you can parametrize with a type of your choosing.
In your situation, you seem to have a dataframe containing a vector (org.apache.spark.ml.linalg.Vector).
import org.apache.spark.ml.linalg._
val firstRow = df.head(1)(0) // or simply df.head
val vect : Vector = firstRow.getAs[Vector](0)
// or all in one: df.head.getAs[Vector](0)
// to transform into a regular array
val array : Array[Double] = vect.toArray
Note also that you can access columns by name like this:
val vect : Vector = firstRow.getAs[Vector]("finalFeatures")

BucketedRandomProjectionLSHModel approxNearestNeighbors function on entire dataframe

I'm trying to evaluate an entire DataFrame through the approxNearestNeighbors function of BucketedRandomProjectionLSHModel
What I expect:
A DataFrame containing the following information:
cookieId NN
id1 [id3, id5, id7]
id2 [id8, id9]
...
Input DataFrame (daily_content_transformed):
cookieID features(a sparse vector)
id1 sparse vector with features
id2 sparse vector with features
...
This works:
val key = Vectors.sparse(37599,
Array(1,4,6,7,16,57,81,104,166,225,290,692,763),
Array(1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0))
model.approxNearestNeighbors(daily_content_transformed, key, 20).show(20, false)
It returns a DataFrame with 21 rows. I could extract the cookieId column from this DataFrame and store it in the expected DataFrame.
Where I'm stuck:
instead of hard coding the key to retrieve NN from, run the method for every row in the input dataframe, and make a dataframe as expected above
Any help?
Edit in reply to first response:
After playing around with the suggestion to use approxSimilarityJoin instead of approxNearestNeighbors I came to the following conclusions:
the suggested solution works well for daily_content_transformed.limit(3000)
starting from daily_content_transformed.limit(5000), my spark job terminates with an java.lang.OutOfMemoryError.
my input table contains +- 800 000 unique cookieID's (rows).
Although the suggested solution works for small inputs, scalability is an issue.
BucketedRandomProjectionLSHModel doesn't provide required API. I think you approximate it using approxSimilarityJoin:
import org.apache.spark.sql.functions.{struct, udf, collect_list. sort_array}
val threshold: Double
val n: Int
def take(n: Int) = udf((xs: Seq[String]) => xs.take(n))
model
.approxNearestNeighbors(
daily_content_transformed.alias("left"),
daily_content_transformed.alias("right"))
.groupBy($"datasetA.id" as "cookieId")
// Collect pairs (dist, id)
.agg(collect_list(struct($"distCol", $"datasetB.id" as "id") as "NN"))
// Sort by dist, drop dist and take n
.withColumn("NN" take(n)(sort_array($"NN", false).getItem("id")))
This guarantees to preserve at most n neighbors.

Converting literal to RDD for subsequent Cartesian Product

Cannot find in the documentation how the result of below:
val DIM_Key_Max = rddA.map(x => (x._1)).max
can be subsequently converted to a single entry RDD for JOINing with another RDD, or rather cartesian product.
Nowhere I can see that. Who can help?
max returns a single object. To turn it into a single entry RDD, use parallelize:
sc.parallelize(List(DIM_Key_Max))
This returns an RDD with a single entry that can be used e.g. as an argument to cartesian.
You are getting something wrong here. max will not retrun an RDD which can be joined with another RDD.
val rdd=sc.parallelize(Array((1,2),(3,4),(5,6))).map(x=>x._1).max
rdd
rdd: Int = 5
rdd.getClass
res2: Class[Int] = int

Replace missing values with mean - Spark Dataframe

I have a Spark Dataframe with some missing values. I would like to perform a simple imputation by replacing the missing values with the mean for that column. I am very new to Spark, so I have been struggling to implement this logic. This is what I have managed to do so far:
a) To do this for a single column (let's say Col A), this line of code seems to work:
df.withColumn("new_Col", when($"ColA".isNull, df.select(mean("ColA"))
.first()(0).asInstanceOf[Double])
.otherwise($"ColA"))
b) However, I have not been able to figure out, how to do this for all the columns in my dataframe. I was trying out the Map function, but I believe it loops through each row of a dataframe
c) There is a similar question on SO - here. And while I liked the solution (using Aggregated tables and coalesce), I was very keen to know if there is a way to do this by looping through each column (I come from R, so looping through each column using a higher order functional like lapply seems more natural to me).
Thanks!
Spark >= 2.2
You can use org.apache.spark.ml.feature.Imputer (which supports both mean and median strategy).
Scala :
import org.apache.spark.ml.feature.Imputer
val imputer = new Imputer()
.setInputCols(df.columns)
.setOutputCols(df.columns.map(c => s"${c}_imputed"))
.setStrategy("mean")
imputer.fit(df).transform(df)
Python:
from pyspark.ml.feature import Imputer
imputer = Imputer(
inputCols=df.columns,
outputCols=["{}_imputed".format(c) for c in df.columns]
)
imputer.fit(df).transform(df)
Spark < 2.2
Here you are:
import org.apache.spark.sql.functions.mean
df.na.fill(df.columns.zip(
df.select(df.columns.map(mean(_)): _*).first.toSeq
).toMap)
where
df.columns.map(mean(_)): Array[Column]
computes an average for each column,
df.select(_: *).first.toSeq: Seq[Any]
collects aggregated values and converts row to Seq[Any] (I know it is suboptimal but this is the API we have to work with),
df.columns.zip(_).toMap: Map[String,Any]
creates aMap: Map[String, Any] which maps from the column name to its average, and finally:
df.na.fill(_): DataFrame
fills the missing values using:
fill: Map[String, Any] => DataFrame
from DataFrameNaFunctions.
To ingore NaN entries you can replace:
df.select(df.columns.map(mean(_)): _*).first.toSeq
with:
import org.apache.spark.sql.functions.{col, isnan, when}
df.select(df.columns.map(
c => mean(when(!isnan(col(c)), col(c)))
): _*).first.toSeq
For imputing the median (instead of the mean) in PySpark < 2.2
## filter numeric cols
num_cols = [col_type[0] for col_type in filter(lambda dtype: dtype[1] in {"bigint", "double", "int"}, df.dtypes)]
### Compute a dict with <col_name, median_value>
median_dict = dict()
for c in num_cols:
median_dict[c] = df.stat.approxQuantile(c, [0.5], 0.001)[0]
Then, apply na.fill
df_imputed = df.na.fill(median_dict)
For PySpark, this is the code I used:
mean_dict = { col: 'mean' for col in df.columns }
col_avgs = df.agg( mean_dict ).collect()[0].asDict()
col_avgs = { k[4:-1]: v for k,v in col_avgs.iteritems() }
df.fillna( col_avgs ).show()
The four steps are:
Create the dictionary mean_dict mapping column names to the aggregate operation (mean)
Calculate the mean for each column, and save it as the dictionary col_avgs
The column names in col_avgs start with avg( and end with ), e.g. avg(col1). Strip the parentheses out.
Fill the columns of the dataframe with the averages using col_avgs

Get range of Dataframe Row

So I've loaded a dataframe from a parquet file. This dataframe now contains an unspecified number of columns. The first column is a Label, and the following are features.
I want to save each row in the dataframe as a LabeledPoint.
So far im thinking:
val labeledPoints: RDD[LabeledPoint] =df.map{row => LabeledPoint(row.getInt(0),Vectors.dense(row.getDouble(1),row.getDouble(2)))}
Its easy to get the column indexes, but when handling a lot of columns this won't hold. I'd like to be able to load the entire row starting from index 1 (since index 0 is the label) into a dense vector.
Any ideas?
This should do the trick
df.map {
row: Row =>
val data = for (index <- 1 until row.length) yield row.getDouble(index)
val vector = new DenseVector(data.toArray)
new LabeledPoint(row.getInt(0), vector)
}