Squared Sums with aggregateByKey PySpark - pyspark

I have the data set a,
a = sc.parallelize([((1,2),(10,20,[1,3])),((1,2),(30,40,[1]))])
and I need the following:
(1,2) is going to be the key
Since I want to calculate the streaming standard deviation of the first two values, I need to evaluate the
pure sums and sums of squares for each of these values. In other words, I need to
sumx=(10+30), sumx^2=(10^2 + 30^2) for the first value,
and
sumx=(20+40), sumx^2=(20^2 + 40^2) for the second value.
for the final value (the lists), I just want to concatenate them.
The final result needs to be:
([(1,2),(40,1000,60,2000,[1,3])])
Here is my code:
a.aggregateByKey((0.0,0.0,0.0,0.0,[]),\
(lambda x,y: (x[0]+y[0],x[0]*x[0]+y[0]*y[0],x[1]+y[1],x[1]*x[1]+y[1]*y[1],x[2]+y[2])),\
(lambda rdd1,rdd2: (rdd1[0]+rdd2[0],rdd1[1]+rdd2[1],rdd1[2]+rdd1[2],rdd1[3]+rdd2[3],rdd1[4]+rdd2[4]))).collect()
Unfortunately it returns the following error:
"TypeError: unsupported operand type(s) for +: 'float' and 'list'"
Any thoughts?

You can use hivecontext to solve this :
from pyspark.sql.context import HiveContext
hivectx = HiveContext(sc)
a = sc.parallelize([((1,2),(10,20,[1,3])),((1,2),(30,40,[1]))])
# Convert this to a dataframe
b = a.toDF(['col1','col2'])
# Explode col2 into individual columns
c = b.map(lambda x: (x.col1,x.col2[0],x.col2[1],x.col2[2])).toDF(['col1','col21','col22','col23'])
c.registerTempTable('mydf')
sql = """
select col1,
sum(col21) as sumcol21,
sum(POW(col21,2)) as sum2col21,
sum(col22) as sumcol22,
sum(POW(col22,2)) as sum2col22,
collect_set(col23) as col23
from mydf
group by col1
"""
d = hivectx.sql(sql)
# Get back your original dataframe
e = d.map(lambda x:(x.col1,(x.sumcol21,x.sum2col21,x.sumcol22,x.sum2col22,[item for sublist in x.col23 for item in sublist]))).toDF(['col1','col2'])

Related

How to convert multidimensional array to dataframe using Spark in Scala?

This is my first time using spark or scala so I am a newbie. I have a 2D array, and I need to convert it to a dataframe. The sample data is a joined table that is in the form of rectangle (double), point (a,b) also doubles, and a boolean of whether or not the point lies within the rectangle. My end goal is to return a dataframe with the name of the rectangle, and how many times it appears where ST_contains is true. Since the query returns all the instances where it is true, I simply am trying to sort by rectangle (they are named as doubles) and count each occurrence. I put that in an array and then try to convert it to a dataset. Here is some of my code and what I have tried:
// Join two datasets (not my code)
spark.udf.register("ST_Contains",(queryRectangle:String, pointString:String)=>(HotzoneUtils.ST_Contains(queryRectangle, pointString)))
val joinDf = spark.sql("select rectangle._c0 as rectangle, point._c5 as point from rectangle,point where ST_Contains(rectangle._c0,point._c5)")
joinDf.createOrReplaceTempView("joinResult")
// MY CODE
// above join gets a view with rectangle, point, and true. so I need to loop through and count how many for each rectangle
//sort by rectangle asc first
joinDf.orderBy("rectangle")
var a = Array.ofDim[String](1, 2)
for (row <- joinDf.rdd.collect){
var count = 1
var previous_r = -1.0
var r = row.mkString(",").split(",")(0).toDouble
var p = row.mkString(",").split(",")(1).toDouble
var c = row.mkString(",").split(",")(2).toDouble
if (previous_r != -1){
if (previous_r == r){
//add another to the count
count = count + 1
}
else{
//stick the result in an array
a ++= Array(Array(previous_r.toString, count.toString))
}
}
previous_r = r
}
//create dataframe from array and return it
val df = spark.createDataFrame(a).toDF()
But I keep getting this error:
inferred type arguments [Array[String]] do not conform to method createDataFrame's type parameter bounds [A <: Product]
val df = spark.createDataFrame(a).toDF()
I also tried it without the .toDf() portion and still no luck. I tried it without the createDataFrame command and just the .toDf but that did not work either.
A few things here:
createDataFrame has multiple variations and the one you end up trying is probably:
def createDataFrame[A <: Product : TypeTag](data: Seq[A]): DataFrame
Array[String] is no Seq[A <: Product]: String is not a Product.
The fastest approach I can think of is go into a Seq and then a DataFrame:
import spark.implicits._
Array("some string")
.toSeq
.toDF
or parallelize the Array[String] into a RDD[String] and then create the DataFrame.
That second toDF() has no value, createDataFrame already returns a DataFrame (if it worked).

Length of dataframe inside UDF function

I need to write a complex User Defined Function (UDF) that takes multiple columns as input. Something like:
val uudf = udf{(val:Int, lag:Int, cumsum_p:Double) => val + lag + cum_p} // actually a more complex function but let's make it simple
The third parameter cumsum_p indicate is a cumulative sum of p where p is a the length of the group it is computed. Because this udf will then be used in a groupby.
I come up with this solution which is almost ok:
val uudf = udf{(val:Int, lag:Int, cumsum_p:Double) => val + lag + cum_p}
val w = Window.orderBy($"sale_qty")
df.withColumn("needThat",
uudf(col("sale_qty"),
lead("sale_qty",1).over(w), sum(lit(1/length_group)).over(w)
)
).show()
The problem is that if I replace lit(1/length_group) with lit(1/count("sale_qty")) the created column now contains only 1 element which lead to an error...
You should compute count("sale_qty") first:
val w = Window.orderBy($"sale_qty")
df
.withColumn("cnt",count($"sale_qty").over())
.withColumn("needThat",
uudf(col("sale_qty"),
lead("sale_qty",1).over(w), sum(lit(1)/$"cnt").over(w)
)
).show()

Calculate cosine similarity of document relevance

I have go the normalized TF-IDF for and also the keyword RDD and now want to compute the cosine similarity to find relevance score for the document .
So I tried as
documentRdd = sc.textFile("documents.txt").flatMap(lambda l: re.split(r'[^\w]+',l))
keyWords = sc.textFile("keywords.txt").flatMap(lambda l: re.split(r'[^\w]+',l))
normalizer1 = Normalizer()
hashingTF = HashingTF()
tf = hashingTF.transform(documentRdd)
tf.cache()
idf = IDF().fit(tf)
tfidf = idf.transform(tf)
normalizedtfidf=normalizer1.transform(tfidf)
Now I wanted to calculate the cosine similarity between the normalizedtfidf and keyWords.So I tried using
x = Vectors.dense(normalizedtfidf)
y = Vectors.dense(keywordTF)
print(1 - x.dot(y)/(x.norm(2)*y.norm(2)) , "is the releavance score")
But this throw the error as
TypeError: float() argument must be a string or a number
Which means I am passing a wrong format .Any help is appreciated .
Update
I tried then
x = Vectors.sparse(normalizedtfidf.count(),normalizedtfidf.collect())
y = Vectors.sparse(keywordTF.count(),keywordTF.collect())
but got
TypeError: Cannot treat type as a
vector
as the error.
You got the errors because you are attempting to convert RDD into Vectors forcibly.
You can achieve what you need without doing the conversion by doing the following steps :
Join both your RDDs into one RDD. Note that I am assuming you do not have a unique index in both RDDs for joining.
# Adding index to both RDDs by row.
rdd1 = normalizedtfidf.zipWithIndex().map(lambda arg : (arg[1], arg[0]))
rdd2 = keywordTF.zipWithIndex().map(lambda arg : (arg[1], arg[0]))
# Join both RDDs.
rdd_joined = rdd1.join(rdd2)
map RDD with a function to calculate cosine distance.
def cosine_dist(row):
x = row[1][0]
y = row[1][1]
return (1 - x.dot(y)/(x.norm(2)*y.norm(2)))
res = rdd_joined.map(cosine_dist)
You can then use your results or run collect to see them.

Replace missing values with mean - Spark Dataframe

I have a Spark Dataframe with some missing values. I would like to perform a simple imputation by replacing the missing values with the mean for that column. I am very new to Spark, so I have been struggling to implement this logic. This is what I have managed to do so far:
a) To do this for a single column (let's say Col A), this line of code seems to work:
df.withColumn("new_Col", when($"ColA".isNull, df.select(mean("ColA"))
.first()(0).asInstanceOf[Double])
.otherwise($"ColA"))
b) However, I have not been able to figure out, how to do this for all the columns in my dataframe. I was trying out the Map function, but I believe it loops through each row of a dataframe
c) There is a similar question on SO - here. And while I liked the solution (using Aggregated tables and coalesce), I was very keen to know if there is a way to do this by looping through each column (I come from R, so looping through each column using a higher order functional like lapply seems more natural to me).
Thanks!
Spark >= 2.2
You can use org.apache.spark.ml.feature.Imputer (which supports both mean and median strategy).
Scala :
import org.apache.spark.ml.feature.Imputer
val imputer = new Imputer()
.setInputCols(df.columns)
.setOutputCols(df.columns.map(c => s"${c}_imputed"))
.setStrategy("mean")
imputer.fit(df).transform(df)
Python:
from pyspark.ml.feature import Imputer
imputer = Imputer(
inputCols=df.columns,
outputCols=["{}_imputed".format(c) for c in df.columns]
)
imputer.fit(df).transform(df)
Spark < 2.2
Here you are:
import org.apache.spark.sql.functions.mean
df.na.fill(df.columns.zip(
df.select(df.columns.map(mean(_)): _*).first.toSeq
).toMap)
where
df.columns.map(mean(_)): Array[Column]
computes an average for each column,
df.select(_: *).first.toSeq: Seq[Any]
collects aggregated values and converts row to Seq[Any] (I know it is suboptimal but this is the API we have to work with),
df.columns.zip(_).toMap: Map[String,Any]
creates aMap: Map[String, Any] which maps from the column name to its average, and finally:
df.na.fill(_): DataFrame
fills the missing values using:
fill: Map[String, Any] => DataFrame
from DataFrameNaFunctions.
To ingore NaN entries you can replace:
df.select(df.columns.map(mean(_)): _*).first.toSeq
with:
import org.apache.spark.sql.functions.{col, isnan, when}
df.select(df.columns.map(
c => mean(when(!isnan(col(c)), col(c)))
): _*).first.toSeq
For imputing the median (instead of the mean) in PySpark < 2.2
## filter numeric cols
num_cols = [col_type[0] for col_type in filter(lambda dtype: dtype[1] in {"bigint", "double", "int"}, df.dtypes)]
### Compute a dict with <col_name, median_value>
median_dict = dict()
for c in num_cols:
median_dict[c] = df.stat.approxQuantile(c, [0.5], 0.001)[0]
Then, apply na.fill
df_imputed = df.na.fill(median_dict)
For PySpark, this is the code I used:
mean_dict = { col: 'mean' for col in df.columns }
col_avgs = df.agg( mean_dict ).collect()[0].asDict()
col_avgs = { k[4:-1]: v for k,v in col_avgs.iteritems() }
df.fillna( col_avgs ).show()
The four steps are:
Create the dictionary mean_dict mapping column names to the aggregate operation (mean)
Calculate the mean for each column, and save it as the dictionary col_avgs
The column names in col_avgs start with avg( and end with ), e.g. avg(col1). Strip the parentheses out.
Fill the columns of the dataframe with the averages using col_avgs

Pyspark data frame aggregation with user defined function

How can I use the 'groupby(key).agg(' with a user defined functions? Specifically I need a list of all unique values per key [not count].
The collect_set and collect_list (for unordered and ordered results respectively) can be used to post-process groupby results. Starting out with a simple spark dataframe
df = sqlContext.createDataFrame(
[('first-neuron', 1, [0.0, 1.0, 2.0]),
('first-neuron', 2, [1.0, 2.0, 3.0, 4.0])],
("neuron_id", "time", "V"))
Let's say the goal is to return the longest length of the V list for each neuron (grouped by name)
from pyspark.sql import functions as F
grouped_df = tile_img_df.groupby('neuron_id').agg(F.collect_list('V'))
We have now grouped the V lists into a list of lists. Since we wanted the longest length we can run
import pyspark.sql.types as sq_types
len_udf = F.udf(lambda v_list: int(np.max([len(v) in v_list])),
returnType = sq_types.IntegerType())
max_len_df = grouped_df.withColumn('max_len',len_udf('collect_list(V)'))
To get the max_len column added with the maximum length of the V list
I found pyspark.sql.functions.collect_set(col) which does the job I wanted.