How to standardize a column in PySpark without using StandardScaler? - pyspark

Seems like this should work, but I'm getting errors:
mu = mean(df[input])
sigma = stddev(df[input])
dft = df.withColumn(output, (df[input]-mu)/sigma)
pyspark.sql.utils.AnalysisException: "grouping expressions sequence is
empty, and '`user`' is not an aggregate function. Wrap
'(((CAST(`sum(response)` AS DOUBLE) - avg(`sum(response)`)) /
stddev_samp(CAST(`sum(response)` AS DOUBLE))) AS `scaled`)' in
windowing function(s) or wrap '`user`' in first() (or first_value) if
you don't care which value you get.;;\nAggregate [user#0,
sum(response)#26L, ((cast(sum(response)#26L as double) -
avg(sum(response)#26L)) / stddev_samp(cast(sum(response)#26L as
double))) AS scaled#46]\n+- AnalysisBarrier\n +- Aggregate
[user#0], [user#0, sum(cast(response#3 as bigint)) AS
sum(response)#26L]\n +- Filter item_id#1 IN
(129,130,131,132,133,134,135,136,137,138)\n +-
Relation[user#0,item_id#1,response_value#2,response#3,trait#4,response_timestamp#5]
csv\n"
I'm not sure what's going on with this error message.

Using collect() is not a good solution in general and you will see that this will not scale as your data grows.
If you don't want to use StandardScaler, a better way is to use a Window to compute the mean and standard deviation.
Borrowing the same example from StandardScaler in Spark not working as expected:
from pyspark.sql.functions import col, mean, stddev
from pyspark.sql import Window
df = spark.createDataFrame(
np.array(range(1,10,1)).reshape(3,3).tolist(),
["int1", "int2", "int3"]
)
df.show()
#+----+----+----+
#|int1|int2|int3|
#+----+----+----+
#| 1| 2| 3|
#| 4| 5| 6|
#| 7| 8| 9|
#+----+----+----+
Suppose you wanted to standardize the column int2:
input_col = "int2"
output_col = "int2_scaled"
w = Window.partitionBy()
mu = mean(input_col).over(w)
sigma = stddev(input_col).over(w)
df.withColumn(output_col, (col(input_col) - mu)/(sigma)).show()
#+----+----+----+-----------+
#|int1|int2|int3|int2_scaled|
#+----+----+----+-----------+
#| 1| 2| 3| -1.0|
#| 7| 8| 9| 1.0|
#| 4| 5| 6| 0.0|
#+----+----+----+-----------+
If you wanted to use the population standard deviation as in the other example, replace pyspark.sql.functions.stddev with pyspark.sql.functions.stddev_pop().

Fortunately, I was able to find code that works:
summary = df.select([mean(input).alias('mu'), stddev(input).alias('sigma')])\
.collect().pop()
dft = df.withColumn(output, (df[input]-summary.mu)/summary.sigma)

Related

Scala (NOT pyspark) map linear regression coefficients to feature names (categorical and continuous)

I have a dataframe in scala that looks like this
df.show
+---+-----+-------------------+--------+------------------+--------+------+------------+-------------+
| id|group| normalized_amount|query_id| y| y1|group1|groupIndexed| groupEncoded|
+---+-----+-------------------+--------+------------------+--------+------+------------+-------------+
| 1| B| 0.22874172014806| 1| 0.317739988492575| 0| B| 1.0|(2,[1],[1.0])|
| 2| A| -1.42432215217563| 2| -1.32008967486074| 0| C| 0.0|(2,[0],[1.0])|
| 3| B| -2.03644548423379| 3| -1.65740392834359| 0| B| 1.0|(2,[1],[1.0])|
| 4| B| 0.425753803902096| 4|-0.127591370989296| 0| C| 0.0|(2,[0],[1.0])|
| 5| A| 0.521050829955076| 5| 0.824285664580579| 1| A| 2.0| (2,[],[])|
| 6| A|-0.0416682439998418| 6| 0.321350404322885| 1| C| 0.0|(2,[0],[1.0])|
| 7| A| -1.2787327462978| 7| -0.88099379032367| 0| A| 2.0| (2,[],[])|
| 8| A| 0.431780409975322| 8| 0.575249966796747| 1| C| 0.0|(2,[0],[1.0])|
And I'm performing a linear regression of y on group1 (a categorical variable of 3 categories) and normalized_amount (a continuous variable) as follows
var assembler = new VectorAssembler().setInputCols(Array("groupEncoded", "normalized_amount")).setOutputCol("features")
val dfFeatures = assembler.transform(df)
var lr = new LinearRegression()
var lrModel = lr.fit(dfFeatures)
var lrPrediction = lrModel.transform(dfFeatures)
I can access coefficients and standard errors as follows
lmModel.intercept
lrModel.coefficients //model coefficient estimates (not intercept)
lrModel.summary.coefficientStandardErrors //standard error of intercept and coefficients, not sure in which order
My questions are
how can I figure out which feature correspond to which coefficient estimate (for categorical values, I need to figure out the coefficient of each category)? Same with standard errors?
how can I choose which category to "leave out" as the reference category?
how to perform a linear regression with no intercept?
I've seen some answers to similar questions, but they are all in pyspark and not in scala, and I'm only using scala
With a dataframe as your transformed df, that includes the prediction, and LogisticRegressionModel, you can access to the attributes of the VectorAssembler field. This code from databricks, I slightly adapted it for a LogisticRegressionModel instead of Pipeline. Note that you can choose if you want intercept estimation or not:
val lrToFit : LinearRegression = ???
lrToFit.setFitIntercept(false)
// With this dataframe as your transformed df that includes the prediction
val df: DataFrame = ???
val lr : LogisticRegressionModel = ???
val schema = df.schema
// Using the schema, the attributes of the Vector Assembler(features) can be extracted
val features = AttributeGroup.fromStructField(schema(lr.getFeaturesCol)).attributes.get.map(_.name.get)
val featureNames: Array[String] = if (lr.getFitIntercept) {
Array("(Intercept)") ++ features
} else {
features
}
val coefficients = lr.coefficients.toArray
val coeffs = if (lr.getFitIntercept) {
coefficients ++ Array(lr.intercept)
} else {
coefficients
}
featureNames.zip(coeffs).foreach { case (feature, coeff) =>
println(s"$feature\t$coeff")
}
This is a method that can be used if you load a pretrained model because in that case you might not know the order of the features in the VectorAssembler transformation. I think that you will need to select the reference category manually.

How to use countDistinct using a window function in Spark/Scala?

I need to use window function that is paritioned by 2 columns and do distinct count on the 3rd column and that as the 4th column. I can do count with out any issues, but using distinct count is throwing exception -
rg.apache.spark.sql.AnalysisException: Distinct window functions are not supported:
Is there any workaround for this ?
A previous answer suggested two possible techniques: approximate counting and size(collect_set(...)). Both have problems.
If you need an exact count, which is the main reason to use COUNT(DISTINCT ...) in big data, approximate counting will not do. Also, approximate counting actual error rates can vary quite significantly for small data.
size(collect_set(...)) may cause a substantial slowdown in processing of big data because it uses a mutable Scala HashSet, which is a pretty slow data structure. In addition, you may occasionally get strange results, e.g., if you run the query over an empty dataframe, because size(null) produces the counterintuitive -1. Spark's native distinct counting runs faster for a number of reasons, the main one being that it doesn't have to produce all the counted data in an array.
The typical approach to solving this problem is with a self-join. You group by whatever columns you need, compute the distinct count or any other aggregate function that cannot be used as a window function, and then join back to your original data.
Use approx_count_distinct (or) collect_set and size functions on window to mimic countDistinct functionality.
Example:
df.show()
//+---+---+---+
//| i| j| k|
//+---+---+---+
//| 1| a| c|
//| 2| b| d|
//| 1| a| c|
//| 2| b| e|
//+---+---+---+
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
val windowSpec = Window.partitionBy("i","j")
df.withColumn("cnt",size(collect_set("k").over(windowSpec))).show()
//or using approx_count_distinct
df.withColumn("cnt",approx_count_distinct("k").over(windowSpec)).show()
//+---+---+---+---+
//| i| j| k|cnt|
//+---+---+---+---+
//| 2| b| d| 2|
//| 2| b| e| 2|
//| 1| a| c| 1| //as c value repeated for 1,a partition
//| 1| a| c| 1|
//+---+---+---+---+
Trying to improve Sim's answer, if you want to do this:
//val newColumnName: String = ...
//val colToCount: Column = ...
//val aggregatingCols: Seq[Column] = ...
df.withColumn(newColName, countDistinct(colToCount).over(partitionBy(aggregatingCols:_*)))
You must instead do this:
//val aggregatingCols: Seq[String] = ...
df.groupBy(aggregatingCols.head, aggregatingCols.tail:_*)
.agg(countDistinct(colToCount).as(newColName))
.select(newColName, aggregatingCols:_*)
.join(df, usingColumns = aggregatingCols)
This will return the number of distinct elements in the partition, using dense_rank() function. When we sum ascending and descending rank, we always get the total number of distinct elements + 1 :
dense_rank().over(Window.partitionBy("i").orderBy(c.asc)) + dense_rank().over(Window.partitionBy("i").orderBy(c.desc)) - 1

Apply UDF function to Spark window where the input paramter is a list of all column values in range

I would like to build a moving average on each row in a window. Let's say -10 rows. BUT if there are less than 10 rows available I would like to insert a 0 in the resulting row -> new column.
So what I would try to achieve is using a UDF in an aggregate window with input paramter List() (or whatever superclass) which has the values of all rows available.
Here's a code example that doesn't work:
val w = Window.partitionBy("id").rowsBetween(-10, +0)
dfRetail2.withColumn("test", udftestf(dfRetail2("salesMth")).over(w))
Expected output: List( 1,2,3,4) if no more rows are available and take this as input paramter for the udf function. udf function should return a calculated value or 0 if less than 10 rows available.
the above code terminates: Expression 'UDF(salesMth#152L)' not supported within a window function.;;
You can use Spark's built-in Window functions along with when/otherwise for your specific condition without the need of UDF/UDAF. For simplicity, the sliding-window size is reduced to 4 in the following example with dummy data:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
import spark.implicits._
val df = (1 to 2).flatMap(i => Seq.tabulate(8)(j => (i, i * 10.0 + j))).
toDF("id", "amount")
val slidingWin = 4
val winSpec = Window.partitionBy($"id").rowsBetween(-(slidingWin - 1), 0)
df.
withColumn("slidingCount", count($"amount").over(winSpec)).
withColumn("slidingAvg", when($"slidingCount" < slidingWin, 0.0).
otherwise(avg($"amount").over(winSpec))
).show
// +---+------+------------+----------+
// | id|amount|slidingCount|slidingAvg|
// +---+------+------------+----------+
// | 1| 10.0| 1| 0.0|
// | 1| 11.0| 2| 0.0|
// | 1| 12.0| 3| 0.0|
// | 1| 13.0| 4| 11.5|
// | 1| 14.0| 4| 12.5|
// | 1| 15.0| 4| 13.5|
// | 1| 16.0| 4| 14.5|
// | 1| 17.0| 4| 15.5|
// | 2| 20.0| 1| 0.0|
// | 2| 21.0| 2| 0.0|
// | 2| 22.0| 3| 0.0|
// | 2| 23.0| 4| 21.5|
// | 2| 24.0| 4| 22.5|
// | 2| 25.0| 4| 23.5|
// | 2| 26.0| 4| 24.5|
// | 2| 27.0| 4| 25.5|
// +---+------+------------+----------+
Per remark in the comments section, I'm including a solution via UDF below as an alternative:
def movingAvg(n: Int) = udf{ (ls: Seq[Double]) =>
val (avg, count) = ls.takeRight(n).foldLeft((0.0, 1)){
case ((a, i), next) => (a + (next-a)/i, i + 1)
}
if (count <= n) 0.0 else avg // Expand/Modify this for specific requirement
}
// To apply the UDF:
df.
withColumn("average", movingAvg(slidingWin)(collect_list($"amount").over(winSpec))).
show
Note that unlike sum or count, collect_list ignores rowsBetween() and generates partitioned data that can potentially be very large to be passed to the UDF (hence the need for takeRight()). If the computed Window sum and count are sufficient for what's needed for your specific requirement, consider passing them to the UDF instead.
In general, especially if the data at hand is already in DataFrame format, it'd perform and scale better by using built-in DataFrame API to take advantage of Spark's execution engine optimization than using user-defined UDF/UDAF. You might be interested in reading this article re: advantages of DataFrame/Dataset API over UDF/UDAF.

Spark withColumn working for modifying column but not adding a new one

Scala 2.12 and Spark 2.2.1 here. I have the following code:
myDf.show(5)
myDf.withColumn("rank", myDf("rank") * 10)
myDf.withColumn("lastRanOn", current_date())
println("And now:")
myDf.show(5)
When I run this, in the logs I see:
+---------+-----------+----+
|fizz|buzz|rizzrankrid|rank|
+---------+-----------+----+
| 2| 5| 1440370637| 128|
| 2| 5| 2114144780|1352|
| 2| 8| 199559784|3233|
| 2| 5| 1522258372| 895|
| 2| 9| 918480276| 882|
+---------+-----------+----+
And now:
+---------+-----------+-----+
|fizz|buzz|rizzrankrid| rank|
+---------+-----------+-----+
| 2| 5| 1440370637| 1280|
| 2| 5| 2114144780|13520|
| 2| 8| 199559784|32330|
| 2| 5| 1522258372| 8950|
| 2| 9| 918480276| 8820|
+---------+-----------+-----+
So, interesting:
The first withColumn works, transforming each row's rank value by multiplying itself by 10
However the second withColumn fails, which is just adding the current date/time to all rows as a new lastRanOn column
What do I need to do to get the lastRanOn column addition working?
Your example is probably too simple, because modifying rank should also not work.
withColumn does not update DataFrame, it's create a new DataFrame.
So you must do:
// if myDf is a var
myDf.show(5)
myDf = myDf.withColumn("rank", myDf("rank") * 10)
myDf = myDf.withColumn("lastRanOn", current_date())
println("And now:")
myDf.show(5)
or for example:
myDf.withColumn("rank", myDf("rank") * 10).withColumn("lastRanOn", current_date()).show(5)
Only then you will have new column added - after reassigning new DataFrame reference

Pyspark Dataframe Apply function to two columns

Say I have two PySpark DataFrames df1 and df2.
df1= 'a'
1
2
5
df2= 'b'
3
6
And I want to find the closest df2['b'] value for each df1['a'], and add the closest values as a new column in df1.
In other words, for each value x in df1['a'], I want to find a y that achieves min(abx(x-y)) for all y in df2['b'](note: can assume that there is only one y that can achieve the minimum distance), and the result would be
'a' 'b'
1 3
2 3
5 6
I tried the following code to create a distance matrix first (before finding the values achieving the minimum distance):
from pyspark.sql.types import IntegerType
from pyspark.sql.functions import udf
def dict(x,y):
return abs(x-y)
udf_dict = udf(dict, IntegerType())
sql_sc = SQLContext(sc)
udf_dict(df1.a, df2.b)
which gives
Column<PythonUDF#dist(a,b)>
Then I tried
sql_sc.CreateDataFrame(udf_dict(df1.a, df2.b))
which runs forever without giving error/output.
My questions are:
As I'm new to Spark, is my way to construct the output DataFrame efficient? (My way would be creating a distance matrix for all the a and b values first, and then find the min one)
What's wrong with the last line of my code and how to fix it?
Starting with your second question - you can apply udf only to existing dataframe, I think you were thinking for something like this:
>>> df1.join(df2).withColumn('distance', udf_dict(df1.a, df2.b)).show()
+---+---+--------+
| a| b|distance|
+---+---+--------+
| 1| 3| 2|
| 1| 6| 5|
| 2| 3| 1|
| 2| 6| 4|
| 5| 3| 2|
| 5| 6| 1|
+---+---+--------+
But there is a more efficient way to apply this distance, by using internal abs:
>>> from pyspark.sql.functions import abs
>>> df1.join(df2).withColumn('distance', abs(df1.a -df2.b))
Then you can find matching numbers by calculating:
>>> distances = df1.join(df2).withColumn('distance', abs(df1.a -df2.b))
>>> min_distances = distances.groupBy('a').agg(min('distance').alias('distance'))
>>> distances.join(min_distances, ['a', 'distance']).select('a', 'b').show()
+---+---+
| a| b|
+---+---+
| 5| 6|
| 1| 3|
| 2| 3|
+---+---+