Scaling dataset with MLlib

Scaling dataset with MLlib - scala

I was doing some scaling on below dataset using spark MLlib:
+---+--------------+
| id| features|
+---+--------------+
| 0|[1.0,0.1,-1.0]|
| 1| [2.0,1.1,1.0]|
| 0|[1.0,0.1,-1.0]|
| 1| [2.0,1.1,1.0]|
| 1|[3.0,10.1,3.0]|
+---+--------------+
You can find the link of this dataset at https://github.com/databricks/Spark-The-Definitive-Guide/blob/master/data/simple-ml-scaling/part-00000-cd03406a-cc9b-42b0-9299-1e259fdd9382-c000.gz.parquet
After performing standard scaling I am getting the below result:
+---+--------------+------------------------------------------------------------+
|id |features |stdScal_06f7a85f98ef__output |
+---+--------------+------------------------------------------------------------+
|0 |[1.0,0.1,-1.0]|[1.1952286093343936,0.02337622911060922,-0.5976143046671968]|
|1 |[2.0,1.1,1.0] |[2.390457218668787,0.2571385202167014,0.5976143046671968] |
|0 |[1.0,0.1,-1.0]|[1.1952286093343936,0.02337622911060922,-0.5976143046671968]|
|1 |[2.0,1.1,1.0] |[2.390457218668787,0.2571385202167014,0.5976143046671968] |
|1 |[3.0,10.1,3.0]|[3.5856858280031805,2.3609991401715313,1.7928429140015902] |
+---+--------------+------------------------------------------------------------+
If I perform min/max scaling (setting val minMax = new MinMaxScaler().setMin(5).setMax(10).setInputCol("features")), I get the below:
+---+--------------+-------------------------------+
| id| features|minMaxScal_21493d63e2bf__output|
+---+--------------+-------------------------------+
| 0|[1.0,0.1,-1.0]| [5.0,5.0,5.0]|
| 1| [2.0,1.1,1.0]| [7.5,5.5,7.5]|
| 0|[1.0,0.1,-1.0]| [5.0,5.0,5.0]|
| 1| [2.0,1.1,1.0]| [7.5,5.5,7.5]|
| 1|[3.0,10.1,3.0]| [10.0,10.0,10.0]|
+---+--------------+-------------------------------+
Please find the code below:
// loading dataset
val scaleDF = spark.read.parquet("/data/simple-ml-scaling")
// using standardScaler
import org.apache.spark.ml.feature.StandardScaler
val ss = new StandardScaler().setInputCol("features")
ss.fit(scaleDF).transform(scaleDF).show(false)
// using min/max scaler
import org.apache.spark.ml.feature.MinMaxScaler
val minMax = new MinMaxScaler().setMin(5).setMax(10).setInputCol("features")
val fittedminMax = minMax.fit(scaleDF)
fittedminMax.transform(scaleDF).show()
I know the formula for standarization and min/max scaling but unable to understand how it comes to the values in third column, please help me explain the math behind it.

MinMaxScaler in Spark works on each feature individually. From the documentation we have:
Rescale each feature individually to a common range [min, max] linearly using column summary statistics, which is also known as min-max normalization or Rescaling.
$$ Rescaled(e_i) = \frac{e_i - E_{min}}{E_{max} - E_{min}} * (max - min) + min $$
[...]
So each column in the features array will be scaled separately.
In this case, the MinMaxScaler is set to have a minimum value of 5 and a maximum value of 10.
The calculation for each column will thus be:
In the first column, the min value is 1.0 and the maximum is 3.0. We have 1.0 -> 5.0, and 3.0 -> 10.0. 2.0 will there for become 7.5.
In the second column, the min value is 0.1 and the maximum is 10.1. We have 0.1 -> 5.0 and 10.1 -> 10.0. The only other value in the column is 1.1 which will become ((1.1-0.1) / (10.1-0.1)) * (10.0 - 5.0) + 5.0 = 5.5 (following the normal min-max formula).
In the third column, the min value is -1.0 and the maximum is 3.0. So we know -1.0 -> 5.0 and 3.0 -> 10.0. For 1.0 it's in the middle and will become 7.5.

Related

pyspark - left join with random row matching the key

I am looking to a way to join 2 dataframes but with random rows matching the key. This strange request is due to a very long calculation to generate positions.
I would like to do a kind of "random left join" in pyspark.
I have a dataframe with an areaID (string) and a count (int). The areaID is unique (around 7k).
+--------+-------+
| areaID | count |
+--------+-------+
| A | 10 |
| B | 30 |
| C | 1 |
| D | 25 |
| E | 18 |
+--------+-------+
I have a second dataframe with around 1000 precomputed rows for each areaID with 2 positions columns x (float) and y (float). This dataframe is around 7 millions rows.
+--------+------+------+
| areaID | x | y |
+--------+------+------+
| A | 0.0 | 0 |
| A | 0.1 | 0.7 |
| A | 0.3 | 1 |
| A | 0.1 | 0.3 |
| ... | | |
| E | 3.15 | 4.17 |
| E | 3.14 | 4.22 |
+--------+------+------+
I would like to end with a dataframe like:
+--------+------+------+
| areaID | x | y |
+--------+------+------+
| A | 0.1 | 0.32 | < row 1/10 - randomly picked where areaID are the same
| A | 0.0 | 0.18 | < row 2/10
| A | 0.09 | 0.22 | < row 3/10
| ... | | |
| E | 3.14 | 4.22 | < row 1/18
| ... | | |
+--------+------+------+
My first idea is to iterate over each areaID of the first dataframe, filter the second dataframe by areaID and sample count rows of this dataframe. The problem is that this is quite slow with 7k load/filtering/sampling processes.
The second approach is to do an outer join on areaID, then shuffle the dataframe (but seems quite complex), apply a rank and keep when the rank <= count but I don't like the approch to load a lot a data to filter afterward.
I am wondering if there is a way to do it using a "random" left join ? In that case, I'll duplicate each row count times and apply it.
Many thanks in advance,
Nicolas

One can interpret the question as stratified sampling of the second dataframe where the number of samples to be taken from each subpopulation is given by the first dataframe.
There is Spark function for stratified sampling.
df1 = ...
df2 = ...
#first calculate the fraction for each areaID based on the required number
#given in df1 and the number of rows for the areaID in df2
fractionRows = df2.groupBy("areaId").agg(F.count("areaId").alias("count2")) \
.join(df1, "areaId") \
.withColumn("fraction", F.col("count") / F.col("count2")) \
.select("areaId", "fraction") \
.collect()
fractions = {f[0]:f[1] for f in fractionRows}
#now run the statified samling
df2.stat.sampleBy("areaID", fractions).show()
There is caveat with this approach: as the sampling done by Spark is a random process, the exact number of rows given in the first dataframe will not always be met exactly.
Edit: fractions > 1.0 are not supported by sampleBy. Looking at the Scala code of sampleBy shows why: the function is implemented as filter with a random variable indicating whether to keep to row or not. Returning multiple copies of a single row will therefore not work.
A similar idea can be used to support fractions > 1.0: instead of using a filter, an udf is created that returns an array. The array contains one entry per copy of the row that should be contained in the result. After applying the udf, the array column is exploded and then dropped:
from pyspark.sql import functions as F
from pyspark.sql import types as T
fractions = {'A': 1.5, 'C': 0.5}
def ff(stratum,x):
fraction = fractions.get(stratum, 0.0)
ret=[]
while fraction >= 1.0:
ret.append("x")
fraction = fraction - 1
if x < fraction:
ret.append("x")
return ret
f=F.udf(ff, T.ArrayType(T.StringType())).asNondeterministic()
seed=42
df2.withColumn("r", F.rand(seed)) \
.withColumn("r",f("areaID", F.col("r")))\
.withColumn("r", F.explode("r")) \
.drop("r") \
.show()

How to calculate standard deviation over a range of dates when there are dates missing in pyspark 2.2.0

I have a pyspark df wherein I am using a combination of windows + udf function to calculate standard deviation over historical business dates. The challenge is that my df is missing dates when there is no transaction. How do I calculate std dev including these missing dates without adding them as additional rows into my df to limit the df size going out of memory.
Sample Table & Current output
| ID | Date | Amount | Std_Dev|
|----|----------|--------|--------|
|1 |2021-03-24| 10000 | |
|1 |2021-03-26| 5000 | |
|1 |2021-03-29| 10000 |2886.751|
Current Code
from pyspark.sql.functions import udf,first,Window,withColumn
import numpy as np
from pyspark.sql.types import IntegerType
windowSpec = Window.partitionBy("ID").orderBy("date")
workdaysUDF = F.udf(lambda date1, date2: int(np.busday_count(date2, date1)) if (date1 is not None and date2 is not None) else None, IntegerType()) # UDF to calculate difference between business days#
df = df.withColumn("date_dif", workdaysUDF(F.col('Date'), F.first(F.col('Date')).over(windowSpec))) #column calculating business date diff#
windowval = lambda days: Window.partitionBy('id').orderBy('date_dif').rangeBetween(-days, 0)
df = df.withColumn("std_dev",F.stddev("amount").over(windowval(6))\
.drop("date_dif")
Desired Output where the values of dates missing between 24 to 29 March are being substituted with 0.
| ID | Date | Amount | Std_Dev|
|----|----------|--------|--------|
|1 |2021-03-24| 10000 | |
|1 |2021-03-26| 5000 | |
|1 |2021-03-29| 10000 |4915.96 |
Please note that I am only showing std dev for a single date for illustration, there would be value for each row as I am using a rolling windows function.
Any help would be greatly appreciated.
PS: Pyspark version is 2.2.0 at enterprise so I do not have flexibility to change the version.
Thanks,
VSG

complex logic on pyspark dataframe including previous row existing value as well as previous row value generated on the fly

I have to apply a logic on spark dataframe or rdd(preferably dataframe) which requires to generate two extra column. First generated column is dependent on other columns of same row and second generated column is dependent on first generated column of previous row.
Below is representation of problem statement in tabular format. A and B columns are available in dataframe. C and D columns are to be generated.
A | B | C | D
------------------------------------
1 | 100 | default val | C1-B1
2 | 200 | D1-C1 | C2-B2
3 | 300 | D2-C2 | C3-B3
4 | 400 | D3-C3 | C4-B4
5 | 500 | D4-C4 | C5-B5
Here is the sample data
A | B | C | D
------------------------
1 | 100 | 1000 | 900
2 | 200 | -100 | -300
3 | 300 | -200 | -500
4 | 400 | -300 | -700
5 | 500 | -400 | -900
Only solution I can think of is to coalesce the input dataframe to 1, convert it to rdd and then apply python function (having all the calcuation logic) to mapPartitions API .
However this approach may create load on one executor.

Mathematically seeing, D1-C1 where D1= C1-B1; so D1-C1 will become C1-B1-C1 => -B1.
In pyspark, window function has a parameter called default. this should simplify your problem. try this:
import pyspark.sql.functions as F
from pyspark.sql import Window
df = spark.createDataFrame([(1,100),(2,200),(3,300),(4,400),(5,500)],['a','b'])
w=Window.orderBy('a')
df_lag =df.withColumn('c',F.lag((F.col('b')*-1),default=1000).over(w))
df_final = df_lag.withColumn('d',F.col('c')-F.col('b'))
Results:
df_final.show()
+---+---+----+----+
| a| b| c| d|
+---+---+----+----+
| 1|100|1000| 900|
| 2|200|-100|-300|
| 3|300|-200|-500|
| 4|400|-300|-700|
| 5|500|-400|-900|
+---+---+----+----+
If the operation is something complex other than subtraction, then the same logic applies - fill the column C with your default value- calculate D , then use lag to calculate C and recalculate D.

The lag() function may help you with that:
import pyspark.sql.functions as F
from pyspark.sql.window import Window
w = Window.orderBy("A")
df1 = df1.withColumn("C", F.lit(1000))
df2 = (
df1
.withColumn("D", F.col("C") - F.col("B"))
.withColumn("C",
F.when(F.lag("C").over(w).isNotNull(),
F.lag("D").over(w) - F.lag("C").over(w))
.otherwise(F.col("C")))
.withColumn("D", F.col("C") - F.col("B"))
)

Spark decimal type precision loss

I'm doing some testing of spark decimal types for currency measures and am seeing some odd precision results when I set the scale and precision as shown below. I want to be sure that I won't have any data loss during calculations but the example below is not reassuring of that. Can anyone tell me why this is happening with spark sql? Currently on version 2.3.0
val sql = """select cast(cast(3 as decimal(38,14)) / cast(9 as decimal(38,14)) as decimal(38,14)) val"""
spark.sql(sql).show
This returns
+----------------+
| val|
+----------------+
|0.33333300000000|
+----------------+

This is a current open issue, see SPARK-27089. The suggested work around is to adjust the setting below. I validated that the SQL statement works as expected with this setting set to false.
spark.sql.decimalOperations.allowPrecisionLoss=false

Use BigDecimal to avoid precision loss. See Double vs. BigDecimal?
example:
scala> val df = Seq(BigDecimal("0.03"),BigDecimal("8.20"),BigDecimal("0.02")).toDS
df: org.apache.spark.sql.Dataset[scala.math.BigDecimal] = [value: decimal(38,18)]
scala> df.select($"value").show
+--------------------+
| value|
+--------------------+
|0.030000000000000000|
|8.200000000000000000|
|0.020000000000000000|
+--------------------+
Using BigDecimal:
scala> df.select($"value" + BigDecimal("0.1")).show
+-------------------+
| (value + 0.1)|
+-------------------+
|0.13000000000000000|
|8.30000000000000000|
|0.12000000000000000|
+-------------------+
if you don't use BigDecimal, there will be a loss in precision. In this case 0.1 is a double
scala> df.select($"value" + lit(0.1)).show
+-------------------+
| (value + 0.1)|
+-------------------+
| 0.13|
| 8.299999999999999|
|0.12000000000000001|
+-------------------+

How to create LablePoint from DataFrame directly without hardcoding the each column index of the dataframe?

I have dataframe(inputDF) with 100 columns with decimal data type. I want to created LabelPoint using the dataframe(inputDF).
I am able to create the LablePoint by hardcoding the each column index of the dataframe, which is not the optimal solution.
val outputLabelPoint = inputDF.map(x => new LabeledPoint(0.0, Vectors.dense(x.getAs[Double](0),x.getAs[Double](1),x.getAs[Double](2),x.getAs[Double](3), ...))
How to create LablePoint from DataFrame directly without hardcoding the each column index of the dataframe?
Help would be much appreciated.

VectorAssembler may be the transformer you wanna find.
VectorAssembler is a transformer that combines a given list of columns into a single vector column.
id | hour | mobile | userFeatures | clicked
----|------|--------|------------------|---------
0 | 18 | 1.0 | [0.0, 10.0, 0.5] | 1.0
AFTER
id | hour | mobile | userFeatures | clicked | features
----|------|--------|------------------|---------|-----------------------------
0 | 18 | 1.0 | [0.0, 10.0, 0.5] | 1.0 | [18.0, 1.0, 0.0, 10.0, 0.5]
I am confused why the two tables cannot display correctly.
Refer to the example in the Spark Doc for more details.
If you want more help, please describe your column names and how they are generated.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Scaling dataset with MLlib - scala

Related

pyspark - left join with random row matching the key

How to calculate standard deviation over a range of dates when there are dates missing in pyspark 2.2.0

complex logic on pyspark dataframe including previous row existing value as well as previous row value generated on the fly

Spark decimal type precision loss

How to create LablePoint from DataFrame directly without hardcoding the each column index of the dataframe?

Categories

Resources