How to groupBy and perform data scaling over each and every group using MlLib Pyspark? - pyspark

I have a dataset just like in the example below and I am trying to group all rows from a given symbol and perform standard scaling of each group so that at the end all my data is scaled by groups. How can I do that with MlLib and Pyspark? I could not find a single solution on internet for it. Can anyone help here?
+------+------------------+------------------+------------------+------------------+
|symbol| open| high| low| close|
+------+------------------+------------------+------------------+------------------+
| AVT| 4.115| 4.115| 4.0736| 4.0736|
| ZEC| 365.6924715181936| 371.9164684545918| 364.8854025324053| 369.5950712239761|
| ETH| 647.220769018717| 654.6370842160561| 644.8942258095359| 652.1231757197687|
| XRP|0.3856343600456335|0.4042970302356221|0.3662228285447956|0.4016658006619401|
| XMR|304.97650674864144|304.98649644294267|299.96970554155274| 303.8663243145598|
| LTC|321.32437862304715| 335.1872636382617| 320.9704201234651| 334.5057757774086|
| EOS| 5.1171| 5.1548| 5.1075| 5.116|
| BCH| 1526.839255299505| 1588.106037653013|1526.8392543926366|1554.8447136830328|
| DASH| 878.00000003| 884.03769206| 869.22000004| 869.22000004|
| BTC|17042.224796462127| 17278.87984139109|16898.509289685637|17134.611038665582|
| REP| 32.50162799| 32.501628| 32.41062673| 32.50162799|
| DASH| 858.98413357| 863.01413927| 851.07145059| 851.17051529|
| ETH| 633.1390884474979| 650.546984589714| 631.2674221381849| 641.4566047907362|
| XRP|0.3912300406160967|0.3915937383961073|0.3480682353334925|0.3488616679337076|
| EOS| 5.11| 5.1675| 5.0995| 5.1674|
| BCH|1574.9602789966184|1588.6004569127992| 1515.3| 1521.0|
| BTC| 17238.0199449088| 17324.83886467445|16968.291408828714| 16971.12960974206|
| LTC| 303.3999614441217| 317.6966006615225|302.40702519057584| 310.971265429805|
| REP| 32.50162798| 32.50162798| 32.345677| 32.345677|
| XMR| 304.1618444641083| 306.2720324372592|295.38042671416935| 295.520097663825|
+------+------------------+------------------+------------------+------------------+

I suggest you import the following:
import pyspark.sql.functions as f
then you can do it like this (not fully tested code):
stats_df = df.groupBy('symbol').withColumn(\
'open', f.mean("open")).alias("open_mean")\
.withColumn(\
'open', f.stddev("open")).alias("open_stddev").collect()
This is the principle of how you would do it (you could use instead the min and max functions for a MinMax scaling), then you just have to apply the formula of standard scaling to stats_df:
x' = (x - μ) / σ

Related

Average element wise List of Dense vectors in each row of a pyspark dataframe

I have a column in a pyspark dataframe that contains Lists of DenseVectors. Different rows might have Lists of different sizes but each vector in the list is of the same size. I want to calculate the element-wise average of each of those lists.
To be more concrete, lets say I have the following df:
|ID | Column |
| -------- | ------------------------------------------- |
| 0 | List(DenseVector(1,2,3), DenseVector(2,4,5))|
| 1 | List(DenseVector(1,2,3)) |
| 2 | List(DenseVector(2,2,3), DenseVector(2,4,5))|
What I would like to obtain is
|ID | Column |
| -------- | --------------------|
| 0 | DenseVector(1.5,3,4)|
| 1 | DenseVector(2,4,5) |
| 2 | DenseVector(2,3,4) |
Many thanks!
I don't think there is a direct pyspark function to do this. There is an ElementwiseProduct(which works different to what is expected here) and others here. So, you could try to achieve this with a udf.
from pyspark.sql import functions as F
from pyspark.ml.linalg import Vectors, VectorUDT
def elementwise_avg(vector_list):
x = y = z = 0
no_of_v = len(vector_list)
for i, elem in enumerate(vector_list):
x += elem[i][0]
y += elem[i][1]
z += elem[i][2]
return Vectors.dense(x/no_of_v,y/no_of_v,z/no_of_v)
elementwise_avg_udf = F.udf(elementwise_avg, VectorUDT())
df = df.withColumn("Elementwise Avg", elementwise_avg_udf("Column"))

What is the best practice to grouping/clustering string with the most similarity using Locality Sensitive Hashing (LSH)?

I'm experimenting with clustering over string data based on similarity (not pairwise). I have data as below:
+---------------+
|data |
+---------------+
|\a\b\c\d\e\f\ |
|\a\e\f\ |
|\b\c\e\f\ |
|\a\f\ |
|.... |
|\x\y\z\q\ |
|\x\y\z\a\ |
|\x\y\z\u\ |
|\x\y\z\i\o\ |
|\x\z\ |
+---------------+
So the best I could find is to use Locality Sensitive Hashing (LSH) (including Tokenization and Bag-of-words) to generate the hashing for all records based on similarity which was satisfying but still I need to reduce the number of the most similar records by (re-)clustering the outputs of
LSH_output. The following is LSH outputs in the form of hashing consisting of 8 string with (_)
+------------------------------------------------------------------+
|LSH_output |
+------------------------------------------------------------------+
|20119135_8063284_2745546_7190819_8799344_480624_9176315_12270942 |
|20119135_8063284_2745546_7190819_8799344_480624_9176315_12270942 |
|1300777_12452151_2745546_15216392_888029_11668419_1698666_34291615|
|1300777_12452151_2745546_1008921_2198807_11668419_10255591_1253523|
|20119135_8063284_2745546_7190819_8799344_480624_9176315_12270942 |
|18083165_35936319_2745546_692046_8799344_480624_49177928_19901790 |
|6827074_1951067_2745546_4764851_2066255_206184_6494055_146137 |
|6827074_1951067_2745546_4999991_2066255_206184_6494055_146140 |
|18083165_10838208_2745546_692046_6599165_480624_1399232_15213423 |
|20119135_8063284_2745546_7190819_8799344_480624_9176315_12270942 |
|20119135_8063285_2745546_7190820_8799344_480624_9176315_12270942 |
+------------------------------------------------------------------+
After checking this answer concerning state-of-the-art, I tried the following approaches, but still, there are lots of corner cases which could be grouped:
re-used LSH over
used K-means
The most related post I found here is based on #Ryan Walker's solution as well as his comment:
The other vectorizers in sklearn (CountVectorizer, TFIDFVectorizer) will allow you to perform introspection, but they have a much bigger footprint; to fit them on large data sets, you can set max_features to a reasonable number.
I also tried to vectorize the LSH outputs for the above-mentioned approaches using:
from pyspark.ml.feature import RegexTokenizer
tokenizer = RegexTokenizer(inputCol="LSH_output", outputCol="tokens2", pattern="_")
tokenized = tokenizer.transform(out)
results:
|LSH_output |tokens2 |
+-------------------------------------------------------------------+----------------------------------------------------------------------------+
|20119135_8063284_2745546_7190819_8799344_480624_9176315_12270942 |[20119135, 8063284, 2745546, 7190819, 8799344, 480624, 9176315, 12270942] |
|20119135_8063284_2745546_7190819_8799344_480624_9176315_12270942 |[20119135, 8063284, 2745546, 7190819, 8799344, 480624, 9176315, 12270942] |
|1300777_12452151_2745546_15216392_888029_11668419_1698666_34291615 |[1300777, 12452151, 2745546, 15216392, 888029, 11668419, 1698666, 34291615] |
|20119135_8063284_2745546_7190819_8799344_480624_9176315_12270942 |[20119135, 8063284, 2745546, 7190819, 8799344, 480624, 9176315, 12270942] |
|1300777_12452151_2745546_1008921_2198807_480624_10255591_1253523 |[1300777, 12452151, 2745546, 1008921, 2198807, 480624, 10255591, 1253523] |
|6827074_1951067_2745546_4764851_2066255_206184_6494055_146137 |[6827074, 1951067, 2745546, 4764851, 2066255, 206184, 6494055, 146137] |
|6827074_1951067_2745546_4764851_2066255_206184_6494055_146137 |[6827074, 1951067, 2745546, 4764851, 2066255, 206184, 6494055, 146137] |
|18083165_10838208_2745546_692046_6599165_480624_1399232_15213423 |[18083165, 10838208, 2745546, 692046, 6599165, 480624, 1399232, 15213423] |
|20119135_8063284_2745546_7190819_8799344_480624_9176315_12270942 |[20119135, 8063284, 2745546, 7190819, 8799344, 480624, 9176315, 12270942] |
|20119135_8063284_2745546_7190819_8799344_480624_9176315_12270942 |[20119135, 8063284, 2745546, 7190819, 8799344, 480624, 9176315, 12270942] |
+-------------------------------------------------------------------+----------------------------------------------------------------------------+
So far Above-mentioned approaches couldn't cluster the results very good either clustering algorithm creates a new cluster despite it having already created a proper/close cluster for those observations ( it groups those records in which 3 out of 8 strings within LSH_output are similar in the same predicted cluster.) . If I re-use the LSH algorithm over LSH_output, it generates strings on top of LSH_output, let's say not consist of 8 strings but one, but still couldn't group all similar versions.
Note: I have included the dataset for further experiments.

pyspark - left join with random row matching the key

I am looking to a way to join 2 dataframes but with random rows matching the key. This strange request is due to a very long calculation to generate positions.
I would like to do a kind of "random left join" in pyspark.
I have a dataframe with an areaID (string) and a count (int). The areaID is unique (around 7k).
+--------+-------+
| areaID | count |
+--------+-------+
| A | 10 |
| B | 30 |
| C | 1 |
| D | 25 |
| E | 18 |
+--------+-------+
I have a second dataframe with around 1000 precomputed rows for each areaID with 2 positions columns x (float) and y (float). This dataframe is around 7 millions rows.
+--------+------+------+
| areaID | x | y |
+--------+------+------+
| A | 0.0 | 0 |
| A | 0.1 | 0.7 |
| A | 0.3 | 1 |
| A | 0.1 | 0.3 |
| ... | | |
| E | 3.15 | 4.17 |
| E | 3.14 | 4.22 |
+--------+------+------+
I would like to end with a dataframe like:
+--------+------+------+
| areaID | x | y |
+--------+------+------+
| A | 0.1 | 0.32 | < row 1/10 - randomly picked where areaID are the same
| A | 0.0 | 0.18 | < row 2/10
| A | 0.09 | 0.22 | < row 3/10
| ... | | |
| E | 3.14 | 4.22 | < row 1/18
| ... | | |
+--------+------+------+
My first idea is to iterate over each areaID of the first dataframe, filter the second dataframe by areaID and sample count rows of this dataframe. The problem is that this is quite slow with 7k load/filtering/sampling processes.
The second approach is to do an outer join on areaID, then shuffle the dataframe (but seems quite complex), apply a rank and keep when the rank <= count but I don't like the approch to load a lot a data to filter afterward.
I am wondering if there is a way to do it using a "random" left join ? In that case, I'll duplicate each row count times and apply it.
Many thanks in advance,
Nicolas
One can interpret the question as stratified sampling of the second dataframe where the number of samples to be taken from each subpopulation is given by the first dataframe.
There is Spark function for stratified sampling.
df1 = ...
df2 = ...
#first calculate the fraction for each areaID based on the required number
#given in df1 and the number of rows for the areaID in df2
fractionRows = df2.groupBy("areaId").agg(F.count("areaId").alias("count2")) \
.join(df1, "areaId") \
.withColumn("fraction", F.col("count") / F.col("count2")) \
.select("areaId", "fraction") \
.collect()
fractions = {f[0]:f[1] for f in fractionRows}
#now run the statified samling
df2.stat.sampleBy("areaID", fractions).show()
There is caveat with this approach: as the sampling done by Spark is a random process, the exact number of rows given in the first dataframe will not always be met exactly.
Edit: fractions > 1.0 are not supported by sampleBy. Looking at the Scala code of sampleBy shows why: the function is implemented as filter with a random variable indicating whether to keep to row or not. Returning multiple copies of a single row will therefore not work.
A similar idea can be used to support fractions > 1.0: instead of using a filter, an udf is created that returns an array. The array contains one entry per copy of the row that should be contained in the result. After applying the udf, the array column is exploded and then dropped:
from pyspark.sql import functions as F
from pyspark.sql import types as T
fractions = {'A': 1.5, 'C': 0.5}
def ff(stratum,x):
fraction = fractions.get(stratum, 0.0)
ret=[]
while fraction >= 1.0:
ret.append("x")
fraction = fraction - 1
if x < fraction:
ret.append("x")
return ret
f=F.udf(ff, T.ArrayType(T.StringType())).asNondeterministic()
seed=42
df2.withColumn("r", F.rand(seed)) \
.withColumn("r",f("areaID", F.col("r")))\
.withColumn("r", F.explode("r")) \
.drop("r") \
.show()

complex logic on pyspark dataframe including previous row existing value as well as previous row value generated on the fly

I have to apply a logic on spark dataframe or rdd(preferably dataframe) which requires to generate two extra column. First generated column is dependent on other columns of same row and second generated column is dependent on first generated column of previous row.
Below is representation of problem statement in tabular format. A and B columns are available in dataframe. C and D columns are to be generated.
A | B | C | D
------------------------------------
1 | 100 | default val | C1-B1
2 | 200 | D1-C1 | C2-B2
3 | 300 | D2-C2 | C3-B3
4 | 400 | D3-C3 | C4-B4
5 | 500 | D4-C4 | C5-B5
Here is the sample data
A | B | C | D
------------------------
1 | 100 | 1000 | 900
2 | 200 | -100 | -300
3 | 300 | -200 | -500
4 | 400 | -300 | -700
5 | 500 | -400 | -900
Only solution I can think of is to coalesce the input dataframe to 1, convert it to rdd and then apply python function (having all the calcuation logic) to mapPartitions API .
However this approach may create load on one executor.
Mathematically seeing, D1-C1 where D1= C1-B1; so D1-C1 will become C1-B1-C1 => -B1.
In pyspark, window function has a parameter called default. this should simplify your problem. try this:
import pyspark.sql.functions as F
from pyspark.sql import Window
df = spark.createDataFrame([(1,100),(2,200),(3,300),(4,400),(5,500)],['a','b'])
w=Window.orderBy('a')
df_lag =df.withColumn('c',F.lag((F.col('b')*-1),default=1000).over(w))
df_final = df_lag.withColumn('d',F.col('c')-F.col('b'))
Results:
df_final.show()
+---+---+----+----+
| a| b| c| d|
+---+---+----+----+
| 1|100|1000| 900|
| 2|200|-100|-300|
| 3|300|-200|-500|
| 4|400|-300|-700|
| 5|500|-400|-900|
+---+---+----+----+
If the operation is something complex other than subtraction, then the same logic applies - fill the column C with your default value- calculate D , then use lag to calculate C and recalculate D.
The lag() function may help you with that:
import pyspark.sql.functions as F
from pyspark.sql.window import Window
w = Window.orderBy("A")
df1 = df1.withColumn("C", F.lit(1000))
df2 = (
df1
.withColumn("D", F.col("C") - F.col("B"))
.withColumn("C",
F.when(F.lag("C").over(w).isNotNull(),
F.lag("D").over(w) - F.lag("C").over(w))
.otherwise(F.col("C")))
.withColumn("D", F.col("C") - F.col("B"))
)

Calculating and aggregating data by date/time

I am working with a dataframe like this:
Id | TimeStamp | Event | DeviceId
1 | 5.2.2019 8:00:00 | connect | 1
2 | 5.2.2019 8:00:05 | disconnect| 1
I am using databricks and pyspark to do the ETL process. How can I calculate and create such a dataframe as shown at the bottom? I have already tried using a UDF but I could not find a way to make it work. I also tried to do it by iterating over the whole data frame, but this is extremely slow.
I want to aggregate this dataframe to get a new dataframe which tells me the times, how long each device has been connected and disconnected:
Id | StartDateTime | EndDateTime | EventDuration |State | DeviceId
1 | 5.2.19 8:00:00 | 5.2.19 8:00:05| 0.00:00:05 |connected| 1
I think you can make this work with a window function and some further column creations with withColumn.
The code I did should create the mapping for devices and create a table with the duration for each state. The only requirement is that connect and disconnect appear alternatively.
Then you can use the following code:
from pyspark.sql.types import *
from pyspark.sql.functions import *
from pyspark.sql.window import Window
import datetime
test_df = sqlContext.createDataFrame([(1,datetime.datetime(2019,2,5,8),"connect",1),
(2,datetime.datetime(2019,2,5,8,0,5),"disconnect",1),
(3,datetime.datetime(2019,2,5,8,10),"connect",1),
(4,datetime.datetime(2019,2,5,8,20),"disconnect",1),],
["Id","TimeStamp","Event","DeviceId"])
#creation of dataframe with 4 events for 1 device
test_df.show()
Output:
+---+-------------------+----------+--------+
| Id| TimeStamp| Event|DeviceId|
+---+-------------------+----------+--------+
| 1|2019-02-05 08:00:00| connect| 1|
| 2|2019-02-05 08:00:05|disconnect| 1|
| 3|2019-02-05 08:10:00| connect| 1|
| 4|2019-02-05 08:20:00|disconnect| 1|
+---+-------------------+----------+--------+
Then you can create the helper functions and the window:
my_window = Window.partitionBy("DeviceId").orderBy(col("TimeStamp").desc()) #create window
get_prev_time = lag(col("Timestamp"),1).over(my_window) #get previous timestamp
time_diff = get_prev_time.cast("long") - col("TimeStamp").cast("long") #compute duration
test_df.withColumn("EventDuration",time_diff)\
.withColumn("EndDateTime",get_prev_time)\ #apply the helper functions
.withColumnRenamed("TimeStamp","StartDateTime")\ #rename according to your schema
.withColumn("State",when(col("Event")=="connect", "connected").otherwise("disconnected"))\ #create the state column
.filter(col("EventDuration").isNotNull()).select("Id","StartDateTime","EndDateTime","EventDuration","State","DeviceId").show()
#finally some filtering for the last events, which do not have a previous time
Output:
+---+-------------------+-------------------+-------------+------------+--------+
| Id| StartDateTime| EndDateTime|EventDuration| State|DeviceId|
+---+-------------------+-------------------+-------------+------------+--------+
| 3|2019-02-05 08:10:00|2019-02-05 08:20:00| 600| connected| 1|
| 2|2019-02-05 08:00:05|2019-02-05 08:10:00| 595|disconnected| 1|
| 1|2019-02-05 08:00:00|2019-02-05 08:00:05| 5| connected| 1|
+---+-------------------+-------------------+-------------+------------+--------+