pyspark piplineRDD fit to Dataframe column - pyspark

Before everything i'm new guy in python and spark world.
I have homework from university but i stuck in one place.
I make clusterization from my data and now i have my clusters in PipelinedRDD
aftre this:
cluster = featurizedScaledRDD.map(lambda r: kmeansModelMllib.predict(r))
cluster = [2,1,2,0,0,0,1,2]
now now i have cluster and my dataframe dataDf i need fit my cluster like a new column to dataDf
i Have: i Need:
+---+---+---+ +---+---+---+-------+
| x | y | z | | x | y | z |cluster|
+---+---+---+ +---+---+---+-------+
| 0 | 1 | 1 | | 0 | 1 | 1 | 2 |
| 0 | 0 | 1 | | 0 | 0 | 1 | 1 |
| 0 | 8 | 0 | | 0 | 8 | 0 | 2 |
| 0 | 8 | 0 | | 0 | 8 | 0 | 0 |
| 0 | 1 | 0 | | 0 | 1 | 0 | 0 |
+---+---+---+ +---+---+---+-------+

You can add index using zipWithIndex, join, and convert back to df.
swp = lambda x: (x[1], x[0])
cluster.zipWithIndex().map(swp).join(dataDf.rdd.zipWithIndex().map(swp)) \
.values().toDF(["cluster", "point"])
In some cases it should be possible to use zip:
cluster.zip(dataDf.rdd).toDF(["cluster", "point"])
You can follow with .select("cluster", "point.*") to flatten the output.

Related

Conditionally lag value over multiple rows

I am trying to find cases where one type of error causes multiple sequential instances of a second type of error on a vehicle. For example, if there are two vehicles, 'a' and 'b', and vehicle a has an error of type 1 ('error_1') on day 0, it can cause errors of type 2 ('error_2') on days 1, 2, 3, and 4. I want to create a variable named cascading_error that shows every consecutive error_2 following an error_1. Note that in the case of vehicle b, it is possible to have an error_2 without a preceding error_1, in which case the value for cascading_error should be 0.
Here's what I've tried:
vals = [('a',0,1,0),('a',1,0,1),('a',2,0,1),('a',3,0,1),('b',0,0,0),('b',1,0,0),('b',2,0,1), ('b',3,0,1)]
df = spark.createDataFrame(vals, ['vehicle','day','error_1','error_2'])
w = Window.partitionBy('vehicle').orderBy('day')
df = df.withColumn('cascading_error', F.lag(df.error_1).over(w) * df.error_2)
df = df.withColumn('cascading_error', F.when((F.lag(df.cascading_error).over(w)==1) & (df.error_2==1), F.lit(1)).otherwise(df.cascading_error))
df.show()
This is my result
| vehicle | day | error_1 | error_2 | cascading_error |
| ------- | --- | ------- | ------- | --------------- |
| a | 0 | 1 | 0 | null |
| a | 1 | 0 | 1 | 1 |
| a | 2 | 0 | 1 | 1 |
| a | 3 | 0 | 1 | 0 |
| a | 4 | 0 | 1 | 0 |
| b | 0 | 0 | 0 | null |
| b | 1 | 0 | 0 | 0 |
| b | 2 | 0 | 1 | 0 |
| b | 3 | 0 | 1 | 0 |
The code is generating the correct cascading_error value on days 1 and 2 for vehicle a, but not on days 3 and 4, which should also be 1. It seems that the logic of combining cascading_error with error_2 to update cascading_error only works for a single row, not sequential ones.

tableau calculate cumulative value with condition

I have a tableau table with columns like this:
| ID | ww | count_flag |
| 1 | ww1 | 0 |
| 1 | ww2 | 1 |
| 1 | ww3 | 1 |
| 1 | ww4 | 0 |
| 1 | ww5 | 1 |
| 2 | ww1 | 1 |
| 2 | ww2 | 1 |
| 2 | ww3 | 1 |
| 2 | ww4 | 0 |
| 2 | ww5 | 1 |
...
Now I'd like to add a new column to show the consistent status for each ID among all the ww(workweek), the consistent status will be reset every time when the count_flag is 0 or ID changes, so it will look like below:
|ID | ww | count_flag | consistent status|
| 1 | ww1 | 0 | 0 |
| 1 | ww2 | 1 | 1 |
| 1 | ww3 | 1 | 2 |
| 1 | ww4 | 0 | 0 |
| 1 | ww5 | 1 | 1 |
| 2 | ww1 | 1 | 1 |
| 2 | ww2 | 1 | 2 |
| 2 | ww3 | 1 | 3 |
| 2 | ww4 | 0 | 0 |
| 2 | ww5 | 1 | 1 |
...
How should I create the calculating field to add such a parameter to the table column.

calculate aggregation and percentage simultaneous after groupBy in scala/Spark Dataset/Dataframe

I am learning to work with Scala and spark. It's my first incidents using them. I have some structured Scala DataSet(org.apache.spark.sql.Dataset) like following format.
Region | Id | RecId | Widget | Views | Clicks | CTR
1 | 1 | 101 | A | 5 | 1 | 0.2
1 | 1 | 101 | B | 10 | 4 | 0.4
1 | 1 | 101 | C | 5 | 1 | 0.2
1 | 2 | 401 | A | 5 | 1 | 0.2
1 | 2 | 401 | D | 10 | 2 | 0.1
NOTE: CTR = Clicks/Views
I want to merge the mapping regardless of Widget (i.e using Region, Id, RecID).
The Expected Output I want is like following:
Region | Id | RecId | Views | Clicks | CTR
1 | 1 | 101 | 20 | 6 | 0.3
1 | 1 | 101 | 15 | 3 | 0.2
What I am getting is like below:
>>> ds.groupBy("Region","Id","RecId").sum().show()
Region | Id | RecId | sum(Views) | sum(Clicks) | sum(CTR)
1 | 1 | 101 | 20 | 6 | 0.8
1 | 1 | 101 | 15 | 3 | 0.3
I understand that it is summing up all the CTR from original but I want to groupBy as explained but still want to get the expected CTR value. I also don't want to change column names as it is changing in my approach.
Is there any possible way of calculating in such manner. I also have #Purchases and CoversionRate (#Purchases/Views) and I want to do the same thing with that field also. Any leads will be appreciated.
You can calculate the ctr after the aggregation. Try the below code.
ds.groupBy("Region","Id","RecId")
.agg(sum(col("Views")).as("Views"), sum(col("Clicks")).as("Clicks"))
.withColumn("CTR" , col("Views") / col("Clicks"))
.show()

Which Variables go on which side of a Karnaugh Map

For a Karnaugh map of three or more variables deciding which side the variables go makes the solution easier to spot and simpler. But how do you know which side which variables go on.
eg. For variables x, y and z; You could have x and y as column headers and z as a row header or you could have y and z as column headers and x as a row header which would give two different tables
For maps with up to four variables, it is a matter of taste, which variable is put at which side. However, Mahoney maps as extension of Karnaugh maps for five and more variables do require a certain ordering along the side.
Expression for the following examples:
abcd!e + abc!de
Five-input Mahoney map:
Equivalent Karnaugh map:
de de
00 01 11 10 00 01 11 10
abc +---+---+---+---+ abc +---+---+---+---+
000 | 0 | 0 | 0 | 0 | 001 | 0 | 0 | 0 | 0 |
+---+---+---+---+ +---+---+---+---+
010 | 0 | 0 | 0 | 0 | 011 | 0 | 0 | 0 | 0 |
+---+---+---+---+ +---+---+---+---+
110 | 0 | 0 | 0 | 0 | 111 | 0 | 1 | 0 | 1 |
+---+---+---+---+ +---+---+---+---+
100 | 0 | 0 | 0 | 0 | 101 | 0 | 0 | 0 | 0 |
+---+---+---+---+ +---+---+---+---+
It is always possible to swap variables as shown here:
de de
00 01 11 10 00 01 11 10
abc +---+---+---+---+ abc +---+---+---+---+
000 | 0 | 0 | 0 | 0 | 001 | 0 | 0 | 0 | 0 |
+---+---+---+---+ +---+---+---+---+
010 | 0 | 0 | 0 | 0 | 011 | 0 | 0 | 0 | 0 |
+---+---+---+---+ +---+---+---+---+
110 | 0 | 0 | 0 | 0 | 111 | 0 | 1 | 0 | 1 |
+---+---+---+---+ +---+---+---+---+
100 | 0 | 0 | 0 | 0 | 101 | 0 | 0 | 0 | 0 |
+---+---+---+---+ +---+---+---+---+
Here you can find a nice online-tool to draw and simplify Karnaugh-Veitch/Mahoney maps.

Boolean function for True only if last True

Is there any way how to get true only if second value is true?
| A | B | Result |
|---|---|--------|
| 0 | 0 | 0 |
| 0 | 1 | 1 |
| 1 | 0 | 0 |
| 1 | 1 | 0 |
Looks like ~A & B would suffice.