Conditionally lag value over multiple rows - pyspark

I am trying to find cases where one type of error causes multiple sequential instances of a second type of error on a vehicle. For example, if there are two vehicles, 'a' and 'b', and vehicle a has an error of type 1 ('error_1') on day 0, it can cause errors of type 2 ('error_2') on days 1, 2, 3, and 4. I want to create a variable named cascading_error that shows every consecutive error_2 following an error_1. Note that in the case of vehicle b, it is possible to have an error_2 without a preceding error_1, in which case the value for cascading_error should be 0.
Here's what I've tried:
vals = [('a',0,1,0),('a',1,0,1),('a',2,0,1),('a',3,0,1),('b',0,0,0),('b',1,0,0),('b',2,0,1), ('b',3,0,1)]
df = spark.createDataFrame(vals, ['vehicle','day','error_1','error_2'])
w = Window.partitionBy('vehicle').orderBy('day')
df = df.withColumn('cascading_error', F.lag(df.error_1).over(w) * df.error_2)
df = df.withColumn('cascading_error', F.when((F.lag(df.cascading_error).over(w)==1) & (df.error_2==1), F.lit(1)).otherwise(df.cascading_error))
df.show()
This is my result
| vehicle | day | error_1 | error_2 | cascading_error |
| ------- | --- | ------- | ------- | --------------- |
| a | 0 | 1 | 0 | null |
| a | 1 | 0 | 1 | 1 |
| a | 2 | 0 | 1 | 1 |
| a | 3 | 0 | 1 | 0 |
| a | 4 | 0 | 1 | 0 |
| b | 0 | 0 | 0 | null |
| b | 1 | 0 | 0 | 0 |
| b | 2 | 0 | 1 | 0 |
| b | 3 | 0 | 1 | 0 |
The code is generating the correct cascading_error value on days 1 and 2 for vehicle a, but not on days 3 and 4, which should also be 1. It seems that the logic of combining cascading_error with error_2 to update cascading_error only works for a single row, not sequential ones.

Related

postgresql: query two tables with same column names and show the result side by side ordered their column names, which occur in both tables

Having two tables (table1, table2) with the same column names (generation, parent), the desired output would be the combination of all columns of both tables. Thereby the rows of table2 should join table1 so that the rows of table2 are matching those of table1 on generation column. The parent number should be ordered ascending for the entries in table1 as well as in table2. The number of rows of the query results should be equal of those of table1.
Given the following tables
table1:
| generation | parent |
|:----------:|:------:|
| 0 | 1 |
| 0 | 2 |
| 0 | 3 |
| 1 | 3 |
| 1 | 2 |
| 1 | 1 |
| 2 | 2 |
| 2 | 1 |
| 2 | 3 |
table2:
| generation | parent |
|:----------:|:------:|
| 1 | 3 |
| 1 | 1 |
| 1 | 3 |
| 2 | 1 |
| 2 | 2 |
| 2 | 3 |
The following queries are thought for creating and populating two sample tables as shown above:
create table table1(generation integer, parent integer);
insert into table1 (generation, parent) values(0,1),(0,2),(0,3),(1,3),(1,2),(1,1),(2,2),(2,1),(2,3);
create table table2(generation integer, parent integer);
insert into table2 (generation, parent) values(1,3),(1,1),(1,3),(2,1),(2,2),(2,3);
the imagined query should lead to the following desired result:
| table1_generation | table1_parent | table2_generation | table2_parent |
|:-----------------:|:-------------:|:-----------------:|:-------------:|
| 0 | 1 | | |
| 0 | 2 | | |
| 0 | 3 | | |
| 1 | 1 | 1 | 1 |
| 1 | 2 | 1 | 3 |
| 1 | 3 | 1 | 3 |
| 2 | 1 | 2 | 1 |
| 2 | 2 | 2 | 2 |
| 2 | 3 | 2 | 3 |
Current query looks as follows:
with
p as (
select
generation,
parent
from
table1
order by
generation,
parent
), o as(
select
generation,
parent
from
table2
order by
generation,
parent
)
select
p.generation as table1_generation,
p.parent as table1_parent,
o.generation as table2_generation,
o.parent as table2_parent
from
p
left join o on
o.generation=p.generation;
Which leads to the following result:
| table1_generation | table1_parent | table2_generation | table2_parent |
|:-----------------:|:-------------:|:-----------------:|:-------------:|
| 0 | 1 | | |
| 0 | 2 | | |
| 0 | 3 | | |
| 1 | 1 | 1 | 1 |
| 1 | 1 | 1 | 3 |
| 1 | 1 | 1 | 3 |
| 1 | 2 | 1 | 1 |
| 1 | 2 | 1 | 3 |
| 1 | 2 | 1 | 3 |
| 1 | 3 | 1 | 1 |
| 1 | 3 | 1 | 3 |
| 1 | 3 | 1 | 3 |
| 2 | 1 | 2 | 1 |
| 2 | 1 | 2 | 2 |
| 2 | 1 | 2 | 3 |
| 2 | 2 | 2 | 1 |
| 2 | 2 | 2 | 2 |
| 2 | 2 | 2 | 3 |
| 2 | 3 | 2 | 1 |
| 2 | 3 | 2 | 2 |
| 2 | 3 | 2 | 3 |
This link led to the conclusion, that any join command might not what is necessary here ... But union does only append rows... so for me it is absolutely unclear, how the desired result can be achieved o.O
Any help is highly appreciated. Thanks in advance!
The main misunderstanding on this question arose from the fact that you mentioned join, which is a very precisely mathematically defined concept based on the Cartesian product and can be applied to any two sets. So the current output is clear.
But as you wrote in the title, you want to put two tables side by side. You take advantage of the fact that they have the same number of rows (triples).
This select returns the output you want.
I made artificial join columns, row_number() OVER (order by generation, parent) as rnum, and moved the second table using the addition of three. I hope this helps you:
with
p as (
select
row_number() OVER (order by generation, parent) as rnum,
generation,
parent
from
table1
order by
generation,
parent
), o as(
select
row_number() OVER (order by generation, parent) as rnum,
generation,
parent
from
table2
order by
generation,
parent
)
select
p.generation as table1_generation,
p.parent as table1_parent,
o.generation as table2_generation,
o.parent as table2_parent
from
p
left join o on
o.rnum+3=p.rnum
order by 1,2,3,4;
Output:
table1_generation
table1_parent
table2_generation
table2_parent
0
1
(null)
(null)
0
2
(null)
(null)
0
3
(null)
(null)
1
1
1
1
1
2
1
3
1
3
1
3
2
1
2
1
2
2
2
2
2
3
2
3

calculate aggregation and percentage simultaneous after groupBy in scala/Spark Dataset/Dataframe

I am learning to work with Scala and spark. It's my first incidents using them. I have some structured Scala DataSet(org.apache.spark.sql.Dataset) like following format.
Region | Id | RecId | Widget | Views | Clicks | CTR
1 | 1 | 101 | A | 5 | 1 | 0.2
1 | 1 | 101 | B | 10 | 4 | 0.4
1 | 1 | 101 | C | 5 | 1 | 0.2
1 | 2 | 401 | A | 5 | 1 | 0.2
1 | 2 | 401 | D | 10 | 2 | 0.1
NOTE: CTR = Clicks/Views
I want to merge the mapping regardless of Widget (i.e using Region, Id, RecID).
The Expected Output I want is like following:
Region | Id | RecId | Views | Clicks | CTR
1 | 1 | 101 | 20 | 6 | 0.3
1 | 1 | 101 | 15 | 3 | 0.2
What I am getting is like below:
>>> ds.groupBy("Region","Id","RecId").sum().show()
Region | Id | RecId | sum(Views) | sum(Clicks) | sum(CTR)
1 | 1 | 101 | 20 | 6 | 0.8
1 | 1 | 101 | 15 | 3 | 0.3
I understand that it is summing up all the CTR from original but I want to groupBy as explained but still want to get the expected CTR value. I also don't want to change column names as it is changing in my approach.
Is there any possible way of calculating in such manner. I also have #Purchases and CoversionRate (#Purchases/Views) and I want to do the same thing with that field also. Any leads will be appreciated.
You can calculate the ctr after the aggregation. Try the below code.
ds.groupBy("Region","Id","RecId")
.agg(sum(col("Views")).as("Views"), sum(col("Clicks")).as("Clicks"))
.withColumn("CTR" , col("Views") / col("Clicks"))
.show()

pyspark piplineRDD fit to Dataframe column

Before everything i'm new guy in python and spark world.
I have homework from university but i stuck in one place.
I make clusterization from my data and now i have my clusters in PipelinedRDD
aftre this:
cluster = featurizedScaledRDD.map(lambda r: kmeansModelMllib.predict(r))
cluster = [2,1,2,0,0,0,1,2]
now now i have cluster and my dataframe dataDf i need fit my cluster like a new column to dataDf
i Have: i Need:
+---+---+---+ +---+---+---+-------+
| x | y | z | | x | y | z |cluster|
+---+---+---+ +---+---+---+-------+
| 0 | 1 | 1 | | 0 | 1 | 1 | 2 |
| 0 | 0 | 1 | | 0 | 0 | 1 | 1 |
| 0 | 8 | 0 | | 0 | 8 | 0 | 2 |
| 0 | 8 | 0 | | 0 | 8 | 0 | 0 |
| 0 | 1 | 0 | | 0 | 1 | 0 | 0 |
+---+---+---+ +---+---+---+-------+
You can add index using zipWithIndex, join, and convert back to df.
swp = lambda x: (x[1], x[0])
cluster.zipWithIndex().map(swp).join(dataDf.rdd.zipWithIndex().map(swp)) \
.values().toDF(["cluster", "point"])
In some cases it should be possible to use zip:
cluster.zip(dataDf.rdd).toDF(["cluster", "point"])
You can follow with .select("cluster", "point.*") to flatten the output.

Count occurrences of value in field for a particular ID using Redshift

I want to count the occurrences of particular values in a certain field for an ID. So what I have is this:
| Location ID | Group |
|:----------- |:---------|
| 1 | Group A |
| 2 | Group B |
| 3 | Group C |
| 4 | Group A |
| 4 | Group B |
| 4 | Group C |
| 3 | Group A |
| 2 | Group B |
| 1 | Group C |
| 2 | Group A |
And what I would hope to yield through some computer magic is this:
| Location ID | Group A Count | Group B Count | Group C count|
|:----------- |:--------------|:--------------|:-------------|
| 1 | 1 | 0 | 1 |
| 2 | 1 | 2 | 0 |
| 3 | 1 | 0 | 1 |
| 4 | 1 | 1 | 1 |
Is there some sort of pivoting function I can use in Redshift to achieve this?
This will require the usage of the CASE function and GROUP clause, as in example.
SELECT l_id,
SUM(CASE WHEN l_group = 'Group A' THEN 1 ELSE 0 END) AS a,
SUM(CASE WHEN l_group = 'Group B' THEN 1 ELSE 0 END) AS b-- and so on
FROM location
GROUP BY l_id;
This should give you such result:
| l_id | a | b |
|------|---|---|
| 4 | 1 | 1 |
| 1 | 1 | 0 |
| 3 | 1 | 0 |
| 2 | 1 | 2 |
You can play with it on this SQL Fiddle.

PostgreSQL group by bug on Unicode strings?

I have a very weird thing happening, where I noticed that a group by (word) wasn't always grouping by word if that word is a UTF-8 string. In the same query, I get cases where it's been grouped correctly, and cases where it hasn't. I wonder if anybody knows what's up with that?
select *,count(*) over (partition by md5(word)) as k
from (
select word,count(*) as n
from :tmpwl
group by 1
) a order by 1,2 limit 12;
/* gives:
word | n | k
------+---+---
いい | 1 | 1
くず | 1 | 1
ごみ | 1 | 1
さま | 1 | 1
さん | 1 | 1
へま | 1 | 1
まめ | 1 | 1
よく | 1 | 1
ろく | 1 | 1
ネガ | 1 | 2 -- what the heck?
ネガ | 1 | 2
パス | 1 | 1
*/
Note that the following workaround works fine:
select word,n,count(*) over (partition by md5(word)) as k
from (
select md5(word),max(word) as word,count(*) as n
from :tmpwl
group by 1
) a order by 1,2 limit 12;
/* gives:
word | n | k
------+---+---
いい | 1 | 1
くず | 1 | 1
ごみ | 1 | 1
さま | 1 | 1
さん | 1 | 1
へま | 1 | 1
まめ | 1 | 1
よく | 1 | 1
ろく | 1 | 1
ネガ | 2 | 1
パス | 1 | 1
プア | 1 | 1
*/
The version is PostgreSQL 8.2.14 (Greenplum Database 4.0.4.0 build 3 Single-Node Edition) on x86_64-unknown-linux-gnu, compiled by GCC gcc.exe (GCC) 4.1.1 compiled on Nov 30 2010 17:20:26.
The source table :tmpwl:
\d :tmpwl
Table "pg_temp_25149.pdtmp_foo706453357357532"
Column | Type | Modifiers
----------+---------+-----------
baseword | text |
word | text |
value | integer |
lexicon | text |
nalts | bigint |
Distributed by: (word)