How do I remove outliers using IQR in pyspark?

How do I remove outliers using IQR in pyspark? - pyspark

I know how to remove outliers with iqr using pandas. But now i'm trying to learn pyspark. I have done some searching online, but most of them only flag outliers with 'yes' and 'no', and does not proceed to remove them. Additionally, I also dont understand the code they are writing.
I have also tried it out a bit by myself but to no avail. Ultimately, I dont know where to start.
Here's an example dataframe
+----------+---------+--------------+---------------+----------------------+------------+
| town|flat_type| flat_model|remaining_lease|floor_area_sqm_imputed|resale_price|
+----------+---------+--------------+---------------+----------------------+------------+
|ANG MO KIO| 2 ROOM| Improved| 736| 44.0| 232000.0|
|ANG MO KIO| 3 ROOM|New Generation| 727| 67.0| 250000.0|
|ANG MO KIO| 3 ROOM|New Generation| 749| 67.0| 262000.0|
|ANG MO KIO| 3 ROOM|New Generation| 745| 68.0| 265000.0|
|ANG MO KIO| 3 ROOM|New Generation| 749| 67.0| 265000.0|
|ANG MO KIO| 3 ROOM|New Generation| 756| 68.0| 275000.0|
|ANG MO KIO| 3 ROOM|New Generation| 738| 68.0| 280000.0|
|ANG MO KIO| 3 ROOM|New Generation| 700| 67.0| 285000.0|
|ANG MO KIO| 3 ROOM|New Generation| 738| 68.0| 285000.0|
|ANG MO KIO| 3 ROOM|New Generation| 736| 67.0| 285000.0|
+----------+---------+--------------+---------------+----------------------+------------+
I plan on doing removals for floor_area_sqm_imputed so i dont need code that assumes there are multiple columns.
Any help appreciated. I know it sounds like I just want answers instead of searching for myself.

Use percentile_approx Spark SQL function to compute quantile 1 and quantile 3 and filter records in this range.
import pyspark.sql.functions as F
df = spark.createDataFrame(data=[["ANG MO KIO","2 ROOM","Improved",736,44.0,232000.0],["ANG MO KIO","3 ROOM","New Generation",727,67.0,250000.0],["ANG MO KIO","3 ROOM","New Generation",749,67.0,262000.0],["ANG MO KIO","3 ROOM","New Generation",745,68.0,265000.0],["ANG MO KIO","3 ROOM","New Generation",749,67.0,265000.0],["ANG MO KIO","3 ROOM","New Generation",756,68.0,275000.0],["ANG MO KIO","3 ROOM","New Generation",738,68.0,280000.0],["ANG MO KIO","3 ROOM","New Generation",700,67.0,285000.0],["ANG MO KIO","3 ROOM","New Generation",738,68.0,285000.0],["ANG MO KIO","3 ROOM","New Generation",736,67.0,285000.0]], schema=["town","flat_type","flat_model","remaining_lease","floor_area_sqm_imputed","resale_price"])
qtr_map = df.select( \
F.expr("percentile_approx(floor_area_sqm_imputed, 0.25) as Q1"), \
F.expr("percentile_approx(floor_area_sqm_imputed, 0.75) as Q3") \
) \
.collect()[0] \
.asDict()
df = df.filter( \
(F.col("floor_area_sqm_imputed") >= qtr_map["Q1"]) \
& (F.col("floor_area_sqm_imputed") <= qtr_map["Q3"]) \
)
Output:
+----------+---------+--------------+---------------+----------------------+------------+
| town|flat_type| flat_model|remaining_lease|floor_area_sqm_imputed|resale_price|
+----------+---------+--------------+---------------+----------------------+------------+
|ANG MO KIO| 3 ROOM|New Generation| 727| 67.0| 250000.0|
|ANG MO KIO| 3 ROOM|New Generation| 749| 67.0| 262000.0|
|ANG MO KIO| 3 ROOM|New Generation| 745| 68.0| 265000.0|
|ANG MO KIO| 3 ROOM|New Generation| 749| 67.0| 265000.0|
|ANG MO KIO| 3 ROOM|New Generation| 756| 68.0| 275000.0|
|ANG MO KIO| 3 ROOM|New Generation| 738| 68.0| 280000.0|
|ANG MO KIO| 3 ROOM|New Generation| 700| 67.0| 285000.0|
|ANG MO KIO| 3 ROOM|New Generation| 738| 68.0| 285000.0|
|ANG MO KIO| 3 ROOM|New Generation| 736| 67.0| 285000.0|
+----------+---------+--------------+---------------+----------------------+------------+

Related

Calculate time between steps in PostgreSQL

I have an audit table that tracks the steps in a process and I need to track the time for each step. It used to be stored in MS SQL Server but is now stored in PostgreSQL. I have the query from SQL server and have not been successful at converting it. Here is the MS SQL working example: http://www.sqlfiddle.com/#!18/1b423/1
Here are the rules:
The steps are not required to be sequential, so step 1 can happen
after step 5.
The records for an order are not stored sequentially by step
or order, but rather are intermixed with other orders based upon
the Time Entered.
The sample data being ordered by Order Number then New is NOT
normal and cannot be depended upon.
Each step can be repeated for any given order, if repeated for an
order, then sum the times by step.
The starting step record is always null in the Old column
Starting step is calculated as the time difference between
when it is in the New column and when it is the value in the Old
column for a given order.
For the steps that the order never came out of, the time is computed
up to the present moment
A step can be repeated many times and am only looking for the total
time spent in each step.
I cannot get the date difference to sum or handle the null old status value for the first step. I get various forms of this error when running the following sql.
ERROR: function isnull(timestamp without time zone, timestamp with
time zone) does not exist LINE 4: sum(a1.timeentered -
isnull(a2.timeentered,now())) as "tota...
^ HINT: No function matches the given name and argument types. You might need to add explicit type casts.
SELECT
a1.ordernumber,
a1."new" AS "Step",
sum(a1.timeentered - isnull(a2.timeentered,now())) as "total time"
FROM
audittrail AS a1
LEFT JOIN
audittrail AS a2
ON
a1."new" = a2."old" AND
a1.ordernumber = a2.ordernumber
GROUP BY
a1.ordernumber,
a1."new"
ORDER BY
a1.ordernumber ASC
Here is the sample data as well as a link to a sample online: http://www.sqlfiddle.com/#!17/e6fd5a
Old New Time Entered Order Number
NULL Step 1 4/30/12 10:43 1C2014A
Step 1 Step 2 5/2/12 10:17 1C2014A
Step 2 Step 3 5/2/12 10:28 1C2014A
Step 3 Step 4 5/2/12 11:14 1C2014A
Step 4 Step 5 5/2/12 11:19 1C2014A
Step 5 Step 9 5/3/12 11:23 1C2014A
NULL Step 1 5/18/12 15:49 1C2014B
Step 1 Step 2 5/21/12 9:21 1C2014B
Step 2 Step 3 5/21/12 9:34 1C2014B
Step 3 Step 4 5/21/12 10:08 1C2014B
Step 4 Step 5 5/21/12 10:09 1C2014B
Step 5 Step 6 5/21/12 16:27 1C2014B
Step 6 Step 9 5/21/12 18:07 1C2014B
NULL Step 1 6/12/12 10:28 1C2014C
Step 1 Step 2 6/13/12 8:36 1C2014C
Step 2 Step 3 6/13/12 9:05 1C2014C
Step 3 Step 4 6/13/12 10:28 1C2014C
Step 4 Step 6 6/13/12 10:50 1C2014C
Step 6 Step 8 6/13/12 12:14 1C2014C
Step 8 Step 4 6/13/12 15:13 1C2014C
Step 4 Step 5 6/13/12 15:23 1C2014C
Step 5 Step 8 6/13/12 15:30 1C2014C
Step 8 Step 9 6/18/12 14:04 1C2014C
This is the expected result:
| OrderNumber | Step | Total Time in Step (seconds) |
|-------------|--------|------------------------------|
| 1C2014A | Step 1 | 171240 |
| 1C2014A | Step 2 | 660 |
| 1C2014A | Step 3 | 2760 |
| 1C2014A | Step 4 | 300 |
| 1C2014A | Step 5 | 86640 |
| 1C2014A | Step 9 | 324902599 |
| 1C2014B | Step 1 | 235920 |
| 1C2014B | Step 2 | 780 |
| 1C2014B | Step 3 | 2040 |
| 1C2014B | Step 4 | 60 |
| 1C2014B | Step 5 | 22680 |
| 1C2014B | Step 6 | 6000 |
| 1C2014B | Step 9 | 323323159 |
| 1C2014C | Step 1 | 79680 |
| 1C2014C | Step 2 | 1740 |
| 1C2014C | Step 3 | 4980 |
| 1C2014C | Step 4 | 3840 |
| 1C2014C | Step 5 | 420 |
| 1C2014C | Step 6 | 5040 |
| 1C2014C | Step 8 | 875160 |
| 1C2014C | Step 9 | 320918539 |

This turned out to be harder than I thought. This is the full query I used. It doesn't have the subsitution for ISNULL function, but it gets most of the way there. I used the extract function from Date/Time Functions. Specifically, to get everything in seconds, I used extract(epoch from ...
SELECT
a1.ordernumber,
a1."new" AS "Step",
sum(extract(epoch from a2.timeentered) -
extract(epoch from a1.timeentered)) as total_time
FROM
audittrail AS a1
LEFT JOIN
audittrail AS a2
ON
a1.new = a2.old AND
a1.ordernumber = a2.ordernumber
GROUP BY
a1.ordernumber,
a1.new
ORDER BY a1.ordernumber ASC
which gives
ordernumber
Step
total_time
1C2014A
Step 1
171240
1C2014A
Step 2
660
1C2014A
Step 3
2760
1C2014A
Step 4
300
1C2014A
Step 5
86640
1C2014A
Step 9
NULL
1C2014B
Step 1
235920
1C2014B
Step 2
780
1C2014B
Step 3
2040
1C2014B
Step 4
60
1C2014B
Step 5
22680
1C2014B
Step 6
6000
1C2014B
Step 9
NULL
1C2014C
Step 1
79680
1C2014C
Step 2
1740
1C2014C
Step 3
4980
1C2014C
Step 4
3840
1C2014C
Step 5
420
1C2014C
Step 6
5040
1C2014C
Step 8
875160
1C2014C
Step 9
NULL
To me this calculation looks wrong. For me, it makes more sense (for example) for the entry for Step 3/Order 1C2014A, the total_time should be 11 minutes or 660 seconds. To achieve this, swap old and new in the join and swap a1 and a2 in the sum(part(epoch....) to become
SELECT
a1.ordernumber,
a1.new AS Step,
sum(extract(epoch from a1.timeentered) -
extract(epoch from a2.timeentered)) as total_time
FROM
audittrail AS a1
LEFT JOIN
audittrail AS a2
ON
a1.old = a2.new AND
a1.ordernumber = a2.ordernumber
GROUP BY
a1.ordernumber,
a1.new
ORDER BY a1.ordernumber ASC

Just replace isnull and datediff with equivalent PostgreSQL expressions in the second query line.
select a1.OrderNumber as "OrderNumber", a1.New as "Step",
extract('epoch' from
sum(coalesce(a2.TimeEntered, now()) - a1.TimeEntered))::integer
as "Total Time in Step (seconds)"
from AuditTrail a1
left join AuditTrail a2
on a1.New = a2.Old
and a1.OrderNumber = a2.OrderNumber
group by a1.OrderNumber, a1.New
order by a1.OrderNumber;

PySpark Timeseries indicator event column, extract data before and after event occurs

I am working with a spark dataframe containing a timeseries data and one of the columns is an indicator for an event. looking something like the dummy table below.
id
time
timeseries_data
event_indicator
a
2022-08-12 08:00
1
0
a
2022-08-12 08:01
2
0
a
2022-08-12 08:02
3
0
a
2022-08-12 08:03
4
1
a
2022-08-12 08:04
5
0
a
2022-08-12 08:05
6
0
b
2022-08-12 08:00
1
0
b
2022-08-12 08:01
2
0
b
2022-08-12 08:02
3
1
b
2022-08-12 08:03
4
0
b
2022-08-12 08:04
5
0
b
2022-08-12 08:05
6
0
I now want to select samples before and after (including the sample where the event occurs). to start off one sample before and after, but also by time so everything within 4 minutes of the event for each id.
I've tried to use the window function but I don't know how to sort it out.
The result for id a is shown below. the event occurs 2022-08-12 08:03 at sample 4 and I now want to extract the following to a new dataframe.
id
time
timeseries_data
event_indicator
a
2022-08-12 08:02
3
0
a
2022-08-12 08:03
4
1
a
2022-08-12 08:04
5
0
Edit:
I thought it would be a very simple solution to this problem, and I am just new to PySpark that's why I don't really get it to work.
What I've tried is to use a window function per id.
windowPartition = Window.partitionBy([F.col("id")]).orderBy("time").rangeBetween(-1, 1)
test_df = df_dummy.where(F.col('event_indicator') == 1).over(windowPartition)
however, the error is that df_dummy does not have object 'over'. So I need to figure out a way to apply this window to the entire dataframe and not just a function.
The lag/lead from my understanding is only to take the lagged/lead value and I want a consecutive dataframe of the time around the event_indicator.
The timestamp is only dummy data, for me currently it does not matter if the window over is per minute or per second so I've changed the question to per minute.
Currently the goal is to get an understanding how I can extract a subset of the entire timeseries dataframe. This to see how the data changes when something happens. An example could be a normal car driving, one tyre explodes and we want to see what happened with the pressure the x timeseries samples before and after the explosion. And the next step might not be to use samples but instead what happened with the data the previous minute and the following minute of data.

#Emil
Solution :
from pyspark.sql import functions as F
from pyspark.sql import Window
data = [("a","2022-08-12 08:00","1","0"),
("a","2022-08-12 08:01","2","0"),
("a","2022-08-12 08:02","3","0"),
("a","2022-08-12 08:03","4","1"),
("a","2022-08-12 08:04","5","0"),
("a","2022-08-12 08:05","6","0"),
("a","2022-08-12 08:10","7","0"),
("a","2022-08-12 08:12","8","1"),
("a","2022-08-12 08:14","9","0"),
("b","2022-08-12 08:00","1","0"),
("b","2022-08-12 08:01","2","0"),
("b","2022-08-12 08:02","3","1"),
("b","2022-08-12 08:03","4","0"),
("b","2022-08-12 08:04","5","0"),
("b","2022-08-12 08:05","6","0"),
("b","2022-08-12 08:10","7","0"), # thesearemy testcase,shouldn't in output
("b","2022-08-12 08:12","8","1"), # theseare my testcase,shouldn't in output
("b","2022-08-12 08:17","9","0")] # thesearemy testcase,shouldn't in output
schema=["id","time","timeseries_data","event_indicator"]
df = spark.createDataFrame(data,schema)
df= df.withColumn("time",F.col("time").cast("timestamp"))\
.withColumn("event_indicator",F.col("event_indicator").cast("int"))
window_spec = Window.partitionBy(["id"]).orderBy("time")
window_spec_groups_all = Window.partitionBy(["id","continous_groups"])
flag_cond_ = ((F.lag(F.col("event_indicator")).over(window_spec)==1) | (F.col("event_indicator")==1)
|(F.lead(F.col("event_indicator")).over(window_spec)==1))
four_minutes_cond_ = (F.unix_timestamp(F.last(F.col("time")).over(window_spec_groups_all))-F.unix_timestamp(F.first(F.col("time")).over(window_spec_groups_all)))
df_flt= df.withColumn("flag",F.when(flag_cond_,F.lit(True)).otherwise(F.lit(False)))\
.withColumn("org_rnk",F.row_number().over(window_spec))\
.filter(F.col("flag"))\
.withColumn("flt_rnk",F.row_number().over(window_spec))\
.withColumn("continous_groups",(F.col("org_rnk")-F.col("flt_rnk")).cast("int"))\
.withColumn("four_minutes_cond_",four_minutes_cond_)\
.filter(F.col("four_minutes_cond_") <=240)\
.select(schema)
df_flt.show(20,0)
output:-
+---+-------------------+---------------+---------------+
|id |time |timeseries_data|event_indicator|
+---+-------------------+---------------+---------------+
|a |2022-08-12 08:02:00|3 |0 |
|a |2022-08-12 08:03:00|4 |1 |
|a |2022-08-12 08:04:00|5 |0 |
|a |2022-08-12 08:10:00|7 |0 |
|a |2022-08-12 08:12:00|8 |1 |
|a |2022-08-12 08:14:00|9 |0 |
|b |2022-08-12 08:01:00|2 |0 |
|b |2022-08-12 08:02:00|3 |1 |
|b |2022-08-12 08:03:00|4 |0 |
+---+-------------------+---------------+---------------+

How to consistently sum lists of values contained in a table?

I have the following two tables:
t1:([]sym:`AAPL`GOOG; histo_dates1:(2000.01.01+til 10;2000.01.01+til 10);histo_values1:(til 10;5+til 10));
t2:([]sym:`AAPL`GOOG; histo_dates2:(2000.01.05+til 5;2000.01.06+til 4);histo_values2:(til 5; 2+til 4));
What I want is to sum the histo_values of each symbol across the histo_dates, such that the resulting table would look like this:
t:([]sym:`AAPL`GOOG; histo_dates:(2000.01.01+til 10;2000.01.01+til 10);histo_values:(0 1 2 3 4 6 8 10 12 9;5 6 7 8 9 12 14 16 18 14))
So the resulting dates histo_dates should be the union of histo_dates1 and histo_dates2, and histo_values should be the sum of histo_values1 and histo_values2 across dates.
EDIT:
I insist on the union of the dates, as I want the resulting histo_dates to be the union of both histo_dates1 and histo_dates2.

There are a few ways. One would be to ungroup to remove nesting, join the tables, aggregate on sym/date and then regroup on sym:
q)0!select histo_dates:histo_dates1, histo_values:histo_values1 by sym from select sum histo_values1 by sym, histo_dates1 from ungroup[t1],cols[t1]xcol ungroup[t2]
sym histo_dates histo_values
-------------------------------------------------------------------------------------------------------------------------------------------
AAPL 2000.01.01 2000.01.02 2000.01.03 2000.01.04 2000.01.05 2000.01.06 2000.01.07 2000.01.08 2000.01.09 2000.01.10 0 1 2 3 4 6 8 10 12 9
GOOG 2000.01.01 2000.01.02 2000.01.03 2000.01.04 2000.01.05 2000.01.06 2000.01.07 2000.01.08 2000.01.09 2000.01.10 5 6 7 8 9 12 14 16 18 14
A possibly faster way would be to make each row a dictionary and then key the tables on sym and add them:
q)select sym:s, histo_dates:key each v, histo_values:value each v from (1!select s, d!'v from `s`d`v xcol t1)+(1!select s, d!'v from `s`d`v xcol t2)
sym histo_dates histo_values
-------------------------------------------------------------------------------------------------------------------------------------------
AAPL 2000.01.01 2000.01.02 2000.01.03 2000.01.04 2000.01.05 2000.01.06 2000.01.07 2000.01.08 2000.01.09 2000.01.10 0 1 2 3 4 6 8 10 12 9
GOOG 2000.01.01 2000.01.02 2000.01.03 2000.01.04 2000.01.05 2000.01.06 2000.01.07 2000.01.08 2000.01.09 2000.01.10 5 6 7 8 9 12 14 16 18 14
Another option would be to use a plus join pj:
q)0!`sym xgroup 0!pj[ungroup `sym`histo_dates`histo_values xcol t1;2!ungroup `sym`histo_dates`histo_values xcol t2]
sym histo_dates histo_values
-------------------------------------------------------------------------------------------------------------------------------------------
AAPL 2000.01.01 2000.01.02 2000.01.03 2000.01.04 2000.01.05 2000.01.06 2000.01.07 2000.01.08 2000.01.09 2000.01.10 0 1 2 3 4 6 8 10 12 9
GOOG 2000.01.01 2000.01.02 2000.01.03 2000.01.04 2000.01.05 2000.01.06 2000.01.07 2000.01.08 2000.01.09 2000.01.10 5 6 7 8 9 12 14 16 18 14
See here for more on plus joins: https://code.kx.com/v2/ref/pj/
EDIT:
To explicitly make sure the result has the union of the dates, you could use a union join:
q)0!`sym xgroup select sym,histo_dates,histo_values:hv1+hv2 from 0^uj[2!ungroup `sym`histo_dates`hv1 xcol t1;2!ungroup `sym`histo_dates`hv2 xcol t2]
sym histo_dates histo_values
-------------------------------------------------------------------------------------------------------------------------------------------
AAPL 2000.01.01 2000.01.02 2000.01.03 2000.01.04 2000.01.05 2000.01.06 2000.01.07 2000.01.08 2000.01.09 2000.01.10 0 1 2 3 4 6 8 10 12 9
GOOG 2000.01.01 2000.01.02 2000.01.03 2000.01.04 2000.01.05 2000.01.06 2000.01.07 2000.01.08 2000.01.09 2000.01.10 5 6 7 8 9 12 14 16 18 14

another way:
// rename the columns to be common names, ungroup the tables, and place the key on `sym and `histo_dates
q){2!ungroup `sym`histo_dates`histo_values xcol x} each (t1;t2)
// add them together (or use pj in place of +), group on `sym
`sym xgroup (+) . {2!ungroup `sym`histo_dates`histo_values xcol x} each (t1;t2)
// and to test this matches t, remove the key from the resulting table
q)t~0!`sym xgroup (+) . {2!ungroup `sym`histo_dates`histo_values xcol x} each (t1;t2)
1b

Another possible way using functional amend
//Column join the histo_dates* columns and get the distinct dates - drop idx
//Using a functional apply use the idx to determine which values to plus
//Join the two tables using sym as the key - Find the idx of common dates
(enlist `idx) _select sym,histo_dates:distinct each (histo_dates1,'histo_dates2),
histovalues:{#[x;z;+;y]}'[histo_values1;histo_values2;idx],idx from
update idx:(where each histo_dates1 in' histo_dates2) from ((1!t1) uj 1!t2)
One possible problem with this is that to get the idx, it depends on the date columns being sorted which is usually the case.

reshape dataframe from column to rows in scala

I want to reshape a dataframe in Spark using scala . I found most of the example uses groupBy andpivot. In my case i dont want to use groupBy. This is how my dataframe looks like
tagid timestamp value
1 1 2016-12-01 05:30:00 5
2 1 2017-12-01 05:31:00 6
3 1 2017-11-01 05:32:00 4
4 1 2017-11-01 05:33:00 5
5 2 2016-12-01 05:30:00 100
6 2 2017-12-01 05:31:00 111
7 2 2017-11-01 05:32:00 109
8 2 2016-12-01 05:34:00 95
And i want my dataframe to look like this,
timestamp 1 2
1 2016-12-01 05:30:00 5 100
2 2017-12-01 05:31:00 6 111
3 2017-11-01 05:32:00 4 109
4 2017-11-01 05:33:00 5 NA
5 2016-12-01 05:34:00 NA 95
i used pivot without groupBy and it throws error.
df.pivot("tagid")
error: value pivot is not a member of org.apache.spark.sql.DataFrame.
How do i convert this? Thank you.

Doing the following should solve your issue.
df.groupBy("timestamp").pivot("tagId").agg(first($"value"))
you should have final dataframe as
+-------------------+----+----+
|timestamp |1 |2 |
+-------------------+----+----+
|2017-11-01 05:33:00|5 |null|
|2017-11-01 05:32:00|4 |109 |
|2017-12-01 05:31:00|6 |111 |
|2016-12-01 05:30:00|5 |100 |
|2016-12-01 05:34:00|null|95 |
+-------------------+----+----+
for more information you can checkout databricks blog

Average of grouping columns

My table is something like this
id ...... amount...........food
+++++++++++++++++++++++++++++++++++++
1 ........ 5 ............. banana
1 ........ 4 ............. strawberry
2 ........ 2 ............. banana
2 ........ 7 ............. orange
2 ........ 8 ............. strawberry
3 ........ 10 .............lime
3 ........ 12 .............banana
What I want is a table display each food, with the average number of times it appears in each ID.
The table should look something like this I think:
food ........... avg............
++++++++++++++++++++++++++++++++
banana .......... 6.3 ............
strawberry ...... 6 ............
orange .......... 7 ............
lime ............ 10 ............
I'm not really sure on how to do this. If I use just avg(amount) then it will just add the whole amount column

Did you try GROUP BY?
SELECT food, AVG(amount) "avg"
FROM table1
GROUP BY food
Here is SQLFiddle
Output:
| food | avg |
|------------|-------------------|
| lime | 10 |
| orange | 7 |
| strawberry | 6 |
| banana | 6.333333333333333 |

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse