Identifying changes in large amounts of data using pyspark - pyspark

I have a very large amount of data (about a billion rows) of a DATE column and a RESULT column.
The values in the RESULT column are predominantly the name but every now and then there would be a significant deviation in the value. I want to only identify the dates where there was a large deviation.
So from an input dataframe as such:
+----------+------+
| DATE|RESULT|
+----------+------+
|2020-06-24| 4.2|
|2020-05-17| 4.5|
|2020-05-11| 4.5|
|2020-07-30| 4.2|
|2020-07-30| 4.2|
|2020-06-29| 4.2|
|2020-06-29| 4.2|
|2020-03-04| 4.5|
|2020-06-01| 4.2|
|2020-06-27| 4.2|
|2020-06-29| 4.2|
|2020-06-29| 4.2|
|2020-04-17| 4.5|
|2020-04-17| 4.5|
|2020-01-04| 4.5|
|2020-02-29| 4.5|
|2020-07-07| 4.2|
|2020-05-07| 4.5|
|2020-06-09| 4.2|
|2020-06-22| 4.2|
+----------+------+
I would expect an output of:
+----------+------+
| DATE|RESULT|
+----------+------+
|2020-05-11| 4.5|
|2020-07-30| 4.2|
|2020-06-29| 4.2|
|2020-04-17| 4.5|
|2020-02-29| 4.5|
|2020-07-07| 4.2|
|2020-05-07| 4.5|
|2020-06-09| 4.2|
+----------+------+
I tried using the window and lag functions, but it is forcing the entire dataset into a single node, and therefore loses the advantage of using distributed computing.
I came across a suggestion in StackOverflow to use the Median and Mean Absolute Deviation (MAD) and defining a threshold to identify the records with abnormal shifts, but I could not find a MAD statistic function in the pyspark.sql.functions library.
Does anyone have any better ideas? I would greatly appreciate it.
I am coding in pyspark, but if the solution is in spark/scala that's fine too.
Thank You

You may find this link useful for calculation of MAD https://www.advancinganalytics.co.uk/blog/2020/9/2/identifying-outliers-in-spark-30
Adding relevant content from that link below:
MAD=Median(|xi-xm|)
where xm is median of the data set and xi is value in data set.
MAD is median of the difference between each value and median of the entire dataset
consider a df with columns 'category', 'data_col'
‘percentile()’ expects a column and an array of percentiles to calculate (for median we can provide ‘array(0.5)’ as 50% value is median) and will return an array of results.
MADdf = df.groupby('category')\
.agg(F.expr('percentile(data_col, array(0.5))')[0]\
.alias('data_col_median'))\
.join(df, "category", "left")\
.withColumn("data_col_difference_median", F.abs(F.col('data_col')-F.col('data_col_median')))\
.groupby('category', 'data_col_median')\
.agg(F.expr('percentile(data_col_difference_median, array(0.5))')[0]\
.alias('median_absolute_difference'))

Related

Set column value depending on previous ones with Spark without repeating grouping attribute

Given the DataFrame:
+------------+---------+
|variableName|dataValue|
+------------+---------+
| IDKey| I1|
| b| y|
| a| x|
| IDKey| I2|
| a| z|
| b| w|
| c| q|
+------------+---------+
I want to create a new column with corresponding IDKey values, where each value changes whenever the dataValue for IDKey changes, here's the expected output :
+------------+---------+----------+
|variableName|dataValue|idkeyValue|
+------------+---------+----------+
| IDKey| I1| I1|
| b| y| I1|
| a| x| I1|
| IDKey| I2| I2|
| a| z| I2|
| b| w| I2|
| c| q| I2|
+------------+---------+----------+
I tried by doing the following code which uses mapPartitions() and a global variable
var currentVarValue = ""
frame
.mapPartitions{ partition =>
partition.map { row =>
val (varName, dataValue) = (row.getString(0), row.getString(1))
val idKeyValue = if (currentVarValue != dataValue && varName == "IDKey") {
currentVarValue = dataValue
dataValue
} else {
currentVarValue
}
ExtendedData(varName, dataValue, currentVarValue)
}
}
But this won't work because of two fundamental things: Spark doesn't handle global variables and also, this doesn't comply with functional programming style
I will gladly appreciate any help on this. Thanks!
You cannot solve this elegantly and performant in a Spark way as
there is not enough initial information provided for Spark to process
all data guaranteed to be in the same partition. If we do all
processing in the same partition, then this is not the true intent of
Spark.
In fact a sensible partitionBy cannot be issued (over Window function). The issue here is that the data represents a long list of sequential such data that would require looking across partitions for if data in the previous partition relates to the current partition. That could be done, but it's quite a job. zero323 has an answer somewhere here that tries to solve this, but if I remember correctly, it is cumbersome.
The logic to do it is easy enough, but using Spark is problematic for this.
Without a partitionBy data all gets shuffled to a single partition and could result in OOM and space problems.
Sorry.

Create another dataframe from existing Dataframe with alias value in spark sql

i am using spark 1.6 with scala.
I have created a Dataframe which looks like below.
DATA
SKU, MAKE, MODEL, GROUP SUBCLS IDENT
IM, AN4032X, ADH3M032, RM, 1011, 0
IM, A3M4936, MP3M4936, RM, 1011, 0
IM, AK116BC, 3M4936P, 05, ABC, 0
IM, A-116-B, 16ECAPS, RM, 1011, 0
I am doing data validation and capture any record in new dataframe which violate the rule.
Rule:
Column “GROUP” must be character
Column “SUBCLS” must be NUMERIC
Column “IDENT” must be 0
The new Dataframe will looks like
AUDIT TABLE
SKU MAKE AUDIT_SKU AUDIT_MAKE AUDIT_MODEL AUDIT_GRP AUDIT_SUBCLS Audit_IDENT
IM, A-K12216BC, N, N, N, Y, Y, N
Y represent rule violation and N represent Rule pass.
i have validated rule using isnull or regex for ex:
checking column Group using
regex: df.where( $"GROUP".rlike("^[A-Za-z]}$")).show
May someone please help me how can i do this in SPARK SQL. is it possible to create a dataframe with the above scenario.
Thanks
you can use rlike with |
scala> df.withColumn("Group1",when($"GROUP".rlike("^[\\d+]|[A-Za-z]\\d+"),"Y").otherwise("N")).withColumn("SUBCLS1",when($"SUBCLS".rlike("^[0-9]"),"N").otherwise("Y")).withColumn("IDENT1",when($"IDENT"==="0","N").otherwise("Y")).show()
+---+-------+--------+-----+------+-----+------+-------+------+
|SKU| MAKE| MODEL|GROUP|SUBCLS|IDENT|Group1|SUBCLS1|IDENT1|
+---+-------+--------+-----+------+-----+------+-------+------+
| IM|AN4032X|ADH3M032| RM| 1011| 0| N| N| N|
| IM|A3M4936|MP3M4936| 1RM| 1011| 0| Y| N| N|
| IM|AK116BC| 3M4936P| 05| ABC| 0| Y| Y| N|
| IM|A-116-B| 16ECAPS| RM1| 1011| 0| Y| N| N|
+---+-------+--------+-----+------+-----+------+-------+------+
just write version 1 of each column for understanding purpose only you can overwrite column.
let me know if you need any help on the same.

What is the correct way to calculate average using pyspark.sql functions?

In pyspark dataframe, I have a timeseries of different events and I want to calculate the average count of events by month. What is the correct way to do that using the pyspark.sql functions?
I have a feeling that this requires agg, avg, window partitioning, but I couldn't make it work.
I have grouped the data by event and month and obtained something like this:
+------+-----+-----+
| event|month|count|
+------+-----+-----+
|event1| 1| 1023|
|event2| 1| 1009|
|event3| 1| 1002|
|event1| 2| 1012|
|event2| 2| 1023|
|event3| 2| 1017|
|event1| 3| 1033|
|event2| 3| 1011|
|event3| 3| 1004|
+------+-----+-----+
What I would like to have is this:
+------+-------------+
| event|avg_per_month|
+------+-------------+
|event1| 1022.6666|
|event2| 1014.3333|
|event3| 1007.6666|
+------+-------------+
What is the correct way to accomplish this?
This should help you to get desired result -
df = spark.createDataFrame(
[('event1',1,1023),
('event2',1,1009),
('event3',1,1002),
('event1',2,1012),
('event2',2,1023),
('event3',2,1017),
('event1',3,1033),
('event2',3,1011),
('event3',3,1004)
],["event", "month", "count"])
Example 1:
df.groupBy("event").\
avg("count").alias("avg_per_month").\
show()
Example 2:
df.groupBy("event").\
agg({'count' : 'avg'}).alias("avg_per_month").\
show()

How to write a large RDD to local disk through the Scala spark-shell?

Through a Scala spark-shell, I have access to an Elasticsearch db using the elasticsearch-hadoop-5.5.0 connector.
I generate my RDD by passing the following command in the spark-shell:
val myRdd = sc.esRDD("myIndex/type", myESQuery)
myRDD contains 2.1 million records across 15 partitions. I have been trying to write all the data to a text file(s) on my local disk but when I try to run operations that convert the RDD to an array, like myRdd.collect(), I overload my java heap.
Is there a way to export the data (eg. 100k records at a time) incrementally so that I am never overloading my system memory?
When you use saveAsTextFile you can pass your filepath as "file:///path/to/output" to have it save locally.
Another option is to use rdd.toLocalIterator Which will allow you to iterate over the rdd on the driver. You can then write each line to a file. This method avoids pulling all the records in at once.
In case someone needs to do this in PySpark (to avoid overwhelming their driver), here's a complete example:
# ========================================================================
# Convenience functions for generating DataFrame Row()s w/ random ints.
# ========================================================================
NR,NC = 100,10 # Number of Rows(); Number of columns.
fn_row = lambda x: Row(*[random.randint(*x) for _ in range(NC)])
fn_df = (lambda x,y: spark.createDataFrame([fn_row(x) for _ in range(NR)])
.toDF(*[f'{y}{c}' for c in range(NC)]))
# ========================================================================
Generate a DataFrame with 100-Rows of 10-Columns; containing integer values between [1..100):
>>> myDF = fn_df((1,100),'c')
>>> myDF.show(5)
+---+---+---+---+---+---+---+---+---+---+
| c0| c1| c2| c3| c4| c5| c6| c7| c8| c9|
+---+---+---+---+---+---+---+---+---+---+
| 72| 88| 74| 81| 68| 80| 45| 32| 49| 29|
| 78| 6| 55| 2| 23| 84| 84| 84| 96| 95|
| 25| 77| 64| 89| 27| 51| 26| 9| 56| 30|
| 16| 16| 94| 33| 34| 86| 49| 16| 21| 86|
| 90| 69| 21| 79| 63| 43| 25| 82| 94| 61|
+---+---+---+---+---+---+---+---+---+---+
Then, using DataFrame.toLocalIterator(), "stream" the DataFrame Row by Row, applying whatever post-processing is desired. This avoids overwhelming Spark driver memory.
Here, we simply print() the Rows to show that each is the same as above:
>>> it = myDF.toLocalIterator()
>>> for _ in range(5): print(next(it)) # Analogous to myDF.show(5)
>>>
Row(c0=72, c1=88, c2=74, c3=81, c4=68, c5=80, c6=45, c7=32, c8=49, c9=29)
Row(c0=78, c1=6, c2=55, c3=2, c4=23, c5=84, c6=84, c7=84, c8=96, c9=95)
Row(c0=25, c1=77, c2=64, c3=89, c4=27, c5=51, c6=26, c7=9, c8=56, c9=30)
Row(c0=16, c1=16, c2=94, c3=33, c4=34, c5=86, c6=49, c7=16, c8=21, c9=86)
Row(c0=90, c1=69, c2=21, c3=79, c4=63, c5=43, c6=25, c7=82, c8=94, c9=61)
And if you wish to "stream" DataFrame Rows to a local file, perhaps transforming each Row along the way, you can use this template:
>>> it = myDF.toLocalIterator() # Refresh the iterator here.
>>> with open('/tmp/output.txt', mode='w') as f:
>>> for row in it: print(row, file=f)

Change Behaviour of Built-in Spark Sql Functions

Is there any way to prevent spark sql functions from nulling values?
For example I have the following dataframe
df.show
+--------------------+--------------+------+------------+
| Title|Year Published|Rating|Length (Min)|
+--------------------+--------------+------+------------+
| 101 Dalmatians| 01/1996| G| 103|
|101 Dalmatians (A...| 1961| G| 79|
|101 Dalmations II...| 2003| G| 70|
I want to apply spark sqls date_format function to Year Published column.
val sql = """date_format(`Year Published`, 'MM/yyyy')"""
val df2 = df.withColumn("Year Published", expr(sql))
df2.show
+--------------------+--------------+------+------------+
| Title|Year Published|Rating|Length (Min)|
+--------------------+--------------+------+------------+
| 101 Dalmatians| null| G| 103|
|101 Dalmatians (A...| 01/1961| G| 79|
|101 Dalmations II...| 01/2003| G| 70|
The first row of the Year Published column has been nulled as the original value was in a different date format than the other dates.
This behaviour is not unique to date_format for example format_number will null non-numeric types.
With my dataset I expect different date formats and dirty data with unparseable values. I have a use case where if the value of a cell cannot be formatted then I want to return the current value as opposed to null.
Is there a way to make spark use the original value in df instead of null if the function for df2 cannot be applied correctly?
What I've tried
I've looked at wrapping Expressions in org.apache.spark.sql.catalyst.expressions but could not see a way to replace the existing functions.
The only working solution I could find is creating my own date_format and registering it as a udf but this isn't practical for all functions. I'm looking for a solution that will never return null if the input to a function is non-null or an automated way to wrap all existing spark functions.
You could probably use the coalesce function for your purposes:
coalesce(date_format(`Year Published`, 'MM/yyyy'), `Year Published`)