PySpark structured streaming and filtered processing for parts - pyspark

I want to evaluate a streamed (unbound) data frame within Spark 2.4:
time id value
6:00:01.000 1 333
6:00:01.005 1 123
6:00:01.050 2 544
6:00:01.060 2 544
When all the data of id 1 got into the dataframe and the data of the next id 2 comes I want to do calculations for the complete data of id 1. But how do I do that? I think I cannot use the window functions since I do not know the time in advance that also varies for each id. And I also do not know the id from other sources besides the streamed data frame.
The only solution that come to my mind contains variable comparison (a memory) and a while loop:
id_old = 0 # start value
while true:
id_cur = id_from_dataframe
if id_cur != id_old: # id has changed
do calulation for id_cur
id_old = id_cur
But I do not think that this is the right solution. Can you give me a hint or documentation that helps me since I cannot find examples or documentation.

I get it running with a combination of watermarking and grouping:
import pyspark.sql.functions as F
d2 = d1.withWatermark("time", "60 second") \
.groupby('id', \
F.window('time', "40 second")) \
.agg(
F.count("*").alias("count"), \
F.min("time").alias("time_start"), \
F.max("time").alias("time_stop"), \
F.round(F.avg("value"),1).alias('value_avg'))
Most of the documentation only shows the basic stuff with grouping only by time and I saw just one example with another parameter for grouping, so I put my 'id' there.

Related

Skewed Window Function & Hive Source Partitions?

The data I am reading via Spark is highly skewed Hive Table with the following stats.
(MIN, 25TH, MEDIAN, 75TH, MAX) via Spark UI:
1506.0 B / 0 232.4 KB / 27288 247.3 KB / 29025 371.0 KB / 42669 269.0 MB / 27197137
I believe it is causing problems downstream in the job when I perform some Window Funcs, and Pivots.
I tried exploring this parameter to limit the partition size however nothing changed and the partitions are still skewed upon read.
spark.conf.set("spark.sql.files.maxPartitionBytes")
Also, when I cache this DF with the Hive table as source it takes a few min and even causes some GC in the Spark UI most likely because of the skew as well.
Does this spark.sql.files.maxPartitionBytes work on Hive tables or only files?
What is the best course of action for handling this skewed Hive source?
Would something like a stage barrier write to parquet or Salting be suitable for this problem?
I would like to avoid .repartition() on read as it adds another layer to an already data roller-coaster of a job.
Thank you
==================================================
After further research it appears the Window Function is causing skewed data too and this is where the Spark Job hangs.
I am performing some time series filling via double Window Function (forward then backward fill to impute all the null sensor readings) and am trying to follow this article to try a salt method to evenly distribute ... however the following code produces all null values so the salt method is not working.
Not sure why I am getting skews after Window since each measure item I am partitioning by has roughly the same amount of records after checking via .groupBy() ... thus why would salt be needed?
+--------------------+-------+
| measure | count|
+--------------------+-------+
| v1 |5030265|
| v2 |5009780|
| v3 |5030526|
| v4 |5030504|
...
salt post => https://medium.com/appsflyer/salting-your-spark-to-scale-e6f1c87dd18
nSaltBins = 300 # based off number of "measure" values
df_fill = df_fill.withColumn("salt", (F.rand() * nSaltBins).cast("int"))
# FILLS [FORWARD + BACKWARD]
window = Window.partitionBy('measure')\
.orderBy('measure', 'date')\
.rowsBetween(Window.unboundedPreceding, 0)
# FORWARD FILLING IMPUTER
ffill_imputer = F.last(df_fill['new_value'], ignorenulls=True)\
.over(window)
fill_measure_DF = df_fill.withColumn('value_impute_temp', ffill_imputer)\
.drop("value", "new_value")
window = Window.partitionBy('measure')\
.orderBy('measure', 'date')\
.rowsBetween(0,Window.unboundedFollowing)
# BACKWARD FILLING IMPUTER
bfill_imputer = F.first(df_fill['value_impute_temp'], ignorenulls=True)\
.over(window)
df_fill = df_fill.withColumn('value_impute_final', bfill_imputer)\
.drop("value_impute_temp")
Salting might be helpful in the case where a single partition is big enough to not fit in memory on a single executor. This might happen even if all the keys are equally distributed as well (as in your case).
You have to include the salt column in your partitionBy clause which you are using to create the Window.
window = Window.partitionBy('measure', 'salt')\
.orderBy('measure', 'date')\
.rowsBetween(Window.unboundedPreceding, 0)
Again you have to create another window which will operate on the intermediate result
window1 = Window.partitionBy('measure')\
.orderBy('measure', 'date')\
.rowsBetween(Window.unboundedPreceding, 0)
Hive based solution :
You can enable Skew join optimization using hive configuration. Applicable settings are:
set hive.optimize.skewjoin=true;
set hive.skewjoin.key=500000;
set hive.skewjoin.mapjoin.map.tasks=10000;
set hive.skewjoin.mapjoin.min.split=33554432;
See databricks tips for this :
skew hints may work in this case

Error using filter on column value with Spark dataframe

Please refer to my sample code below:
sampleDf -> my sample Scala dataframe that I want to filter on 2 columns startIPInt and endIPInt.
var row = sampleDf.filter("startIPInt <=" + ip).filter("endIPInt >= " + ip)
I now want to view the content of this row.
The following takes barely a sec to execute but does not show me the contents of this row object:
println(row)
But this code takes too long to execute:
row.show()
So my question is how should I view the content of this row object? Or is there any issue with the way I am filtering my dataframe?
My initial approach was to use filter as mentioned here: https://spark.apache.org/docs/1.5.0/api/java/org/apache/spark/sql/DataFrame.html#filter(java.lang.String)
According to that, the following line of code gives me an error about "overloaded method 'filter'":
var row = sampleDf.filter($"startIPInt" <= ip).filter($"endIPInt" >= ip)
Can anyone help me understand what is happening here? And which is the right and fastest way to filter and get content of a dataframe as above.
First, using filter you don't really get a row/row object, you will get a new dataframe.
The reason show takes longer to execute is due to Spark being lazy. It will only compute transformations when an action is taken on the dataframe (see e.g. Spark Transformation - Why its lazy and what is the advantage?). Using println on a dataframe will not do anything and the filter transformations will not actually be computed. show on the other hand requires some computation which is why it's slower to execute.
Using
sampleDf.filter("startIPInt <=" + ip).filter("endIPInt >= " + ip)
and
sampleDf.filter($"startIPInt" <= ip).filter($"endIPInt" >= ip)
are equivalent and should give the same result as long as you have imported spark implicits (for use of the $ notation).

Spark Dataframe returns an inconsistent value on count()

I am using pypark to perform some computations on data obtained from a PostgreSQL database. My pipeline is something similar to this:
limit = 1000
query = "(SELECT * FROM table LIMIT {}) as filter_query"
df = spark.read.format("jdbc") \
.option("url", "jdbc:postgresql://path/to/db") \
.option("dbtable", query.format(limit)) \
.option("user", "user") \
.option("password", "password") \
.option("driver", "org.postgresql.Driver")
df.createOrReplaceTempView("table")
df.count() # 1000
So far, so good. The problem starts when I perform some transformations on the data:
counted_data = spark.sql("SELECT column1, count(*) as count FROM table GROUP BY column1").orderBy("column1")
counted_data.count() # First value
counted_data_with_additional_column = counted_data.withColumn("column1", my_udf_function)
counted_data_with_additional_column.count() # Second value, inconsistent with the first count (should be the same)
The first transformation alters the number of rows, (the value should be <= 1000). However, the second one does not, it just adds a new column. How can it be that I am getting a different result for count()?
The explanation is actually quite simple, but a bit tricky. Spark might perform additional reads to the input source (in this case a database). Since some other process is inserting data in the database, these additional calls read slightly different data than the original read, causing this inconsistent behaviour. A simple call to df.cache() after the read disables further reads. I figured this out by analyzing the traffic between the database and my computer, and indeed, some further SQL commands where issued that matched my transformations. After adding the cache() call, no further traffic appeared.
Since you are using Limit 1000, you might be getting different 1000 records on each execution. And since you will be getting different records each time, the result of aggregation will be different. In order to get the consistent behaviour with Limit you can try following approaches.
Either try to cache your dataframe with cahce() or Persist method, which will ensure that spark will use same data till the time it will be available in memory.
But better approach could be to sort the data based on some unique column and then get the 1000 records, which will ensure that you will get the same 1000 records each time.
Hope it helps.

Split Spark DataFrame into parts

I have a table of distinct users, which has 400,000 users. I would like to split it into 4 parts, and expected each user located in one part only.
Here is my code:
val numPart = 4
val size = 1.0 / numPart
val nsizes = Array.fill(numPart)(size)
val data = userList.randomSplit(nsizes)
Then I write each data(i), i from 0 to 3, into parquet files. Select the directory, group by user id and count by part, there are some users that located in two or more parts.
I still have no idea why?
I have found the solution: cache the DataFrame before you split it.
Should be
val data = userList.cache().randomSplit(nsizes)
Still have no idea why. My guess, each time the randomSplit function "fill" the data, it reads records from userList which is re-evaluate from the parquet file(s), and give a different order of rows, that's why some users are lost and some users are duplicated.
That's what I thought. If some one have any answer or explanation, I will update.
References:
(Why) do we need to call cache or persist on a RDD
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-rdd-caching.html
http://159.203.217.164/using-sparks-cache-for-correctness-not-just-performance/
If your goal is to split it to different files you can use the functions.hash to calculate a hash, then mod 4 to get a number between 0 to 4 and when you write the parquet use partitionBy which would create a directory for each of the 4 values.

Using groupBy in Spark and getting back to a DataFrame

I have a difficulty when working with data frames in spark with Scala. If I have a data frame that I want to extract a column of unique entries, when I use groupBy I don't get a data frame back.
For example, I have a DataFrame called logs that has the following form:
machine_id | event | other_stuff
34131231 | thing | stuff
83423984 | notathing | notstuff
34131231 | thing | morestuff
and I would like the unique machine ids where event is thing stored in a new DataFrame to allow me to do some filtering of some kind. Using
val machineId = logs
.where($"event" === "thing")
.select("machine_id")
.groupBy("machine_id")
I get a val of Grouped Data back which is a pain in the butt to use (or I don't know how to use this kind of object properly). Having got this list of unique machine id's, I then want to use this in filtering another DataFrame to extract all events for individual machine ids.
I can see I'll want to do this kind of thing fairly regularly and the basic workflow is:
Extract unique id's from a log table.
Use unique ids to extract all events for a particular id.
Use some kind of analysis on this data that has been extracted.
It's the first two steps I would appreciate some guidance with here.
I appreciate this example is kind of contrived but hopefully it explains what my issue is. It may be I don't know enough about GroupedData objects or (as I'm hoping) I'm missing something in data frames that makes this easy. I'm using spark 1.5 built on Scala 2.10.4.
Thanks
Just use distinct not groupBy:
val machineId = logs.where($"event"==="thing").select("machine_id").distinct
Which will be equivalent to SQL:
SELECT DISTINCT machine_id FROM logs WHERE event = 'thing'
GroupedData is not intended to be used directly. It provides a number of methods, where agg is the most general, which can be used to apply different aggregate functions and convert it back to DataFrame. In terms of SQL what you have after where and groupBy is equivalent to something like this
SELECT machine_id, ... FROM logs WHERE event = 'thing' GROUP BY machine_id
where ... has to be provided by agg or equivalent method.
A group by in spark followed by aggregation and then a select statement will return a data frame. For your example it should be something like:
val machineId = logs
.groupBy("machine_id", "event")
.agg(max("other_stuff") )
.select($"machine_id").where($"event" === "thing")