Spark Structure Streaming Differentiate over time - spark-structured-streaming

I have a table with a timestamp column (t) and a list of columns for which I would like to compute the difference over time (v), grouped by some key(k): v_diff(t) = v(t)-v(t-1) for each k independently.
Normally I would write:
lag_window = Window.partitionBy(COLS_TO_DIFF).orderBy('timestamp')
for col in COLS_TO_DIFF:
df = df.withColumn(
col + "_diff",
df[col] - F.lag(df[col]).over(lag_window))
Yet for streaming I get:
AnalysisException: Non-time-based windows are not supported on streaming DataFrames/Datasets;
How do I get around that?
Note: my data is streaming slowly in batches

Related

Joing large RDDs in scala spark

I want to join large(1TB) data RDD with medium(10GB) size data RDD. There was an earlier processing on large data with was completing in 8 hours. I then joined the medium sized data to get an info that need to be add to the processing(its a simple join, which takes the value of second column and add it to the final output along with the large data processed output. But this job is running longer for more than 1 day. How do I optimize it? I tried to refer some solutions like Refer. But the solution are for spark dataframe. How do I optimize it for RDD?
Large dataset
1,large_blob_of_info
2,large_blob_of_info
3,large_blob_of_info
4,large_blob_of_info
5,large_blob_of_info
6,large_blob_of_info
Medium size data
3,23
2,45
1,67
4,89
Code that I have have.
rdd1.join(rdd2).map(a => a.x)
val result = input
.map(x => {
val row = x.split(",")
(row(0), row(2))
}).filter(x=> x._1 != null !x._1.isEmpty)

Extract and Replace values from duplicates rows in PySpark Data Frame

I have duplicate rows of the may contain the same data or having missing values in the PySpark data frame.
The code that I wrote is very slow and does not work as a distributed system.
Does anyone know how to retain single unique values from duplicate rows in a PySpark Dataframe which can run as a distributed system and with fast processing time?
I have written complete Pyspark code and this code works correctly.
But the processing time is really slow and its not possible to use it on a Spark Cluster.
'''
# Columns of duplicate Rows of DF
dup_columns = df.columns
for row_value in df_duplicates.rdd.toLocalIterator():
print(row_value)
# Match duplicates using std name and create RDD
fill_duplicated_rdd = ((df.where((sf.col("stdname") == row_value['stdname'] ))
.where(sf.col("stdaddress")== row_value['stdaddress']))
.rdd.map(fill_duplicates))
# Creating feature names for the same RDD
fill_duplicated_rdd_col_names = (((df.where((sf.col("stdname") == row_value['stdname']) &
(sf.col("stdaddress")== row_value['stdaddress'])))
.rdd.map(fill_duplicated_columns_extract)).first())
# Creating DF using the previous RDD
# This DF stores value of a single set of matching duplicate rows
df_streamline = fill_duplicated_rdd.toDF(fill_duplicated_rdd_col_names)
for column in df_streamline.columns:
try:
col_value = ([str(value[column]) for value in
df_streamline.select(col(column)).distinct().rdd.toLocalIterator() if value[column] != ""])
if len(col_value) >= 1:
# non null or empty value of a column store here
# This value is a no duplicate distinct value
col_value = col_value[0]
#print(col_value)
# The non-duplicate distinct value of the column is stored back to
# replace any rows in the PySpark DF that were empty.
df_dedup = (df_dedup
.withColumn(column,sf.when((sf.col("stdname") == row_value['stdname'])
& (sf.col("stdaddress")== row_value['stdaddress'])
,col_value)
.otherwise(df_dedup[column])))
#print(col_value)
except:
print("None")
'''
There are no error messages but the code is running very slow. I want a solution that fills rows with unique values in PySpark DF that are empty. It can fill the rows with even mode of the value
"""
df_streamline = fill_duplicated_rdd.toDF(fill_duplicated_rdd_col_names)
for column in df_streamline.columns:
try:
# distinct() was replaced by isNOTNULL().limit(1).take(1) to improve the speed of the code and extract values of the row.
col_value = df_streamline.select(column).where(sf.col(column).isNotNull()).limit(1).take(1)[0][column]
df_dedup = (df_dedup
.withColumn(column,sf.when((sf.col("stdname") == row_value['stdname'])
& (sf.col("stdaddress")== row_value['stdaddress'])
,col_value)
.otherwise(df_dedup[column])))
"""

finding the count, pass and fail percentage of the multiple values from single column while aggregating another column using pyspark

Data
I want to apply groupby for column1 and want to calculate the percentage of passed and failed percentage for each 1 and as well count
Example ouput I am looking for
Using pyspark I am doing the below code but I am only getting the percentage
levels = ["passed", "failed","blocked"]
exprs = [avg((col("Column2") == level).cast("double")*100).alias(level)
for level in levels]
df = sparkSession.read.json(hdfsPath)
result1 = df1.select('Column1','Column2').groupBy("Column1").agg(*exprs)
You would need to explicitly calculate the counts, and then do some string formatting to combine the percentages in the counts into a single column.
from pyspark.sql.functions import avg, col, count, concat, lit
levels = ["passed", "failed","blocked"]
# percentage aggregations
pct_exprs = [avg((col("Column2") == level).cast("double")*100).alias('{}_pct'.format(level))
for level in levels]
# count aggregations
count_exprs = [sum((col("Column2") == level).cast("int")).alias('{}_count'.format(level))
for level in levels]
# combine all aggregations
exprs = pct_exprs + count_exprs
# string formatting select expressions
select_exprs = [
concat(
col('{}_pct'.format(level)).cast('string'),
lit('('),
col('{}_count'.format(level)).cast('string'),
lit(')')
).alias('{}_viz'.format(level))
for level in levels
]
df = sparkSession.read.json(hdfsPath)
result1 = (
df1
.select('Column1','Column2')
.groupBy("Column1")
.agg(*exprs)
.select('Column1', *select_exprs)
)
NB: it seems like you are trying to use Spark to make a nice visualization of the results of your calculations, but I don't think Spark is well-suited for this task. If you have few enough records that you can see all of them at once, you might as well work locally in Pandas or something similar. And if you have enough records that using Spark makes sense, then you can't see all of them at once anyway so it doesn't matter too much whether they look nice.

How to get the row top 1 in Spark Structured Streaming?

I have an issue with Spark Streaming (Spark 2.2.1). I am developing a real time pipeline where first I get data from Kafka, second join the result with another table, then send the Dataframe to a ALS model (Spark ML) and it return a streaming Dataframe with one additional column predit. The problem is when I tried to get the row with the highest score, I couldn't find a way to resolve it.
I tried:
Apply SQL functions like Limit, Take, sort
dense_rank() function
search in StackOverflow
I read Unsupported Operations but doesn't seem to be much there.
Additional with the highest score I would send to a Kafka queue
My code is as follows:
val result = lines.selectExpr("CAST(value AS STRING)")
.select(from_json($"value", mySchema).as("data"))
//.select("data.*")
.selectExpr("cast(data.largo as int) as largo","cast(data.stock as int) as stock","data.verificavalormax","data.codbc","data.ide","data.timestamp_cli","data.tef_cli","data.nombre","data.descripcion","data.porcentaje","data.fechainicio","data.fechafin","data.descripcioncompleta","data.direccion","data.coordenadax","data.coordenaday","data.razon_social","data.segmento_app","data.categoria","data.subcategoria")
result.printSchema()
val model = ALSModel.load("ALSParaTiDos")
val fullPredictions = model.transform(result)
//fullPredictions is a streaming dataframe with a extra column "prediction", here i need the code to get the first row
val query = fullPredictions.writeStream.format("console").outputMode(OutputMode.Append()).option("truncate", "false").start()
query.awaitTermination()
Update
Maybe I was not clear, so I'm attaching an image with my problem. Also I wrote a more simple code to complement it: https://gist.github.com/.../9193c8a983c9007e8a1b6ec280d8df25
detailing what i need. Please I will appreciate any help :)
TL;DR Use stream-stream inner joins (Spark 2.3.0) or use memory sink (or a Hive table) for a temporary storage.
I think that the following sentence describes your case very well:
The problem is when I tried to get the row with the highest score, I couldn't find a way to resolve it.
Machine learning aside as it gives you a streaming Dataset with predictions so focusing on finding a maximum value in a column in a streaming Dataset is the real case here.
The first step is to calculate the max value as follows (copied directly from your code):
streaming.groupBy("idCustomer").agg(max("score") as "maxscore")
With that, you have two streaming Datasets that you can join as of Spark 2.3.0 (that has been released few days ago):
In Spark 2.3, we have added support for stream-stream joins, that is, you can join two streaming Datasets/DataFrames.
Inner joins on any kind of columns along with any kind of join conditions are supported.
Inner join the streaming Datasets and you're done.
Try this:
Implement a function that extract the max value of the column and then filter your dataframe with the max
def getDataFrameMaxRow(df:DataFrame , col:String):DataFrame = {
// get the maximum value
val list_prediction = df.select(col).toJSON.rdd
.collect()
.toList
.map { x => gson.fromJson[JsonObject](x, classOf[JsonObject])}
.map { x => x.get(col).getAsString.toInt}
val max = getMaxFromList(list_prediction)
// filter dataframe by the maximum value
val df_filtered = df.filter(df(col) === max.toString())
return df_filtered
}
def getMaxFromList(xs: List[Int]): Int = xs match {
case List(x: Int) => x
case x :: y :: rest => getMaxFromList( (if (x > y) x else y) :: rest )
}
And in the body of your code add:
import com.google.gson.JsonObject
import com.google.gson.Gson
import org.apache.spark.sql.DataFrame
val fullPredictions = model.transform(result)
val df_with_row_max = getDataFrameMaxRow(fullPredictions, "prediction")
Good Luck !!

Best way to gain performance when doing a join count using spark and scala

i have a requirement to validate an ingest operation , bassically, i have two big files within HDFS, one is avro formatted (ingested files), another one is parquet formatted (consolidated file).
Avro file has this schema:
filename, date, count, afield1,afield2,afield3,afield4,afield5,afield6,...afieldN
Parquet file has this schema:
fileName,anotherField1,anotherField1,anotherField2,anotherFiel3,anotherField14,...,anotherFieldN
If i try to load both files in a DataFrame and then try to use a naive join-where, the job in my local machine takes more than 24 hours!, which is unaceptable.
ingestedDF.join(consolidatedDF).where($"filename" === $"fileName").count()
¿Which is the best way to achieve this? ¿dropping colums from the DataFrame before doing the join-where-count? ¿calculating the counts per dataframe and then join and sum?
PD
I was reading about map-side-joint technique but it looks that this technique would work for me if there was a small file able to fit in RAM, but i cant assure that, so, i would like to know which is the prefered way from the community to achieve this.
http://dmtolpeko.com/2015/02/20/map-side-join-in-spark/
I would approach this problem by stripping down the data to only the field I'm interested in (filename), making a unique set of the filename with the source it comes from (the origin dataset).
At this point, both intermediate datasets have the same schema, so we can union them and just count. This should be orders of magnitude faster than using a join on the complete data.
// prepare some random dataset
val data1 = (1 to 100000).filter(_ => scala.util.Random.nextDouble<0.8).map(i => (s"file$i", i, "rubbish"))
val data2 = (1 to 100000).filter(_ => scala.util.Random.nextDouble<0.7).map(i => (s"file$i", i, "crap"))
val df1 = sparkSession.createDataFrame(data1).toDF("filename", "index", "data")
val df2 = sparkSession.createDataFrame(data2).toDF("filename", "index", "data")
// select only the column we are interested in and tag it with the source.
// Lets make it distinct as we are only interested in the unique file count
val df1Filenames = df1.select("filename").withColumn("df", lit("df1")).distinct
val df2Filenames = df2.select("filename").withColumn("df", lit("df2")).distinct
// union both dataframes
val union = df1Filenames.union(df2Filenames).toDF("filename","source")
// let's count the occurrences of filename, by using a groupby operation
val occurrenceCount = union.groupBy("filename").count
// we're interested in the count of those files that appear in both datasets (with a count of 2)
occurrenceCount.filter($"count"===2).count