Pyflink user defined aggregate function does not accumulate values and merge and accumulate methods do not seem to be working - merge

I am running the following pyflink code, which accepts kafka messages and performs sliding window aggregations:
# Import modules
import os
from pyflink.datastream import StreamExecutionEnvironment
from pyflink.table import StreamTableEnvironment, EnvironmentSettings
from pyflink.table.window import Slide
from pyflink.table import DataTypes, TableDescriptor, Schema
from pyflink.table.expressions import lit, col, concat
from pyflink.table.table import Table
from pyflink.table.udf import ScalarFunction, udf
from pyflink.common import Row
from pyflink.table import AggregateFunction, DataTypes, TableEnvironment, EnvironmentSettings
from pyflink.table.expressions import call
from pyflink.table.udf import udaf
from pyflink.table.expressions import col, lit
from pyflink.table.window import Tumble
from pyflink.table.udf import udf
# Class to create alarm based on the number of ids in the window
class AlarmGen(AggregateFunction):
def create_accumulator(self):
return ['']
def get_value(self, accumulator):
num_ids = len(list(set(list(accumulator[0].split(',')[:-1]))))
if num_ids > 7:
return 'Alarm'
else:
return 'No Alarm'
def accumulate(self, accumulator, value):
accumulator[0] += value
def get_result_type(self):
return DataTypes.STRING()
def get_accumulator_type(self):
return DataTypes.STRING()
def merge(self, accumulator, accumulators):
accumulator += accumulators[0]
# Define environment
env = StreamExecutionEnvironment.get_execution_environment()
env.set_parallelism(1)
env_settings = EnvironmentSettings.new_instance().in_streaming_mode().build()
st_env = StreamTableEnvironment.create(env, environment_settings=env_settings)
# Define source table
st_env.execute_sql(
f"""
CREATE TABLE source (
id INT,
name STRING,
age INT,
ts BIGINT,
all_ids STRING,
rowtime as TO_TIMESTAMP(FROM_UNIXTIME(ts, 'yyyy-MM-dd HH:mm:ss')),
WATERMARK FOR rowtime AS rowtime - INTERVAL '0' SECOND
) WITH (
'connector' = 'kafka',
'topic' = '{os.environ["KAFKA_TOPIC"]}',
'scan.startup.mode' = 'latest-offset',
'properties.bootstrap.servers' = '{os.environ["KAFKA_HOST"]}',
'properties.zookeeper.connect' = '{os.environ["ZOOKEEPER_HOST"]}',
'properties.group.id' = '{os.environ["KAFKA_CONSUMER_GROUP"]}',
'format' = 'json'
)
"""
)
# Defie sink table
st_env.execute_sql(
"""
CREATE TABLE sink (
average_age DOUBLE,
window_end TIMESTAMP(3),
alarm STRING
) WITH (
'connector' = 'print',
'print-identifier' = 'Average Age, Window, alarm: '
)
"""
)
# Make functions available via udaf
alarm_gen = udaf(AlarmGen(), result_type=DataTypes.STRING())
# Execute function over a window
st_env.from_path("source")\
.window(Slide.over(lit(5).seconds)\
.every(lit(5).seconds).on("rowtime").alias("w"))\
.group_by("w")\
.select(col("age").avg, col("w").end, alarm_gen(col("all_ids")))\
.execute_insert("sink")\
.wait()
The inputs from kafka producer are in the following format:
{'id': 1, 'name': 'Osbourne', 'age': '73', 'ts': 1675667790, 'all_ids': '0,1,2,3,4,'}
The purpose of the AlarmGen class is to take the rows of the all_ids column, concatenate all the values together, then get all the unique values, and then to count them. If the count exceeds 7, the function should return 'Alarm', otherwise it returns 'No Alarm'.
Right now it seems the accumulator variable always remains as '' (an empty string), whether it is acting as an input into the accumulate function, or the merge function.
Then the accumulate function already receives '' as input, and so num_ids variable is always 0.
Any ideas why merge and accumulate methods may not be acting as they should be?
I am working with pyflink version 1.14.6.

Related

Add error handling decorator in addition to pandas_udf

I'd like to create a decorator that handles errors inside of a pandas_udf. I've tried a few attempts with no luck so wanted to see if anyone has been successful in doing this?
Below is some initial code I've tried but it fails. In this example, I'm trying to decorate the function pandas_divide with both pandas_udf and a new decorator to detect errors, return_code.
I'm not sure if my idea is possible given the fact that pandas UDFs require us to define a single return data type (whereas this idea of wrapping it in a safe call would allow for either the output of the function to be returned in the column or an exception). I tried researching if I could define a new pyspark data type that is the union of one data type, an exception and None but did not have any luck - is this possible?
I was also thinking of using a closure to try and get this functionality but closures are new to me so I'm still looking into this.
from pyspark.sql import types as T
from pyspark.sql import functions as F
from pyspark.sql.functions import pandas_udf, PandasUDFType
# create dataframe for testing
df = spark.range(0, 10).withColumn('id', (F.col('id') / 10).cast('integer')).withColumn('v', F.rand())
columns = ['id', 'v']
vals = [(1, 2), (2, 0), (3, 0)]
new_rows_df = spark.createDataFrame(vals, columns)
df = df.union(new_rows_df)
df.cache()
df.count()
display(df)
class ReturnCode:
def __init__(self):
self.pass1 = 'PASS'
self.fail1 = 'FAIL'
def __call__(self, fn, *args, **kwargs):
def inner_func(self, *args, **kwargs):
try:
output = func(**kwargs)
return_code = self.pass1
except Exception as ex:
output = f"{ex}"
return_code = self.fail1
return (return_code, output)
return inner_func
return_code = ReturnCode()
#pandas_udf(T.StructType(T.StructField('return_code', T.StringType()), T.StructField('value', T.IntegerType())))
#return_code
def pandas_divide(v):
if v == 0:
raise
return 1/v
#pandas_divide(0)[0]
df = df.withColumn('pandas_divide', pandas_divide(F.col('v')))
df.show()

is it possible (and how) to specify an sql query on command line with spark-submit

I have the following code:
def main(args: Array[String]) {
var dvfFiles : String = "g:/data/gouv/dvf/raw"
var q : String = ""
//q = "SELECT distinct DateMutation, NVoie, IndVoie, Voie, Valeur, CodeTypeLocal, TypeLocal, Commune FROM mutations WHERE Commune = 'ICI' and Valeur > 100000 and CodeTypeLocal in (1, 2) order by Valeur desc"
args.sliding(2, 2).toList.collect {
case Array("--sfiles", argFiles: String) => dvfFiles = argFiles
case Array("--squery", argQ: String) => q = argQ
}
println(s"files from: ${dvfFiles}")
if I run the following command:
G:\dev\fromGit\dvf\spark>spark-submit .\target\scala-2.11\dfvqueryer_2.11-1.0.jar \
--squery "SELECT distinct DateMutation, NVoie, IndVoie, Voie, Valeur, CodeTypeLocal, \
TypeLocal, Commune FROM mutations WHERE (Commune = 'ICI') and (Valeur > 100000) and (CodeTypeLocal in (1, 2)) order by Valeur desc"
I got the following result:
== SQL ==
SELECT distinct DateMutation, NVoie, IndVoie, Voie, Valeur, CodeTypeLocal, TypeLocal, Commune FROM mutations WHERE (Commune = 'ICI') and (Valeur and (CodeTypeLocal in (1, 2)) order by Valeur desc
----------------------------------------------------------------------------------------------^^^
the ^^^ pointing the FROM
I also notice the missing > 100000 after Valeur.
the query is correct because if I uncomment the //q =..., package the code and submit it, all happens fine.
Seems that the process is burning part of the query during input. One solution to this problem would be to send the entire argument of you select query in one line and input it into a string value. In that format it can be immediately pipelined into the sql function to run you query. Below is how you can build out the function:
//The Package Tree
package stack.overFlow
//Call all needed packages
import org.apache.spark.sql.{DataFrame, SparkSession, Column, SQLContext}
import org.apache.spark.SparkContext
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import org.apache.spark.sql
//Object Name
object demoCode {
def main(args: Array[String]) {
///Build the contexts
var spark = SparkSession.builder.enableHiveSupport().getOrCreate()
var sc = spark.sparkContext
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
//Set the query as a string for argument 1
val commandQuery : String = args(0)
//Pass query to the sql function
val inputDF = spark.sql(commandQuery)
}
}
When the function compiles you will need two objects (1) the Jar as well as (2) the package tree and class for running the function. When running bot of those within --class all you need to do is include a space and pass through the SQL query so on run time it will be loaded into the spark session.
spark-submit --class stack.overFlow.demoCode /home/user/demo_code/target/demoCode-compilation-jar.jar \
SELECT distinct DateMutation, NVoie, IndVoie, Voie, Valeur, CodeTypeLocal,TypeLocal, Commune FROM mutations WHERE (Commune = 'ICI') and (Valeur > 100000) and (CodeTypeLocal in (1, 2)) order by Valeur desc
Would this help your use-case or do you need it to be in another format?

How take data from several parquet files at once?

I need your help cause I am new in Spark Framework.
I have folder with a lot of parquet files. The name of these files has the same format: DD-MM-YYYY. For example: '01-10-2018', '02-10-2018', '03-10-2018', etc.
My application has two input parameters: dateFrom and dateTo.
When I try to use next code application hangs. It seems like application scan all files in folder.
val mf = spark.read.parquet("/PATH_TO_THE_FOLDER/*")
.filter($"DATE".between(dateFrom + " 00:00:00", dateTo + " 23:59:59"))
mf.show()
I need to take data pool for period as fast as it possible.
I think it would be great to divide period into days and then read files separately, join them like that:
val mf1 = spark.read.parquet("/PATH_TO_THE_FOLDER/01-10-2018");
val mf2 = spark.read.parquet("/PATH_TO_THE_FOLDER/02-10-2018");
val final = mf1.union(mf2).distinct();
dateFrom and dateTo are dynamic, so I don't know how correctly organize code right now. Please help!
#y2k-shubham I tried to test next code, but it raise error:
import org.joda.time.{DateTime, Days}
import org.apache.spark.sql.{DataFrame, SparkSession}
val dateFrom = DateTime.parse("2018-10-01")
val dateTo = DateTime.parse("2018-10-05")
def getDaysInBetween(from: DateTime, to: DateTime): Int = Days.daysBetween(from, to).getDays
def getDatesInBetween(from: DateTime, to: DateTime): Seq[DateTime] = {
val days = getDaysInBetween(from, to)
(0 to days).map(day => from.plusDays(day).withTimeAtStartOfDay())
}
val datesInBetween: Seq[DateTime] = getDatesInBetween(dateFrom, dateTo)
val unionDf: DataFrame = datesInBetween.foldLeft(spark.emptyDataFrame) { (intermediateDf: DataFrame, date: DateTime) =>
intermediateDf.union(spark.read.parquet("PATH" + date.toString("yyyy-MM-dd") + "/*.parquet"))
}
unionDf.show()
ERROR:
org.apache.spark.sql.AnalysisException: Union can only be performed on tables with the same number of columns, but the first table has 0 columns and the second table has 20 columns;
It seems like intermediateDf DateFrame at start is empty. How to fix the problem?
import java.time.LocalDate
import java.time.format.DateTimeFormatter
import org.apache.spark.sql.{DataFrame, SparkSession}
val formatter = DateTimeFormatter.ofPattern("yyyy-MM-dd")
def dateRangeInclusive(start: String, end: String): Iterator[LocalDate] = {
val startDate = LocalDate.parse(start, formatter)
val endDate = LocalDate.parse(end, formatter)
Iterator.iterate(startDate)(_.plusDays(1))
.takeWhile(d => d.isBefore(endDate) || d.isEqual(endDate))
}
val spark = SparkSession.builder().getOrCreate()
val data: DataFrame = dateRangeInclusive("2018-10-01", "2018-10-05")
.map(d => spark.read.parquet(s"/path/to/directory/${formatter.format(d)}"))
.reduce(_ union _)
I also suggest using the native JSR 310 API (part of Java SE since Java 8) rather than joda-time, since it is more modern and does not require external dependencies. Note that first creating a sequence of paths and doing map+reduce is probably simpler for this use case than a more general foldLeft-based solution.
Additionally, you can use reduceOption, then you'll get an Option[DataFrame] if the input date range is empty. Also, if it is possible for some input directories/files to be missing, you'd want to do a check before invoking spark.read.parquet. If your data is on HDFS, you should probably use the Hadoop FS API:
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.{FileSystem, Path}
val spark = SparkSession.builder().getOrCreate()
val fs = FileSystem.get(new Configuration(spark.sparkContext.hadoopConfiguration))
val data: Option[DataFrame] = dateRangeInclusive("2018-10-01", "2018-10-05")
.map(d => s"/path/to/directory/${formatter.format(d)}")
.filter(p => fs.exists(new Path(p)))
.map(spark.read.parquet(_))
.reduceOption(_ union _)
While I haven't tested this piece of code, it must work (probably slight modification?)
import org.joda.time.{DateTime, Days}
import org.apache.spark.sql.{DataFrame, SparkSession}
// return no of days between two dates
def getDaysInBetween(from: DateTime, to: DateTime): Int = Days.daysBetween(from, to).getDays
// return sequence of dates between two dates
def getDatesInBetween(from: DateTime, to: DateTime): Seq[DateTime] = {
val days = getDaysInBetween(from, to)
(0 to days).map(day => from.plusDays(day).withTimeAtStartOfDay())
}
// read parquet data of given date-range from given path
// (you might want to pass SparkSession in a different manner)
def readDataForDateRange(path: String, from: DateTime, to: DateTime)(implicit spark: SparkSession): DataFrame = {
// get date-range sequence
val datesInBetween: Seq[DateTime] = getDatesInBetween(from, to)
// read data of from-date (needed because schema of all DataFrames should be same for union)
val fromDateDf: DataFrame = spark.read.parquet(path + "/" + datesInBetween.head.toString("yyyy-MM-dd"))
// read and union remaining dataframes (functionally)
val unionDf: DataFrame = datesInBetween.tail.foldLeft(fromDateDf) { (intermediateDf: DataFrame, date: DateTime) =>
intermediateDf.union(spark.read.parquet(path + "/" + date.toString("yyyy-MM-dd")))
}
// return union-df
unionDf
}
Reference: How to calculate 'n' days interval date in functional style?

display column name into list[column]scala

I want to insert list of column from datframe into a list [column] so I can perform a select request. it means want to get list of column and insert it automatically into a list [column] Any help Thanks
object PCA extends App{
val spark = SparkSession.builder.master("local").appName("my-spark-app").getOrCreate()
val strPath="C:/Users/mhattabi/Desktop/testBis2.txt"
val intial_Data=spark.read.option("header",true).csv(strPath)
//array string contains names of column
val arrayList=intial_Data.columns
var colsList = List[Column]()
//wanna insert name of column into the listColum
arrayList.foreach(p=>colsList.)
//i want to have something like
//val colsList = List(col("col1"),col("col2"))
//intial_Data.select(colsList:_*).show
}
You could use col function as follow:
var colsList = List[Column]()
arrayList.columns.foreach { c => colsList:+=col(c)}
Remember to import sql functions to use col:
import org.apache.spark.sql.functions._
I would rather use immutable list than the variable list by transformation like below.
val arrayList = initial_Data.columns
val colsList = arrayList.map(col)

How do I filter rows based on whether a column value is in a Set of Strings in a Spark DataFrame

Is there a more elegant way of filtering based on values in a Set of String?
def myFilter(actions: Set[String], myDF: DataFrame): DataFrame = {
val containsAction = udf((action: String) => {
actions.contains(action)
})
myDF.filter(containsAction('action))
}
In SQL you can do
select * from myTable where action in ('action1', 'action2', 'action3')
How about this:
myDF.filter("action in (1,2)")
OR
import org.apache.spark.sql.functions.lit
myDF.where($"action".in(Seq(1,2).map(lit(_)):_*))
OR
import org.apache.spark.sql.functions.lit
myDF.where($"action".in(Seq(lit(1),lit(2)):_*))
Additional support will be added to make this cleaner in 1.5