Add error handling decorator in addition to pandas_udf - pyspark

I'd like to create a decorator that handles errors inside of a pandas_udf. I've tried a few attempts with no luck so wanted to see if anyone has been successful in doing this?
Below is some initial code I've tried but it fails. In this example, I'm trying to decorate the function pandas_divide with both pandas_udf and a new decorator to detect errors, return_code.
I'm not sure if my idea is possible given the fact that pandas UDFs require us to define a single return data type (whereas this idea of wrapping it in a safe call would allow for either the output of the function to be returned in the column or an exception). I tried researching if I could define a new pyspark data type that is the union of one data type, an exception and None but did not have any luck - is this possible?
I was also thinking of using a closure to try and get this functionality but closures are new to me so I'm still looking into this.
from pyspark.sql import types as T
from pyspark.sql import functions as F
from pyspark.sql.functions import pandas_udf, PandasUDFType
# create dataframe for testing
df = spark.range(0, 10).withColumn('id', (F.col('id') / 10).cast('integer')).withColumn('v', F.rand())
columns = ['id', 'v']
vals = [(1, 2), (2, 0), (3, 0)]
new_rows_df = spark.createDataFrame(vals, columns)
df = df.union(new_rows_df)
df.cache()
df.count()
display(df)
class ReturnCode:
def __init__(self):
self.pass1 = 'PASS'
self.fail1 = 'FAIL'
def __call__(self, fn, *args, **kwargs):
def inner_func(self, *args, **kwargs):
try:
output = func(**kwargs)
return_code = self.pass1
except Exception as ex:
output = f"{ex}"
return_code = self.fail1
return (return_code, output)
return inner_func
return_code = ReturnCode()
#pandas_udf(T.StructType(T.StructField('return_code', T.StringType()), T.StructField('value', T.IntegerType())))
#return_code
def pandas_divide(v):
if v == 0:
raise
return 1/v
#pandas_divide(0)[0]
df = df.withColumn('pandas_divide', pandas_divide(F.col('v')))
df.show()

Related

how to remove "Missing transform attribute error"?

I am writing a code in palantir using pyspark and I have this error which I am unable to figure out.
The Error is:
A TransformInput object does not have an attribute withColumn.
Please check the spelling and/or the datatype of the object.
My code for your reference
import pyspark.sql.functions as F
import pyspark.sql.types as T
from pyspark.sql.functions import when
from transforms.api import configure, transform, Input, Output
#transform(
result = Output('Output_data_file_location'),
first_input=Input('Input_file1'),
second_input= Input('Input_file2'),
)
def function_temp(first_input, second_input, result):
from pyspark.sql.functions import monotonically_increasing_id
res = ncbs.withColumn("id", monotonically_increasing_id())
# Recode type
res = res.withColumn("old_col_type", F.when(
(F.col("col_type") == 'left') | (F.col("col_type") == 'right'), 'turn'
).when(
(F.col("col_type") == 'up') | (F.col("col_type") == 'down'), 'straight'
))
res = res.withColumnRenamed("old_col_type","t_old_col_type") \
.withColumnRenamed("old_col2_type","t_old_col2_type")
res = res.filter((res.col_type== 'straight')
res = res.join(second_input, #eqNullSafe is like an equal sign but includes null in join
(res.col1.eqNullSafe(second_input.pre_col1)) &
(res.col2.eqNullSafe(second_input.pre_col2)),
how='left')\
.drop(*["pre_col1", "pre_col2"]).withColumnRenamed("temp_result", "final_answer")
result.write_dataframe(res)
Can anyone help me with the error. Thanks in advance
The error code you are receiving explains it pretty well, you are calling .withColumn() on an object that is not a regular Spark Dataframe but a TransformInput object. You need to call the .dataframe() method to access the Dataframe.
The documentation for reference.
In addition you should move the monotonically_increasing_id to the top of the file, since Foundrys transform logic level versioning only works when the imports are happening on the module level, according to the documentation.

Flink: RowRowConverter seems to fail for nested DataTypes

I am trying to load a complex JSON file (multiple different data types, nested objects/arrays etc) from my local, read them in as a source using the Table API File System Connector, convert them into DataStream, and then do some action afterwards (not shown here for brevity).
The conversion gives me a DataStream of type DataStream[Row], which I need to convert to DataStream[RowData] (for sink purposes, won't go into details here). Thankfully, there's a RowRowConverter utility that helps to do this mapping. It works when I tried a completely flat JSON, but when I introduced Arrays and Maps within the JSON, it no longer works.
Here is the exception that was thrown - a null pointer exception:
at org.apache.flink.table.data.conversion.ArrayObjectArrayConverter.allocateWriter(ArrayObjectArrayConverter.java:140)
at org.apache.flink.table.data.conversion.ArrayObjectArrayConverter.toBinaryArrayData(ArrayObjectArrayConverter.java:114)
at org.apache.flink.table.data.conversion.ArrayObjectArrayConverter.toInternal(ArrayObjectArrayConverter.java:93)
at org.apache.flink.table.data.conversion.ArrayObjectArrayConverter.toInternal(ArrayObjectArrayConverter.java:40)
at org.apache.flink.table.data.conversion.DataStructureConverter.toInternalOrNull(DataStructureConverter.java:61)
at org.apache.flink.table.data.conversion.RowRowConverter.toInternal(RowRowConverter.java:75)
at flink.ReadJsonNestedData$.$anonfun$main$2(ReadJsonNestedData.scala:48)
Interestingly, when I setup my breakpoints and debugger this is what I discovered: RowRowConverter::toInternal, the first time it was called works, will go all the way down to ArrayObjectArrayConverter::allocateWriter()
However, for some strange reason, RowRowConverter::toInternal runs twice, and if I continue stepping through eventually it will come back here, which is where the null pointer exception happens.
Example of the JSON (simplified with only a single nested for brevity). I placed it in my /src/main/resources folder
{"discount":[670237.997082,634079.372133,303534.821218]}
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment
import org.apache.flink.table.api.DataTypes
import org.apache.flink.table.api.bridge.java.StreamTableEnvironment
import org.apache.flink.table.data.conversion.RowRowConverter
import org.apache.flink.table.types.FieldsDataType
import org.apache.flink.table.types.logical.RowType
import scala.collection.JavaConverters._
object ReadJsonNestedData {
def main(args: Array[String]): Unit = {
// setup
val jsonResource = getClass.getResource("/NESTED.json")
val jsonFilePath = jsonResource.getPath
val tableName = "orders"
val readJSONTable =
s"""
| CREATE TABLE $tableName (
| `discount` ARRAY<DECIMAL(12, 6)>
| )WITH (
| 'connector' = 'filesystem',
| 'path' = '$jsonFilePath',
| 'format' = 'json'
|)""".stripMargin
val colFields = Array(
"discount"
)
val defaultDataTypes = Array(
DataTypes.ARRAY(DataTypes.DECIMAL(12, 6))
)
val rowType = RowType.of(defaultDataTypes.map(_.getLogicalType), colFields)
val defaultDataTypesAsList = defaultDataTypes.toList.asJava
val dataType = new FieldsDataType(rowType, defaultDataTypesAsList)
val rowConverter = RowRowConverter.create(dataType)
// Job
val env = StreamExecutionEnvironment.getExecutionEnvironment()
val tableEnv = StreamTableEnvironment.create(env)
tableEnv.executeSql(readJSONTable)
val ordersTable = tableEnv.from(tableName)
val dataStream = tableEnv
.toDataStream(ordersTable)
.map(row => rowConverter.toInternal(row))
dataStream.print()
env.execute()
}
}
I would hence like to know:
Why RowRowConverter is not working and how I can remedy it
Why RowRowConverter::toInternal is running twice for the same Row .. which may be the cause of that NullPointerException
If my method of instantiating and using the RowRowConverter is correct based on my code above.
Thank you!
Environment:
IntelliJ 2021.3.2 (Ultimate)
AdoptOpenJDK 1.8
Scala: 2.12.15
Flink: 1.13.5
Flink Libraries Used (for this example):
flink-table-api-java-bridge
flink-table-planner-blink
flink-clients
flink-json
The first call of RowRowConverter::toInternal is an internal implementation for making a deep copy of the StreamRecord emitted by table source, which is independent from the converter in your map function. The reason of the NPE is that the RowRowConverter in the map function is not initialized by calling RowRowConverter::open. You can use RichMapFunction instead to invoke the RowRowConverter::open in RichMapFunction::open.
Thank you to #renqs for the answer.
Here is the code, if anyone is interested.
class ConvertRowToRowDataMapFunction(fieldsDataType: FieldsDataType)
extends RichMapFunction[Row, RowData] {
private final val rowRowConverter = RowRowConverter.create(fieldsDataType)
override def open(parameters: Configuration): Unit = {
super.open(parameters)
rowRowConverter.open(this.getClass.getClassLoader)
}
override def map(row: Row): RowData =
this.rowRowConverter.toInternal(row)
}
// at main function
// ... continue from previous
val dataStream = tableEnv
.toDataStream(personsTable)
.map(new ConvertRowToRowDataMapFunction(dataType))

How can pyspark remember something in memory like class attributes in mapreduce?

I have a table with 2 columns: image_url, comment. The same image may have many comments, and data are sorted by image_url in files.
I need to crawl the image, and transfer it to binary. This will take a long time. So, for the same image, I want to do it only once.
In mapreduce, I can remember the last row and result in memory.
class Mapper:
def __init__(self):
self.image_url = None
self.image_bin = None
def run(self, image_url, comment):
if image_url != self.image_url:
self.image_url = image_url
self.image_bin = process(image_url)
return self.image_url, self.image_bin, comment
How can I do it in pyspark? Either rdd and dataframe is ok.
I would advice you to simply process a grouped version of your dataframe. Something like this :
from pyspark.sql import functions as F
# Assuming df is your dataframe
df = df.groupBy("image_url").agg(F.collect_list("comment").alias("comments"))
df = df.withColumn("image_bin", process(F.col("image_url")))
df.select(
"image_url",
"image_bin",
F.explode("comments").alias("comment"),
).show()
I found mapPartitions works. The code look like this.
def do_cover_partition(partitionData):
last_url = None
last_bin = None
for row in partitionData:
data = row.asDict()
print(data)
if data['cover_url'] != last_url:
last_url = data['cover_url']
last_bin = url2bin(last_url)
print(data['comment'])
data['frames'] = last_bin
yield data
columns = ["id","cover_url","comment","frames"]
df = df.rdd.mapPartitions(do_cover_partition).map(lambda x: [x[c] for c in columns]).toDF(columns)

Scala WithColumn only if both columns exists

I have seen some variations of this question asked but havent found exactly what Im looking for. Here is the question:
I have some report names that I have collected in a dataframe and pivoted. The trouble I am having is regarding the resilience of the report_name. I cant be assured that every 90 days data will be present and that Rpt1, Rpt2, and Rpt3 will be there. So how do I go about creating a calculation ONLY if the column is present. I have outlined how my code looks right now. It works if all columns are there, but Id like to future proof it to ensure that if the report is not present in the 90 day window the pipline will not error out, but instead just skip the .withColumn addition
df1=(reports.alias("r")
.groupBy(uniqueid)
.filter("current_date<=90")
.pivot(report_name)
**
Result would be the following columns uniqueid Rpt1, Rpt2, Rpt3
* +---+-----+------+----------+
* |id |Rpt1 |Rpt2 |Rpt3 |
* +---+-----+------+----------+
* |205|72 |36 | 12 |
**
df2=(df1.alias("d1")
.withColumn("new_calc",expr("Rpt2/Rpt3"))
You can catch the error with a Try monad and return the original dataframe if withColumn fails.
import scala.util.Try
val df2 = Try(df1.withColumn("new_calc", expr("Rpt2/Rpt3")))
.getOrElse(df1)
.alias("d1")
You can also define it as a method if you want to reuse:
import org.apache.spark.sql.Column
def withColumnIfExist(df: DataFrame, colName: String, col: Column) =
Try(df.withColumn("new_calc",expr("Rpt2/Rpt3"))).getOrElse(df)
val df3 = withColumnIfExist(df1, "new_calc", expr("Rpt2/Rpt3"))
.alias("d1")
And if you need to chain multiple transformation you can use it with transform:
val df4 = df1.alias("d1")
.transform(withColumnIfExist(_, "new_calc", expr("Rpt2/Rpt3")))
.transform(withColumnIfExist(_, "new_calc_2", expr("Rpt1/Rpt2")))
Or you can implement it as an extension method with implicit class:
implicit class RichDataFrame(df: DataFrame) {
def withColumnIfExist(colName: String, col: Column): DataFrame =
Try(df.withColumn("new_calc", expr("Rpt2/Rpt3"))).getOrElse(df)
}
val df5 = df1.alias("d1")
.withColumnIfExist("new_calc", expr("Rpt2/Rpt3"))
.withColumnIfExist("new_calc_2", expr("Rpt1/Rpt2"))
Since withColumn works with all datasets, and if you want withColumnIfExist to work generically for all datasets including dataframe:
implicit class RichDataset[A](ds: Dataset[A]) {
def withColumnIfExist(colName: String, col: Column): DataFrame =
Try(ds.withColumn("new_calc", expr("Rpt2/Rpt3"))).getOrElse(ds.toDF)
}

How to correctly handle Option in Spark/Scala?

I have a method, createDataFrame, which returns an Option[DataFrame]. I then want to 'get' the DataFrame and use it in later code. I'm getting a type mismatch that I can't fix:
val df2: DataFrame = createDataFrame("filename.txt") match {
case Some(df) => { //proceed with pipeline
df.filter($"activityLabel" > 0)
case None => println("could not create dataframe")
}
val Array(trainData, testData) = df2.randomSplit(Array(0.5,0.5),seed = 12345)
I need df2 to be of type: DataFrame otherwise later code won't recognise df2 as a DataFrame e.g. val Array(trainData, testData) = df2.randomSplit(Array(0.5,0.5),seed = 12345)
However, the case None statement is not of type DataFrame, it returns Unit, so won't compile. But if I don't declare the type of df2 the later code won't compile as it is not recognised as a DataFrame. If someone can suggest a fix that would be helpful - been going round in circles with this for some time. Thanks
What you need is a map. If you map over an Option[T] you are doing something like: "if it's None I'm doing nothing, otherwise I transform the content of the Option in something else. In your case this content is the dataframe itself. So inside this myDFOpt.map() function you can put all your dataframe transformation and just in the end do the pattern matching you did, where you may print something if you have a None.
edit:
val df2: DataFrame = createDataFrame("filename.txt").map(df=>{
val filteredDF=df.filter($"activityLabel" > 0)
val Array(trainData, testData) = filteredDF.randomSplit(Array(0.5,0.5),seed = 12345)})