Spark UDF using Annotations - pyspark

trying to understand how to register udf using annotations(#udf) in spark but not getting any outcome but it works if I use spark.udf.register
from pyspark.sql.types import *
from pyspark.sql.functions import *
from pyspark.sql import *
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
def to_date_format_udf(d_str):
l = [char for char in d_str]
return "".join(l[0:2]) + "/" + "".join(l[2:4]) + " " +"".join(l[4:6]) + ":" + "
".join(l[6:])
spark.udf.register("to_date_format_udf", to_date_format_udf, StringType())
str="02190925"
print(to_date_format_udf(str))
with this code I am getting the expected outcome:
2/19 09:25
but when I try to follow databricks documentation on #udf
I get the following outcome:
Column<b'to_date_format_udf(02190925)'>
Here is the modifications on databricks documentation:
#udf(returnType=StringType())
def to_date_format_udf(d_str):
l = [char for char in d_str]
return "".join(l[0:2]) + "/" + "".join(l[2:4]) + " " +"".join(l[4:6]) + ":" + "".join(l[6:])
print(to_date_format_udf("02190925"))

In the first case, the result is the expected output since the input is applied directly on the function UDF is not invoked at all and invocation is treated as a normal python call.
However the annotation #udf also generally known as decorators, modifies the behavior of the to_date_format_udf causing it to return an expression which will be evaluated by Spark when an action is taken.
Invoking spark.sql('select to_date_format_udf("02190925")').show() would yield the same result in both cases.

Related

pyspark.sql.utils.ParseException error when filtering the df

I want to select all rows from a pyspark df except some rows where the array contains 1. It works with the code below in the notebook:
<pyspark df>.filter(~exists("<col name>", lambda x: x=="hello"))
But when I write it as this:
cond = '~exists("<col name>", lambda x: x=="hello")'
df = df.filter(con)
I got error as below:
pyspark.sql.utils.ParseException:
extraneous input 'x' expecting {')', ','}(line 1, pos 32)
I really can't spot any typo. Could someone give me a hint if I missed something?
Thanks, J
To pass in the conditions through variable, it needs to be written in the form of
expr str of spark sql. So it can be modified to:
cond = '!exists(col_name, x -> x == "hello")'

Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions

from pyspark.sql.window import Window
import mpu
from pyspark.sql.functions import udf
from pyspark.sql.functions import lag
from math import sin, cos, sqrt, atan2
windowSpec = Window.partitionBy("UserID").orderBy(asc("Timestamp"))
df14=df.withColumn("newLatitude",lag("Latitude",1).over(windowSpec)) \
.withColumn("newLongitude",lag("Longitude",1).over(windowSpec)) \
.drop('AllZero'," Date","Time","Altitude")
df15=df14.orderBy(col("UserID").asc(),col("Timestamp").asc())
df16=df15.na.drop()
from geopy.distance import geodesic
origin = (30.172705, 31.526725) # (latitude, longitude) don't confuse
dist = (30.288281, 31.732326)
print(geodesic(origin, dist).meters)
df17=df16.withColumn("distance",geodesic((col("Latitude"), col("Longitude")), (col("newLatitude"), col("newLongitude"))).meters)
df17.show()
i try to use lag function to get put the previous set of Latitude and Longitude after the original df, but when i try to caculate the distance between these two sets of Latitude and Longitude, it went worong like:
/usr/local/spark/python/pyspark/sql/column.py in nonzero(self)
688
689 def nonzero(self):
--> 690 raise ValueError("Cannot convert column into bool: please use '&' for 'and', '|' for 'or', "
691 "'~' for 'not' when building DataFrame boolean expressions.")
692 bool = nonzero
ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.
I really don't understand what was going on.
def dist_col(a, b, c, d):
col_dist = geodesic((a,b), (c,d)).meters
return col_dist
# integer datatype is defined
new_f = F.udf(dist_col, FloatType())
df17=df16.withColumn('dist', new_f(col("Latitude"), col("Longitude"),col("newLatitude"), col("newLongitude")))
i creat a funtion to calculate outside the withColumn function,and use udf to define parameter types.

How to union multiple dynamic inputs in Palantir Foundry?

I want to Union multiple datasets in Palantir Foundry, the name of the datasets are dynamic so I would not be able to give the dataset names in transform_df() statically. Is there a way I can dynamically take multiple inputs into transform_df and union all of those dataframes?
I tried looping over the datasets like:
li = ['dataset1_path', 'dataset2_path']
union_df = None
for p in li:
#transforms_df(
my_input = Input(p),
Output(p+"_output")
)
def my_compute_function(my_input):
return my_input
if union_df is None:
union_df = my_compute_function
else:
union_df = union_df.union(my_compute_function)
But, this doesn't generate the unioned output.
This should be able to work for you with some changes, this is an example of dynamic dataset with json files, your situation would maybe be only a little different. Here is a generalized way you could be doing dynamic json input datasets that should be adaptable to any type of dynamic input file type or internal to foundry dataset that you can specify. This generic example is working on a set of json files uploaded to a dataset node in the platform. This should be fully dynamic. Doing a union after this should be a simple matter.
There's some bonus logging going on here as well.
Hope this helps
from transforms.api import Input, Output, transform
from pyspark.sql import functions as F
import json
import logging
def transform_generator():
transforms = []
transf_dict = {## enter your dynamic mappings here ##}
for value in transf_dict:
#transform(
out=Output(' path to your output here '.format(val=value)),
inpt=Input(" path to input here ".format(val=value)),
)
def update_set(ctx, inpt, out):
spark = ctx.spark_session
sc = spark.sparkContext
filesystem = list(inpt.filesystem().ls())
file_dates = []
for files in filesystem:
with inpt.filesystem().open(files.path) as fi:
data = json.load(fi)
file_dates.append(data)
logging.info('info logs:')
logging.info(file_dates)
json_object = json.dumps(file_dates)
df_2 = spark.read.option("multiline", "true").json(sc.parallelize([json_object]))
df_2 = df_2.withColumn('upload_date', F.current_date())
df_2.drop_duplicates()
out.write_dataframe(df_2)
transforms.append(update_logs)
return transforms
TRANSFORMS = transform_generator()
So this question breaks down in two questions.
How to handle transforms with programatic input paths
To handle transforms with programatic inputs, it is important to remember two things:
1st - Transforms will determine your inputs and outputs at CI time. Which means that you can have python code that generates transforms, but you cannot read paths from a dataset, they need to be hardcoded into your python code that generates the transform.
2nd - Your transforms will be created once, during the CI execution. Meaning that you can't have an increment or special logic to generate different paths whenever the dataset builds.
With these two premises, like in your example or #jeremy-david-gamet 's (ty for the reply, gave you a +1) you can have python code that generates your paths at CI time.
dataset_paths = ['dataset1_path', 'dataset2_path']
for path in dataset_paths:
#transforms_df(
my_input = Input(path),
Output(f"{path}_output")
)
def my_compute_function(my_input):
return my_input
However to union them you'll need a second transform to execute the union, you'll need to pass multiple inputs, so you can use *args or **kwargs for this:
dataset_paths = ['dataset1_path', 'dataset2_path']
all_args = [Input(path) for path in dataset_paths]
all_args.append(Output("path/to/unioned_dataset"))
#transforms_df(*all_args)
def my_compute_function(*args):
input_dfs = []
for arg in args:
# there are other arguments like ctx in the args list, so we need to check for type. You can also use kwargs for more determinism.
if isinstance(arg, pyspark.sql.DataFrame):
input_dfs.append(arg)
# now that you have your dfs in a list you can union them
# Note I didn't test this code, but it should be something like this
...
How to union datasets with different schemas.
For this part there are plenty of Q&A out there on how to union different dataframes in spark. Here is a short code example copied from https://stackoverflow.com/a/55461824/26004
from pyspark.sql import SparkSession, HiveContext
from pyspark.sql.functions import lit
from pyspark.sql import Row
def customUnion(df1, df2):
cols1 = df1.columns
cols2 = df2.columns
total_cols = sorted(cols1 + list(set(cols2) - set(cols1)))
def expr(mycols, allcols):
def processCols(colname):
if colname in mycols:
return colname
else:
return lit(None).alias(colname)
cols = map(processCols, allcols)
return list(cols)
appended = df1.select(expr(cols1, total_cols)).union(df2.select(expr(cols2, total_cols)))
return appended
Since inputs and outputs are determined at CI time, we cannot form true dynamic inputs. We will have to somehow point to specific datasets in the code. Assuming the paths of datasets share the same root, the following seems to require minimum maintenance:
from transforms.api import transform_df, Input, Output
from functools import reduce
datasets = [
'dataset1',
'dataset2',
'dataset3',
]
inputs = {f'inp{i}': Input(f'input/folder/path/{x}') for i, x in enumerate(datasets)}
kwargs = {
**{'output': Output('output/folder/path/unioned_dataset')},
**inputs
}
#transform_df(**kwargs)
def my_compute_function(**inputs):
unioned_df = reduce(lambda df1, df2: df1.unionByName(df2), inputs.values())
return unioned_df
Regarding unions of different schemas, since Spark 3.1 one can use this:
df1.unionByName(df2, allowMissingColumns=True)

How to explode a struct column with a prefix?

My goal is to explode (ie, take them from inside the struct and expose them as the remaining columns of the dataset) a Spark struct column (already done) but changing the inner field names by prepending an arbitrary string. One of the motivations is that my struct can contain columns that have the same name as columns outside of it - therefore, I need a way to differentiate them easily. Of course, I do not know beforehand what are the columns inside my struct.
Here is what I have so far:
implicit class Implicit(df: DataFrame) {
def explodeStruct(column: String) = df.select("*", column + ".*").drop(column)
}
This does the job alright - I use this writing:
df.explodeStruct("myColumn")
It returns all the columns from the original dataframe, plus the inner columns of the struct at the end.
As for prepending the prefix, my idea is to take the column and find out what are its inner columns. I browsed the documentation and could not find any method on the Column class that does that. Then, I changed my approach to taking the schema of the DataFrame, then filtering the result by the name of the column, and extracting the column found from the resulting array. The problem is that this element I find has the type StructField - which, again, presents no option to extract its inner field - whereas what I would really like is to get handled a StructType element - which has the .getFields method, that does exactly what I want (that is, showing me the name of the inner columns, so I can iterate over them and use them on my select, prepending the prefix I want to them). I know no way to convert a StructField to a StructType.
My last attempt would be to parse the output of StructField.toString - which contains all the names and types of the inner columns, although that feels really dirty, and I'd rather avoid that lowly approach.
Any elegant solution to this problem?
Well, after reading my own question again, I figured out an elegant solution to the problem - I just needed to select all the columns the way I was doing, and then compare it back to the original dataframe in order to figure out what were the new columns. Here is the final result - I also made this so that the exploded columns would show up in the same place as the original struct one, so not to break the flow of information:
implicit class Implicit(df: DataFrame) {
def explodeStruct(column: String) = {
val prefix = column + "_"
val originalPosition = df.columns.indexOf(column)
val dfWithAllColumns = df.select("*", column + ".*")
val explodedColumns = dfWithAllColumns.columns diff df.columns
val prefixedExplodedColumns = explodedColumns.map(c => col(column + "." + c) as prefix + c)
val finalColumnsList = df.columns.map(col).patch(originalPosition, prefixedExplodedColumns, 1)
df.select(finalColumnsList: _*)
}
}
Of course, you can customize the prefix, the separator, and etc - but that is simple, anyone could tweak the parameters and such. The usage remains the same.
In case anyone is interested, here is something similar for PySpark:
def explode_struct(df: DataFrame, column: str) -> DataFrame:
original_position = df.columns.index(column)
original_columns = df.columns
new_columns = df.select(column + ".*").columns
exploded_columns = [F.col(column + "." + c).alias(column + "_" + c) for c in new_columns]
col_list = [F.col(c) for c in df.columns]
col_list.pop(original_position)
col_list[original_position:original_position] = exploded_columns
return df.select(col_list)

I have an issue with regex extract with multiple matches

I am trying to extract 60 ML and 0.5 ML from the string "60 ML of paracetomol and 0.5 ML of XYZ" . This string is part of a column X in spark dataframe. Though I am able to test my regex code to extract 60 ML and 0.5 ML in regex validator, I am not able to extract it using regexp_extract as it targets only 1st matches. Hence I am getting only 60 ML.
Can you suggest me the best way of doing it using UDF ?
Here is how you can do it with a python UDF:
from pyspark.sql.types import *
from pyspark.sql.functions import *
import re
data = [('60 ML of paracetomol and 0.5 ML of XYZ',)]
df = sc.parallelize(data).toDF('str:string')
# Define the function you want to return
def extract(s)
all_matches = re.findall(r'\d+(?:.\d+)? ML', s)
return all_matches
# Create the UDF, note that you need to declare the return schema matching the returned type
extract_udf = udf(extract, ArrayType(StringType()))
# Apply it
df2 = df.withColumn('extracted', extract_udf('str'))
Python UDFs take a significant performance hit over native DataFrame operations. After thinking about it a little more, here is another way to do it without using a UDF. The general idea is replace all the text that isn't what you want with commas, then split on comma to create your array of final values. If you only want the numbers you can update the regex's to take 'ML' out of the capture group.
pattern = r'\d+(?:\.\d+)? ML'
split_pattern = r'.*?({pattern})'.format(pattern=pattern)
end_pattern = r'(.*{pattern}).*?$'.format(pattern=pattern)
df2 = df.withColumn('a', regexp_replace('str', split_pattern, '$1,'))
df3 = df2.withColumn('a', regexp_replace('a', end_pattern, '$1'))
df4 = df3.withColumn('a', split('a', r','))