DataFrame' object has no attribute 'add_suffix' - pyspark

from pyspark.sql.functions import *
z= k.withColumn('date', when( k.date > 29, 1).otherwise(0)).collect()
i want to add suffix to the dataframe
z1 = k.add_suffix(19)
getting Error as
AttributeError: 'DataFrame' object has no attribute 'add_suffix'
Thanks

If you would like to add a suffix to multiple columns in a pyspark dataframe, you could use a for loop and .withColumnRenamed().
As an example, you might like:
def add_suffix(sdf, suffix):
for c in sdf.columns:
sdf = sdf.withColumnRenamed(c, '{}{}'.format(c, suffix))
return sdf
You can amend sdf.columns as you see fit.

try with .withColumnRenamed function instead of add_suffix.
z1 = k.withColumnRenamed('date', 'date_19')
(or)
You can create a lambda function that can add suffix to all columnnames in data frame.
reference:-
How to add suffix and prefix to all columns in python/pyspark dataframe

Related

pyspark.sql.utils.ParseException error when filtering the df

I want to select all rows from a pyspark df except some rows where the array contains 1. It works with the code below in the notebook:
<pyspark df>.filter(~exists("<col name>", lambda x: x=="hello"))
But when I write it as this:
cond = '~exists("<col name>", lambda x: x=="hello")'
df = df.filter(con)
I got error as below:
pyspark.sql.utils.ParseException:
extraneous input 'x' expecting {')', ','}(line 1, pos 32)
I really can't spot any typo. Could someone give me a hint if I missed something?
Thanks, J
To pass in the conditions through variable, it needs to be written in the form of
expr str of spark sql. So it can be modified to:
cond = '!exists(col_name, x -> x == "hello")'

How to union multiple dynamic inputs in Palantir Foundry?

I want to Union multiple datasets in Palantir Foundry, the name of the datasets are dynamic so I would not be able to give the dataset names in transform_df() statically. Is there a way I can dynamically take multiple inputs into transform_df and union all of those dataframes?
I tried looping over the datasets like:
li = ['dataset1_path', 'dataset2_path']
union_df = None
for p in li:
#transforms_df(
my_input = Input(p),
Output(p+"_output")
)
def my_compute_function(my_input):
return my_input
if union_df is None:
union_df = my_compute_function
else:
union_df = union_df.union(my_compute_function)
But, this doesn't generate the unioned output.
This should be able to work for you with some changes, this is an example of dynamic dataset with json files, your situation would maybe be only a little different. Here is a generalized way you could be doing dynamic json input datasets that should be adaptable to any type of dynamic input file type or internal to foundry dataset that you can specify. This generic example is working on a set of json files uploaded to a dataset node in the platform. This should be fully dynamic. Doing a union after this should be a simple matter.
There's some bonus logging going on here as well.
Hope this helps
from transforms.api import Input, Output, transform
from pyspark.sql import functions as F
import json
import logging
def transform_generator():
transforms = []
transf_dict = {## enter your dynamic mappings here ##}
for value in transf_dict:
#transform(
out=Output(' path to your output here '.format(val=value)),
inpt=Input(" path to input here ".format(val=value)),
)
def update_set(ctx, inpt, out):
spark = ctx.spark_session
sc = spark.sparkContext
filesystem = list(inpt.filesystem().ls())
file_dates = []
for files in filesystem:
with inpt.filesystem().open(files.path) as fi:
data = json.load(fi)
file_dates.append(data)
logging.info('info logs:')
logging.info(file_dates)
json_object = json.dumps(file_dates)
df_2 = spark.read.option("multiline", "true").json(sc.parallelize([json_object]))
df_2 = df_2.withColumn('upload_date', F.current_date())
df_2.drop_duplicates()
out.write_dataframe(df_2)
transforms.append(update_logs)
return transforms
TRANSFORMS = transform_generator()
So this question breaks down in two questions.
How to handle transforms with programatic input paths
To handle transforms with programatic inputs, it is important to remember two things:
1st - Transforms will determine your inputs and outputs at CI time. Which means that you can have python code that generates transforms, but you cannot read paths from a dataset, they need to be hardcoded into your python code that generates the transform.
2nd - Your transforms will be created once, during the CI execution. Meaning that you can't have an increment or special logic to generate different paths whenever the dataset builds.
With these two premises, like in your example or #jeremy-david-gamet 's (ty for the reply, gave you a +1) you can have python code that generates your paths at CI time.
dataset_paths = ['dataset1_path', 'dataset2_path']
for path in dataset_paths:
#transforms_df(
my_input = Input(path),
Output(f"{path}_output")
)
def my_compute_function(my_input):
return my_input
However to union them you'll need a second transform to execute the union, you'll need to pass multiple inputs, so you can use *args or **kwargs for this:
dataset_paths = ['dataset1_path', 'dataset2_path']
all_args = [Input(path) for path in dataset_paths]
all_args.append(Output("path/to/unioned_dataset"))
#transforms_df(*all_args)
def my_compute_function(*args):
input_dfs = []
for arg in args:
# there are other arguments like ctx in the args list, so we need to check for type. You can also use kwargs for more determinism.
if isinstance(arg, pyspark.sql.DataFrame):
input_dfs.append(arg)
# now that you have your dfs in a list you can union them
# Note I didn't test this code, but it should be something like this
...
How to union datasets with different schemas.
For this part there are plenty of Q&A out there on how to union different dataframes in spark. Here is a short code example copied from https://stackoverflow.com/a/55461824/26004
from pyspark.sql import SparkSession, HiveContext
from pyspark.sql.functions import lit
from pyspark.sql import Row
def customUnion(df1, df2):
cols1 = df1.columns
cols2 = df2.columns
total_cols = sorted(cols1 + list(set(cols2) - set(cols1)))
def expr(mycols, allcols):
def processCols(colname):
if colname in mycols:
return colname
else:
return lit(None).alias(colname)
cols = map(processCols, allcols)
return list(cols)
appended = df1.select(expr(cols1, total_cols)).union(df2.select(expr(cols2, total_cols)))
return appended
Since inputs and outputs are determined at CI time, we cannot form true dynamic inputs. We will have to somehow point to specific datasets in the code. Assuming the paths of datasets share the same root, the following seems to require minimum maintenance:
from transforms.api import transform_df, Input, Output
from functools import reduce
datasets = [
'dataset1',
'dataset2',
'dataset3',
]
inputs = {f'inp{i}': Input(f'input/folder/path/{x}') for i, x in enumerate(datasets)}
kwargs = {
**{'output': Output('output/folder/path/unioned_dataset')},
**inputs
}
#transform_df(**kwargs)
def my_compute_function(**inputs):
unioned_df = reduce(lambda df1, df2: df1.unionByName(df2), inputs.values())
return unioned_df
Regarding unions of different schemas, since Spark 3.1 one can use this:
df1.unionByName(df2, allowMissingColumns=True)

pyspark: How can I use multiple `lit(val)` in a computation where `val` is a variable in python?

D = # came from numpy.int64 via pandas
E = # came from numpy.int64 via pandas
import pyspark.sql.functions as F
output_df.withColumn("c", F.col("A") - F.log(F.lit(D) - F.lit(E)))
I tried to use multiple lit inside pyspark with column operation. But I keep getting errors like
*** AttributeError: 'numpy.int64' object has no attribute '_get_object_id'
But these ones work
D=2
output_df.withColumn("c", F.lit(D))
output_df.withColumn("c", F.lit(2))
Try this
df.withColumn("c", F.col("A") - F.log(F.lit(int(D - E))))
D = int(D)
E = int(E)
Just add these two lines and it will work. The issue is that pyspark doesn't know how to handle numpy.int64

How to explode a struct column with a prefix?

My goal is to explode (ie, take them from inside the struct and expose them as the remaining columns of the dataset) a Spark struct column (already done) but changing the inner field names by prepending an arbitrary string. One of the motivations is that my struct can contain columns that have the same name as columns outside of it - therefore, I need a way to differentiate them easily. Of course, I do not know beforehand what are the columns inside my struct.
Here is what I have so far:
implicit class Implicit(df: DataFrame) {
def explodeStruct(column: String) = df.select("*", column + ".*").drop(column)
}
This does the job alright - I use this writing:
df.explodeStruct("myColumn")
It returns all the columns from the original dataframe, plus the inner columns of the struct at the end.
As for prepending the prefix, my idea is to take the column and find out what are its inner columns. I browsed the documentation and could not find any method on the Column class that does that. Then, I changed my approach to taking the schema of the DataFrame, then filtering the result by the name of the column, and extracting the column found from the resulting array. The problem is that this element I find has the type StructField - which, again, presents no option to extract its inner field - whereas what I would really like is to get handled a StructType element - which has the .getFields method, that does exactly what I want (that is, showing me the name of the inner columns, so I can iterate over them and use them on my select, prepending the prefix I want to them). I know no way to convert a StructField to a StructType.
My last attempt would be to parse the output of StructField.toString - which contains all the names and types of the inner columns, although that feels really dirty, and I'd rather avoid that lowly approach.
Any elegant solution to this problem?
Well, after reading my own question again, I figured out an elegant solution to the problem - I just needed to select all the columns the way I was doing, and then compare it back to the original dataframe in order to figure out what were the new columns. Here is the final result - I also made this so that the exploded columns would show up in the same place as the original struct one, so not to break the flow of information:
implicit class Implicit(df: DataFrame) {
def explodeStruct(column: String) = {
val prefix = column + "_"
val originalPosition = df.columns.indexOf(column)
val dfWithAllColumns = df.select("*", column + ".*")
val explodedColumns = dfWithAllColumns.columns diff df.columns
val prefixedExplodedColumns = explodedColumns.map(c => col(column + "." + c) as prefix + c)
val finalColumnsList = df.columns.map(col).patch(originalPosition, prefixedExplodedColumns, 1)
df.select(finalColumnsList: _*)
}
}
Of course, you can customize the prefix, the separator, and etc - but that is simple, anyone could tweak the parameters and such. The usage remains the same.
In case anyone is interested, here is something similar for PySpark:
def explode_struct(df: DataFrame, column: str) -> DataFrame:
original_position = df.columns.index(column)
original_columns = df.columns
new_columns = df.select(column + ".*").columns
exploded_columns = [F.col(column + "." + c).alias(column + "_" + c) for c in new_columns]
col_list = [F.col(c) for c in df.columns]
col_list.pop(original_position)
col_list[original_position:original_position] = exploded_columns
return df.select(col_list)

How to make a mutated copy of PySpark Row object?

from pyspark.sql import Row
A a Row object is immutable. It can be converted to Python dictionary then mutated then back to a Row object. Is there a way to make a mutable or mutated copy without this conversion to dictionary and back to row?
This is need in a function run in mapPartitions.
Both row.asDict() and **dict do not preserve the ordering of your fields. Note that in python 3.6+ this might change. see PEP 468
Similar to what #hahmed said. This dynamically creates a mutated row BUT with the same schema as the row passed in.
from pyspark.sql import Row
from collections import OrderedDict
def copy(row, **kwargs):
d = OrderedDict(zip(row.__fields__, row)) #note this is not recursive
for key, value in kwargs.iteritems():
d[key]=value
MyRow = Row(row.__fields__)
return MyRow(*d.values())
This is useful if you need to transform your dataframe as a RDD and then make it a DF again
eg.
df_schema = df.schema
rdd = df_schema.rdd.map(lambda row: copy(row, field=newvalue))
new_df = spark.createDataFrame(rdd, df_schema)
Here is the dynamic solution for making a mutated copy I came up with:
from pyspark.sql import Row
def copy(row, **kwargs):
dict = {}
for attr in list(row.__fields__):
dict[attr] = row[attr]
for key, value in kwargs.items():
dict[key] = value
return Row(**dict)
row = Row(name="foo", age=45)
print(row) #Row(age=45, name='foo')
new_row = copy(row, name="bar")
print(new_row) #Row(age=45, name='bar')
Depending on your actual use case, one possibility would be simply create a new Row object from the existing one.
from pyspark.sql import Row
R = Row('a', 'b', 'c')
r = R(1,2,3)
Let's say we want to change a to 3 for r, make a new Row object out of r:
R(3, r.b, r.c)
# Row(a=3, b=2, c=3)
While r still is:
r
# Row(a=1, b=2, c=3)