How to make a mutated copy of PySpark Row object? - pyspark

from pyspark.sql import Row
A a Row object is immutable. It can be converted to Python dictionary then mutated then back to a Row object. Is there a way to make a mutable or mutated copy without this conversion to dictionary and back to row?
This is need in a function run in mapPartitions.

Both row.asDict() and **dict do not preserve the ordering of your fields. Note that in python 3.6+ this might change. see PEP 468
Similar to what #hahmed said. This dynamically creates a mutated row BUT with the same schema as the row passed in.
from pyspark.sql import Row
from collections import OrderedDict
def copy(row, **kwargs):
d = OrderedDict(zip(row.__fields__, row)) #note this is not recursive
for key, value in kwargs.iteritems():
d[key]=value
MyRow = Row(row.__fields__)
return MyRow(*d.values())
This is useful if you need to transform your dataframe as a RDD and then make it a DF again
eg.
df_schema = df.schema
rdd = df_schema.rdd.map(lambda row: copy(row, field=newvalue))
new_df = spark.createDataFrame(rdd, df_schema)

Here is the dynamic solution for making a mutated copy I came up with:
from pyspark.sql import Row
def copy(row, **kwargs):
dict = {}
for attr in list(row.__fields__):
dict[attr] = row[attr]
for key, value in kwargs.items():
dict[key] = value
return Row(**dict)
row = Row(name="foo", age=45)
print(row) #Row(age=45, name='foo')
new_row = copy(row, name="bar")
print(new_row) #Row(age=45, name='bar')

Depending on your actual use case, one possibility would be simply create a new Row object from the existing one.
from pyspark.sql import Row
R = Row('a', 'b', 'c')
r = R(1,2,3)
Let's say we want to change a to 3 for r, make a new Row object out of r:
R(3, r.b, r.c)
# Row(a=3, b=2, c=3)
While r still is:
r
# Row(a=1, b=2, c=3)

Related

How to union multiple dynamic inputs in Palantir Foundry?

I want to Union multiple datasets in Palantir Foundry, the name of the datasets are dynamic so I would not be able to give the dataset names in transform_df() statically. Is there a way I can dynamically take multiple inputs into transform_df and union all of those dataframes?
I tried looping over the datasets like:
li = ['dataset1_path', 'dataset2_path']
union_df = None
for p in li:
#transforms_df(
my_input = Input(p),
Output(p+"_output")
)
def my_compute_function(my_input):
return my_input
if union_df is None:
union_df = my_compute_function
else:
union_df = union_df.union(my_compute_function)
But, this doesn't generate the unioned output.
This should be able to work for you with some changes, this is an example of dynamic dataset with json files, your situation would maybe be only a little different. Here is a generalized way you could be doing dynamic json input datasets that should be adaptable to any type of dynamic input file type or internal to foundry dataset that you can specify. This generic example is working on a set of json files uploaded to a dataset node in the platform. This should be fully dynamic. Doing a union after this should be a simple matter.
There's some bonus logging going on here as well.
Hope this helps
from transforms.api import Input, Output, transform
from pyspark.sql import functions as F
import json
import logging
def transform_generator():
transforms = []
transf_dict = {## enter your dynamic mappings here ##}
for value in transf_dict:
#transform(
out=Output(' path to your output here '.format(val=value)),
inpt=Input(" path to input here ".format(val=value)),
)
def update_set(ctx, inpt, out):
spark = ctx.spark_session
sc = spark.sparkContext
filesystem = list(inpt.filesystem().ls())
file_dates = []
for files in filesystem:
with inpt.filesystem().open(files.path) as fi:
data = json.load(fi)
file_dates.append(data)
logging.info('info logs:')
logging.info(file_dates)
json_object = json.dumps(file_dates)
df_2 = spark.read.option("multiline", "true").json(sc.parallelize([json_object]))
df_2 = df_2.withColumn('upload_date', F.current_date())
df_2.drop_duplicates()
out.write_dataframe(df_2)
transforms.append(update_logs)
return transforms
TRANSFORMS = transform_generator()
So this question breaks down in two questions.
How to handle transforms with programatic input paths
To handle transforms with programatic inputs, it is important to remember two things:
1st - Transforms will determine your inputs and outputs at CI time. Which means that you can have python code that generates transforms, but you cannot read paths from a dataset, they need to be hardcoded into your python code that generates the transform.
2nd - Your transforms will be created once, during the CI execution. Meaning that you can't have an increment or special logic to generate different paths whenever the dataset builds.
With these two premises, like in your example or #jeremy-david-gamet 's (ty for the reply, gave you a +1) you can have python code that generates your paths at CI time.
dataset_paths = ['dataset1_path', 'dataset2_path']
for path in dataset_paths:
#transforms_df(
my_input = Input(path),
Output(f"{path}_output")
)
def my_compute_function(my_input):
return my_input
However to union them you'll need a second transform to execute the union, you'll need to pass multiple inputs, so you can use *args or **kwargs for this:
dataset_paths = ['dataset1_path', 'dataset2_path']
all_args = [Input(path) for path in dataset_paths]
all_args.append(Output("path/to/unioned_dataset"))
#transforms_df(*all_args)
def my_compute_function(*args):
input_dfs = []
for arg in args:
# there are other arguments like ctx in the args list, so we need to check for type. You can also use kwargs for more determinism.
if isinstance(arg, pyspark.sql.DataFrame):
input_dfs.append(arg)
# now that you have your dfs in a list you can union them
# Note I didn't test this code, but it should be something like this
...
How to union datasets with different schemas.
For this part there are plenty of Q&A out there on how to union different dataframes in spark. Here is a short code example copied from https://stackoverflow.com/a/55461824/26004
from pyspark.sql import SparkSession, HiveContext
from pyspark.sql.functions import lit
from pyspark.sql import Row
def customUnion(df1, df2):
cols1 = df1.columns
cols2 = df2.columns
total_cols = sorted(cols1 + list(set(cols2) - set(cols1)))
def expr(mycols, allcols):
def processCols(colname):
if colname in mycols:
return colname
else:
return lit(None).alias(colname)
cols = map(processCols, allcols)
return list(cols)
appended = df1.select(expr(cols1, total_cols)).union(df2.select(expr(cols2, total_cols)))
return appended
Since inputs and outputs are determined at CI time, we cannot form true dynamic inputs. We will have to somehow point to specific datasets in the code. Assuming the paths of datasets share the same root, the following seems to require minimum maintenance:
from transforms.api import transform_df, Input, Output
from functools import reduce
datasets = [
'dataset1',
'dataset2',
'dataset3',
]
inputs = {f'inp{i}': Input(f'input/folder/path/{x}') for i, x in enumerate(datasets)}
kwargs = {
**{'output': Output('output/folder/path/unioned_dataset')},
**inputs
}
#transform_df(**kwargs)
def my_compute_function(**inputs):
unioned_df = reduce(lambda df1, df2: df1.unionByName(df2), inputs.values())
return unioned_df
Regarding unions of different schemas, since Spark 3.1 one can use this:
df1.unionByName(df2, allowMissingColumns=True)

How to explode a struct column with a prefix?

My goal is to explode (ie, take them from inside the struct and expose them as the remaining columns of the dataset) a Spark struct column (already done) but changing the inner field names by prepending an arbitrary string. One of the motivations is that my struct can contain columns that have the same name as columns outside of it - therefore, I need a way to differentiate them easily. Of course, I do not know beforehand what are the columns inside my struct.
Here is what I have so far:
implicit class Implicit(df: DataFrame) {
def explodeStruct(column: String) = df.select("*", column + ".*").drop(column)
}
This does the job alright - I use this writing:
df.explodeStruct("myColumn")
It returns all the columns from the original dataframe, plus the inner columns of the struct at the end.
As for prepending the prefix, my idea is to take the column and find out what are its inner columns. I browsed the documentation and could not find any method on the Column class that does that. Then, I changed my approach to taking the schema of the DataFrame, then filtering the result by the name of the column, and extracting the column found from the resulting array. The problem is that this element I find has the type StructField - which, again, presents no option to extract its inner field - whereas what I would really like is to get handled a StructType element - which has the .getFields method, that does exactly what I want (that is, showing me the name of the inner columns, so I can iterate over them and use them on my select, prepending the prefix I want to them). I know no way to convert a StructField to a StructType.
My last attempt would be to parse the output of StructField.toString - which contains all the names and types of the inner columns, although that feels really dirty, and I'd rather avoid that lowly approach.
Any elegant solution to this problem?
Well, after reading my own question again, I figured out an elegant solution to the problem - I just needed to select all the columns the way I was doing, and then compare it back to the original dataframe in order to figure out what were the new columns. Here is the final result - I also made this so that the exploded columns would show up in the same place as the original struct one, so not to break the flow of information:
implicit class Implicit(df: DataFrame) {
def explodeStruct(column: String) = {
val prefix = column + "_"
val originalPosition = df.columns.indexOf(column)
val dfWithAllColumns = df.select("*", column + ".*")
val explodedColumns = dfWithAllColumns.columns diff df.columns
val prefixedExplodedColumns = explodedColumns.map(c => col(column + "." + c) as prefix + c)
val finalColumnsList = df.columns.map(col).patch(originalPosition, prefixedExplodedColumns, 1)
df.select(finalColumnsList: _*)
}
}
Of course, you can customize the prefix, the separator, and etc - but that is simple, anyone could tweak the parameters and such. The usage remains the same.
In case anyone is interested, here is something similar for PySpark:
def explode_struct(df: DataFrame, column: str) -> DataFrame:
original_position = df.columns.index(column)
original_columns = df.columns
new_columns = df.select(column + ".*").columns
exploded_columns = [F.col(column + "." + c).alias(column + "_" + c) for c in new_columns]
col_list = [F.col(c) for c in df.columns]
col_list.pop(original_position)
col_list[original_position:original_position] = exploded_columns
return df.select(col_list)

UDF function to check whether my input dataframe has duplicate columns or not using pyspark

I need to return boolean false if my input dataframe has duplicate columns with the same name. I wrote the below code. It identifies the duplicate columns from the input dataframe and returns the duplicated columns as a list. But when i call this function it must return boolean value i.e., if my input dataframe has duplicate columns with the same name it must return flase.
#udf('string')
def get_duplicates_cols(df, df_cols):
duplicate_col_index = list(set([df_cols.index(c) for c in df_cols if df_cols.count(c) == 2]))
for i in duplicate_col_index:
df_cols[i] = df_cols[i] + '_duplicated'
df2 = df.toDF(*df_cols)
cols_to_remove = [c for c in df_cols if '_duplicated' in c]
return cols_to_remove
duplicate_cols = udf(get_duplicates_cols,BooleanType())
You don't need any UDF, you simple need a Python function. The check will be in Python not in JVM. So, as #Santiago P said you can use checkDuplicate ONLY
def checkDuplicate(df):
return len(set(df.columns)) == len(df.columns)
Assuming that you pass the data frame to the function.
udf(returnType=BooleanType())
def checkDuplicate(df):
return len(set(df.columns)) == len(df.columns)

DataFrame' object has no attribute 'add_suffix'

from pyspark.sql.functions import *
z= k.withColumn('date', when( k.date > 29, 1).otherwise(0)).collect()
i want to add suffix to the dataframe
z1 = k.add_suffix(19)
getting Error as
AttributeError: 'DataFrame' object has no attribute 'add_suffix'
Thanks
If you would like to add a suffix to multiple columns in a pyspark dataframe, you could use a for loop and .withColumnRenamed().
As an example, you might like:
def add_suffix(sdf, suffix):
for c in sdf.columns:
sdf = sdf.withColumnRenamed(c, '{}{}'.format(c, suffix))
return sdf
You can amend sdf.columns as you see fit.
try with .withColumnRenamed function instead of add_suffix.
z1 = k.withColumnRenamed('date', 'date_19')
(or)
You can create a lambda function that can add suffix to all columnnames in data frame.
reference:-
How to add suffix and prefix to all columns in python/pyspark dataframe

Spark - Create a DataFrame from a list of Rows generated in a loop

I have a loop which generates rows in each iteration. My goal is to create a dataframe, with a given schema, that contents just those rows. I have in mind a set of steps to follow, but I am not able to add a new Row to a List[Row] in each loop iteration
I am trying the following approach:
var listOfRows = List[Row]()
val dfToExtractValues: DataFrame = ???
dfToExtractValues.foreach { x =>
//Not really important how to generate here the variables
//So to simplify all the rows will have the same values
var col1 = "firstCol"
var col2 = "secondCol"
var col3 = "thirdCol"
val newRow = RowFactory.create(col1,col2,col3)
//This step I am not able to do
//listOfRows += newRow -> Just for strings
//listOfRows.add(newRow) -> This add doesnt exist, it is a addString
//listOfRows.aggregate(1)(newRow) -> This is not how aggreage works...
}
val rdd = sc.makeRDD[RDD](listOfRows)
val dfWithNewRows = sqlContext.createDataFrame(rdd, myOriginalDF.schema)
Can someone tell me what am I doing wrong, or what could I change in my approach to generate a dataframe from the rows I'm generating?
Maybe there is a better way to collect the Rows instead of List[Row]. But then I need to convert that other type of collection into a dataframe.
Can someone tell me what am I doing wrong
Closures:
First of all it looks like you skipped over Understanding Closures in the Programming Guide. Any attempt to modify variables passed with closure is futile. All you can do is modify a copy and changes won't be reflected globally.
Variable doesn't make object mutable:
Following
var listOfRows = List[Row]()
creates a variable. Assigned List is as immutable as it was. If it wasn't in the Spark context you could create a new List and reassign:
listOfRows = newRow :: listOfRows
Note that we perpend not append - you don't want to append to the list in a loop.
Variables with immutable objects are useful, when you want to share data (it is common pattern in Akka for example), but don't have many applications in Spark.
Keep things distributed:
Finally never fetch data to the driver just to distribute it again. You should also avoid unnecessary conversions between RDDs and DataFrames. It is best to use DataFrame operators all the way:
dfToExtractValues.select(...)
but if you need something more complex map:
import org.apache.spark.sql.catalyst.encoders.RowEncoder
dfToExtractValues.map(x => ...)(RowEncoder(schema))