Basic Pyspark Question - If Else Equivalent - select

Hi very basic question but I am new to Pyspark. I want my function to return different columns based on input argument but can't figure out how to do this. The Python equivalent would be:
if model='a': return df[[colA,colB]] if model ='b': return df[[colA,colB,colC]]
Thanks in advance

Pyspark equivalent would be to use select to fetch the required columns -
if model = 'a':
return df.select(*[colA,colB])
elif model = 'b':
return df.select(*[colA,colB,colC])

Related

How to union multiple dynamic inputs in Palantir Foundry?

I want to Union multiple datasets in Palantir Foundry, the name of the datasets are dynamic so I would not be able to give the dataset names in transform_df() statically. Is there a way I can dynamically take multiple inputs into transform_df and union all of those dataframes?
I tried looping over the datasets like:
li = ['dataset1_path', 'dataset2_path']
union_df = None
for p in li:
#transforms_df(
my_input = Input(p),
Output(p+"_output")
)
def my_compute_function(my_input):
return my_input
if union_df is None:
union_df = my_compute_function
else:
union_df = union_df.union(my_compute_function)
But, this doesn't generate the unioned output.
This should be able to work for you with some changes, this is an example of dynamic dataset with json files, your situation would maybe be only a little different. Here is a generalized way you could be doing dynamic json input datasets that should be adaptable to any type of dynamic input file type or internal to foundry dataset that you can specify. This generic example is working on a set of json files uploaded to a dataset node in the platform. This should be fully dynamic. Doing a union after this should be a simple matter.
There's some bonus logging going on here as well.
Hope this helps
from transforms.api import Input, Output, transform
from pyspark.sql import functions as F
import json
import logging
def transform_generator():
transforms = []
transf_dict = {## enter your dynamic mappings here ##}
for value in transf_dict:
#transform(
out=Output(' path to your output here '.format(val=value)),
inpt=Input(" path to input here ".format(val=value)),
)
def update_set(ctx, inpt, out):
spark = ctx.spark_session
sc = spark.sparkContext
filesystem = list(inpt.filesystem().ls())
file_dates = []
for files in filesystem:
with inpt.filesystem().open(files.path) as fi:
data = json.load(fi)
file_dates.append(data)
logging.info('info logs:')
logging.info(file_dates)
json_object = json.dumps(file_dates)
df_2 = spark.read.option("multiline", "true").json(sc.parallelize([json_object]))
df_2 = df_2.withColumn('upload_date', F.current_date())
df_2.drop_duplicates()
out.write_dataframe(df_2)
transforms.append(update_logs)
return transforms
TRANSFORMS = transform_generator()
So this question breaks down in two questions.
How to handle transforms with programatic input paths
To handle transforms with programatic inputs, it is important to remember two things:
1st - Transforms will determine your inputs and outputs at CI time. Which means that you can have python code that generates transforms, but you cannot read paths from a dataset, they need to be hardcoded into your python code that generates the transform.
2nd - Your transforms will be created once, during the CI execution. Meaning that you can't have an increment or special logic to generate different paths whenever the dataset builds.
With these two premises, like in your example or #jeremy-david-gamet 's (ty for the reply, gave you a +1) you can have python code that generates your paths at CI time.
dataset_paths = ['dataset1_path', 'dataset2_path']
for path in dataset_paths:
#transforms_df(
my_input = Input(path),
Output(f"{path}_output")
)
def my_compute_function(my_input):
return my_input
However to union them you'll need a second transform to execute the union, you'll need to pass multiple inputs, so you can use *args or **kwargs for this:
dataset_paths = ['dataset1_path', 'dataset2_path']
all_args = [Input(path) for path in dataset_paths]
all_args.append(Output("path/to/unioned_dataset"))
#transforms_df(*all_args)
def my_compute_function(*args):
input_dfs = []
for arg in args:
# there are other arguments like ctx in the args list, so we need to check for type. You can also use kwargs for more determinism.
if isinstance(arg, pyspark.sql.DataFrame):
input_dfs.append(arg)
# now that you have your dfs in a list you can union them
# Note I didn't test this code, but it should be something like this
...
How to union datasets with different schemas.
For this part there are plenty of Q&A out there on how to union different dataframes in spark. Here is a short code example copied from https://stackoverflow.com/a/55461824/26004
from pyspark.sql import SparkSession, HiveContext
from pyspark.sql.functions import lit
from pyspark.sql import Row
def customUnion(df1, df2):
cols1 = df1.columns
cols2 = df2.columns
total_cols = sorted(cols1 + list(set(cols2) - set(cols1)))
def expr(mycols, allcols):
def processCols(colname):
if colname in mycols:
return colname
else:
return lit(None).alias(colname)
cols = map(processCols, allcols)
return list(cols)
appended = df1.select(expr(cols1, total_cols)).union(df2.select(expr(cols2, total_cols)))
return appended
Since inputs and outputs are determined at CI time, we cannot form true dynamic inputs. We will have to somehow point to specific datasets in the code. Assuming the paths of datasets share the same root, the following seems to require minimum maintenance:
from transforms.api import transform_df, Input, Output
from functools import reduce
datasets = [
'dataset1',
'dataset2',
'dataset3',
]
inputs = {f'inp{i}': Input(f'input/folder/path/{x}') for i, x in enumerate(datasets)}
kwargs = {
**{'output': Output('output/folder/path/unioned_dataset')},
**inputs
}
#transform_df(**kwargs)
def my_compute_function(**inputs):
unioned_df = reduce(lambda df1, df2: df1.unionByName(df2), inputs.values())
return unioned_df
Regarding unions of different schemas, since Spark 3.1 one can use this:
df1.unionByName(df2, allowMissingColumns=True)

UDF function to check whether my input dataframe has duplicate columns or not using pyspark

I need to return boolean false if my input dataframe has duplicate columns with the same name. I wrote the below code. It identifies the duplicate columns from the input dataframe and returns the duplicated columns as a list. But when i call this function it must return boolean value i.e., if my input dataframe has duplicate columns with the same name it must return flase.
#udf('string')
def get_duplicates_cols(df, df_cols):
duplicate_col_index = list(set([df_cols.index(c) for c in df_cols if df_cols.count(c) == 2]))
for i in duplicate_col_index:
df_cols[i] = df_cols[i] + '_duplicated'
df2 = df.toDF(*df_cols)
cols_to_remove = [c for c in df_cols if '_duplicated' in c]
return cols_to_remove
duplicate_cols = udf(get_duplicates_cols,BooleanType())
You don't need any UDF, you simple need a Python function. The check will be in Python not in JVM. So, as #Santiago P said you can use checkDuplicate ONLY
def checkDuplicate(df):
return len(set(df.columns)) == len(df.columns)
Assuming that you pass the data frame to the function.
udf(returnType=BooleanType())
def checkDuplicate(df):
return len(set(df.columns)) == len(df.columns)

pyspark udf return values

I created an udf that return list of lists (The built in list object). I saved the returned values to a new column, but found that it was converted to a string. I need it as a list of lists in order to activate posexplode, what is the correct way to do it?
def conc(hashes, band_width):
...
...
return combined_chunks #it's type: list[list[float]]
concat = udf(conc)
#bands column becomes a string
mh2 = mh1.withColumn("bands", concat(col('hash'),lit(bandwidth)))
I solved it:
concat = udf(conc,ArrayType(VectorUDT()))
And in conc: return a list of dense vectors using Vectors.dense.

returnvalue in a matrix

im trying to make a function that lets you chose rows and columns and return that value and print a graph. I'm new at matlab, but here's what i have been writing.
function [sorted] = createMatrix()
rows = input('rows?');
columns = input('columns?');
unsorted = randi(100,rows,columns);
sorted = sort(unsorted);
This is the first function that create and sort the matrix, it works just fine, tho its not returning any value as output i think, "workspace" have one line named "ans" with my matrix, tho not the name i wanted it to have.
I dont have any problem with the second function that shows the 3DGraph!
So the big problem that i think i have is the output as a matrix!
Thank you!
The function is written correctly.
function [sorted] = createMatrix()
rows = input('rows?');
columns = input('columns?');
unsorted = randi(100,rows,columns);
sorted = sort(unsorted);
I think you are calling the function as createMatrix(), that is why you are having matrix stored as ans.
To solve this:
theNameYouWant = createMatrix();

How to add a string as data to a dataset?

I use the following code to create a simple dataset and add the first two rows:
data = dataset([1; 2],[3; 4],'VarNames', {'A', 'B'})
After that, I would like to set the value 4 to 'test':
data(1,2) = 'test'
Since this throws the following exception:
Error using dataset/subsasgnParens (line 198)
Right hand side must be a dataset array.
Error in dataset/subsasgn (line 79)
a = subsasgnParens(a,s,b,creating);
I also tried:
data(1,2) = dataset('test');
But this is also not working. Therefore my question: How can I add a String to a dataset such as I have created, using the method I'm using (I have to specify the row and column)?
You can't do
data(1,2) = dataset('test');
because 'test' is char type and the rest of your data are doubles and because the string 'test' is four elements which you're trying to put into one element of an array.
You need to use cell arrays. If you want to use the dataset function capabilities, see the cell2dataset and dataset2cell functions. For example:
data = dataset([1; 2],[3; 4],'VarNames',{'A', 'B'})
data2 = dataset2cell(data);
data2{3,1} = 'test';
data3 = cell2dataset(data2,'ReadVarNames',true');