Pass Column to Function where the column can be empty - pyspark

I am trying to pass a column to a function where sometimes the column passed can be empty or blank.
For example
def test (df,segment):
score_df = df \
.withColumn('model_segment', when(lit(segment) =='',lit('')).otherwise(col(segment)))
return score_df
This works
test(df,'my_existing_column').show()
However this errors
test(df,'').show()
This errors with the message cannot resolve '``' given input columns
I get why it is doing that, but how would I go about handling this kind of scenario?

You can get a list of dataframe fields by df.columns,then check whether the incoming parameter exists in the field list, if it exists, execute show action, otherwise trigger a custom exception that the field does not exist.

Related

How to add null value in Azure Datafactory Derived columns expression builder

I am currently using Azure Datafactory in that I am creating a Derived column and since the field will always will be blank, so I want the value to be NULL
currently Derived Column I am doing this for adding the expression e.g. toString("null") and toString(null()) but this is appearing as string. I only want null to appear without quotes in Json document
I have reproduced the above and got below results.
I tried to give null() to a column and it gave the error like below.
So, in ADF Dataflow, there should be any wrap over null() with the functions like toInteger() or toString()
When I give toString(null()) when id is 4 in derived column of dataflow and the sink is a JSON, it gave me the below output.
You can see the row with id==4 skipped the null valued key in JSON. If you give toString(null()) same key in every row will be skipped.
You can go through this link by #ShaikMaheer-MSFT to understand more about this.
AFAIK, The workaround for this can be to store the null as 'null' string to get that key in JSON like this and later use this as per your requirement.

Can't convert table value to string in ADF

I'm trying to loop over data in an SQL table, but when I'm trying to use the value inside a foreach loop action using the #item() i get the error:
"Failed to convert the value in 'table' property to 'System.String' type. Please make sure the payload structure and value are correct."
So the row value can't be converted to a string.
Could that be my problem? and if so, what can I do about it?
Here is the pipeline:
I reproduced the above scenario with SQL table containing table names in lookup and csv files from ADLS gen2 in copy activity and got the same error.
The above error arises when we gave the lookup output array items directly into a string parameter inside ForEach.
If we look at the below lookup output,
The above value array is not a normal array, it is an array of objects. So, #item() in 1st iteration of ForEach means one object { "tablename": "sample1.csv" }. But our parameter expects a string value and that's why it is giving the above error.
To resolve this, give #item().tablename which will give our table names in every iteration inside ForEach.
My repro for your reference:
I have given same in sink also and this is my output.
Pipeline Execution
Copied data in target

Azure Data Factory DataFlow md5 specific columns

First of all, I have my array of columns parameter called $array_merge_keys
$array_merge_keys = ['Column1', 'Column2', 'NoColumnInSomeCases']
So then I am going to hash them, if the third NoColumnInSomeCases is not existed, I would like to treat it as null or some strings else there value.
But actually, when I use them with byNames(), it would return NULL because the last is not existed, even though first and second still have values. So I would expect byNames($array_merge_keys) always return value in order to hash them.
Since that problem cannot be solved, I am back to filter these only existed column
filter(columnNames('', true()), contains(['Column1', 'Column2', 'NoColumnInSomeCases'], #item_1 == #item)) => ['Column1', 'Column2']
But it comes to another problem that byNames() cannot compute on the fly, it said 'byNames' does not accept column or argument
array(byNames(filter(columnNames('', true()), contains(['Column1', 'Column2', 'NoColumnInSomeCases'], #item_1 == #item))))
Spark job failed: { "text/plain":
"{"runId":"649f28bf-35af-4472-a170-1b6ece50c551","sessionId":"a26089f4-b0f4-4d24-8b79-d2a91a9c52af","status":"Failed","payload":{"statusCode":400,"shortMessage":"DF-EXPR-030
at Derive 'CreateTypeFromFile'(Line 35/Col 36): Column name function
'byNames' does not accept column or argument
parameters","detailedMessage":"Failure 2022-04-13 05:26:31.317
failed DebugManager.processJob,
run=649f28bf-35af-4472-a170-1b6ece50c551, errorMessage=DF-EXPR-030 at
Derive 'CreateTypeFromFile'(Line 35/Col 36): Column name function
'byNames' does not accept column or argument parameters"}}\n" } -
RunId: 649f28bf-35af-4472-a170-1b6ece50c551
I have tried lots of ways, even created a new derived column (before that stream name) to store ['Column1', 'Column2']. But it said that column cannot be referenced within byNames() function
Do we have any elegant solution?
It is true that byName() cannot evaluate with late binding. You need to either use a Select transformation to set the columns in the stream you wish to hash first or send in the column names via a parameter. Since that is "early column binding", byName() will work.
You can use a get metadata activity in the pipeline to inspect which columns are present in your source before calling the data flow, allowing you to send a pipeline parameter with just those columns you wish to hash.
Alternatively, you can create a new branch, use a select matching rule, then hash the row based on those columns (see example below).

Parameters on Pentaho not working when inside of braces

I am trying to perform a simple postgresql task with Pentaho. I need to update a jsonb value, based on a user's input. So it looks like:
UPDATE table
set column = jsonb_set(
column,
'{path}',
'${new_param_value}'
)
WHERE id = ${id_param}
I added an alert and checked the network call, both indicating the correct parameter values are being passed. However, the value is never updated. If I substitute the user's entered param value with a hard coded number, it works as expected.
Ex: Value is updated successfully to 3, regardless of input
UPDATE table
set column = jsonb_set(
column,
'{path}',
'3'
)
WHERE id = ${id_param}
I guess that there is an issue with finding the parameter when it is inside of {} braces. I've tried using ` tick marks to account for this, and using a WITH clause before my statement, but am still stuck. Any suggestion helps!

web2py db query select showing field name

When i have the follow query:
str(db(db.items.id==int(row)).select(db.items.imageName)) + "\n"
The output includes the field name:
items.imageName
homegear\homegear.jpg
How do i remove it so that field name will not be included and just the selected imagename.
i tried referencing it like a list [1] gives me an out of range error and [0] i end up with:
<Row {'imageName': 'homegear\\homegear.jpg'}>
The above is not a list, what object is that and how can i reference on it?
Thanks!
John
db(db.items.id==int(row)).select(db.items.imageName) returns a Rows object, and its __str__ method converts it to CSV output, which is what you are seeing.
A Rows object contains Row objects, and a Row object contains field values. To access an individual field value, you must first index the Rows object to extract the Row, and then get the individual field value as an attribute of the Row. So, in this case, it would be:
db(db.items.id==int(row)).select(db.items.imageName)[0].imageName
or:
db(db.items.id==int(row)).select(db.items.imageName).first().imageName
The advantage of rows.first() over rows[0] is that the former returns None in case there are no rows, whereas the latter will generate an exception (this doesn't help in the above case, because the subsequent attempt to access the .imageName attribute would raise an exception in either case if there were no rows).
Note, even when the select returns just a single row with a single field, you still have to explicitly extract the row and the field value as above.