I have a function written which converts the datatype of a dataframe to the specified schema in Pyspark. Cast function silently makes the entry as Null if it is not able to convert to the respective datatype.
e.g. F.col(col_name).cast(IntegerType())will typecast to Integer and if the column value is Long it will make that as null.
Is there any way to capture the cases where it converts to Null? In a data pipeline that runs daily, if those are not captured it will silently make them Null and pass to the upstream systems.
Related
First of all, I have my array of columns parameter called $array_merge_keys
$array_merge_keys = ['Column1', 'Column2', 'NoColumnInSomeCases']
So then I am going to hash them, if the third NoColumnInSomeCases is not existed, I would like to treat it as null or some strings else there value.
But actually, when I use them with byNames(), it would return NULL because the last is not existed, even though first and second still have values. So I would expect byNames($array_merge_keys) always return value in order to hash them.
Since that problem cannot be solved, I am back to filter these only existed column
filter(columnNames('', true()), contains(['Column1', 'Column2', 'NoColumnInSomeCases'], #item_1 == #item)) => ['Column1', 'Column2']
But it comes to another problem that byNames() cannot compute on the fly, it said 'byNames' does not accept column or argument
array(byNames(filter(columnNames('', true()), contains(['Column1', 'Column2', 'NoColumnInSomeCases'], #item_1 == #item))))
Spark job failed: { "text/plain":
"{"runId":"649f28bf-35af-4472-a170-1b6ece50c551","sessionId":"a26089f4-b0f4-4d24-8b79-d2a91a9c52af","status":"Failed","payload":{"statusCode":400,"shortMessage":"DF-EXPR-030
at Derive 'CreateTypeFromFile'(Line 35/Col 36): Column name function
'byNames' does not accept column or argument
parameters","detailedMessage":"Failure 2022-04-13 05:26:31.317
failed DebugManager.processJob,
run=649f28bf-35af-4472-a170-1b6ece50c551, errorMessage=DF-EXPR-030 at
Derive 'CreateTypeFromFile'(Line 35/Col 36): Column name function
'byNames' does not accept column or argument parameters"}}\n" } -
RunId: 649f28bf-35af-4472-a170-1b6ece50c551
I have tried lots of ways, even created a new derived column (before that stream name) to store ['Column1', 'Column2']. But it said that column cannot be referenced within byNames() function
Do we have any elegant solution?
It is true that byName() cannot evaluate with late binding. You need to either use a Select transformation to set the columns in the stream you wish to hash first or send in the column names via a parameter. Since that is "early column binding", byName() will work.
You can use a get metadata activity in the pipeline to inspect which columns are present in your source before calling the data flow, allowing you to send a pipeline parameter with just those columns you wish to hash.
Alternatively, you can create a new branch, use a select matching rule, then hash the row based on those columns (see example below).
I am using an expression builder in derived column action of Azure data factory. I have an iif statement that that adds objects to a single array of objects based on whether 5 columns are null. Within the iif statement if the object is not null it adds it to the array object and I did not specify an action for when the columns is null. So if the 3 columns have a value then there should be 3 total objects in the array but the issue is for those 2 empty columns they show up as 2 "null" values within the array. I don't want that. I just want to cleanly have only the 3 objects in the array. How can I convert the null values to whitespace or is there a better way to get this done?
I've made a test to conver null value to whitespace successfully.
My source data is a csv file with 6 columns and some columns may contains Null value:
In the dataflow, I'm using Derived Column to convert the Null value.
In the data preview, we can see the Null value was replaced with whitespace/blank
Summary:
So we can use expression iif(isNull(<Column_Name>),'\n',<Column_Name>) to replace the NULL value to a whitespace.
I have a scenrio in which I have to remove some text from date and then convert them into DATETIME. but when I using below method it is resulting in NULL output.
SELECT CAST(REPLACE('2017-09-21','',NULL) AS DATETIME)
Same output is coming when using CONVERT. Why is this happening?
It's documented:
Returns NULL if any one of the arguments is NULL.
You pass NULL as last argument.
You could use NULLIF instead of REPLACE:
SELECT CAST(NULLIF('20170921', '') AS DATETIME)
This will return NULL if the string is empty and a casted datetime otherwise.
You did specify NULL. In SQL (all products) NULL means UNKNOWN. Applying any function to an unknown value results in an unknown result, hence NULL. The exception are the functions specifically meant to deal with NULLs.
It's hard to understand how your expression can be fixed though, since you try to replace an empty string with NULL. What is the empty string supposed to match?
I have a postgres query with one input parameter of type varchar.
value of that parameter is used in where clause.
Till now only single value was sent to query but now we need to send multiple values such that they can be used with IN clause.
Earlier
value='abc'.
where data=value.//current usage
now
value='abc,def,ghk'.
where data in (value)//intended usage
I tried many ways i.e. providing value as
value='abc','def','ghk'
Or
value="abc","def","ghk" etc.
But none is working and query is not returning any result though there are some matching data available. If I provide the values directly in IN clause, I am seeing the data.
I think I should somehow split the parameter which is comma separated string into multiple values, but I am not sure how I can do that.
Please note its Postgres DB.
You can try to split input string into an array. Something like that:
where data = ANY(string_to_array('abc,def,ghk',','))
I am using a tFilterRow to avoid empty rows. While trying to use it I am getting only one function value 'absolute value'.
I want to filter values with a length greater than 0.
Why I am not getting any other functions?
As mentioned in the comments, the length function is only available to schema columns that have the String data type.
To filter out any rows that have a null value in a column you can use a tFilterRow but configured so that the column being checked is not equal to null like so:
In the case you are dealing with the primitive int (rather than the Integer class) then the primitive can never be null and instead defaults to 0 so you'll want to set it as not equal to 0 instead.