Remove fields that are null or 0 in a StructType - scala

I am creating an aggregation with some fields and some of these fields can have a 0 or null values (I can keep null or change it to 0 depending on what works). This finally gets written to dynamodb. The reason I want to remove these null fields is to meet the record_size requirements on the dynamodb.
df.withColumn("phone_nos", struct(
col("address"),
col("home_number"),
col("cell_number"),
col("work_number"),
.groupBy("name")
.agg(collect_set(col("phone_nos")).alias("phone_nos_by_address"))
Each customer can have multiple addresses, but not necessarily all the phone numbers in that address.
I have tried to use to_json while that does remove the null fields, the consumer of ddb cannot read as a json string. I need a way to keep the map structure but remove the null fields.
Any help is appreciated.

Related

Only saving files without null value on Nifi

an absolute newbie here trying out Nifi and postgresql on docker compose.
I have a sample CSV file with 4 columns.
I want to split this CSV file into two
based on whether if it contains a row with null value or not.
Grade ,BreedNm ,Gender ,Price
C++ ,beef_cattle ,Female ,10094
C++ ,milk_cow ,Female ,null
null ,beef_cattle ,Male ,12704
B++ ,milk_cow ,Female ,16942
for example, above table should be split into two tables each containing row 1,4 and 2,3
and save each of them into a Postgresql table.
Below is what I have tried so far.
I was trying to
split flowfile into 2 and only save rows without null value on left side and with null values on right side.
Write each of them into a table each named 'valid' and 'invalid'
but I do not know how to split the csv file and save them as a psql table through Nifi.
Can anyone help?
What you could do is use a RouteOnContent with the "Content Must Contain Match" factor, with the match being null. Therefore, anything that matches null would be routed that way, and anything not matching null would be routed a different way. Not sure if it's possible the way you're doing it, but that is 1 possibility. The match could be something like (.*?)null
I used QueryRecord processor with two SQL statements each sorting out the rows with null value and the other without the null value and it worked as intended!

Mapping Data Flows Dynamic Column Updates

I have a text input source. This has over 100 columns so I won't show all of them here - a cut-down view of the data would be:
CustomerNo
DOB
DOD
Status
01418495
01/02/1940
NULL
1
01418496
01/01/1930
NULL
1
The users want to be able to update/override any of these columns during processing by providing another input text file containing the PK (CustomerNo) and the key/value pairs of the columns to be updated e.g.
CustomerNo
Variable
New Value
01418495
DOB
01/12/1941
01418496
DOD
01/01/2021
01418496
Status
0
Can this data be used to create dynamic columns somehow that update the customer records regardless of the columns they want to update - in the example above this would result in:
CustomerNo
DOB
DOD
Status
01418495
01/02/1941
NULL
1
01418496
01/01/1930
01/01/2021
0
I have looked at the documentation but don't see any examples of how something like this could be achieved? Thanks in advance for any advice.
You would use a technique similar to what I describe in this video: https://www.youtube.com/watch?v=q7W6J-DUuJY. What I've done is created a file with rules that have expressions and then apply those rules dynamically inside of my data flow.
The key to make this work is using the expr() function to dynamically evaluate the expression from the external file.

Postgres COALESCE inside nullif for 2 different fields

I am new to SQL and POSTGRES and had a quick question. Right now I have 2 different tables one with car info and one with partial car info and I would like to sort on car.vin OR partial_car.vin depending if either exists and sending all nulls/empty strings to the end of the sort. Currently my ORDER BY statement looks like:
ORDER BY nullif(coalesce(car.vin, partial_car.partial_vin), '') asc nulls last limit 50 offset 0
My expectation for this is that coalesce will take the first non null value and use that for sorting or it will return null and send that to the end. My results so far I haven't been able to make sense of. There are null values being placed in between actual values etc.. If I make this change coalesce(car.vin, '') again I see it work properly. Anyone have an ideas as to why this is the behavior? Let me know if you need something more from me.
It was human error on my end. The object being sent to client was not being populated properly with partial data. So sorting was correct but was seeing blanks due to those values not being present.

Convert varchar parameter with CSV into column values postgres

I have a postgres query with one input parameter of type varchar.
value of that parameter is used in where clause.
Till now only single value was sent to query but now we need to send multiple values such that they can be used with IN clause.
Earlier
value='abc'.
where data=value.//current usage
now
value='abc,def,ghk'.
where data in (value)//intended usage
I tried many ways i.e. providing value as
value='abc','def','ghk'
Or
value="abc","def","ghk" etc.
But none is working and query is not returning any result though there are some matching data available. If I provide the values directly in IN clause, I am seeing the data.
I think I should somehow split the parameter which is comma separated string into multiple values, but I am not sure how I can do that.
Please note its Postgres DB.
You can try to split input string into an array. Something like that:
where data = ANY(string_to_array('abc,def,ghk',','))

Null db values and defaults

I have 2 fields that I'm adding to a current database table with data in it. One is a bit and one is an int. If I am setting defaults for both, should I just set them to not null since there is no case where they would be null?
If you will ever need to store data where you need the ability to indicate "we don't know" then you may consider allowing null values.
For example, I store data from remote sensors. When I am unable to retrieve the sensor data, like due to network problems, I use null.
If, however, you require that a value always be present, then you should use the NOT NULL constraint.
Yes, that would do the trick. If you set those columns as not null and you don't specify a default value, you'll definitely get an error from the DB.