Merge Rows diff step flags "changed" when actually they should be identical - postgresql

I'm using Merge Rows diff() step to compare the two data sets. Set A are the records from source(compare rows) and set B are the records from target (reference rows).
Merge rows diff raises a flag "changed" when the records are actually "identical".
I'm inserting or updating a record in my target table using "Synchronize after merge" step which inserts when records are "new" and updates the record when the records are "changed".
So, every time when I execute my transformation it always shows the flag as changed which could not happen.
My two data sets are from postgres database.
Used "Sort rows" step sorted the two data sets on a primary key field in the transformation.
In Merge rows step, I used "key field" to match the records in both streams. and compared the value fields
I want expect the exact behavior with merge rows diff flag
If the value is changed - I want to see the flag as "changed"
if the record value is identical - I want to see the flag as "identical"
When there is new coming from the source - i want to see the flag as "new"
sorted source
sorted target

Double check; triple check. There is some difference, somewhere. And don't forget about leading or trailing whitespace.
Are you absolutely sure that systemmmodstamp has the same value in both cases?

Related

Update column in a dataset only if matching record exists in another dataset in Tableau Prep Builder

Any way to do this? Basically trying to do a SQL UPDATE SET function if matching record for one or more key fields exists in another dataset.
Tried using Joins and Merge. Joins seems like more steps and the Merge appends records instead of updating the correlating rows.

Spark Delta Table Updates

I am working in Microsoft Azure Databricks environment using sparksql and pyspark.
So I have a delta table on a lake where data is partitioned by say, file_date. Every partition contains files storing millions of records per day with no primary/unique key. All these records have a "status" column which can either contain values NULL (if everything looks good on that specific record) or Not null (say if a particular lookup mapping for a particular column is not found). Additionally, my process contains another folder called "mapping" which gets refreshed on a periodic basis, lets say nightly to make it simple, from where mappings are found.
On a daily basis, there is a good chance that about 100~200 rows get errored out (status column containing not null values). From these files, on a daily basis, (hence is the partition by file_date) , a downstream job pulls all the valid records and sends it for further processing ignoring those 100-200 errored records, waiting for the correct mapping file to be received. The downstream job, in addition to the valid status records, should also try and see if a mapping is found for the errored records and if present, take it down further as well (after of course, updating the data lake with the appropriate mapping and status).
What is the best way to go? The best way is to directly first update the delta table/lake with the correct mapping and update the status column to say "available_for_reprocessing" and my downstream job, pull the valid data for the day + pull the "available_for_reprocessing" data and after processing, update back with the status as "processed". But this seems to be super difficult using delta.
I was looking at "https://docs.databricks.com/delta/delta-update.html" and the update example there is just giving an example for a simple update with constants to update, not for updates from multiple tables.
The other but the most inefficient is, say pull ALL the data (both processed and errored) for the last say 30 days , get the mapping for the errored records and write the dataframe back into the delta lake using the replaceWhere option. This is super inefficient as we are reading everything (hunderds of millions of records) and writing everything back just to process say a 1000 records at the most. If you search for deltaTable = DeltaTable.forPath(spark, "/data/events/") at "https://docs.databricks.com/delta/delta-update.html", the example provided is for very simple updates. Without a unique key, it is impossible to update specific records as well. Can someone please help?
I use pyspark or can use sparksql but I am lost
If you want to update 1 column ('status') on the condition that all lookups are now correct for rows where they weren't correct before (where 'status' is currently incorrect), I think UPDATE command along with EXISTS can help you solve this. It isn't mentioned in the update documentation, but it works both for delete and update operations, effectively allowing you to update/delete records on joins.
For your scenario I believe the sql command would look something like this:
UPDATE your_db.table_name AS a
SET staus = 'correct'
WHERE EXISTS
(
SELECT *
FROM your_db.table_name AS b
JOIN lookup_table_1 AS t1 ON t1.lookup_column_a = b.lookup_column_a
JOIN lookup_table_2 AS t2 ON t2.lookup_column_b = b.lookup_column_b
-- ... add further lookups if needed
WHERE
b.staus = 'incorrect' AND
a.lookup_column_a = b.lookup_column_a AND
a.lookup_column_b = b.lookup_column_b
)
Merge did the trick...
MERGE INTO deptdelta AS maindept
USING updated_dept_location AS upddept
ON upddept.dno = maindept.dno
WHEN MATCHED THEN UPDATE SET maindept.dname = upddept.updated_name, maindept.location = upddept.updated_location

Optimal approach for large bulk updates to indexed bit sets

The environment for this question is PostgreSQL 9.6.5 on AWS RDS.
The question is about an optimal schema design and batch update strategy for a table with 300 million rows containing the following logical data model:
id: primary key, string up to 40 characters long
code: integer 1-999
year: integer year
flags: variable number (1000+) each associated with a name, new flags added over time. Ideally, a flag should be thought of as having three values: absent (null), on (true/1) and off (false/0). It is possible, at the cost of additional updates (see below), to treat a flag as a simple bit (on or off, no absent). "On" values are typically very sparse: < 1/1000.
Queries typically involve boolean expressions on the presence or absence of one or more flags (by name) with code and year occasionally involved also.
The data is updated in batch via Apache Spark, i.e., updates can be represented as flat file(s), e.g., in COPY format, or as SQL operations. Only one update is active at any one time. Updates to code and year are very infrequent. Updates to flags affect 1-5% of rows per update (3-15 million rows). It is possible for the update rows to include all flags and their values, just the "on" flags to be updated or just the flags whose values have changed. In the former case, Spark would need to query the data to get the current values of flags.
There will be a small read load during updates.
The question is about an optimal schema and associated update strategy to support the query & updates as described.
Some comments from research so far:
Using 1,000+ boolean columns would create a very efficient row representation but, in addition to some DDL complexity, would require 1,000+ indexes.
Bit strings would be great if there was a way to index individual bits. Also, they do not offer a good way to represent absent flags. Using this approach would require maintaining a lookup table between flag names and bit IDs. Merging updates, if needed, works with ||, though, given PostgreSQL's MVCC there doesn't seem to be much benefit to updating just flags as opposed to replacing an entire row.
JSONB fields offer indexing. They also offer null representation but that comes at a cost: all flags that are "off" would need to be explicitly set, which would make the fields quite large. If we ignore null representation, JSONB fields would be relatively small. To further shrink them, we could use short 1-3 character field names with a lookup table. Same comments re: merging as with bit strings.
tsvector/tsquery: have no experience with this data type but, in theory, seems to be an exact representation of a set of "on" flags by name. Must use a lookup table mapping flag names to tokens with the additional requirement to ensure there are no collisions due to stemming.
Don't store the flags in the main table.
Assuming that the main table is called data, define something like the following:
CREATE TABLE flag_names (
id smallint PRIMARY KEY,
name text NOT NULL
);
CREATE TABLE flag (
flagname_id smallint NOT NULL REFERENCES flag_names(id),
data_id text NOT NULL REFERENCES data(id),
value boolean NOT NULL,
PRIMARY KEY (flagname_id, data_id)
);
If a new flag is created, insert a new row in flag_names.
If a flag is set to TRUE or FALSE, insert or update a row in the flag table.
Join flag with data to test if a certain flag is set.

Sort unmatched records using joinkeys

I have two GDG files (-1 & 0 version). Using these two files a flat file needs to be generated which will have Insert records(records which are not in -1 version but are in +0 version), Delete records(records which are in -1 version but are not in +0 version) and Update records(records which are in both the versions but the +0 version might have changes in some of the fields). How can i get those update records? Can i do it using Joinkeys, if yes, How?
Note: The update can be anywhere from column 1 to the last column of the file(+0 version of the GDG)
It is a simple JOINKEYS:
OPTION COPY
JOINKEYS F1=INA,FIELDS=(4,80),SORTED,NOSEQCK
JOINKEYS F2=INB,FIELDS=(4,80),SORTED,NOSEQCK
JOIN UNPAIRED
REFORMAT FIELDS=(F1:1,227,F2:1,227,?)
The OPTION COPY is for the Main Task, the bit which runs after the joined file is produced. SORT FIELDS=COPY is equivalent to OPTION COPY.
The assumption is that your data is in key order already. If not, remove the SORTED,NOSEQCKs but bear in mind that you may get "spurious" matches, by equal keys not in the same position on the file relative to inserts and deletes.
JOIN UPAIRED gives you matches and both types of mismatch. JOIN UNPAIRED,F1,F2 is equivalent.
The REFORMAT statement defines the records on the joined file. You want all the data from both/either record, and you want to know whether there was a match, and if no match, which input file had the record. That is what the question-mark (?) is. It will contain 'B' (on both files), '1' (on F1, or the first physically present JOINKEYS, only) or '2' (on the other JOINKEYS file only).
Then you need to output the data. I'll assume you want the data in separate places:
OUTFIL FNAMES=INSERT,
INCLUDE=(455,1,CH,EQ,C'1'),
BUILD=(1,227)
OUTFIL FNAMES=DELETE,
INCLUDE=(455,1,CH,EQ,C'2'),
BUILD=(228,227)
OUTFIL FNAMES=CHANGE,
INCLUDE=(455,1,CH,EQ,C'B',
AND,
1,227,CH,NE,228,227,CH),
BUILD=(1,454)
OUTFIL FNAMES=UNCHNGE,
SAVE,
BUILD=(1,227)
INCLUDE= (or OMIT=) includes or omits the data from the "OUTFIL Group". OUTFILs "run" concurrently (as in the same record is presented to each in turn, then the next record, etc).
FNAMES gives you the DDname to put in the JCL.
For CHANGE, the INCLUDE is for the first record (known to match due to the test for 'B') not being equal to the second. It is not exactly clear what output you want here. Currently those are output as F2 appended to F1, and entire (twice the size) record written. You could also write the records in "pairs" (BUILD=(1,227,/,228,227)) or just one or the other of the records.
SAVE is a thing which says "if this record hasn't appeared on any OUTFIL, output it here. It is certainly useful for testing, even if you don't want it in the final code.

Only row key in a Cassandra column family

I'm using Cassandra 1.1.8 and today I saw in my keyspace a column family with the following content
SELECT * FROM challenge;
KEY
----------------------------
49feb2000100000a556522ed68
49feb2000100000a556522ed74
49feb2000100000a556522ed7a
49feb2000100000a556522ed72
49feb2000100000a556522ed76
49feb2000100000a556522ed6a
49feb2000100000a556522ed70
49feb2000100000a556522ed78
49feb2000100000a556522ed6e
49feb2000100000a556522ed6c
So, only rowkeys.
Yesterday those rows were there and I ran some deletions (exactly on those rows). I'm using Hector
Mutator<byte []> mutator = HFactory.createMutator(keyspace, BYTES_ARRAY_SERIALIZER)
.addDeletion(challengeRowKey(...), CHALLENGE_COLUMN_FAMILY_NAME)
.execute();
This is a small development and test environment on a single machine / single node so I don't believe the hardware details are relevant.
Probably I'm doing something stupid or I didn't get the point about how things are working, but as far I understood the rows above are no valid... column name and column value coordinates are missing so there are no valid cells (rowkey / column name / column value)...is that right?
I read about ghost reads but I think this is a scenario in a distribuited environment...is that valid after one day and on a single Cassandra node??
From http://www.datastax.com/docs/1.0/dml/about_writes#about-deletes
"The row key for a deleted row may still appear in range query results. When you delete a row in Cassandra, it marks all columns for that row key with a tombstone. Until those tombstones are cleared by compaction, you have an empty row key (a row that contains no columns). These deleted keys can show up in results of get_range_slices() calls. If your client application performs range queries on rows, you may want to have if filter out row keys that return empty column lists."