nvarchar(), MD5, and matching data in Bigquery - tsql

I am consolidating a couple of datasets in BQ, and in order to do that i need to run MD5 on some data.
The problem I'm having is a chunk of the data is coming MD5'ed already from Azure, and the original field is nvarchar()
I'm not as familiar with Azure, but what I find is that:
HASHBYTES('MD5',CAST('6168a1f5-6d4c-40a6-a5a4-3521ba7b97f5' as nvarchar(max)))
returns
0xCD571D6ADB918DC1AD009FFC3786C3BC (which is expected value)
where
HASHBYTES('MD5','6168a1f5-6d4c-40a6-a5a4-3521ba7b97f5')
returns
0x94B04255CD3F8FEC2B3058385F96822A which is equivalent to what i get if i run
MD5(TO_HEX(MD5('6168a1f5-6d4c-40a6-a5a4-3521ba7b97f5'))) in Bigquery, but it is unfortunately not what i need to match to, i need to match to the nvarchar version in BQ but i cannot figure out how to do that.

Figured out the problem, posting here for posterity.
The field in Azure is being stored as a nvarchar(50) which is encoded as UTF-16LE

Related

Oracle ORA_HASH alternative in Postgres(MD5)

I am trying to find an alternative to the ORA_HASH Oracle function in Postgres(edb). I know there are hashtext() and MD5(). Hashtext should be ideal, but it's not documented and so I can't use it for some reasons. I'd like to know if there is any way of using MD5() and getting the same value that you'd get in ORA_HASH giving the same value for both of them.
For example, this is how I use it and what I get in Oracle:
SELECT ORA_HASH('huyu') FROM dual;
----------------------------------
3284595515
I'd like to do something similar in postgres, so that if I pass the 'huyu' value, I'd get the exact same '3284595515' value.
Now, I know that ORA_HASH restores a number and MD5 restores hexadecimal value. I think I'd need the function that converts the hexadecimal 32 into a number, but I can't get the same value and I'm not sure if it is possible.
Thank you in advance for any suggestions
If you rely on the exact implementation of ORA_HASH, which is a proprietary hash function in a closed-source software, you are locked to this vendor, sorry.
I don't see how using PostgreSQL's hashtext is worse than ORA_HASH: it is documented (in the source), it's implementation is public, and it is not going to be removed, because it is required for hash partitioning.

Db2 for I: Cpyf *nochk emulation

In the IBM i system there's a way to copy a from a structured file to one without structure using Cpyf *nochk.
How can it be done with sql?
The answer may be "You can't", not if you are using DDL defined tables anyway. The problem is that *NOCHK just dumps data into the file like a flat file. Files defined with CRTPF, whether they have source, or are program defined, don't care about bad data until read time, so they can contain bad data. In fact you can even read bad data out of a file if you use a program definition for that file.
But, an SQL Table (one defined using DDL) cannot contain bad data. No matter how you write it, the database validates the data at write time. Even the *NOCHK option of the CPYF command cannot coerce bad data into an SQL table.
There really isn't an easy way
Closest would be to just build a big character string using CONCAT...
insert into flatfile
select mycharfld1
concat cast(myvchar as char(20))
concat digits(zonedFld3)
from mytable
That works for fixed length, varchar (if casted to char) and zoned decimal...
Packed decimal would be problematic..
I've seen user defined functions that can return the binary character string that make up a packed decimal...but it's very ugly
I question why you think you need to do this.
You can use QSYS2.QCMDEXC stored procedure to execute OS commands.
Example:
call qsys2.qcmdexc ( 'CPYF FROMFILE(QTEMP/FILE1) TOFILE(QTEMP/FILE2) MBROPT(*replace) FMTOPT(*NOCHK)' )

Informatica SQ returns different result

I am trying to pull data from DB2 via informatica, I have a SQ query that pulls few fields based on joins for 4 different tables.
When I run the query directly in the database, it returns the expected result, however when I run it in informatica and run a debugger, I see something else.
Please note all the columns data perfectly match, except one single column.
Weird thing is, this is a calculated field from the table based on a case statement:
CASE WHEN Column1='3' THEN 'N' ELSE 'Y' END.
Since this is a calculated field with a length of one string, I have connected from the source to SQ from one of the sources having 1 character length.
This returns 'Y' when executed in the database, the same query when I copy paste in SQ of information and run it, I get a data 'E', and this data can never be possible as I expect only a N or a Y. I have verified the column order, that its in the right place. This is very strange, is something going wrong because of the CASE Statement?
Save yourself the hassle, put an expression transformation after tge source qualifier and calculate, port value there then forget about it
I think i got the issue. We use Informatica PowerExchange to connect to a as400 system(DB2), and it seems that when we are trying to set a flag information in AS400, and pass it to informatica via PowerExchange, it converts it to binary, and to solve this, there needs to be an entry in the PowerExchange configuration file.
Unfortunately, i myself was not aware that it could be related to PowerExchange instead of powercenter itself.!!
Thanks for your assistance! Below is the KB about it.
https://kb.informatica.com/solution/4/Pages/17498.aspx

Oracle DB link - where clause evaluation

i have a DB2 data source and an Oracle 12c target.
The Oracle has a DB link to the DB2 defined which is working in general.
Now i have a huge table in the DB2 which has a timestamp column (lets call it ROW_CHANGED) for row changes. I want to retrieve rows which have changed after a particular time.
Running
SELECT * FROM lib.tbl WHERE ROW_CHANGED >'2016-08-01 10:00:00'
on the DB2 returns exactly 1 row after ca. 90 secs which is fine.
Now i try the same query from the Oracle via the db link:
SELECT * FROM lib.tbl#dblink_name WHERE ROW_CHANGED >TO_TIMESTAMP('2016-08-01 10:00:00')
This runs for hours and ends up in a timeout.
I read some Oracle docs and found distributed query optimization tips but most of them refer to joining a local to a remote table which is not my case.
In my desperation, i have tried the DRIVING_SITE hint, without effect.
Now i wonder when the WHERE part of the query will be evaluated. Since i have to use Oracle syntax and not DB2 syntax for the query, is it possible the Oracle will try to first copy the full table and apply the where clause afterwards? I did some research but did not find anything which would help me in this direction.
The ROW_CHANGED is a hidden column in the DB2, if that matters.
Thx for any hint in advance.
Update
Thanks#all for help. I'll share what did the trick for me.
First of all i have used TO_TIMESTAMP since the DB2 column is also Timestamp (not date) and i had expected to circumvent implicit conversions by this.
Without the explicit conversion i ran into ORA-28534: Heterogeneous Services preprocessing error and i have no hope of touching the DB config within reasonable time.
The explain plan btw did not bring much. It showed a FULL hint and no conversion on the predicates. Indeed it showed the ROW_CHANGED column as Date, i wonder why.
I have tried Justins suggestion to use a bind variable, however i got ORA-28534 again. Next thing i did was to wrap it into a pl/sql block (will run in a SP anyway later).
declare
v_tmstmp TIMESTAMP := 01.08.16 10:00:00;
begin
INSERT INTO ORAUSER.TMP_TBL (SRC_PK,ROW_CHANGED)
SELECT SRC_PK,ROW_CHANGED
FROM lib.tbl#dblink_name
WHERE ROW_CHANGED > v_tmstmp;
end;
This was executing in the same time as in DB2 itself. The date format is DD.MM.YY here since it is the default unfortunately.
When changing the variable assignment to
v_tmstmp TIMESTAMP := TO_TIMESTAMP('01.08.16 10:00:00','DD.MM.YY HH24:MI:SS');
I got the same problem as before.
Meanwhile the DB2 operators have created an index in the ROW_CHANGED column which i requested earlier that day. This has solved the problem in general it seems. Even my original query finishes in no time now.
If you are actually using an Oracle-specific conversion function like to_timestamp, that forces the predicate to be evaluated on the Oracle side. Oracle isn't going to know how to convert a built-in function like to_timestamp into an exactly equivalent function call in DB2.
If you used a bind variable, that would be more likely to get evaluated on the DB2 side. But that may be complicated by the data type mapping between different databases-- there may not be a perfect mapping between one engine's date and another engine's timestamp data type. If this was a numeric column, a bind variable would be almost certain to get pushed. In this case, it probably involves playing around a bit to figure out exactly what data type to use for your variable that works for your framework, Oracle, and DB2.
If using a bind variable doesn't work, you can force the predicate to be evaluated on the remote server using the dbms_hs_passthrough package. That lets you send a query verbatim to the remote server which allows you to do things like use functions defined in your DB2 database. That's a bit of overkill in this situation, hopefully, but it's nice to have the hammer as your backup if the simpler solution doesn't work quickly enough.

boolean field in redshift copy

I am producing a comma-separated file in S3 that needs to be copied to a staging table in a redshift database using the postgres COPY command.
It has one boolean field. With every sensible way I can think of to represent the boolean value in the file, redshift copy complains, usually with "Unknown boolean format".
I'm going to give up and change the staging table field to a smallint so that I can proceed with the copy and translate the value on the load from staging to the final redshift table, but I'm curious if anyone knows the correct incantation.
A zero or one works just fine for us.
Check your loads carefully, it may well be another issue that's 'pushing' invalid data into your boolean column.
For instance, we had all kinds of crazy characters embedded in our data that would cause errors like that. I eventually settled on using the US character for the record separator.
Check to make sure you're excluding the headers during the COPY command.
I ran into the same problem, but adding the ignoreheader 1 option (ignores 1 header line during import) solved the issue.