IBM Datastage - Join Stage not doing Exact match? - left-join

We are using IBM Datastage to match data coming via extracts from various mainframes.
One of the steps is to remove data in Oracle tables that are coming down in the extracts that night and it uses the Join stage to match an extract from an Oracle table against a dataset created from one of the extracts.
The issue is that the Join doesn't seem to do an exact match, it's like it'd using a LIKE using the keys.
Has anyone seen this before?
Example of Data:
Oracle Extract
POLICY_NUMBER SQL Type(VarChar) Extended(Unicode) Length(13)
POLICY_NUMBER CLIENT_NUMBER
A 123456 12345
A 123456W A 123456W01
A 234567 23456
A 234567J A 234567J01
Nightly Extract
POLICY_NUMBER SQL Type(Char) Extended(Unicode) Length(8)
POLICY_NUMBER PRODUCT
A 123456 LIFE
A 234567 PENSION
Dataset after join
POLICY_NUMBER CLIENT_NUMBER
A 123456 12345
A 123456 A 123456W01
A 234567 23456
A 234567 A 234567J01

Adjust the data types - it seems the comparison is done on the char(8). I have not seen it efore - and a test with two differnt char length columns did not match in my test.
There is no like comparison in join.
I think if you adjust the datatypes bioth to varchar(13) it should work.
Otherwise open a PMR.

Related

Aminoacid screening library in Knime

I have a task to create tetrapeptide screening library aminoacids using Knime. I have never used Knime before sadly. I need to create a workflow with all 20 aminoacids, multiply it with another 20, then multiply the result with another 20 and repeat to get final result of tetrapeptides. Can someone suggest me how to input aminoacids on the Knime? Thank you very much!
Use a Table Creator node to enter the Amino acid single-letter codes, one per table. Now use a Cross Joiner node to cross-join the table to itself - you should now have a table with rows like:
A|A
A|C
etc.
Now put this table into both inputs of a second Cross Joiner node, which should give you now quite a long table starting something like:
A|A|A|A
A|A|A|C
A|C|A|A
A|C|A|C
etc.
Now use a Column Aggregator node, select all column as aggregation columns, the aggregation method as Concatenate and change the delimiter to an empty string:
and:
This will give you a table with a single column, 'Peptide':
AAAA
AAAC
ACAA
ACAC
etc.
If you want the output as a chemical structure, then as of v1.36.0 the Vernalis community contribution contains a node Speedy Sequence to SMILES which will convert the sequence to a SMILES string (make sure you select the option that your input column is a Protein!)
The full workflow is as shown:

Tableau - Calculated field of different columns based on different partition of the same table

Sorry for the stupid question.
Situation: I have a partitioned table (the partition is the week of the year) with some metrics (e.g. frequency of some keywords); I need to run an analysis of metrics belonging to different partitions (e.g. the trend between the frequency of a keyword in week 32 compared to week 3). The ultimate purpose is to create a dashboard where the user can choose the week of the year and is presented with the calculated analysis on the go.
So far I have used a live query that uses two parameters (week_1 and week_2) that joins data from the same table based on the two different parameters. You can imagine that the dashboard recomputes everything once one of the parameter is changed by the user. To avoid long waiting times, I have set the two parameters to a non-existent default value (0, zero), so that the dashboard can open very quickly. Then I prompt the user to stop the dashboard, insert the new parameters of choice, and then restart the dashboard to load the new computations.
My question is: is it possible to achieve the same by using an extract of the table? The table itself should not be excessively big (it should be 15 million records spanning 3 years) and as far as I know the extracts are performant with those numbers.
I am quite new to Tableau, so I would like to know from more expert people if there is a more optimal way to do such a thing without using live queries.
Please, feel free to ask more information if I was not clear! However, I cannot share my workbook, as it contains sensitive information.
Edit:
+-----------+ -----------+ ------------+
partition keyword frequency
+-----------+ -----------+ ------------+
202032 hello 5000
202032 ciao 567
...
202031 hello 2323
202031 ciao 34567
...
20203 hello 2
20203 ciao 1000
With the live query, I can join the table where partition = 202032 with the same table where partition - 20203 and make a new table with a column where I compute e.g. a trend between the two frequencies:
+----------+ -----------------------+ ---------------+
keyword partitions_compared trend
+----------+ -----------------------+ ---------------+
hello 202032 - 20203 +1billion %
ciao 202032 - 20203 +1K %
With the live query I join on the keywords.
Thanks a lot in advance and have a great day!
Cheers

Inaccurate COUNT DISTINCT Aggregation with Date dimension in Google Data Studio

When I aggregate values in Google Data Studio with a date dimension on a PostgreSQL Connector, I see buggy behaviour. The symptom is that performing COUNT(DISTINCT) returns the same value as COUNT():
My theory is that it has something to do with the aggregation on the data occurring after the count has already happened. If I attempt the exact same aggregation on the same data in an exported CSV instead of directly from a PostgreSQL Connector Data Source, the issue does not reproduce:
My PostgreSQL Connector is connecting to Amazon Redshift (jdbc:postgresql://*******.eu-west-1.redshift.amazonaws.com) with the following custom query:
SELECT
userid,
submissionid,
date
FROM mytable
Workaround
If I stop using the default date field for the Date Dimension and aggregate my own dates directly in within the SQL query (date_byweek), the COUNT(DISTINCT) aggregation works as expected:
SELECT
userid,
submissionid,
to_char(date,'YYYY-IW') as date_byweek
FROM mytable
While this workaround solves my immediate problem, it sucks because I miss out on all the date functionality provided by Data Studio (Hierarchy Drill Down, Date Range filtering, etc.). Not to mention reducing my confidence at what else may be "buggy" within the product 😞
How to Reproduce
If you'd like to re-create the issue, using the following data as a PostgreSQL Data Source should suffice:
> SELECT * FROM mytable
userid submissionid
-------- -------------
1 1
2 2
1 3
1 4
3 5
> COUNT(DISTINCT userid) -- ERROR: Returns 5 when data source is PostgreSQL
> COUNT(DISTINCT userid) -- EXPECTED: Returns 3 when data source is CSV (exported from same PostgreSQL query above)
I'm happy to report that as of Sep 17 2020, there's a workaround.
DataStudio added the DATETIME_TRUNC function (see here https://support.google.com/datastudio/answer/9729685?), that allows you to add a custom field that truncs the original date to whatever granularity you want, without causing the distinct bug.
Attempting to set the display granularity in the report still causes the bug (i.e., you'll still set Oct 1 2020 12:00:00 instead of Oct 2020).
This can be solved by creating a SECOND custom field, which just returns the first, and then you can add IT to the report, change the display granularity, and everything will work OK.
I have the same issue with MySQL Connector. But my problem is solved, when I change date field format in DB from DATETIME (YYYY-MM-DD HH:MM:SS) to INT (Unixtimestamp). After connection this table to the Googe Datastudio I set type for this field as Date (YYYYMMDD) and all works, as expected. Hope, this may help you :)
In this Google forum there is a curious solution by Damien Choizit that involves combining your data source with itself. It works well for me.
https://support.google.com/datastudio/thread/13600719?hl=en&msgid=39060607
It says:
I figured out a solution in my case: I used a Blend Data joining twice the same data source with corresponding join key(s), then I specified a data range dimension only on the left side and selected the columns I wanted to CTD aggregate as "dimensions" (and not metric!) on the right side.

Multiple PDF export with single JRXML

I am using iReport for designing the JRXML and JasperReports Server for report scheduling.
I have developed the simple single JRXML in iReport with SQL query that execute query and get the record and generate single pdf.
Now i have different type of records like :
DEPT NAME Salary
---------------------------
HR MR XYZ 500000
MFG MR PQR 300000
HR MR ABC 400000
EDU MR DEF 350000
EDU MR SSS 400000
Now my requirement is to generate individual pdf for each department. Means one pdf contain only HR related data other must contain only EDU related data etc.
How can do it with iReport and JasperReports Server?
You can parametrize your query. If I understand correctly, and choosing department is a matter of simple WHERE clause you can add to the query:
WHERE dept like $P{DEPT_PARAM}
And pass that parameter from jasperserver as needed ("%" for when you need to print for all the departments)

SQL Server 2008: Pivot column with no aggregate function workaround

Yes I know, this question has been asked MANY times but after reading all the posts I found that there wasn't an answer that fits my need. So, Heres my question. I would like to take a column of values and pivot them into rows of 6 columns.
I want to take this...... And turn it into this.......................
G Letter Date Code Ammount Name Account
081278 G 081278 12 00123535 John Doe 123456
12
00123535
John Doe
123456
I have 110000 values in this one column in one table called TempTable. I need all the values displayed because each row is an entity to itself. For instance, There is one unique entry for all of the Letter, Date, Code, Ammount, Name, and Account columns. I understand that the aggregate function is required but is there a workaround that will allow me to get this desired result?
Just use a MAX aggregate
If one row = one column (per group of 6 rows) then MAX of a single value = that row value.
However, the data you've posted in insufficient. I don't see anything to:
associate the 6 rows per group
distinguish whether a row is "Letter" or "Name"
There is no implicit row order or number to rely upon to generate the groups
Unfortunately, the max columns in a SQL 2008 select statement is 4,096 as per MSDN Max Capacity.
Instead of using a pivot, you might consider dynamic SQL to get what you want to do.
Declare #SQLColumns nvarchar(max),#SQL nvarchar(max)
select #SQLColumns=(select '''+ColName+'''',' from TableName for XML Path(''))
set #SQLColumns=left(#SQLColumns,len(#SQLColumns)-1)
set #SQL='Select '+#SQLColumns
exec sp_ExecuteSQL #SQL,N''