Which Stage is used to Combine Two Data Stream without Common Key Field in DataStage (IBM) - datastage

I'm using Data Stage version 11.7 and encountered the error message below from the Lookup stage while compiling the job:
"The supplied expression was empty."
In the Lookup Stage, there are two links from two transformers and there is no common key column between the two datasets.
I googled how to merge or combine the two datasets from two transformers without a common key column. However, I couldn't find a proper way to solve this issue or the way implementing my job in DataStage.
Empty Expression
Is there anyone who knows how to solve this problem? If so, please let me know which stage is good for my job or how to solve the error. I would appreciate it.

If you need to join n:m, add a dummy column to each input link and fill it with a constant value like 1. Then join over that column. Decide if mutiple matches result in mutiple output rows or if the first match 'wins' - which would be like a random n:1 then, since every row matches when joining over a const value.
But if you need to join specific rows, it indicates that there actually is a common key but it's not obvious or visible. Either Transform the sources so that they get a common key or use an anchor table that provides the relations. Join that to the first source, then join the second source.

Related

Performing a delete in an ADF mapping data flow - Updated with potential solution

I am trying to do upsert and delete in a mapping data flow.
There is a dimension table, DimCustomer.
It is being populated with data from a file.
If a Sha2 hash does not match then upsert.
if CustomerID is missing from the rawSource data, then delete (see image below for settings)
The upsert works, but the delete does not. Its likely because in the sink i have selected the customerID column as the key, but this means it can never delete a record if the entire record, including key is missing from source.
Is there a prescribed design pattern for this scenario?
The easiest solution i can think of is a 2nd dataflow, in which the only customerID's sent to the sink are ones where there is no matching customerID in the source (effectively a right outer join), but want to see if this is indeed the best way to do this.
Update:
The best solution i can come up with for this is, to the above dataflow, add an additional column, the formula for which is coalesce(RawCustomerData#CustomerID,DimCustomer#CustomerID)
This ensures there is a CustID column that always has a value.
In the sink, i change the mapping so that this custID maps to the sink CustomerID.
The delete now works as expected. Still unsure if this is the best solution but it works and doesn't appear to cause a major performance issue.
Per my experience, I think that's the best solution, add a new column can solve the problem is much easier than other operations. The way which simplest and effective is the best solution. You don't need create another data flow actives to achieve it or re-design the Alter active logic.
Your Solution:
Add an additional column CustID, the formula for which is: coalesce(RawCustomerData#CustomerID,DimCustomer#CustomerID)
This ensures there is a CustID column that always has a value. In the sink, you change the mapping so that this custID maps to the sink CustomerID.
The delete now works as expected.

Error in makeClassifTask - columns to join must specify "on="

I am getting an error here for the makeClassifTask() from MLR package.
task = makeClassifTask(data = data[,2:20441], target='Disease')
Entering this I get this error.
Provided data is not a pure data.frame but from class data.table, hence it will be converted.
Error in [.data.table(data, target) :
When i is a data.table (or character vector), the columns to join by must be specified using 'on=' argument (see ?data.table), by keying x (i.e. sorted, and, marked as sorted, see ?setkey), or by sharing column names between x and i (i.e., a natural join). Keyed joins might have further speed benefits on very large data due to x being sorted in RAM.
If someone could help me out it'd be great.
Given that you did not provide the data I can only do some guessing and suggest to read the documentation at https://mlr3book.mlr-org.com/tasks.html.
It looks like you left out the first column in your dataset which might be your target. Hence makeClassifTask() cannot find your target column.
As #Shreyash Gputa pointed out correctly, changing the data.table object to a data.frame object solves the issue:
task = makeClassifTask(data = as.data.frame(data[,2:20441]), target='Disease')
Given of course that data[,2:20441] contains the target variable Disease...

How to get counts from AlterRow transformation in Azure Data Factory

I have an AlterRow transformation that marks each row with the appropriate CRUD operation in an ADFv2 data flow. I don't see any output variables on this activity that will give me the total inserts, updates, etc. I do, however, see methods in the expression syntax to tell me if a particular row is an IsInsert(), IsUpdate(), etc.
Would the correct way to get counts be to
Add another output from the AlterRow transformation
Add derived column that uses the expression syntax IsInsert(), IsUpdate() to set operation type (I, U, D)
Add an aggregate to group by this column to get total counts for each operation
When creating the aggregate, I don't see any metadata that would allow me to group by the CRUD operation type so I assume I would have to create this myself, but it seems like it should already be there since that's the purpose of the AlterRow transformation. Am I working too hard to get these counts?
Add an aggregate after your AlterRow with no group-by and use these formulas:

PostgreSQL - Compare ts_vector fields

I have two tables in which I have data coming from two different sources. One of the field of each table contains the title of a movie, but for some reason out of my control, the titles are not always exactly the same.
So I use the ts_vector to get rid of all the minor differences (stop words, plurals and so on).
See an example here: http://sqlfiddle.com/#!17/5ccbc/3
My problem is how to compare the two ts_vector without taking into account the numberic values, but just the text content. If I compare directly the two fields, I only get the exact match between values, including position of each word. The only solution I have found is using the strip() function, that remove positions and weights from tsvector, leaving only the text content.
I was wondering if there is a fastest way to compare ts_vectors.
You could create in index on the stripped vector:
create index on tbl1 (strip(ts_title));
create index on tbl2 (strip(ts_title));
But given that your query has to fetch every row of each table, it is unlikely this would serve much of a point. Doing a merge join between the precomputed stripped vectors could be faster, but probably not once you include the overhead of building and maintaining the indexes. If the real WHERE clause is more restrictive (selecting only a few rows from one or the other of the tables) then please share the real query.

Tableau: Create a table calculation that sums distinct string values (names) when condition is met

I am getting my data from denormalized table, where I keep names and actions (apart from other things). I want to create a calculated field that will return sum of workgroup names but only when there are more than five actions present in DB for given workgroup.
Here's how I have done it when I wanted to check if certain action has been registered for workgroup:
WINDOW_SUM(COUNTD(IF [action] = "ADD" THEN [workgroup_name] END))
When I try to do similar thing with count, I am getting "Cannot mix aggregate and non-aggregate arguments":
WINDOW_SUM(COUNTD(IF COUNT([Number of Records]) > 5 THEN [workgroup_name] END))
I know that there's problem with the IF clause, but don't know how to fix it.
How to change the IF to be valid? Maybe there's an easier way to do it, that I am missing?
EDIT:
(after Inox's response)
I know that my problem is mixing aggregate with non-aggregate fields. I can't use filter to do it, because I want to use it later as a part of more complicated view - filtering would destroy the whole idea.
No, the problem is to mix aggregated arguments (e.g., sum, count) with non aggregate ones (e.g., any field directly). And that's what you're doing mixing COUNT([Number of Records]) with [workgroup_name]
If your goal is to know how many workgroup_name (unique) has more than 5 records (seems like that by the idea of your code), I think it's easier to filter then count.
So first you drag workgroup_name to Filter, go to tab conditions, select By field, Number of Records, Count, >, 5
This way you'll filter only the workgroup_name that has more than 5 records.
Now you can go with a simple COUNTD(workgroup_name)
EDIT: After clarification
Okay, than you need to add a marker that is fixed in your database. So table calculations won't help you.
By definition table calculation depends on the fields that are on the worksheet (and how you decide to use those fields to partition or address), and it's only calculated AFTER being called in a sheet. That way, each time you call the function it will recalculate, and for some analysis you may want to do, the fields you need to make the table calculation correct won't be there.
Same thing applies to aggregations (counts, sums,...), the aggregation depends, well, on the level of aggregation you have.
In this case it's better that you manipulate your data prior to connecting it to Tableau. I don't see a direct way (a single calculated field that would solve your problem). What can be done is to generate a db from Tableau (with the aggregation of number of records for each workgroup_name) then export it to csv or mdb and then reconnect it to Tableau. But if you can manipulate your database outside Tableau, it's usually a better solution