Parallel job is adding extra columns when outputting to a dataset - datastage

The last job before my dataset is written is a transformation. It's a lot more complex than this, but the basics are:
input = A Integer, B Integer and C Integer
output = A Integer, if B > 10 then C else 0 -> C Integer
So, to clarify, column A is just passed through and columns B and C are used to perform a transformation that is called "C" in the final output link.
When I examine the columns being written to the dataset I see A and C. I can save the table definition and this is also just columns A and C. However, when I actually run the job, column B also ends up in the dataset, so I end up with (in any order) columns A, B and C.
I've tried deleting my output dataset, then recreating it, giving it a new name, but it always ends up with that "working column" B in it for some reason I don't fully understand. I don't see how it's picking up a column that isn't in the final output link and choosing to add it against my wishes.
I don't want column B in my dataset, it's wasteful to store it and it's confusing for developers as it shouldn't be there in the first place. How do I stop DataStage from writing it?

Seems you have RCP Runtime Column Propagation activated - that will transfor all columns available independend of the specified ones.
Go to the stage (Transformer) - Properties - Output tab and there is a checkbox Runtime Column Propagation - remove the check mark.
In other stages it could be located on the columns tab as well.
In the job properties of your job there is also a setting which will enable RCP for new links - remove this mark as well to avaoid this problems for future job extensions.
For more details on RCP check out this.

Related

Create sample value for failure records spark

I have a scenario where my dataframe has 3 columns a,b and c. I need to validate if the length of all the columns is equal to 100. Based on validation I am creating status column like a_status,b_status,c_status with values 5 (Success) and 10 (Failure). In Failure scenarios I need to update count and create new columns a_sample,b_sample,c_sample with some 5 failure sample values separated by ",". For creating samples column I tried like this
df= df.select(df.columns.toList.map(col(_)) :::
df.columns.toList.map( x => (lit(getSample(df.select(x, x + "_status").filter(x + "_status=10" ).select(x).take(5))).alias(x + "_sample")) ).toList: _* )
getSample method will just get array of rows and concatenate as a string. This works fine for limited columns and data size. However if the number of columns > 200 and data is > 1 million rows it creates huge performance impact. Is there any alternate approach for same.
While the details of your problem statement are unclear, you can break up the task into two parts:
Transform data into a format where you identify several different types of rows you need to sample.
Collect sample by row type.
The industry jargon for "row type" is stratum/strata and the way to do (2), without collecting data to the driver, which you don't want to do when the data is large, is via stratified sampling, which Spark implements via df.stat.sampleBy(). As a statistical function, it doesn't work with exact row numbers but fractions. If you absolutely must get a sample with an exact number of rows there are two strategies:
Oversample by fraction and then filter unneeded rows, e.g., using the row_number() window function followed by a filter 'row_num < n.
Build a custom user-defined aggregate function (UDAF), firstN(col, n). This will be much faster but a lot more work. See https://docs.databricks.com/spark/latest/spark-sql/udaf-scala.html
An additional challenge for your use case is that you want this done per column. This is not a good fit with Spark's transformations such as grouping or sampleBy, which operate on rows. The simple approach is to make several passes through the data, one column at a time. If you absolutely must do this in a single pass through the data, you'll need to build a much more custom UDAF or Aggregator, e.g., the equivalent of takeFirstNFromAWhereBHasValueC(n, colA, colB, c).

Tableau - Find Null Values across Specific Dimension Values

Working in Tableau - My data set looks like this:
Filename Run Score
File1 Run1 80
File1 Run2 Null
File1 Run3 Null
File1 Run4 60
File2 Run1 70
I need to be able to filter the file data based on Nulls being in certain runs. My current plan is a calculated field being used as either a parameter or filter (or both):
IF $score_for{$file}{'Run2'} == Null && $score_for{$file}{'Run3'} == Null
THEN $file{'calc value'} = 1 (or 'null values in runs I care about')
Then I can filter all 1's out of the charts and look at the files that did work for runs 2 & 3.
I have a feeling I can do this using INCLUDE, but for the life of me I can't figure out how that works. I've watched their training video three times.
It looks like your end goal is to identify files that satisfy a condition - in this case, files with non-null values for the runs of interest. This is a good case for using Tableau sets.
There are alot of ways to think of sets: a named filter, a Boolean function defined for each data row, a mathematical set defined for members of some discrete field. I'd recommend something along the lines of:
Define the set of Runs of Interest -- right click on the Run field in the data pane in the left sidebar. Choose Create Set. Call it "Runs of Interest" and manually select the Runs that you want to belong to that set: Run2 and Run3 in your example.
Define the set of Files that Worked -- right click on the Files field, Create a Set. Name it "Working Files", and then instead of manually selecting set members, choose the Use All radio Botton an the TOP of the set dialog, and then choose the Condition tab to define the condition that distinguishes working files from the non-working files.
Enter a condition as a formula such as: MIN(NOT ISNULL([Score])) which will be satisfied for Files where EVERY data row has a non-null score. If instead you want files to belong to the set if ANY data row has a non-null score, then use MAX() instead of MIN().
Now that you have your Working Files set, place it on the filter shelf to restrict the viz to only working files. You can also use sets on the row/col shelves or in calculated fields. You can edit the Runs of Interest set as needed, and the Working File set will adjust

COUNTIF based on three conditions using OFFSET and MATCH

Please see example screengrab
I would like to populate cell M2. Firstly to match K2 (Taylor) against column headers C1:I1 looking at the results in the column C2:C32. I would like to find the amount of times "a" appears in C2:C32 where Type (Column B) = "r".
So the result would be 3 (Reynolds, Maggio & Hamilton).
As you can see I've managed to populate Column R with totals without comparing against Type (Column B) but am having great difficulty understanding how to extend the comparison, intentionally without the use of helper columns/rows.
Any help would be greatly appreciated.
Since you have to depend on 2 columns, you will have to use COUNTIFS. Without being dynamic, the formula for M2 would be:
=COUNTIFS($B$2:$B$32,"r",$C$2:$C$32,"a")
^------------^ ^------------^
1st Condition 2nd Condition
To make it dynamic, only the second column needs to be changed:
=COUNTIFS($B$2:$B$32,"r",OFFSET($B$2:$B$32,0,MATCH($K2,$C$1:$I$1,0)),"a")
Your total's formula could be simplified to this also (keep the range as it is instead of manually putting it as 32 rows high for instance):
=COUNTA(OFFSET($B$2:$B$32,0,MATCH($K2,$C$1:$I$1,0)))

Closed loop does not work in a Talend Job

I have a Talend Job, where somehow a closed loop is formed by the components. Image is as follows:
The schemas of both the tMap outputs is same. Now after connecting any tMap to tUnite, when I try to connect the second tMap, it does not connect to it.
I heard that Talend does not allow, a closed loop in a Job. Is that true? If yes, the Why?
Someone had a similar question here, but found no answers.
Talend actually creates a Java program; essentially that is the reason for the limitation you've encountered.
tUnite take all the data provided by each of the inputs in turn i.e. all of A then all of B then all of C.
It cannot take row 1 from A then row 1 from B then row 1 from C then row 2 from A then row 2 from B etc. because of the nature of programming loops used for each flow.
However, tMap multiple outputs or tReplicate do create row 1 to A then row 1 to B then row 1 to C then row 2 to A then row 2 to B etc..
This is why you cannot split and then rejoin flows.
PreetyK has explained the why. I'll explain how to work around this limitation.
You can store the output from tMap_10 and tMap_11 in a tHashOutput each. On the 2nd tHashOutput you must check "Link with a tHashOutput" checkbox and then select the other tHashOutput from the droplist. This tells it to write to the same buffer as the 1st tHashOutput effectively making "union" of your tMap_10 and tMap_11 outputs.
On the next subjob, you use a tHashInput to read from your tHashOuput (you must use a single tHashInput as the 2 outputs share the same data).
Here are some screenshots :
Then the tHashInput:
Note that by default these components are hidden. You have to go to File > Project Settings > Designer > Palette settings, and then move them from left to right pane as bellow. You will then find them in your palette.

Macro that will copy only cells with data and paste in a different worksheet

I need create a macro that will copy cells that go from A2 to O2 in the worksheet DL and continue for a varrying amount of rows. (Depends on the month). I need this pasted in the worksheet Efficiency in rows A2 to O2.
Because every time I create the report the number of rows of data changes I'm running into issues with creating an effective macro.
Also, some of the rows don't have information in every column, but I still want the blank cells to be coppied in this case. Basically if there is data in column A I want the rest of the row to column O to be copied as well.
Do any of those suggestions help you out?
Also, I'm a bit confused of what are you referring to when you write "rows". A row has a number (1 to 1048576 in Office 2010 and up) and a column has a letter (A to XFD in Office 2010 and up). So what do you mean by "row A2 to O2"?
EDIT: Also, to copy from A to O every time, you simply set the range from A to O. For example Range("A2:O10") would include every cell from A2 to O10. Since you set the range with simple strings you can select the row numbers with the help of one of the suggestions in the link I gave you.