Find a word inside text file using Pentaho Kettle/Spoon/PDI - tsql

I am creating a Data Comparison/Verification script using SQL and Spoon PDI. We're moving data between two servers, and to make sure we've got all the data we have SQL queries showing a date then the quantity of rows transferred.
Example:
Serv1: 20150522 | 100
Serv2: 20150522 | 100
The script will then try to union these values, and if it fails we'll get a fail email. However, we wish to change this setup to write the outcome to a text file, and based on that text file send either a pass or fail email.
The idea behind this is we have multiple tables we're comparing, so we wish to write all the outcomes of each comparison (eight) to a text file and based off the final text file, send the outcome - rather than spamming our email inbox if multiple steps fail.
The format of the text file we wish to have is either match -> send email or mismatch [step-name] [date] -> send email.
Usually I wouldn't ask a question if I haven't tried anything first, but I've searched everywhere on Google, tried the knowledge I currently have and nothing is going the way I wish it to. I believe this is due to the logic I am using.
I am not asking for a solution to this, or for someone to do it for me. I am simply asking for guidance along the correct path.

I would do this in a transformation where there are steps for each union where the result of each step is the comparison_name and the result. This would result in a data set at the end that looks something like this:
comparison_name | result
Union A | true
Union B | false
Union C | true
You would then be able to output those results to a text file in another step to get your result file to sent out regardless of whether the job passed or failed.
Lastly you would loop through the result row in the stream, and if all are true, you could do an email step to send out a "pass" email, and if one is false, send out a "fail" email.
EDIT:
To get the date of the pass or fail you could either get the date from each individual union query result by adding it to the query like so:
SELECT CURRENT_DATE
Or you could use the Get System Info step in spoon which has multiple ways of injecting the current date into the data stream. (system date fixed, start date range of the transformation, today 00:00:00, etc.)

Related

In SSRS, can you group multiple parameter values into one?

I am relatively new to SSRS but have been working with SQL for years. I have been tasked with creating a report that reflects shipped items based on their status. For example, I have x number of items with varying statuses including "IN_TRANSIT", "RECEIVING", "SHIPPED", "WORKING", and "CLOSED". The requestor is asking if I can provide the following options in a report drop down:
"IN_PROCESS" Status filter including all statuses except "CLOSED".
"CLOSED".
Essentially, they want to be able to view all non closed statuses, closed, statuses, or all. Right now, I have it set so you can individually select all statuses, essentially getting them the data they want, just not with the "right" parameters.
My question is, does SSRS provide a way to essentially 'group' the non-closed statuses into one inside the report so that when they select "IN_PROCESS" it sends those non-closed statuses to the SQL query I have built in? The problem with using SQL for this is that the dataset I created to generate the dropdown options provides "CLOSED" and "IN_PROCESS" as it's output options, but when they select "IN_PROCESS" (sending that value to the filter in the report), since it's not an actual status, nothing comes back.
If more information or clarification is required, please let me know.
Thanks ahead of time!
You can create a new column in your SQL query and use a CASE statement to give the value of IN_PROCESS or CLOSED for the applicable status. Then you will just need to the filtering condition to match the SSRS parameter to the new column.
Depending on how often this case is likely to be reused should help determine how to approach it. If it sounds like it might become a regular process.... "Oh can we have another report with the same filter but showing xyz " then take the time to setup correctly and it will save time in the future.
Personally I would add a database table, if possible, that contains the status names and then a status group name (ignoring fully normalising for the sake of simplicity here).
CREATE TABLE StatusGroups(Status varchar(10), StatusGroup varchar(10))
INSERT INTO StatusGroups VALUES
('IN_TRANSIT', 'In Process'),('RECEIVING, 'In Process'),('SHIPPED', 'In Process'),('WORKING', 'In Process'),('CLOSED' 'Closed')
Then a simple view
CREATE MyNewView AS
SELECT t.*, g.StatusGroup
FROM MyTable t
JOIN StatusGroups g on t.STATUS = g.Status
Now change your report dataset query to use this view passing in the report parameter like this...
SELECT *
FROM MyNewView
WHERE StatusGroup = #myReportParameter
Your dataset for your report parameter's available values list could then be something like
SELECT DISTINCT StatusGroup FROM StatusGroups
This way if you every add more status or statusgroup values you can add an entry to this table and everything will work without ever having to edit your report.

In a data flow task, how do I restrict rows flowing using a value from another source?

I have an excel sheet with many tabs. Say one is called wsMain and the other is called wsDate.
In my data flow transformation I am able to successfully load the data from wsMain to my table.
Now I have to update this transformation where I have to fetch the maximum date from the worksheet wsDate and only load data from wsMain where the date is less than on equal to the maximum date in wsDate (that is the only column available).
So for I have figured out that I need to create a new Excel connection manager to read the data from wsDate and I have used the Aggregate transformatioin to get the maximum date.
Now the question is how do I use this date to restrict the rows coming from wsMain?
I understand from the link below that you can store the value in a variable but what do I do next?:
SSIS set result set from data flow to variable
I have tried using a merge join but not sure if I am doing it right.
Here is what it looks like now:
I could not achieve the above but would be interested to know if that is possible. As a work around I have created a separate dataflow where I have stored the valued in a variable and then used the variable in the conditional split to filter the required rows:
Here is a step by step guide I followed to write the variable:
https://www.proteanit.com/2008/12/11/ssis-writing-to-a-package-variable-in-a-dataflow/
You can obtain the maximum value of the wsDate column first, this use this as a filter to avoid introducing unnecessary records into the data flow which which would be discarded by the Conditional Split. An overview of this process is below. I'd also recommend confirming the data types for all columns involved.
Create an SSIS DateTime variable and name this something descriptive such as MaxDate.
Create a Data Flow Task before the current one with an Excel Source component. Use the SQL command option for the Data Access Mode and enter a SQL statement to return the max value of the wsDate column. In the following example ExcelSource is the name of the sheet that you're pulling from. I'd suggested confirming the query with the Preview button on the Excel Source as well.
Add a Script Component (not Task) after the Excel Source. Add the MaxDate variable in the ReadWriteVariables field on the main page of the Script Component. On the Inputs and Outputs pane add the output column from the Excel Source as an Input Column with the ReadOnly usage Type. Example C# code for this is below. Note that variables can only be written to in the PostExecute method. The Input0_ProcessInputRow method is called once for each row that passes through, however there will only be the single row in this case. On the following code MaxExcelDate is the name of the output column from the Excel Source.
On the Excel Source component in the Data Flow Task where the records are imported from Excel, change the Data Access Mode to SQL command and enter a SQL statement to return records that have a date less than or equal to the maximum wsDate value. This is the last example and the ? is a placeholder for the parameter. After entering this SQL, click the Parameters button and select Parameter0 for the Parameters field, the MaxDate variable for Variables field, and a direction of Input. The Conditional Split can then be removed since these records will now be filtered out.
Excel MAX wsDate SELECT:
SELECT MAX(wsDate) AS MaxExcelDate FROM ExcelSource
C# Script Component:
DateTime maxDate;
public override void PostExecute()
{
base.PostExecute();
Variables.MaxDate = maxDate;
}
public override void Input0_ProcessInputRow(Input0Buffer Row)
{
maxDate = Row.MaxExcelDate;
}
Excel Command with Date Filter:
SELECT
Column1,
Column2,
Column3
FROM ExcelSheet
WHERE DateColumn <= ?
Yes, it is possible. In the data flow, you will need to determine the max date, which you already have. Next, you will need to MERGE JOIN the two data flows on the date column. From there, you will feed it into a CONDITIONAL SPLIT and split where the date columns match [i.e., !ISNULL()] versus do not match [i.e., ISNULL()]. In your case, you only want the matches. The non-matches will be disregarded.
Note: if you use an INNER JOIN on the MERGE JOIN where there is only one date (i.e., MaxDate) to join on, then this will take care of the row filtering for you. You will not need a CONDITIONAL SPLIT.
Welcome to ETL.
Update
It is a real pain that SSIS's MERGE JOINs only perform joins on EQUAL operations as opposed to LESS THAN and GREATER THAN operations. You will need to separate the data flows.
Use a script component to scan the excel file for the MAX Date and assign that value to a package variable in SSIS. Alternatively, you can have a dates table in SQL Server and then use an Execute SQL Command in SSIS to retrieve the MAX Date from the table and assign that value to a package variable
Modify your existing data flow to remove the reading of the Excel date file completely. Then add a DERIVED COLUMN transformation and add a new column that is mapped to the package variable in SSIS that stores the MAX date. You can name the Derived Column Name 'MaxDate'
Add a conditional split transformation with the following CONDITION logic: [AsOfDt] <= [MaxDate]
Set the Output Name to Insert Records
Note: The CONDITIONAL SPLIT creates a new output data flow with restricted/filtered rows. It does not create a new column within the existing data flow. Think of this as a transposition of data flow output from column modification to row modification. Only those rows that match the condition will be sent to the output that you desire. I assume you only want to Insert these records, so I named it that. You can choose whatever naming convention you prefer
Note 2: Sorry for not making the Update my original answer - I haven't used the AGGREGATE transformation before so I was not aware that it restricts row output as opposed to reading a value in the data flow and then assigning it to a variable. That would be a terrific transformation for Microsoft to add to SSIS. It appears that the ROWCOUNT and SCRIPT COMPONENT transformations are the only ones that have the ability to set a package variable value within the data flow.

jaspersoft print specific records from csv file

I have a Label set up in Jaspersoft Studio that references a data adapter file in CSV format. The csv file contains thousands of records. I want the end user to be able to select or key enter the "order no" for the specific records to print. If one order no is entered - record is found and printed. If ten order no's are entered - 10 records will print.
Thank You.
You could use Parameters to let user make input. But because CSV is not query language, you can use Field Expression in Data Source
Here's how to add Parameter
https://community.jaspersoft.com/wiki/using-report-parameters
And this is how to use Field Expression
https://community.jaspersoft.com/wiki/how-apply-parameters-csv-data-source
Detail: Filter by attribute:
In query dialog, you select Filter Expression tab and fill the field with code like this
$F{order_no}.equals($P{order_no})
This code will filter the csv row that has order_no field that equal to order_no parameters
Filter by multiple order_no:
JasperReport are allowed user to do some scripting in Java or Groovy (depending or their settings). So you can do more complicated task like split input into array and use it to search rows.
What I have in mind, are ask end user to seperate order_no by space and use this script to filter the data.
Arrays.asList($P{order_no}.split(" ")).indexOf($F{order_no}) > -1
I have not tested this code yet, but hopefully you get the idea. (Experiment with the script).

How to get all missing days between two dates

I will try to explain the problem on an abstract level first:
I have X amount of data as input, which is always going to have a field DATE. Before, the dates that came as input (after some process) where put in a table as output. Now, I am asked to put both the input dates and any date between the minimun date received and one year from that moment. If there was originally no input for some day between this two dates, all fields must come with 0, or equivalent.
Example. I have two inputs. One with '18/03/2017' and other with '18/03/2018'. I now need to create output data for all the missing dates between '18/03/2017' and '18/04/2017'. So, output '19/03/2017' with every field to 0, and the same for the 20th and 21st and so on.
I know to do this programmatically, but on powercenter I do not. I've been told to do the following (which I have done, but I would like to know of a better method):
Get the minimun date, day0. Then, with an aggregator, create 365 fields, each has that "day0"+1, day0+2, and so on, to create an artificial year.
After that we do several transformations like sorting the dates, union between them, to get the data ready for a joiner. The idea of the joiner is to do an Full Outer Join between the original data, and the data that is going to have all fields to 0 and that we got from the previous aggregator.
Then a router picks with one of its groups the data that had actual dates (and fields without nulls) and other group where all fields are null, and then said fields are given a 0 to finally be written to a table.
I am wondering how can this be achieved by, for starters, removing the need to add 365 days to a date. If I were to do this same process for 10 years intead of one, the task gets ridicolous really quick.
I was wondering about an XOR type of operation, or some other function that would cut the number of steps that need to be done for what I (maybe wrongly) feel is a simple task. Currently I now need 5 steps just to know which dates are missing between two dates, a minimun and one year from that point.
I have tried to be as clear as posible but if I failed at any point please let me know!
Im not sure what the aggregator is supposed to do?
The same with the 'full outer' join? A normal join on a constant port is fine :) c
Can you calculate the needed number of 'dublicates' before the 'joiner'? In that case a lookup configured to return 'all rows' and a less-than-or-equal predicate can help make the mapping much more readable.
In any case You will need a helper table (or file) with a sequence of numbers between 1 and the number of potential dublicates (or more)
I use our time-dimension in the warehouse, which have one row per day from 1753-01-01 and 200000 next days, and a primary integer column with values from 1 and up ...
You've identified you know how to do this programmatically and to be fair this problem is more suited to that sort of solution... but that doesn't exclude powercenter by any means, just feed the 2 dates into a java transformation, apply some code to produce all dates between them and for a record to be output for each. Java transformation is ideal for record generation
You've identified you know how to do this programmatically and to be fair this problem is more suited to that sort of solution... but that doesn't exclude powercenter by any means, just feed the 2 dates into a java transformation, apply some code to produce all dates between them and for a record to be output for each. Java transformation is ideal for record generation
Ok... so you could override your source qualifier to achieve this in the selection query itself (am giving Oracle based example as its what I'm used to and I'm assuming your data in is from a table). I looked up the connect syntax here
SQL to generate a list of numbers from 1 to 100
SELECT (MIN(tablea.DATEFIELD) + levquery.n - 1) AS Port1 FROM tablea, (SELECT LEVEL n FROM DUAL CONNECT BY LEVEL <= 365) as levquery
(Check if the query works for you - haven't access to pc to test it at the minute)

Changing text in CSV file using Perl

I know nothing about Perl. I have looked at some online tutorials and am at a loss for the following.
I do a query in PostgreSQL that saves to a CSV file. However, one element needs to be changed after the CSV file is created, and I have no idea how to do it.
The existing query results are like this
phone date time staff email and customer ID -- my explanation
1112223333,10/21/2013,3:00 AM,sklund#myemail.comSMIB010170 -- data in csv
After query is completed, the data in the time field must be converted to:
1112223333,10/21/2013,03:00am,sklund#myemail.comSMIB010170
As you can see, the time needs to be ammended to include a 0 if the hour is less than ten, and the AM must be changed to am.
Is there a simple Perl script that can do this? The lines of data, of course, will be different, as each line would reflect results of the query for the day.
If someone can point me to a tutorial, link, or help in this I'd be very grateful.
This will do what you need.
I assume you want the space before AM removed as well? You don't mention it in your question.
perl -pe 's/,(\d{1,2}):(\d\d)\s+([AP]M),/sprintf ",%02d:%02d%s,",$1,$2,lc $3/ei' mylogfile > newlogfile