Handling and assigning the date into respective categories - datastage

I have an input file like below and trying to convert the multiple customer records into respective quarters and also record per customer. Once the quarter (like Q2 2019) is derived from the data, now the latest one should goto TimeFrame4 and old ones to 3,2,1 order.
So far was able to derive the quarters using the transformer, but after that I was stuck in how to identify and assign them to respective buckets (TimeFrame1 TimeFrame2 TimeFrame3 TimeFrame4). Any ideas on how to implement this effectively (input has 50M records) in DataStage (11.3 Parallel job).
Input:
CustID Contacted_Time
1 2018-12-25
1 2019-06-15
1 2019-01-03
2 2019-02-24
2 2019-03-05
I need desired Output like below:
CustID TimeFrame1 TimeFrame2 TimeFrame3 TimeFrame4
1 null Q4 2018 Q1 2019 Q2 2019
2 null null null Q1 2019

You could sort the data by CustId and Contacted_Time desc, filter the data (because I assume there could be more than 4) to four contacts and once you got the quarters assign a helper column with the number (also in transformer).
Finally a Pivot stage can do the verticalization or you could do this in a transformer as well i.e. with a loop.

Related

Select string FROM in Grafana InfluxDB

I have some sales data in InfluxDB, basically every 10 seconds I record the total number of sales from SQL, then break that down into sales of different items in the fields.
So a row might look like:
time Bicycles Shovels Hats Candles Sales
01/01/2019 12 6 2 4 24
02/01/2019 14 9 3 5 31
(This is dummy data for this question and doesn't reflect the number in the images, which are from production tests I've been doing with actual data)
It's important to stress this is accumulative, each entry grows by the number of sales.
I can then use Grafana to show me the difference(max(sales)) group by time(1d) to show the total sales (per day in this case)
What I am also trying to do is show the most sales per group (day) in a table, a record sales table, how well we have done in the past.
So I have 3 queries in Grafana:
SELECT time, max(difference) FROM (
SELECT difference(max("Sales"))
FROM "autogen"."Paid_Orders"
WHERE $timeFilter
GROUP BY time(1d) fill(none)
)
Swap "sales" for bicycles and hats (whatever) in the subsequent queries and I end up with a table that looks like:
Problem is I don't know which line is which field. Normally (in SQL) I would do something like SELECT 'Sales' as Type, etc... to add in a column with "Sales" in it. But I can't see that as an option in either Grafana or Influx.
Am I recording the data wrong and could I be leveraging Tags for this?
How can I identify the rows in this table?
yes, your data structure doesn't look good. Instead of one record:
time Bicycles Shovels Hats Candles Sales
01/01/2019 12 6 2 4 24
save multiple records with different item tag (sales is int field):
time item sales
01/01/2019 Bicycles 12
01/01/2019 Shovels 6
01/01/2019 Hats 2
01/01/2019 Candles 4
Then group by time AND item tag + SPREAD function can be used to remove subquery, e.g.
SELECT time, SPREAD(sales)
FROM "autogen"."Paid_Orders"
WHERE $timeFilter
GROUP BY time(1d), "item" fill(none)

How to dynamically pivot based on rows data and parameter value?

I am trying to pivot using crosstab function and unable to achieve for the requirement. Is there is a way to perform crosstab dynamically and also dynamic result set?
I have tried using crosstab built-in function and unable to meet my requirement.
select * from crosstab ('select item,cd, type, parts, part, cnt
from item
order by 1,2')
AS results (item text,cd text, SUM NUMERIC, AVG NUMERIC);
Sample Data:
ITEM CD TYPE PARTS PART CNT
Item 1 A AVG 4 1 10
Item 1 B AVG 4 2 20
Item 1 C AVG 4 3 30
Item 1 D AVG 4 4 40
Item 1 A SUM 4 1 10
Item 1 B SUM 4 2 20
Item 1 C SUM 4 3 30
Item 1 D SUM 4 4 40
Expected Results:
ITEM CD PARTS TYPE_1 CNT_1 TYPE_1 CNT_1 TYPE_2 CNT_2 TYPE_2 CNT_2 TYPE_3 CNT_3 TYPE_3 CNT_3 TYPE_4 CNT_4 TYPE_4 CNT_4
Item 1 A 4 AVG 10 SUM 10 AVG 20 SUM 20 AVG 30 SUM 30 AVG 40 SUM 40
The PARTS value is based on a parameter passed by the user. If the user passes 2 for example, there will be 4 rows in the result set (2 parts for AVG and 2 parts of SUM).
Can I achieve this requirement using CROSSTAB function or is there a custom SQL statement that need to be developed?
I'm not following your data, so I can't offer examples based on it. But I have been looking at pivot/cross-tab features over the past few days. I was just looking at dynamic cross tabs just before seeing your post. I'm hoping that your question gets some good answers, I'll start off with a bit of background.
You can use the crosstab extension for standard cross tabs, what when wrong when you tried it? Here's an example I wrote for myself the other day with a bunch of comments and aliases for clarity. The pivot is looking at item scans to see where the scans were "to", like the warehouse or the floor.
/* Basic cross-tab example for crosstab (text) format of pivot command.
Notice that the embedded query has to return three columns, see the aliases.
#1 is the row label, it shows up in the output.
#2 is the category, what determines how many columns there are. *You have to work this out in advance to declare them in the return.*
#3 is the cell data, what goes in the cross tabs. Note that this form of the crosstab command may return NULL, and coalesce does not work.
To get rid of the null count/sums/whatever, you need crosstab (text, text).
*/
select *
from crosstab ('select
specialty_name as row_label,
scanned_to as column_splitter,
count(num_inst)::numeric as cell_data
from scan_table
group by 1,2
order by 1,2')
as scan_pivot (
row_label citext,
"Assembly" numeric,
"Warehouse" numeric,
"Floor" numeric,
"QA" numeric);
As a manual alternative, you can use a series of FILTER statements. Here's an example that summaries errors_log records by day of the week. The "down" is the error name, the "across" (columns) are the days of the week.
select "error_name",
count(*) as "Overall",
count(*) filter (where extract(dow from "updated_dts") = 0) as "Sun",
count(*) filter (where extract(dow from "updated_dts") = 1) as "Mon",
count(*) filter (where extract(dow from "updated_dts") = 2) as "Tue",
count(*) filter (where extract(dow from "updated_dts") = 3) as "Wed",
count(*) filter (where extract(dow from "updated_dts") = 4) as "Thu",
count(*) filter (where extract(dow from "updated_dts") = 5) as "Fri",
count(*) filter (where extract(dow from "updated_dts") = 6) as "Sat"
from error_log
where "error_name" is not null
group by "error_name"
order by 1;
You can do the same thing with CASE, but FILTER is easier to write.
It looks like you want something basic, maybe the FILTER solution appeals? It's easier to read than calls to crosstab(), since that was giving you trouble.
FILTER may be slower than crosstab. Probably. (The crosstab extension is written in C, and I'm not sure how smart FILTER is about reading off indexes.) But I'm not sure as I haven't tested it out yet. (It's on my to do list, but I haven't had time yet.) I'd be super interested if anyone can offer results. We're on 11.4.
I wrote a client-side tool to build FILTER-based pivots over the past few days. You have to supply the down and across fields, an aggregate formula and the tool spits out the SQL. With support for coalesce for folks who don't want NULL, ROLLUP, TABLESAMPLE, view creation, and some other stuff. It was a fun project. Why go to that effort? (Apart from the fun part.) Because I haven't found a way to do dynamic pivots that I actually understand. I love this quote:
"Dynamic crosstab queries in Postgres has been asked many times on SO all involving advanced level functions/types. Consider building your needed query in application layer (Java, Python, PHP, etc.) and pass it in a Postgres connected query call. Recall SQL is a special-purpose, declarative type while app layers are general-purpose, imperative types." – Parfait
So, I wrote a tool to pre-calculate and declare the output columns. But I'm still curious about dynamic options in SQL. If that's of interest to you, have a look at these two items:
https://postgresql.verite.pro/blog/2018/06/19/crosstab-pivot.html
Flatten aggregated key/value pairs from a JSONB field?
Deep magic in both.

pentaho distinct count over date

I am currently working on Pentaho and I have the following problem:
I want to get a "rooling distinct count on a value, which ignores the "group by" performed by Business Analytics. For instance:
Date Field
2013-01-01 A
2013-02-05 B
2013-02-06 A
2013-02-07 A
2013-03-02 C
2013-04-03 B
When I use a classical "distinct count" aggregator in my schema, sum it, and then add "month" to column, I get:
Month Count Sum
2013-01 1 1
2013-02 2 3
2013-03 1 4
2013-04 1 5
What I would like to get would be:
Month Sum
2013-01 1
2013-02 2
2013-03 3
2013-04 3
which is the distinct count of all Fields so far. Does anyone has any idea on this topic?
my database is in Postgre, and I'm looking for any solution under PDI, PSW, PBA or PME.
Thank you!
A naive approach in PDI is the following:
Sort the rows by the Field column
Add a sequence for changing values in the Field column
Map all sequence values > 1 to zero
These first 3 effectively flag the first time a value was seen (no matter the date).
Sort the rows by year/month
Sum the mapped sequence values by year+month
Get a Cumulative Sum of all the previous sums
These 3 aggregate the distinct values per month, then keep a cumulative sum. In PDI this might look something like:
I posted a Gist of this transformation here.
A more efficient solution is to parallelize the two sorts, then join at the latest point possible. I posted this one as it is easier to explain, but it shouldn't be too difficult to take this transformation and make it more parallel.

SQL Sum and Group By for a running Tally?

I'm completely rewriting my question to simplify it. Sorry if you read the prior version. (The previous version of this question included a very complex query example that created a distraction from what I really need.) I'm using SQL Express.
I have a table of lessons.
LessonID StudentID StudentName LengthInMinutes
1 1 Chuck 120
2 2 George 60
3 2 George 30
4 1 Chuck 60
5 1 Chuck 10
These would be ordered by date. (Of course the actual table is thousands of records with dates and other lesson-related data but this is a simplification.)
I need to query this table such that I get all rows (or a subset of rows by a date range or by student), but I need my query to add a new column we might call PriorLessonMinutes. That is, the sum of all minutes of all lessons for the same student in lessons of PRIOR dates only.
So the query would return:
LessonID StudentID StudentName LengthInMinutes PriorLessonMinutes
1 1 Chuck 120 0
2 2 George 60 0
3 2 George 30 60 (The sum Length from row 2 only)
4 1 Chuck 60 120 (The sum Length from row 1 only)
5 1 Chuck 10 180 (The sum of Length from rows 1 and 4)
In essence, I need a running tally of the sum of prior lesson minutes for each student. Ideally the tally shouldn't include the current row, but if it does, no big deal as I can do subtraction in the code that receives the query.
Further, (and this is important) if I retrieve only a subset of records, (for example by a date range) PriorLessonMinutes must be a sum that considers rows that are NOT returned.
My first idea was to use SUM() and to GROUP BY Student, but that isn't right because unless I'm mistaken it would include a sum of minutes for all rows for each student, including rows that come after the row which aren't relevant to the sum I need.
OPTIONS I'M REJECTING: I could scan through all rows in my code that receives it, (although this would force me to retrieve all rows unnecessarily) but that's obviously inefficient. I could also put a real data field in there and populate it, but this too presents problems when other records are deleted or altered.
I have no idea how to write such a query together. Any guidance?
This is a great opportunity to use Windowed Aggregates. The trick is that you need SQL Server 2012 Express. If you can get it, then this is the query you are looking for:
select *,
sum(LengthInMinutes)
over (partition by StudentId order by LessonId
rows between unbounded preceding and 1 preceding)
as PriorLessonMinutes
from Lessons
Note that it returns NULLs instead of 0s (zeroes). If you insist on zeroes, use COALESCE function to turn NULLs into zeroes.
I suggest using a nested query to limit the number of rows returned:
select * from
(
select *,
sum(LengthInMinutes)
over (partition by StudentId order by LessonId
rows between unbounded preceding and 1 preceding)
as PriorLessonMinutes
from Lessons
) as NestedLessons
where LessonId > 3 -- this is an example of a filter
This way the filter is applied after the aggregation is complete.
Now, if you want to apply a filter that doesn't affect the aggregation (like only querying data for a certain student), you should apply the filter to the inner query, as pruning the rows that don't affect the computation early (like data for other students) will improve the performance.
I feel the following code will serve your purpose.Check it:-
select Students.StudentID ,Students.First, Students.Last,sum(Lessons.LengthInMinutes)
as TotalPriorMinutes from lessons,students
where Lessons.StartDateTime < getdate()
and Lessons.StudentID = Students.StudentID
and StartDateTime >= '20090130 00:00:00' and StartDateTime < '20790101 00:00:00'
group by Students.StudentID ,Students.First, Students.Last

Calculating change in leaders for baseball stats in MSSQL

Imagine I have a MSSQL 2005 table(bbstats) that updates weekly showing
various cumulative categories of baseball accomplishments for a team
week 1
Player H SO HR
Sammy 7 11 2
Ted 14 3 0
Arthur 2 15 0
Zach 9 14 3
week 2
Player H SO HR
Sammy 12 16 4
Ted 21 7 1
Arthur 3 18 0
Zach 12 18 3
I wish to highlight textually where there has been a change in leader for each category
so after week 2 there would be nothing to report on hits(H); Zach has joined Arthur with most strikeouts(SO) at
18; and Sammy is new leader in homeruns(HR) with 4
So I would want to set up a process something like
a) save the past data(week 1) as table bbstatsPrior,
b) updates the bbstats for the new results - I do not need assistance with this
c) compare between the tables for the player(s with ties) with max value for each column
and spits out only where they differ
d) move onto next column and repeat
In any real world example there would be significantly more columns to calculate for
Thanks
Responding to Brents comments, I am really after any changes in the leaders for each category
So I would have something like
select top 1 with ties player
from bbstatsPrior
order by H desc
and
select top 1 with ties player,H
from bbstats
order by H desc
I then want to compare the player from each query (do I need to do temp tables) . If they differ I want to output the second select statement. For the H category Ted is leader `from both tables but for other categories there are changes between the weeks
I can then loop through the columns using
select name from sys.all_columns sc
where sc.object_id=object_id('bbstats') and name <>'player'
If the number of stats doesn't change often, you could easily just write a single query to get this data. Join bbStats to bbStatsPrior where bbstatsprior.week < bbstats.week and bbstats.week=#weekNumber. Then just do a simple comparison between bbstats.Hits to bbstatsPrior.Hits to get your difference.
If the stats change often, you could use dynamic SQL to do this for all columns that match a certain pattern or are in a list of columns based on sys.columns for that table?
You could add a column for each stat column to designate the leader using a correlated subquery to find the max value for that column and see if it's equal to the current record.
This might get you started, but I'd recommend posting what you currently have to achieve this and the community can help you from there.