How to make a function in DB2 ibm to sum fields 01 to 06 in month 5? 01-07 in month6 and so on... - db2

We have a database that stores the bank account of our clients like this:
|client|c01|c02|c03|c04|d01|d02|d03|d04|
|a |3€ |5€ |4€ |0€ |-2€|-1€|-4€| 0€|
This is the structure of the database when we are in the month 3. In month 4 would be:
|client|c01|c02|c03|c04|c05|d01|d02|d03|d04|d05|
|a |3€ |5€ |4€ |2€ |0€ |-2€|-1€|-4€|-2€| 0€|
take attention on the c05 and d05.
The database auto-updates adds those columns.
Because of this changing in the columns I can't get the sum of c01, c02, c03, c04, d01, d02, d03, d04 easily. I was thinking of making a function that checks the current month and makes a loop in order to select and sum those columns without errors.
If you have a better idea to do it, you are welcome.
But the main question is how to make a function that is able to sum a variable number of columns?
thanks

There's something about adding a column every month that bugs me. I know it can be a valid OLAP strategy as selective de-normalization, but it just feels wierd here. Usually with these kinds of things, the entire width is specified, if for no other reason than to avoid ALTER TABLE statements. I don't know enough about what you're storing there to give a recommendation otherwise, but I guess I just prefer fully normalized structures.
Having done similar endeavours, your best bet will be to use dynamic sql. You can place that inside your stored procedure, PREPAREing and EXECUTEing as normal.

Related

Scala Spark/Databricks: Efficiently load multiple partitions with different schema?

I have data that's partitioned as year/month/day. I want to be able to load an arbitrary date range - a start date and end date, rather than just a particular day/month/year. The data has mildly different schema for different days.
I can load only a single item at a single level - like, "2020", "July 2020" or "July 1st, 2020". This is fast, and with mergeschema = true any schema issues will be handled for me. But, I can't choose to load a particular week or other arbitrary range that goes across partitions.
I can load at the top level with "mergeschema = true", convert the year/month/day fields to a single date column and filter on that column. This can do arbitrary ranges, handles the schema issue but is slow, as it looks at all the data without benefiting from the partitioning. It will also fail if there are schema issues that can't be handled with mergeschema, even if those only exist outside of the range I'm loading. (For instance, if I'm trying to load a week in the middle of July, but there's badly-formatted data in April, it will fail if I try to load and then filter.)
I can programatically figure out the set of partitions that correspond to the date range in question, load them and union them together. This is fast and will only look at the data it needs to load, but the union call fails if there are schema differences.
I'm on the verge of writing a "MergeSchema" function myself so that I can union different dataframes and add null columns where needed (as would happen if I'd loaded with "mergeschema"), but this feels like a really awkward and difficult solution to what seems like a simple problem.
What's the correct way to handle this? I can't change the sources I'm loading from, they're handled by other teams a long way away from me.
use brackets or braces :
"2020/07/[1-7]" or
"2020/07/{1,2,3,4...}"
basePath='s3://some-bucket/year=2020/'
paths = [
's3://some-bucket/year=2020/month=06/day=2[6-9]',
's3://some-bucket/year=2020/month=06/day=30',
's3://some-bucket/year=2020/month=07/day=[1-3]',
]
df = spark.read.option("basePath", basePath).json(paths)

How to get all missing days between two dates

I will try to explain the problem on an abstract level first:
I have X amount of data as input, which is always going to have a field DATE. Before, the dates that came as input (after some process) where put in a table as output. Now, I am asked to put both the input dates and any date between the minimun date received and one year from that moment. If there was originally no input for some day between this two dates, all fields must come with 0, or equivalent.
Example. I have two inputs. One with '18/03/2017' and other with '18/03/2018'. I now need to create output data for all the missing dates between '18/03/2017' and '18/04/2017'. So, output '19/03/2017' with every field to 0, and the same for the 20th and 21st and so on.
I know to do this programmatically, but on powercenter I do not. I've been told to do the following (which I have done, but I would like to know of a better method):
Get the minimun date, day0. Then, with an aggregator, create 365 fields, each has that "day0"+1, day0+2, and so on, to create an artificial year.
After that we do several transformations like sorting the dates, union between them, to get the data ready for a joiner. The idea of the joiner is to do an Full Outer Join between the original data, and the data that is going to have all fields to 0 and that we got from the previous aggregator.
Then a router picks with one of its groups the data that had actual dates (and fields without nulls) and other group where all fields are null, and then said fields are given a 0 to finally be written to a table.
I am wondering how can this be achieved by, for starters, removing the need to add 365 days to a date. If I were to do this same process for 10 years intead of one, the task gets ridicolous really quick.
I was wondering about an XOR type of operation, or some other function that would cut the number of steps that need to be done for what I (maybe wrongly) feel is a simple task. Currently I now need 5 steps just to know which dates are missing between two dates, a minimun and one year from that point.
I have tried to be as clear as posible but if I failed at any point please let me know!
Im not sure what the aggregator is supposed to do?
The same with the 'full outer' join? A normal join on a constant port is fine :) c
Can you calculate the needed number of 'dublicates' before the 'joiner'? In that case a lookup configured to return 'all rows' and a less-than-or-equal predicate can help make the mapping much more readable.
In any case You will need a helper table (or file) with a sequence of numbers between 1 and the number of potential dublicates (or more)
I use our time-dimension in the warehouse, which have one row per day from 1753-01-01 and 200000 next days, and a primary integer column with values from 1 and up ...
You've identified you know how to do this programmatically and to be fair this problem is more suited to that sort of solution... but that doesn't exclude powercenter by any means, just feed the 2 dates into a java transformation, apply some code to produce all dates between them and for a record to be output for each. Java transformation is ideal for record generation
You've identified you know how to do this programmatically and to be fair this problem is more suited to that sort of solution... but that doesn't exclude powercenter by any means, just feed the 2 dates into a java transformation, apply some code to produce all dates between them and for a record to be output for each. Java transformation is ideal for record generation
Ok... so you could override your source qualifier to achieve this in the selection query itself (am giving Oracle based example as its what I'm used to and I'm assuming your data in is from a table). I looked up the connect syntax here
SQL to generate a list of numbers from 1 to 100
SELECT (MIN(tablea.DATEFIELD) + levquery.n - 1) AS Port1 FROM tablea, (SELECT LEVEL n FROM DUAL CONNECT BY LEVEL <= 365) as levquery
(Check if the query works for you - haven't access to pc to test it at the minute)

Apply matlab function to large table grouped by variables

I have a large table in Matlab of 7 variables and about 2 million rows. The first columns/variable has Ids, the second has dates, and the 3rd variable has prices. For each Id and each date I want to check whether the price was above 100 in each of the previous 6 days. I have a solution but it's very slow, so I would like ideas for improving speed. My solution is the following (with some toy data):
Data = table(reshape(repmat(1:4,3000,1),12000,1),repmat(datestr(datenum(2001,01,31):1:datenum(2009,04,18)),4,1),normrnd(200,120,12000,1),...
'VariableNames',{'ID','Date','Price'});
function y=Lag6days(x)
y=zeros(size(x));
for i=7:size(x,1)
y(i,1)=sum(x(i-6:i-1,1)>100)==6;
end
end
Func = #Lag6days;
A = varfun(Func,Data,'GroupingVariables',{'ID'},'InputVariables','Price');
Any suggestions?
This might have something to do with the table data structure - which I'm not really used to.
Consider the use of 'OutputFormat','cell', in the call of varfun, this seems to work for me.
Of course you would have to make sure that the grouping procedure of varfun is stable, so that your dates don't get mixed.
You could consider extracting each ID group into separate vectors by using:
A1 = Lag6days(Data.Price(Data.ID==1));
...
So you can have more control over your dates getting shuffled.
PS: Obviously your algorithm will only work if your prices are already sorted by date and there's exactly one price entry per day. It would be good practice to check for these assertions.

SSAS 2008 Date in Compound Key

I'm trying to design a cube in SSAS 2008 for data whose base unit is Member-Month, meaning that for each member there is demographic data, certain other indicators that may change, and dollar amounts paid per month. I feel like I need to include MemberID and MonthKey in the same dimension, but this seems like the wrong approach in the case when I just want to see dollars by month. If so, would I put both a Month Key and the Member-Month Key in the fact table? Or use a surrogate key in the Member-Month dimension, but include the MemberID and MonthKey in it? It seems wrong to have Month in two different places (Member-Month and Date). Any help is appreciated!
If I understand your question correctly, you should create a member table, month (or dates) table and a fact table that has FactKey,MemberKey,MonthKey,Amount columns in it. Then you may create Member and Month dimensions.
You should not add month data to the member dimension. The relation between month and member dimensions is already built by the fact table which has all data required for cross dimension data existance.
This is a very simple design problem and easily get implemented with SSAS.
Hope this help.

Filemaker: making queries of large data more efficient

OK I have a Master Table of shipments, and a separate Charges table. There are millions of records in each, and it's come into Filemaker from a legacy system, so all the fields are defined as Text even though they may be Date, Number, etc.
There's a date field in the charges table. I want to create a number field to represent just the year. I can use the Middle function to parse the field and get just the year in a Calculation field. But wouldn't it be faster to have the year as a literal number field, especially since I'm going to be filtering and sorting? So how do I turn this calculation into its value? I've tried just changing the Calculation field to Number, but it just renders blanks.
There's something wrong with your calculation, it should not turn blank just because field type is different. I.e.:
Middle("10-12-2010", 7, 4)
should suffice, provided the calc result is set to Number. You may also wrap it into GetAsNumber(...), but, really, there's no difference as long as field type is right.
If you have FM Advanced, try to set up your calc in the Data Viewer (Tools -> Data Viewer) rather than in Define Fields, this would be faster and, once you like the result, you can transfer it into a field or make a replace. But, from the searching/sorting standpoint there's no difference between a (stored) calculation and a regular field, so replacing is pointless and, actually, more dangerous, as there's no way to undo a wrong replace.
Here's what i was looking for, from
http://help.filemaker.com/app/answers/detail/a_id/3366/~/converting-unstored-calculation-fields-to-store-data
:
Basically, instead of using a
Calculation field, you create am EMPTY
Number, date or text field and use
Replace Field Contents from the Records menu, and put
your calculation (or reference, or
both) there.
Not dissing FileMaker at all, but millions of records means FileMaker is probably the wrong choice here. Your system will be slow, slow, slow. FileMaker is great for workgroups and there is no way to develop a database app faster. But one thing FileMaker is not good at is handling huge numbers of records.
BTW, Mikhail Edoshin is exactly right.