How to create a "bucket" variable in SAS from ranges given in another table - merge

I am trying to create a bucket variable in SAS that will split transactions into various buckets. However, depending on the retailer where the transactions occurred, the buckets have different lengths and end points. For example, Bucket 1 for Retailer 1 is from June 2017 to July 2018, while for Retailer 2 it is from January 2018 to November 2018. The retailers, bucket labels, and end points for the buckets are stored in an Excel file which I have imported successfully. The transactions are stored in a separate table with retailer information and a "date incurred" column. I am struggling to create a bucket variable in the transactions table. Does SAS allow for conditional logic when merging, like "if the transaction date is between these two dates assign this bucket value"? Is merging even the best way to add the bucket info to the transactions table?
Thank you so much for your help - this is my first ever Stack Overflow question, and I am teaching myself SAS for the first time. Please let me know what other information I can provide to make answering this question easier!

Add the condition to the SQL join, for example :
transactions a
left join
buckets b on a.Retailer = b.Retailer
and a.TransactionDate between b.BucketStart and b.BucketEnd

Related

filemaker pro 16 creating records that share a date (work rota)

I am currently trying to create a rota within filemaker 16 and I can't figure out how to create records that share a date.
I want to be able to have people assigned to jobs and jobs assigned to dates but currently when I create jobs with the same date it creates a new record instead of assigning it to the one already existing.
I have 3 tables currently jobs, date and people. I have a 4th layout with a portal where I wanted to view records related to jobs that are set for a certain day.
Any help would be much appreciated.
Many thanks.
I am not 100% convinced you need a Dates table. Do you have anything specific to record about a date, other than its existence?
However, you certainly need a join table of Assignments, with fields for:
PersonID
JobID
Date
(this is assuming your rota is daily, otherwise you will need to indicate the shift or hours too).
I think your structure should change on this.
So instead have:
Parent:
ProjectId
Date
PersonId
JobId
Date
Then make the project Id your 'Primary Key' so your parent record.
Then you are just assigning the person, job & date as the child.
That way you can always add to the previous record without relying on date field.
You could then filter via dates etc.

How to Handle Rows that Change over Time in Druid

I'm wondering how we could handle data that changes over time in Druid. I realize that Druid is built for streaming data where we wouldn't expect a particular row to have data elements change. However, I'm working on a project where we want to stream transactional data from a logistics management system, but there's a calculation that happens in that system that can change for a particular transaction based on other transactions. What I mean:
-9th of the month - I post transaction A with a date of today (9th) that results in the stock on hand coming to 0 units
-10th of the month - I post transaction B with a date of the 1st of the month, crediting my stock amount by 10 units. At this time (on the 10th of the month) the stock on hand for transaction A recalculates to 10 units. The same would be true for ALL transactions after the 1st of the month
As I understand it, we would re-extract transaction A, resulting in transaction A2.
The stock on hand dimension is incredibly important to our metrics. Specifically, identifying when stockouts occur (when stock on hand = 0). In the above example, if I have two rows for transaction A, I would be mistakenly identifying a stockout with transaction A1, whereas transaction A2 is the source of truth.
Is there any ability to archive a row and replace it with an updated row, or do we need to add logic to our queries that finds the rows with the freshest timestamp per transaction id?
Thanks
I have two thoughts that I hope help you. The key documentation for this is "Updating Existing Data": http://druid.io/docs/latest/ingestion/update-existing-data.html which gives you three options: Lookup Tables, Reindexing, and Delta Ingestion. The last one, Delta Ingestion, is only for adding new rows to old segments, so that's not very useful for you, let's go over the other two.
Reindexing: You can crunch all the numbers that change in your ETL process, identify the segments that would need to be reloaded, and simply have Druid re-index those segments. That will replace the stock-on-hand value for A in your example whenever you want, whenever you do the re-indexing.
Lookups: If you have stock values for multiple products, you can store the product id in the segment and have that be immutable, but lookup the stock-on-hand value in a lookup. So, you would store:
A, 2018-01-01, product-id: 123
And in your lookup, you'd have:
product-id: 123, stock-on-hand: 0
And later, you'd update the lookup and change that to 10. This would update any rows that reference product-id: 123.
I can't be sure but you may be mixing up dimensions and metrics while you're doing this, and you may need to read over that terminology in OLAP descriptions like this: https://en.wikipedia.org/wiki/Online_analytical_processing
Good luck!

Executing query in chunks on Greenplum

I am trying to creating a way to convert bulk date queries into incremental query. For example, if a query has where condition specified as
WHERE date > now()::date - interval '365 days' and date < now()::date
this will fetch a years data if executed today. Now if the same query is executed tomorrow, 365 days data will again be fetched. However, I already have last 364 days data from previous run. I just want a single day's data to be fetched and a single day's data to be deleted from the system, so that I end up with 365 days data with better performance. This data is to be stored in a separate temp table.
To achieve this, I create an incremental query, which will be executed in next run. However, deleting the single date data is proving tricky when that "date" column does not feature in the SELECT clause but feature in the WHERE condition as the temp table schema will not have the "date" column.
So I thought of executing the bulk query in chunks and assign an ID to that chunk. This way, I can delete a chunk and add a chunk and other data remains unaffected.
Is there a way to achieve the same in postgres or greenplum? Like some inbuilt functionality. I went through the whole documentation but could not find any.
Also, if not, is there any better solution to this problem.
I think this is best handled with something like an aggregates table (I assume the issue is you have heavy aggregates to handle over a lot of data). This doesn't necessarily cause normalization problems (and data warehouses often denormalize anyway). In this regard the aggregates you need can be stored per day so you are able to cut down to one record per day of the closed data, plus non-closed data. Keeping the aggregates to data which cannot change is what is required to avoid the normal insert/update anomilies that normalization prevents.

sql server partitioned table

I am preparing for 70-451 exam. There is a question I got:
You are a database developer. You plan to design a database solution by using SQL Server 2008. The database will contain a table named Claims. The Claims table will contain a large amount of data. You plan to partition the data into following categories:
Open claims
Claims closed before January 1, 2005
Claims closed between January 1, 2005 and December 31, 2007
Claims closed from January 1, 2008 till date
The close_date field in the Claims table is a date data type and is populated only if the claim has been closed. You need to design a partition function to segregate records into the defined categories.
what should you do?
A Create a RANGE RIGHT partition function by using the values 20051231, 20071231, and 20080101.
B Create a RANGE RIGHT partition function by using the values 20051231, 20071231, and NULL.
C Create a RANGE LEFT partition function by using the values 20051231, 20071231, and 20080101.
D Create a RANGE LEFT partition function by using the values 20051231, 20071231, and NULL.
Can someone answer this?
I've looked at this a few times, and I can't see any of them being right.
The partition for claimes before Jan 1 2005 partition is not generated by any of them, since the first partition value on any answer is 20051231. Whether LEFT / RIGHT is used is then immaterial, every value up until 31st Dec 2005 is in a single partition, and the LEFT / RIGHT is just about whether that date is included.
I would of expected a left with 20041231, or a right with 20050101 to be in the mix somewhere.
If the answers all started with 20041231 instead of 20051231, then I would take answer D as correct. Either question has a typo, or the test does.
I had the exam this week and this question came up. I commented this question with the non related date 20051231.

How to make a function in DB2 ibm to sum fields 01 to 06 in month 5? 01-07 in month6 and so on...

We have a database that stores the bank account of our clients like this:
|client|c01|c02|c03|c04|d01|d02|d03|d04|
|a |3€ |5€ |4€ |0€ |-2€|-1€|-4€| 0€|
This is the structure of the database when we are in the month 3. In month 4 would be:
|client|c01|c02|c03|c04|c05|d01|d02|d03|d04|d05|
|a |3€ |5€ |4€ |2€ |0€ |-2€|-1€|-4€|-2€| 0€|
take attention on the c05 and d05.
The database auto-updates adds those columns.
Because of this changing in the columns I can't get the sum of c01, c02, c03, c04, d01, d02, d03, d04 easily. I was thinking of making a function that checks the current month and makes a loop in order to select and sum those columns without errors.
If you have a better idea to do it, you are welcome.
But the main question is how to make a function that is able to sum a variable number of columns?
thanks
There's something about adding a column every month that bugs me. I know it can be a valid OLAP strategy as selective de-normalization, but it just feels wierd here. Usually with these kinds of things, the entire width is specified, if for no other reason than to avoid ALTER TABLE statements. I don't know enough about what you're storing there to give a recommendation otherwise, but I guess I just prefer fully normalized structures.
Having done similar endeavours, your best bet will be to use dynamic sql. You can place that inside your stored procedure, PREPAREing and EXECUTEing as normal.