Fuzzy join without proc SQL

Fuzzy join without proc SQL - merge

Good day,
I wish to merge two dates to next closest.
Datasets are huge 500Mb to 1G so proc sql is out of the question.
I have two data sets. First (Fleet) has observations, second has date and which generation number to use for further processing. Like this:
data Fleet
CreatedPortalDate
2013/2/19
2013/8/22
2013/8/25
2013/10/01
2013/10/07
data gennum_list
date
01/12/2014
08/12/2014
15/12/2014
22/12/2014
29/12/2014
...
What I'd like to have is a link-table like this:
data link_table
CreatedPortalDate date
14-12-03 01/12/2014
14-12-06 01/12/2014
14-12-09 08/12/2014
14-12-11 08/12/2014
14-12-14 08/12/2014
With logic that
Date < CreatedPortalDate and (CreatedPortalDate - date) = min(CreatedPortalDate - date)
What I came up with is a bit clunky and I'm looking for an efficient/better way to accomplish this.
data all_comb;
set devFleet(keep=createdportaldate);
do i=1 to n;
set gennum_list(keep=date) point=i nobs=n;
if createdportaldate > date
and createdportaldate - 15 < date then do;/*Assumption, the generations are created weekly.*/
distance= createdportaldate - date;
output;
end;
end;
run;
proc sort data=all_comb; by createdportaldate distance; run;
data link_table;
set _all_comb(drop=distance);
by createdportaldate;
if first.createdportaldate;
run;
Any ideas how to improve or approach this issue?
Ignorant idea: Could I create hash tables where distance would be stored.
Arrays maybe? somehow.
EDIT:
common format
Done
Where does the billion rows come from?
Yes, there are other data involved but the date is the only linking variable.
Sorted?
Yes, the data is sorted and can be sorted again.
Are gen num dates always seven days apart ?
No. That's the tricky part. Otherwise I could use weekand year(or other binning) as unique identifier.

Huge is a relative term, today's huge is tomorrow's speck.
Key data features indicate a direct addressing lookup scheme is possible
Date values are integers.
Date value ranges are limited.
A date value, or any of the next 14 days will be used as a lookup verifier
The key is a date value, which can be used as an array index.
Load the Gennum lookup once as follows
array gennum_of ( %sysfunc(today()) ) _temporary_;
if last_date then
do index = last_date to date-1;
gennum_of(index) = prev_date;
end;
last_date = date;
And fetch a gennum as
if portaldate > last_date
then portal_gennum = last_date;
else portal_gennum = gennum_of ( portaldate );
If you have many rows due to grouping by account ids, you will have to clear and load up the gennum array per group.

This is a typical application of a sas by statement.
The by statement in a data step is meant to read two or more data sets at onece sorted by a common variable.
The common variable is the date, but it is named differently on both datasets. In sql, you solve that by requiring equality of the one variable to the other Fleet.CreatedPortalDate = gennum_list.date, but the by statement does not allow such construction, so we have to rename (at least) one of them while reading the datasets. That is waht we do in the rename clause within the options of gennum_list
data all_comb;
merge gennum_list (in = in_gennum rename = (date = CreatedPortalDate))
Fleet (in = in_fleet);
by CreatedPortalDate;
I choose to combine the by statement with a merge statement, though a set would have done the job too, but then the order of both input datasets makes a difference.
Also note that I requested sas to create indicator variables in_gennum and in_fleet that indicate in which input dataset a value was present. It is handy to know that this type of variables id not written to the result data set.
However, we have to recover the date from the CreatedPortalDate, of course
if in_gennum then date = CreatedPortalDate;
If you are new to sas, you will be surprised the above statement does not work unless you explicitly instruct sas to retain the value of date from one observation to the nest. (Observation is sas jargon for row.)
retain date;
And here we write out one observation for each observation read from the Fleet dataset.
if in_fleet then output;
run;
The advantages of this approach are
you need much less logic to correctly combine the observations from both input datasets (and that is what the data step is invented for)
you never have to retain an array of values in memory, so you can not have overflow problems
this sollution is of order 1 (O1), in the size of the datasets (apart from the sorting), so we know upfront that doubling the amount of data will only const double the time.
Disclaimer: this answer is under construction.
It will be tested later this week

Related

Which data type for date-type attributes in a dimensional table, including start & end dates?

I'm designing a data warehouse using dimensional modeling. I've read most of the Data Warehouse Toolkit by Kimbal & Ross. My question is regarding the columns in a dimensional table that hold dates. For example, here is a table for Users of the application:
CREATE TABLE user_dim (
user_key BIGINT, -- surrogate key
user_id BIGINT, -- natural key
user_name VARCHAR(100),
...
user_added_date DATE, -- type 0, date user added to the system
...
-- Type-2 SCD administrative columns
row_start_date DATE, -- first effective date for this row
row_end_date DATE, -- last effective date for this row, 9999-12-31 if current
row_current_flag VARCHAR(10), -- current or expired
)
The last three attributes are for implementing type 2 slowly-changing dimensions. See Kimbal page 150-151.
Question 1: Is there are best practice for the data type of the row_start_date and row_end_date columns? The type could be DATE (as shown), STRING/VARCHAR/CHAR ("YYYY-MM-DD"), or even BIGINT (foreign key to Date Dimension). I don't think there would be much filtering on the row start/end dates, so a key to the Date Dimension is not required.
Question 2: Is there a best practice for the data type of dimension attributes such "user_added_date"? I can see someone wanting reports on users added per fiscal quarter, so using a foreign key to Date Dimension would be helpful. Any downsides to this, besides having to join from User Dimension to Date Dimension for display of the attribute?
If it matters, I'm using Amazon Redshift.

Question 1 : For the SCD from and to dates I suggest you use timestamp. My preference is WITHOUT time zone and ensure all of your timestamps are UTC
Question 2 : I always set up a date dimension table with a logical key of the actual date. that way you can join any date (e.g. the start date of the user) to the date dimension to find the eg "fiscal month" or whatever off the date dimension. But also you can see the date without joining to the date dimension as its plain to see (stored as a date)
With redshift (or any columnar MPP DBMS) it is good practice to denormalise a little. e.g. use star schema rather than snowflake schema. This is because of the efficiencies that columnar brings, and deals with the inneficient joins (because there are no indexes)

For Question 1: row_start_date and row_end_date are not part of the incoming data. As you mentioned they are created artifially for SCD Type 2 purposes, so they should not have a key to Date dimension. User dim has no reason to have a key to Date dimension. For data type YYYY-MM-DD should be fine.
For Question 2: If you have a requirement like this I would suggest creating a derived fact table (often called accumulating snapshot fact table) to keep derived measures like user_added_date
For more info see https://www.kimballgroup.com/data-warehouse-business-intelligence-resources/kimball-techniques/dimensional-modeling-techniques/accumulating-snapshot-fact-table/

Setting up indexes for date comparison in DynamoDB

How do I setup an index on DynamoDB table to compare dates? i.e. for example, I have a column in my table called synchronizedAt, I want my query to fetch all the rows that were never synchronized (i.e. 0) or weren’t synchronized in the past 2 weeks i.e. (new Date().getTime()) - (1000 * 60 * 60 * 24 * 7 * 4)

It depends by the other attributes of your table.
You may use an Hash and Range Primary Key if the set of Hash values is relatively small and stable; in this case you could filter the dates by putting them in the Range, but anyway the queries will be done by specifying the Index also, and because of this it may or may not make sense to pre-query for all the Index values in order to perform a loop where ask for the interesting Range (inside of the loop for each Index value).
An alternative could be an Hash and Range GSI. In this case you might put a fixed dumb value as Index, in order to query for the range of all Items at once.
Lastly, the less efficient Scan, but with large tables it will be a problem (the larger the table the more time the Scan will take to complete).

I had similar requirement to query on the date range. In my case date range was only criteria. The issue with DynamoDB is you cannot create an Index with Just Range key. It always require Hashkey and Query on such index always expect equal to condition for Hashkey.
So I tricked the DB. I created a Key as Century and populated with century year of the date. For example 1 Jan 2019, century key value is 20. For 1 Jan 2020 century key value is also 20. Very easy to derived from any date. Then Created GSI with Hashkey as Century and RangeKey as date. While querying it is very easy to derive century from date range and Build query condition Hashkey as century and date range. Since I am dealing with data no more than 5 years, trick won't fail for next 75 years. :)
It is not so "nice to have" workaround but it work for me quite well. May be it will help someone else as well.

sas convert date format

I have a var of birth date in this format: 15APR1954
I need to set a new var that will present the current age - as if today's date is 01.01.2011
in order to use the var, how do I convert the date?
otherwise it gives me the following error :"The MDY function call does not have enough arguments".
data DAT2;set DAT1;
array BD{*} birth_date;
Curage=0;
do i=1 to dim(BD);
Curage+(MDY(01012011)-(birth_date));
end;
drop i;
run;

The best way to calculate age is to use the SAS built-in function yrdif().
data dat2;
set dat1;
curage = yrdif(birth_date, today(), 'AGE');
run;
The function today() returns today's date. If you want the age as of a certain date, e.g. 2011-01-01 like in your example, you can replace today() with '01JAN2011'd or with mdy(1, 1, 2011). (Note that your syntax for mdy() was incorrect.)
I'll also note that your array approach doesn't make a whole lot of sense; you're defining an array with only one element, so you might as well just perform operations on that value. Arrays are useful when you wish to perform identical operations to a group of 2 or more variables. For thorough information on array processing in SAS, see this section of the documentation.

Date in calculated field not separating data

What I'm trying to do is list all of the data in a specific table. I tried these code snippets shown, and I still get data from outside the range (2005, etc.)
This was the first one I tried
=IIF((Fields!APP_CREATE_DATETIME.Value >="{ts '2014-01-01 00:00:00'}") AND (Fields!APP_CREATE_DATETIME.Value > "{ts '2014-01-31 00:00:00'}"), Switch(Fields!DLR_NAME.Value, "JAN"), nothing)
Then this
=IIF((Fields!APP_CREATE_DATETIME.Value >="2014-01-01 00:00:00") AND (Fields!APP_CREATE_DATETIME.Value > "2014-01-31 00:00:00"), Switch(Fields!DLR_NAME.Value, "JAN"), nothing)
The SQL column in the table itself is of DATETIME format

I'm not sure what you're trying to do there but that isn't the right syntax for Switch()
=SWITCH(Conditional1, Result1,
Conditional2, Result2,
Conditional3, Result3
)
I think it would be easier to set two filters on the table. One for greater than the start date and the other for less than the end date. It would be much easier to understand and maintain. But that is mostly because it seems to be a simple date range filtering if the SWITCH() doesn't actually do anything.
I'd recommend using the DateSerial() function in order to generate the date rather than relying on trying to properly convert a string value. Otherwise you'll need to use the CDATE() function to convert the string, but string dates always just feel a little unreliable to me.
=DateSerial(1970,1,1)
or
=Cdate("1970-01-01")

If-Then Block Issue in SAS

I am working on writing a SAS code and since I am new to SAS (Have worked on R all the time), I am having trouble understanding the date formats in SAS.
I Have a SAS data set Sales_yyyymm and I am creating a code that takes the user's input of a date value, and if Sales data exists for that date, I need to create a flag as 1 else as 0. Currently I am using this code -
%Let check_date = 20010120;
Data A;
Set B;
If date=&check_date then Date_Flag = 1;
else Date_Flag = 0;
run;
date is the Date column in my SAS data set Sales_yyyymm and the values are like 20130129, 20110412, 20140120 etc.
But if I run this code, I get all my Date_Flag values as 0. The IF condition is being ignored and I am not sure how or why this is happening.
Any idea?
Thanks!

You need to read Understanding How SAS Handles Dates article to get how SAS internally stores a date and how arithmetic on date including comparisions are carried out.
In SAS, every date is a unique number on a number line. Dates before
January 1, 1960, are negative numbers; those after January 1, 1960,
are positive. Because SAS date values are numeric variables, you can
sort them easily, determine time intervals, and use dates as
constants, as arguments in SAS functions, or in calculations.
As I mentioned in the comments you really need to specify your date literals in "ddmmmyyyy"d format.
So, %Let check_date = 20010120; should be written as:
%Let check_date = 20JAN2001;
Data A;
Set B;
If date="&check_date."d then Date_Flag = 1;
else Date_Flag = 0;
run;
20010120 translated into SAS date goes beyond the valid range SAS can handle. Ex: 2001012 - note there is no zero at the end, corresponds to "03AUG7438" - yes, that's year 7438!!
Whereas 14995 is the integer that SAS understands to be the date 20JAN2001

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse