first. and last. with multiple occurences - date

I am trying to gather dates from a dataset using the first.variable and last.variable from SAS. Here is the code to create a reproducible example:
data example;
infile datalines delimiter = ",";
input id $ code $ valid_from valid_to;
format valid_from IS8601DA10. valid_to IS8601DA10.;
datalines;
1A,ABC,20058,20177
1A,DEF,20178,20481
1A,DEF,20482,20605
1A,DEF,20606,21548
1A,DEF,21549,21638
1A,DEF,21639,21729
1A,ABC,21730,21733
1A,ABC,21734,21808
1B,MNO,20200,20259
1B,PQR,20260,20269
1B,STU,20270,20331
1B,VWX,20332,20361
1B,VWX,20362,22108
1B,VWX,22109,22164
1B,VWX,22165,22165
1B,VWX,22166,2936547
;
run;
The idea is to get, for each id, only one observation per code with the corresponding range of dates it cover.
Here is my code:
proc sort data=example out=example_sorted; by code valid_from; run;
data collapse_val_dates;
set example_sorted;
by code valid_from;
if first.code = 1 and last.code = 1 then do;
output;
end;
if first.code = 1 and last.code = 0 then do;
hold = valid_from;
retain hold;
end;
if first.code = 0 and last.code = 1 then do;
valid_from = hold;
output;
end;
drop hold;
run;
Here is the result (table collapse_val_dates):
+----+------+------------+------------+
| id | code | valid_from | valid_to |
+----+------+------------+------------+
| 1A | ABC | 2014-12-01 | 2019-09-16 |
| 1A | DEF | 2015-03-31 | 2019-06-29 |
| 1B | MNO | 2015-04-22 | 2015-06-20 |
| 1B | PQR | 2015-06-21 | 2015-06-30 |
| 1B | STU | 2015-07-01 | 2015-08-31 |
| 1B | VWX | 2015-09-01 | 9999-12-31 |
+----+------+------------+------------+
It produces what I expect for id=1B but not for id=1A. Indeed, as the code=ABC appears once in the beginning and twice at the end, the result table put valid_from=2014-12-01.
What I would like is the valid_from for code=ABC to be 2019-06-30. In other words, I would like SAS to "forget" the first occurence of the code if there is an (or multiple) other code in between. The final table would look like this:
+----+------+------------+------------+
| id | code | valid_from | valid_to |
+----+------+------------+------------+
| 1A | DEF | 2015-03-31 | 2019-06-29 |
| 1A | ABC | 2019-06-30 | 2019-09-16 |
| 1B | MNO | 2015-04-22 | 2015-06-20 |
| 1B | PQR | 2015-06-21 | 2015-06-30 |
| 1B | STU | 2015-07-01 | 2015-08-31 |
| 1B | VWX | 2015-09-01 | 9999-12-31 |
+----+------+------------+------------+

For a single pass over the data you can't output a date range for a code while serially processing a group because later rows in the group could overwrite the date range wanted.
You will need to either
code multiple steps, or
perform a single pass and use temporary storage
Presume have is sorted by id valid_from and that valid_to never overlaps a succeeding valid_from within the id group.
Multiple Steps
Compute a group number for rows grouped by contiguous code for use in final ordering.
* multi step way;
data stage1;
set have;
by id code notsorted;
if first.code then group_number+1;
run;
proc sort data=stage1 out=stage2;
by id code group_number valid_from;
run;
* remember there can be multiple contiguous code groups within id & group;
data stage3;
do until (last.code);
set stage2;
by id code group_number;
if first.group_number then _start = valid_from;
if last.code then do;
valid_from = _start;
OUTPUT; /* date range for contiguous code group */
end;
end;
drop _start;
run;
proc sort data=stage3 out=want(drop=group_number);
by id valid_from;
run;
Single Pass
A DOW loop (a loop that has a SET statement within it) can compute a result over a group and subgroup and output one row per combination. Temporary storage can be a hash (for an arbitrary number of subgroups), or an array for an assumed maximum number of subgroups.
Example:
A temporary array of fixed size 1,000 is used to store temporary data that should be modified while examining the group.
* find the range of the dates from the last set of contiguous rows of a code within id;
data want(keep=id code valid_:);
array dates (1000,2) 8 _temporary_; /* ,1 for _from and ,2 for _to */
array codes (1000) $50 _temporary_;
array seq (1000) 8 _temporary_; /* sequence for output order */
* process the id group;
do _n_ = 1 by 1 until (last.id);
set have;
by id code notsorted;
* save start of date range in temporay storage;
if first.code then do;
* linear search for slot to use for subgroup code;
do _index = 1 by 1
until (missing(codes(_index)) or codes(_index)=code);
end;
codes(_index) = code;
dates(_index,1) = valid_from;
seq (_index) = _n_ + _index / 1000; * encode order value with lookup index;
end;
* save end of date range;
if last.code then
dates(_index,2) = valid_to;
end;
*---;
* process each code within id;
call sort (of seq(*)); * order of first date for last code subgroup;
do _index = 1 to dim(seq);
if missing(seq(_index)) then continue;
* extract encoded information;
_ix = round((seq(_index) - int(seq(_index))) * 1000);
code = codes(_ix);
valid_from = dates(_ix,1);
valid_to = dates(_ix,2);
OUTPUT;
end;
* clear out temporary arrays for next group processing;
call missing (of dates(*), of codes(*), of seq(*));
run;

Related

UPDATE from temp table picking the "last" row per group

Suppose there is a table with data:
+----+-------+
| id | value |
+----+-------+
| 1 | 0 |
| 2 | 0 |
+----+-------+
I need to do a bulk update. And use COPY FROM STDIN for fast insert to temp table without constraints and so it can contains duplicate values in id column
Temp table to update from:
+----+-------+
| id | value |
+----+-------+
| 1 | 1 |
| 2 | 1 |
| 1 | 2 |
| 2 | 2 |
+----+-------+
If I simply run a query like with:
UPDATE test target SET value = source.value FROM tmp_test source WHERE target.id = source.id;
I got wrong results:
+----+-------+
| id | value |
+----+-------+
| 1 | 1 |
| 2 | 1 |
+----+-------+
I need the target table to contain the values that appeared last in the temporary table.
What is the most effective way to do this, given that the target table may contain millions of records, and the temporary table may contain tens of thousands?**
Assuming you want to take the value from the row that was inserted last into the temp table, physically, you can (ab-)use the system column ctid, signifying the physical location:
UPDATE test AS target
SET value = source.value
FROM (
SELECT DISTINCT ON (id)
id, value
FROM tmp_test
ORDER BY id, ctid DESC
) source
WHERE target.id = source.id
AND target.value <> source.value; -- skip empty updates
About DISTINCT ON:
Select first row in each GROUP BY group?
This builds on a implementation detail, and is not backed up by the SQL standard. If some insert method should not write rows in sequence (like future "parallel" INSERT), it breaks. Currently, it should work. About ctid:
How do I decompose ctid into page and row numbers?
If you want a safe way, you need to add some user column to signify the order of rows, like a serial column. But do your really care? Your tiebreaker seems rather arbitrary. See:
Temporary sequence within a SELECT
AND target.value <> source.value
skips empty updates - assuming both columns are NOT NULL. Else, use:
AND target.value IS DISTINCT FROM source.value
See:
How do I (or can I) SELECT DISTINCT on multiple columns?

Column-wise autocomplete

I have a table in a PostgreSQL database with four columns that contain increasingly more detailed information (think state->city->street->number), along with a column where everything is concatenated according to some simple formatting rules. Example:
| kommun | trakt | block | enhet | beteckning |
| Mora | Gislövs Läge | 9 | 16 | Mora Gislövs Läge 9:16 |
| Mora | Gisslaved | * | 8 | Mora Gisslaved 8 |
| Mora | Gisslaved | * | 9 | Mora Gisslaved 9 |
| Lilla Edet | Sanda | GA | 1 | Lilla Edet Sanda GA:1 |
A web service uses this table to implement a word-wise autocomplete, where the user gets input suggestions as they drill down. An input of mora gis will result in
["Mora Gislövs", "Mora Gisslaved"]
Currently, this is done by splitting the concatenated column by word in this query:
select distinct trim(substring(beteckning from '(^(\S+\s?){NUMPARTS})')) as bet
from beteckning_ac
where upper(beteckning) like upper('mora gis%')
order by bet
Where NUMPARTS is the number of words in the input - 2 in this case.
Now I want the autocomplete to be done column-wise rather than word-wise, so mora gis would now result in this instead:
["Mora Gislövs Läge", "Mora Gisslaved"]
Since the first two columns can contain an arbitrary number of words, I can no longer use the input to determine how many columns to include in the response. Is there a way to do this, or have I maybe gone about this autocomplete business all wrong?
CREATE OR REPLACE FUNCTION get_auto(text)
--$1 is here your input
RETURNS setof text
LANGUAGE plpgsql
AS $function$
declare
NUMPARTS int := array_length(regexp_split_to_array($1,' '), 1);
begin
return query
select
case
when (NUMPARTS = 1) then kommun
when (NUMPARTS = 2) then kommun||' '||trakt
when (NUMPARTS = 3) then kommun||' '||trakt||' '||block
when (NUMPARTS = 4) then kommun||' '||trakt||' '||block||' '||enhet
--alter if you want to
end
from
auto_complete --your tablename here
where
beteckning like $1||'%';
end;
$function$;

PostgreSQL BETWEEN selects record when not fulfilled

Why does this query returns a record?:
db2=> select * FROM series WHERE start <= '882001010000' AND "end" >= '882001010000' ORDER BY timestamp DESC LIMIT 1;
id | timestamp | start | end |
-------+---------------------+----------+-----------
23443 | 2016-12-23 17:10:05 | 88160000 | 88209999 |
or with BETWEEN:
db2=> select * FROM series WHERE '882001010000' BETWEEN start AND "end" ORDER BY timestamp DESC LIMIT 1;
id | timestamp | start | end |
-------+---------------------+----------+-----------
23443 | 2016-12-23 17:10:05 | 88160000 | 88209999 |
start and end are TEXT columns.
They are returning records because you are doing the comparisons as strings not as numbers.
Hence: '8' is between '7000000' and '9000', because the comparisons are one character at a time.
If you want numeric comparisons, you can cast the values to numbers. Or, better yet, represent the values as numerics. Postgres has the nice capability of very large precisions.

Update intermediate result

EDIT
As requested a little background of what I want to achieve. I have a table that I want to query but I don't want to change the table itself. Next the result of the SELECT query (what I called the 'intermediate table') needs to be cleaned a bit. For example certain cells of certain rows need to be swapped and some strings need to be trimmed. Of course this could all be done as postprocessing in, e.g., Python, but I was hoping to do all of this with one query statement.
Being new to Postgresql I want to update the intermediate table that results from a SELECT statement. So I basically want to edit the resulting table from a SELECT statement in one query. I'd like to prevent having to store the intermediate result.
I've tried the following 'with clause':
with result as (
select
a
from
b
)
update result as r
set
a = 'd'
...but that results in ERROR: relation "result" does not exist, while the following does work:
with result as (
select
a
from
b
)
select
*
from
result
As I said, I'm new to Postgresql so it is entirely possible that I'm using the wrong approach.
Depending on the complexity of the transformations you want to perform, you might be able to munge it into the SELECT, which would let you get away with a single query:
WITH foo AS (SELECT lower(name), freq, cumfreq, rank, vec FROM names WHERE name LIKE 'G%')
SELECT ... FROM foo WHERE ...
Or, for more or less unlimited manipulation options, you could create a temp table that will disappear at the end of the current transaction. That doesn't get the job done in a single query, but it does get it all done on the SQL server, which might still be worthwhile.
db=# BEGIN;
BEGIN
db=# CREATE TEMP TABLE foo ON COMMIT DROP AS SELECT * FROM names WHERE name LIKE 'G%';
SELECT 4677
db=# SELECT * FROM foo LIMIT 5;
name | freq | cumfreq | rank | vec
----------+-------+---------+------+-----------------------
GREEN | 0.183 | 11.403 | 35 | 'KRN':1 'green':1
GONZALEZ | 0.166 | 11.915 | 38 | 'KNSL':1 'gonzalez':1
GRAY | 0.106 | 15.921 | 69 | 'KR':1 'gray':1
GONZALES | 0.087 | 18.318 | 94 | 'KNSL':1 'gonzales':1
GRIFFIN | 0.084 | 18.659 | 98 | 'KRFN':1 'griffin':1
(5 rows)
db=# UPDATE foo SET name = lower(name);
UPDATE 4677
db=# SELECT * FROM foo LIMIT 5;
name | freq | cumfreq | rank | vec
--------+-------+---------+-------+---------------------
grube | 0.002 | 67.691 | 7333 | 'KRP':1 'grube':1
gasper | 0.001 | 69.999 | 9027 | 'KSPR':1 'gasper':1
gori | 0.000 | 81.360 | 28946 | 'KR':1 'gori':1
goeltz | 0.000 | 85.471 | 47269 | 'KLTS':1 'goeltz':1
gani | 0.000 | 86.202 | 51743 | 'KN':1 'gani':1
(5 rows)
db=# COMMIT;
COMMIT
db=# SELECT * FROM foo;
ERROR: relation "foo" does not exist

How to read this multi-format data in SAS?

I have an odd dataset that I need to import into SAS, splitting the records into two tables depending on formatting, and dropping some records altogether. The data is structured as follows:
c Comment line 1
c Comment line 2
t lines init
a 'mme006' M 8 99 15 '111 ME - RANDOLPH ST'
path=no
dwt=0.01 42427 ttf=1 us1=3 us2=0
dwt=#0 42350 ttf=1 us1=1.8 us2=0 lay=3
dwt=>0 42352 ttf=1 us1=0.5 us2=18.13
42349 lay=3
a 'mme007' M 8 99 15 '111 ME - RANDOLPH ST'
path=no
dwt=+0 42367 ttf=1 us1=0.6 us2=0
dwt=0.01 42368 ttf=1 us1=0.6 us2=35.63 lay=3
dwt=#0 42369 ttf=1 us1=0.3 us2=0
42381 lay=3
Only the lines beginning with a, dwt or an integer need to be kept.
For the lines beginning with a, the desired output is a table like this, called "lines", which contains the first two non-a values in the row:
name | type
--------+------
mme006 | M
mme007 | M
For the dwt/integer rows, the table "itins" would look like so:
anode | dwt | ttf | us1 | us2 | lay
------+------+-----+-----+-------+-----
42427 | 0.01 | 1 | 3.0 | 0.00 |
42350 | #0 | 1 | 1.8 | 0.00 | 3
42352 | >0 | 1 | 0.5 | 18.13 |
42349 | | | | | 3 <-- line starting with integer
42367 | +0 | 1 | 0.6 | 0.00 |
42368 | 0.01 | 1 | 0.6 | 35.63 | 3
42369 | #0 | 1 | 0.3 | 0.00 |
42381 | | | | | 3 <-- line starting with integer
The code I have so far is almost there, but not quite:
data lines itins;
infile in1 missover;
input #1 first $1. #;
if first in ('c','t') then delete;
else if first='a' then do;
input name $ type $;
output lines; end;
else do;
input #1 path=$ dwt=$ anode ttf= us1= us2= us3= lay=;
if path='no' then delete;
output itins; end;
The problems:
The "lines" table is correct, except I can't get rid of the quotes around the "name" values (e.g. 'mme006')
In the "itins" table, "ttf", "us1", and "us2" are being populated correctly. However, "anode" and "lay" are always null, and "dwt" has values like #0 4236 and 0.01 42, always 8 characters long, borrowing part of what should be in "anode".
What am I doing wrong?
DEQUOTE() will remove matched quotation marks.
Your problem with dwt is that you'll need to tell it what informat to use; so if dwt is four long, :$4. instead of just $.
However, anode is a problem. The solution I came up with is:
data lines itins;
infile in1 missover;
input #1 first $1. #;
if first in ('c','t') then delete;
else if first='a' then do;
input name $ type $;
output lines; end;
else do;
input #1 path= $ #;
if path='no' then delete;
else do;
if substr(_infile_,5,1)='d' then do;
input dwt= :$12. ttf= us1= us2= us3= lay=;
anode=input(scan(dwt,2,' '),best.);
dwt=scan(dwt,1,' ');
output itins;
end;
else do;
input #5 anode 5. lay=;
output itins;
end;
end;
end;
run;
Basically, check for plan first; then if it's not a plan row, check for the 'd' in dwt. If that's present, read in a line like that, incorporating anode into dwt and then splitting it off later. If it's not present, just read in anode and lay.
If dwt can have widths other than 2-4 such that it might need to be shorter, then this probably won't work, and you'll have to explicitly figure out the position of anode to read it in properly.