How to read this multi-format data in SAS? - import

I have an odd dataset that I need to import into SAS, splitting the records into two tables depending on formatting, and dropping some records altogether. The data is structured as follows:
c Comment line 1
c Comment line 2
t lines init
a 'mme006' M 8 99 15 '111 ME - RANDOLPH ST'
path=no
dwt=0.01 42427 ttf=1 us1=3 us2=0
dwt=#0 42350 ttf=1 us1=1.8 us2=0 lay=3
dwt=>0 42352 ttf=1 us1=0.5 us2=18.13
42349 lay=3
a 'mme007' M 8 99 15 '111 ME - RANDOLPH ST'
path=no
dwt=+0 42367 ttf=1 us1=0.6 us2=0
dwt=0.01 42368 ttf=1 us1=0.6 us2=35.63 lay=3
dwt=#0 42369 ttf=1 us1=0.3 us2=0
42381 lay=3
Only the lines beginning with a, dwt or an integer need to be kept.
For the lines beginning with a, the desired output is a table like this, called "lines", which contains the first two non-a values in the row:
name | type
--------+------
mme006 | M
mme007 | M
For the dwt/integer rows, the table "itins" would look like so:
anode | dwt | ttf | us1 | us2 | lay
------+------+-----+-----+-------+-----
42427 | 0.01 | 1 | 3.0 | 0.00 |
42350 | #0 | 1 | 1.8 | 0.00 | 3
42352 | >0 | 1 | 0.5 | 18.13 |
42349 | | | | | 3 <-- line starting with integer
42367 | +0 | 1 | 0.6 | 0.00 |
42368 | 0.01 | 1 | 0.6 | 35.63 | 3
42369 | #0 | 1 | 0.3 | 0.00 |
42381 | | | | | 3 <-- line starting with integer
The code I have so far is almost there, but not quite:
data lines itins;
infile in1 missover;
input #1 first $1. #;
if first in ('c','t') then delete;
else if first='a' then do;
input name $ type $;
output lines; end;
else do;
input #1 path=$ dwt=$ anode ttf= us1= us2= us3= lay=;
if path='no' then delete;
output itins; end;
The problems:
The "lines" table is correct, except I can't get rid of the quotes around the "name" values (e.g. 'mme006')
In the "itins" table, "ttf", "us1", and "us2" are being populated correctly. However, "anode" and "lay" are always null, and "dwt" has values like #0 4236 and 0.01 42, always 8 characters long, borrowing part of what should be in "anode".
What am I doing wrong?

DEQUOTE() will remove matched quotation marks.
Your problem with dwt is that you'll need to tell it what informat to use; so if dwt is four long, :$4. instead of just $.
However, anode is a problem. The solution I came up with is:
data lines itins;
infile in1 missover;
input #1 first $1. #;
if first in ('c','t') then delete;
else if first='a' then do;
input name $ type $;
output lines; end;
else do;
input #1 path= $ #;
if path='no' then delete;
else do;
if substr(_infile_,5,1)='d' then do;
input dwt= :$12. ttf= us1= us2= us3= lay=;
anode=input(scan(dwt,2,' '),best.);
dwt=scan(dwt,1,' ');
output itins;
end;
else do;
input #5 anode 5. lay=;
output itins;
end;
end;
end;
run;
Basically, check for plan first; then if it's not a plan row, check for the 'd' in dwt. If that's present, read in a line like that, incorporating anode into dwt and then splitting it off later. If it's not present, just read in anode and lay.
If dwt can have widths other than 2-4 such that it might need to be shorter, then this probably won't work, and you'll have to explicitly figure out the position of anode to read it in properly.

Related

first. and last. with multiple occurences

I am trying to gather dates from a dataset using the first.variable and last.variable from SAS. Here is the code to create a reproducible example:
data example;
infile datalines delimiter = ",";
input id $ code $ valid_from valid_to;
format valid_from IS8601DA10. valid_to IS8601DA10.;
datalines;
1A,ABC,20058,20177
1A,DEF,20178,20481
1A,DEF,20482,20605
1A,DEF,20606,21548
1A,DEF,21549,21638
1A,DEF,21639,21729
1A,ABC,21730,21733
1A,ABC,21734,21808
1B,MNO,20200,20259
1B,PQR,20260,20269
1B,STU,20270,20331
1B,VWX,20332,20361
1B,VWX,20362,22108
1B,VWX,22109,22164
1B,VWX,22165,22165
1B,VWX,22166,2936547
;
run;
The idea is to get, for each id, only one observation per code with the corresponding range of dates it cover.
Here is my code:
proc sort data=example out=example_sorted; by code valid_from; run;
data collapse_val_dates;
set example_sorted;
by code valid_from;
if first.code = 1 and last.code = 1 then do;
output;
end;
if first.code = 1 and last.code = 0 then do;
hold = valid_from;
retain hold;
end;
if first.code = 0 and last.code = 1 then do;
valid_from = hold;
output;
end;
drop hold;
run;
Here is the result (table collapse_val_dates):
+----+------+------------+------------+
| id | code | valid_from | valid_to |
+----+------+------------+------------+
| 1A | ABC | 2014-12-01 | 2019-09-16 |
| 1A | DEF | 2015-03-31 | 2019-06-29 |
| 1B | MNO | 2015-04-22 | 2015-06-20 |
| 1B | PQR | 2015-06-21 | 2015-06-30 |
| 1B | STU | 2015-07-01 | 2015-08-31 |
| 1B | VWX | 2015-09-01 | 9999-12-31 |
+----+------+------------+------------+
It produces what I expect for id=1B but not for id=1A. Indeed, as the code=ABC appears once in the beginning and twice at the end, the result table put valid_from=2014-12-01.
What I would like is the valid_from for code=ABC to be 2019-06-30. In other words, I would like SAS to "forget" the first occurence of the code if there is an (or multiple) other code in between. The final table would look like this:
+----+------+------------+------------+
| id | code | valid_from | valid_to |
+----+------+------------+------------+
| 1A | DEF | 2015-03-31 | 2019-06-29 |
| 1A | ABC | 2019-06-30 | 2019-09-16 |
| 1B | MNO | 2015-04-22 | 2015-06-20 |
| 1B | PQR | 2015-06-21 | 2015-06-30 |
| 1B | STU | 2015-07-01 | 2015-08-31 |
| 1B | VWX | 2015-09-01 | 9999-12-31 |
+----+------+------------+------------+
For a single pass over the data you can't output a date range for a code while serially processing a group because later rows in the group could overwrite the date range wanted.
You will need to either
code multiple steps, or
perform a single pass and use temporary storage
Presume have is sorted by id valid_from and that valid_to never overlaps a succeeding valid_from within the id group.
Multiple Steps
Compute a group number for rows grouped by contiguous code for use in final ordering.
* multi step way;
data stage1;
set have;
by id code notsorted;
if first.code then group_number+1;
run;
proc sort data=stage1 out=stage2;
by id code group_number valid_from;
run;
* remember there can be multiple contiguous code groups within id & group;
data stage3;
do until (last.code);
set stage2;
by id code group_number;
if first.group_number then _start = valid_from;
if last.code then do;
valid_from = _start;
OUTPUT; /* date range for contiguous code group */
end;
end;
drop _start;
run;
proc sort data=stage3 out=want(drop=group_number);
by id valid_from;
run;
Single Pass
A DOW loop (a loop that has a SET statement within it) can compute a result over a group and subgroup and output one row per combination. Temporary storage can be a hash (for an arbitrary number of subgroups), or an array for an assumed maximum number of subgroups.
Example:
A temporary array of fixed size 1,000 is used to store temporary data that should be modified while examining the group.
* find the range of the dates from the last set of contiguous rows of a code within id;
data want(keep=id code valid_:);
array dates (1000,2) 8 _temporary_; /* ,1 for _from and ,2 for _to */
array codes (1000) $50 _temporary_;
array seq (1000) 8 _temporary_; /* sequence for output order */
* process the id group;
do _n_ = 1 by 1 until (last.id);
set have;
by id code notsorted;
* save start of date range in temporay storage;
if first.code then do;
* linear search for slot to use for subgroup code;
do _index = 1 by 1
until (missing(codes(_index)) or codes(_index)=code);
end;
codes(_index) = code;
dates(_index,1) = valid_from;
seq (_index) = _n_ + _index / 1000; * encode order value with lookup index;
end;
* save end of date range;
if last.code then
dates(_index,2) = valid_to;
end;
*---;
* process each code within id;
call sort (of seq(*)); * order of first date for last code subgroup;
do _index = 1 to dim(seq);
if missing(seq(_index)) then continue;
* extract encoded information;
_ix = round((seq(_index) - int(seq(_index))) * 1000);
code = codes(_ix);
valid_from = dates(_ix,1);
valid_to = dates(_ix,2);
OUTPUT;
end;
* clear out temporary arrays for next group processing;
call missing (of dates(*), of codes(*), of seq(*));
run;

Computation of the number of uppercase letters in a string

I've bumped into a seemingly simple problem that I'm unable to solve. I would like to determine whether the number of uppercase letters is greater than the number of lowercase letter (ignoring special character, spaces etc).
Example
id | text | upper_greater_lower | note
------------------------------------------------------------------
1 | Hello World | False | because |HW| < |elloorld|
2 | The XYZ | True | because |TXYZ| > |he|
3 | Foo!!! | False | because |F| < |oo|
4 | BAr??? | True | because |BA| > |r|
My initial idea was to determine the number of lowecase letters, then uppercase letters, and finally, compare them. However, I'm unable to do so in any elegant and efficient way.
I expect handling ~30M rows with ~300 character each.
What would you suggest?
Thanks!
Using regular expression magic, that could be:
SELECT length(regexp_replace(textcol, '[^[:upper:]]', '', 'g'))
> length(regexp_replace(textcol, '[^[:lower:]]', '', 'g'))
FROM atable;

Column-wise autocomplete

I have a table in a PostgreSQL database with four columns that contain increasingly more detailed information (think state->city->street->number), along with a column where everything is concatenated according to some simple formatting rules. Example:
| kommun | trakt | block | enhet | beteckning |
| Mora | Gislövs Läge | 9 | 16 | Mora Gislövs Läge 9:16 |
| Mora | Gisslaved | * | 8 | Mora Gisslaved 8 |
| Mora | Gisslaved | * | 9 | Mora Gisslaved 9 |
| Lilla Edet | Sanda | GA | 1 | Lilla Edet Sanda GA:1 |
A web service uses this table to implement a word-wise autocomplete, where the user gets input suggestions as they drill down. An input of mora gis will result in
["Mora Gislövs", "Mora Gisslaved"]
Currently, this is done by splitting the concatenated column by word in this query:
select distinct trim(substring(beteckning from '(^(\S+\s?){NUMPARTS})')) as bet
from beteckning_ac
where upper(beteckning) like upper('mora gis%')
order by bet
Where NUMPARTS is the number of words in the input - 2 in this case.
Now I want the autocomplete to be done column-wise rather than word-wise, so mora gis would now result in this instead:
["Mora Gislövs Läge", "Mora Gisslaved"]
Since the first two columns can contain an arbitrary number of words, I can no longer use the input to determine how many columns to include in the response. Is there a way to do this, or have I maybe gone about this autocomplete business all wrong?
CREATE OR REPLACE FUNCTION get_auto(text)
--$1 is here your input
RETURNS setof text
LANGUAGE plpgsql
AS $function$
declare
NUMPARTS int := array_length(regexp_split_to_array($1,' '), 1);
begin
return query
select
case
when (NUMPARTS = 1) then kommun
when (NUMPARTS = 2) then kommun||' '||trakt
when (NUMPARTS = 3) then kommun||' '||trakt||' '||block
when (NUMPARTS = 4) then kommun||' '||trakt||' '||block||' '||enhet
--alter if you want to
end
from
auto_complete --your tablename here
where
beteckning like $1||'%';
end;
$function$;

How to automatically calculate the SUS Score for a given spreadsheet in LibreOffice Calc?

I have several spreadsheets for a SUS-Score usability test.
They have this form:
| Strongly disagree | | | | Strongly agree |
I think, that I would use this system often | x | | | | |
I found the system too complex | |x| | | |
(..) | | | | | x |
(...) | x | | | | |
To calculate the SUS-Score you have 3 rules:
Odd item: Pos - 1
Even item: 5 - Pos
Add Score, multiply by 2.5
So for the first entry (odd item) you have: Pos - 1 = 1 - 1 = 0
Second item (even): 5 - Pos = 5 - 2 = 3
Now I have several of those spreadsheets and want to calculate the SUS-Score automatically. I've changed the x to a 1 and tried to use IF(F5=1,5-1). But I would need an IF-condition for every column: =IF(F5=1;5-1;IF(E5=1;4-1;IF(D5=1;3-1;IF(C5=1;2-1;IF(B5=1;1-1))))), so is there an easier way to calculate the score, based on the position in the table?
I would use a helper table and then SUM() all the cells of the helper table and multiply by 2.5. This formula (modified as needed, see notes below) can start your helper table and be copy-pasted to fill out the entire table:
=IF(D2="x";IF(MOD(ROW();2)=1;5-D$1;D$1-1);"")
Here D is an answer column
Depending on what row (odd/even) your answers start you may need to change the =1 after the MOD function to =0
This assumes the position number is in row 1; if position numbers are in a different row change the number after the $ appropriately

Update intermediate result

EDIT
As requested a little background of what I want to achieve. I have a table that I want to query but I don't want to change the table itself. Next the result of the SELECT query (what I called the 'intermediate table') needs to be cleaned a bit. For example certain cells of certain rows need to be swapped and some strings need to be trimmed. Of course this could all be done as postprocessing in, e.g., Python, but I was hoping to do all of this with one query statement.
Being new to Postgresql I want to update the intermediate table that results from a SELECT statement. So I basically want to edit the resulting table from a SELECT statement in one query. I'd like to prevent having to store the intermediate result.
I've tried the following 'with clause':
with result as (
select
a
from
b
)
update result as r
set
a = 'd'
...but that results in ERROR: relation "result" does not exist, while the following does work:
with result as (
select
a
from
b
)
select
*
from
result
As I said, I'm new to Postgresql so it is entirely possible that I'm using the wrong approach.
Depending on the complexity of the transformations you want to perform, you might be able to munge it into the SELECT, which would let you get away with a single query:
WITH foo AS (SELECT lower(name), freq, cumfreq, rank, vec FROM names WHERE name LIKE 'G%')
SELECT ... FROM foo WHERE ...
Or, for more or less unlimited manipulation options, you could create a temp table that will disappear at the end of the current transaction. That doesn't get the job done in a single query, but it does get it all done on the SQL server, which might still be worthwhile.
db=# BEGIN;
BEGIN
db=# CREATE TEMP TABLE foo ON COMMIT DROP AS SELECT * FROM names WHERE name LIKE 'G%';
SELECT 4677
db=# SELECT * FROM foo LIMIT 5;
name | freq | cumfreq | rank | vec
----------+-------+---------+------+-----------------------
GREEN | 0.183 | 11.403 | 35 | 'KRN':1 'green':1
GONZALEZ | 0.166 | 11.915 | 38 | 'KNSL':1 'gonzalez':1
GRAY | 0.106 | 15.921 | 69 | 'KR':1 'gray':1
GONZALES | 0.087 | 18.318 | 94 | 'KNSL':1 'gonzales':1
GRIFFIN | 0.084 | 18.659 | 98 | 'KRFN':1 'griffin':1
(5 rows)
db=# UPDATE foo SET name = lower(name);
UPDATE 4677
db=# SELECT * FROM foo LIMIT 5;
name | freq | cumfreq | rank | vec
--------+-------+---------+-------+---------------------
grube | 0.002 | 67.691 | 7333 | 'KRP':1 'grube':1
gasper | 0.001 | 69.999 | 9027 | 'KSPR':1 'gasper':1
gori | 0.000 | 81.360 | 28946 | 'KR':1 'gori':1
goeltz | 0.000 | 85.471 | 47269 | 'KLTS':1 'goeltz':1
gani | 0.000 | 86.202 | 51743 | 'KN':1 'gani':1
(5 rows)
db=# COMMIT;
COMMIT
db=# SELECT * FROM foo;
ERROR: relation "foo" does not exist