SPSS Merging Data with duplicate Keys - merge

I am currently attempting to join 2 datasets using SPSS syntax but am struggling as I have duplicate values on the keys. I would like for the joined data to be duplicated for each instance of the key on the source dataset (or other way round as it doesn't matter which is the source).
The datasets are like the following -
Data1 (3rd column placeholder)
batch
run
date
A
1
1
A
2
1
A
3
1
B
1
1
C
1
1
C
2
1
D
1
1
E
1
1
Data2
batch
Value1
Value2
A
1
21
A
2
22
A
3
23
A
4
24
B
5
25
B
6
26
B
7
27
B
8
28
C
9
29
C
10
30
C
11
31
C
12
32
D
13
33
D
14
34
D
15
35
D
16
36
E
17
37
E
18
38
E
19
39
E
20
40
Current attempt
What I have just now is a method where I CASETOVARS on Data1 before matching it onto Data2 and then VARSTOCASES to expand it out. This works perfectly with my test data but, unfortunately, it requires that I know exactly how many 'runs' there will be. That will not be known in production. It could be 1 or more.
Is there a method to join these datasets while expanding the joined data into the multliple cases in the source?
I am open to using macros but am not able to utilise Python solutions for this (which would probably be easier!).
edit - Unfortunately, extensions are also not possible for me to use.
CASESTOVARS
/ID = batch .
DATASET ACTIVATE data2 .
MATCH FILES
/FILE = *
/TABLE = data1
/BY batch .
EXECUTE .
VARSTOCASES
/MAKE run FROM BATCH_RUN_ID.1 TO BATCH_RUN_ID.3 .
EXECUTE .

If Python and dependent extention command are not availabe, here's an idea how to solve the dynamic list length for the varstocases phase.
What you'll do is basically to create a new dataset with the maximum number of runs possible, attach your read dataset to it, and then set the varstocases to go for that maximum number of runs (blank rows are dropped automatically):
dataset name orig.
data list free/throwthisrow (f1) BATCH_RUN_ID.1 to BATCH_RUN_ID.50 (50F8.2) .
begin data
1,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
end data.
add files /file=* /file=orig .
EXECUTE.
select if missing(throwthisrow).
VARSTOCASES
/MAKE run FROM BATCH_RUN_ID.1 TO BATCH_RUN_ID.50 /drop throwthisrow.
EXECUTE .

To complete your present approach you can use spssinc select variables extention command (see examples of use here and here and here). You will use it to automatically create a list of the variables you want to name in your varstocases command, so that the syntax will automatically adapt itself to the number of runs in the data:
So after varstocases and match files:
spssinc select variables macroname="!from" /properties pattern = "BATCH_RUN_ID".
VARSTOCASES /MAKE run FROM !from .

Related

Can't get the written file content in q?

I've copy the exact example in q for mortals as follows:
q)h:hopen `:D:/q4m/raw
q)h[42]
548i
q)h 10 20 30
548i
q)hclose h
q)get `:D:/q4m/raw
'D:/q4m/raw
[0] get `:D:/q4m/raw
Look into the directory, the file was created there. Why can't I get it?
Instead, if I do:
q)h:hopen `:D:/q4m/L
q)h[42]
628i
q)h[10 20 30]
628i
q)hclose h
q)get `:D:/q4m/L
0 1 2 3 4 42 10 20 30
Things get normal, why?
After testing the given code I believe your issue may be in how you intialise the file.
I assume in the code that works that you use some variation of
`:D:/q4m/L set til 5
before.
However this is not done for
`:D:/q4m/raw
If you were to use
`:D:/q4m/raw set til 5
or alternatively
.[`:D:/q4m/raw;();:;()]
beforehand then the first set of code will work.
Additionally, if we look at the binary using
read1 `:D:/q4m/raw
and
read1 `:D:/q4m/L
and the output does not include 07 near the beginning then it is not being recognised as a proper kdb list. That is, hopen simply appends to the binary file instead of amending it. (If you notice the 05 byte that indicates length of the list, this doesn't increase when you add via the handle).
eg.
The first method you get
q)read1 `:D:/q4m/raw
0x2a000000000000000a0000000000000014000000000000001e00000000000000
which dosen't really mean anything in q.
The second method gives
q)read1 `:D:/q4m/L
0xfe2007000000000005000000000000000000000000000000010000000000000002000000000..
which is a proper kdb list (notice the 07 which indicates type).
If you wish to instead just read in /q4m/raw then I suggest setting an empty list, hopen to that list and pass it `:D:/q4m/raw as follows
q)`:empty set 0#0
`:empty
q)h:hopen `:empty
q)h read1 `:D:/q4m/raw
3i
q)get `:empty
42 10 20 30
This will only work if all entries are the same type.

finding matching pattern, performing calculations and shifting the columns of file

I need help to format below file with some calculation for a particular row having some pattern
hnt 1 454 454
gft 10 8844 8853
step 2 23 24
str 10 Check sum(00244-00240) 420 434
dert 03 14 16
ghh 33 Check sum(12366-12361) 8008 8046
I need to have four column file by performing subtraction for the row having text "check sum".
I wish to remove the text "Check sum" and then subtract the numbers given in ( ). For e.g. (00244-00240) will be subtracted and will be having value '4' and this '4' will be added to left column which has the value '10'.
so now this value will become '14'. After this calculation the other value on that row will shift left column wise. Thus making four columns table instead of six columns
The desired output is
hnt 1 454 454
gft 10 8844 8853
step 2 23 24
str 14 420 434
dert 03 14 16
ghh 38 8008 8046
I am new to shell scrip and appreciate your help to get above desired output using awk or sed or both. I am also ok if this can be achieved without using this awk and sed and by using other command in unix shell script
Perl can do the parsing and the addition:
perl -pe 's/(\d+)\s+Check sum\((\d+)([+-]\d+)\)/$1+$2+$3/e' file
hnt 1 454 454
gft 10 8844 8853
step 2 23 24
str 14 420 434
dert 03 14 16
ghh 38 8008 8046
If you want the output to look prettier, pipe the result through column -t
You can try this one liner:
sed "s/Check//;s/sum//;s/(//;s/)//" filename|awk '/-/{sub(/-/," ");for (i=1;i<=NF;i++);{calc=$3-$4;$2=$2+calc;$3="";$4=""}}1'
Note:
This answer does not retain the original white-spaces for the rows having calculation performed

Using sed to copy data between two numerical patterns to a new file

I'm running a bunch (~320) computational chemistry experiments and I need to pull a small amount of the data out of each of the files so that I can do some work on it in MatLab.
I'm pretty sure I can use sed to make this work, but try as I might I don't seem to be able to do so.
I need all of the data starting at the line beginning with "1 1" and ending with the line starting with "33 33".
I J FI(I,J) k(I,J) K(I,J)
1 1 -337.13279 -0.06697 -0.00430
2 2 3804.89120 8.52972 0.54787
3 3 3195.69653 6.01702 0.38648
4 4 3189.18684 5.99253 0.38490
5 5 3183.73262 5.97205 0.38359
6 6 3174.47525 5.93737 0.38136
7 7 3167.88746 5.91275 0.37978
8 8 1628.80868 1.56311 0.10040
9 9 1623.56055 1.55306 0.09975
10 10 1518.21620 1.35806 0.08723
11 11 1476.93012 1.28520 0.08255
12 12 1341.24087 1.05990 0.06808
13 13 1312.30373 1.01466 0.06517
14 14 1264.73004 0.94242 0.06053
15 15 1185.62592 0.82822 0.05320
16 16 1175.54013 0.81419 0.05230
17 17 1170.41211 0.80710 0.05184
18 18 1090.20196 0.70027 0.04498
19 19 1039.29190 0.63639 0.04088
20 20 1015.00116 0.60699 0.03899
21 21 1005.05773 0.59516 0.03823
22 22 986.55965 0.57345 0.03683
23 23 917.65537 0.49615 0.03187
24 24 842.93089 0.41863 0.02689
25 25 819.00146 0.39520 0.02538
26 26 758.39720 0.33888 0.02177
27 27 697.11173 0.28632 0.01839
28 28 628.75684 0.23292 0.01496
29 29 534.75856 0.16849 0.01082
30 30 499.35579 0.14692 0.00944
31 31 422.01320 0.10493 0.00674
32 32 409.30255 0.09870 0.00634
33 33 227.12411 0.03039 0.00195
33 2nd derivatives larger than 0.371D-04 over 561
MatLab is not a fan of text, so I'd like to not use text delimiters (though there are some in the header of this data section) and keep the data contained to only the numeric lines.
The data files contain a lot of other numbers as well, so I need to match the occurrence of "1 1" at the start of the line and "33 33" as the end of the copy. These 'indices' exist only in this block of info.
I attempted to use
% sed -n /"1 1"/,/"33 33"/p input.file > output.file
But I get a WHOLE BUNCH of data in the output file as it copies everything that shows up between any "1" and "33"
Is there any way to do what I'm looking for?
Also, I'm using the tcsh as that is what my servers run.
How about using awk
awk '$1=="1"&&$2=="1"{t=1};t;$1=="33"&&$2=="33"{t=0}' file
Recommand by #mklement0, if there is only one block, to avoid processing the remainder of the file you can update the command to:
awk '$1=="1"&&$2=="1"{t=1};t;$1=="33"&&$2=="33"{exit}' file
Your problem is twofold. First, there are two blanks between the ones, but your regex only allows for one (judging from the now indented code). Second, you are probably not precise enough; the /1 1/ pattern matches 11 11, for example, and 111 111 and so on.
So, you should consider:
sed -n -e '/^ *1 *1 /,/^33 *33 /p' -e '/^33 33 /q' input.file > output.file
The patterns are anchored to the start of line by the ^ (caret). The numbers are separated by one or more blanks (there are other, longer-winded ways of writing that in standard sed; the + option is not standard sed but is widely available). And the numbers are terminated by a blank. The chances are that the first expression alone will give you what you want. The second expression terminates the search early when it recognizes the 33 33 input line, which can save a significant amount of file I/O and hence processing time if the input file is big enough.
If the lines with ID numbers in the hundreds have some different format, then it should be fairly straight-forward to tweak the regexes to match what is used. If the data contains tabs instead of (or as well as) blanks, you can tweak the regexes to manage that, too.
If you data is all formatted exactly the same as this file, then you can use sed to just read the 3rd through the 35th line (rows 1 1 - 33 33). This is a lot easier than parsing the values, but does require that the files have a standard format:
sed -n 3,35p data.txt
Another cheap way would be to grep for only numeric lines, and take only the first 33:
grep "^[0-9 ][0-9 .-]*$" data.txt | head -n 33

copying every nth line and duplicating it on it's following line

I am trying to make test files for the project, and I figured in order to make a bradycardia test file from an example file of a normal ECG.
Therefore I would need to copy every third line and insert it into the next line.
for example:
a = [
1
2
3
4
5
6
7
8
9
10]
and I want:
b = [
1
2
3
3
4
5
6
6
7
8
9
9
10]
and so on... but since the file is 6000 characters long, obviously i cannot manually copy it. And I need it to be 9000 characters long I've tried looking online on how to do this, and am having no luck.
Any suggestions?
b=zeros(floor(4/3*length(a)),1);
b(1:4:end)=a(1:3:end);
b(2:4:end)=a(2:3:end);
b(3:4:end)=a(3:3:end);
b(4:4:end)=a(3:3:end);
Another way:
b = a(sort([1:numel(a) 3:3:numel(a)]))
And here is a third faster and simpler method
b = a(round(1:0.75:numel(a)))
This only works if length(a) is a multiple of 3, but seems to be faster than the other answers, at least for large vectors:
b = reshape([reshape(a,3,[]); a(3:3:end).'],[],1);

What do the numbers on a Git Diff header mean? [duplicate]

This question already has answers here:
What does the “##…##” meta line with at signs in svn diff or git diff mean?
(3 answers)
Closed 7 years ago.
Every time I run git diff, for each single changes I made, I get some sort of header with numbers, for example:
## -169,14 +167,12 ## function Browser(window, document, body, XHR, $log) {.....
I wonder what does the four numbers mean? I guess -169 means that this particular line of code that follows was originally in line 169 but now is in 167? And what do 14 and 12 mean?
This header is called set of change, or hunk. Each hunk starts with a line that contains, enclosed in ##, the line or line range from,no-of-lines in the file before (with a -) and after (with a +) the changes. After that come the lines from the file. Lines starting with a - are deleted, lines starting with a + are added. Each line modified by the patch is surrounded with 3 lines of context before and after.
An addition looks like this:
## -75,6 +103,8 ##
foo
bar
baz
+line1
+line2
more context
and more
and still context
That means, in the original file before line 78 (= 75 + 3 lines of context) add two lines. These will be lines 106 (= 103 + 3 lines of context) through 107 after all changes.
Note the difference in from numbers (-75 vs +103), this means that there were other changes in this file before this particular hunk, that added 28 (103 - 75) lines of code.
A deletion looks like this:
## -75,7 +75,6 ##
foo
bar
baz
-line1
more context
and more
and still context
That means, delete line 78 (= 75 + 3 lines of context) in the original file. The unchanged context will be on lines 75 to 80 after all changes.
Note that from numbers in this hunk are equal (-75 and +75), this means that either there were no changes before this hunk, or amount of added and deleted lines in previous changes are the same.
Finally, a change looks like this:
## -70,7 +70,7 ##
foo
bar
baz
-red
+blue
more context
and more
still context
That means, change line 73 (= 70 + 3 lines of context) in the file before all changes, which contains red to blue. The changed line is also line 73 (= 70 + 3 lines of context) in the file after all changes.
I wonder what does the four numbers mean?
Let's analyze a simple example
The format is basically the same the diff -u unified diff.
We start with numbers from 1 to 16 and remove 2, 3, 14 and 15:
diff -u <(seq 16) <(seq 16 | grep -Ev '^(2|3|14|15)$')
Output:
## -1,6 +1,4 ##
1
-2
-3
4
5
6
## -11,6 +9,4 ##
11
12
13
-14
-15
16
## -1,6 +1,4 ## means:
-1,6 means that this piece of the first file starts at line 1 and shows a total of 6 lines. Therefore it shows lines 1 to 6.
1
2
3
4
5
6
- means "old", as we usually invoke it as diff -u old new.
+1,4 means that this piece of the second file starts at line 1 and shows a total of 4 lines. Therefore it shows lines 1 to 4.
+ means "new".
We only have 4 lines instead of 6 because 2 lines were removed! The new hunk is just:
1
4
5
6
## -11,6 +9,4 ## for the second hunk is analogous:
on the old file, we have 6 lines, starting at line 11 of the old file:
11
12
13
14
15
16
on the new file, we have 4 lines, starting at line 9 of the new file:
11
12
13
16
Note that line 11 is the 9th line of the new file because we have already removed 2 lines on the previous hunk: 2 and 3.
Summary:
Assume git diff will output [0-3] lines of context [before/after] [first/last] changes
## -[original file's number of first line displayed],[context lines + removed lines] +[changed file's number of first line displayed],[context lines + added lines] ##