Delete lines in a file that start with specific strings - sed

I have some files that look like this:
Node Present
1 243
2 445
10 65
4 456
43 8
...
I need to remove the values corresponding to specific nodes and I have a file specifying this nodes that looks like this:
1
4
...
The idea is to delete the lines that start with the values specified in my second file. I know that "sed" can do something like this, but I do not know how to apply it for all the values specified in the second file. More over, I want to delete node 1, but not node 100, and I am seeing that node 100 will also get erased with my approach.
sed '/^1/d'

This sort of problem is typically done with awk, and is quite common. Read the first file into an array, and then use it to process subsequent files. For example:
$ cat skip
1
4
$ cat input
Node Present
1 243
2 445
10 65
4 456
43 8
...
$ cat input2
Node Present
1 243
2 445
4 456
43 8
4 587
...
$ awk 'NR==FNR{a[$1] = 1; next} ! a[$1] && FNR > 1' skip input input2
2 445
10 65
43 8
...
2 445
43 8
...
The initial NR == FNR causes those command to only be executed on the first file, loading the array with the ids you wish to skip. Subsequent commands print lines in which the first column did not appear in the first file and the first line of each file, in which FNR > 1. (FNR is the "File Record Number", aka the line number in the file.)

sed is not the right tool for this job. I suggest using awk like this:
awk 'NR == FNR {ids[$1]; next} NR == 1 || !($1 in ids)' ids nodes
Node Present
2 445
10 65
43 8
Where input files are:
cat ids
1
4
cat nodes
Node Present
1 243
2 445
10 65
4 456
43 8

If you want to stick with sed and bash, here is one possible way how you would do it. Please note that, for large files, that would take orders of magnitude longer to run than the awk answers (similar to time complexity of linear search vs binary search).
sed -f <(sed 's/^/\/^/;s/[[:blank:]]*$/[[:blank:]]\/d/' ids) nodes
ids is filename of single-column file and nodes is that of two-column.

Related

rsync tries to copy although file is copied

I'm trying to understand the following, maybe one of you guys can help me out by explaining on what exactly happens here:
My goal is to write a script, which copies files from the find-command, parallels rsync-commands and backup those, by re-creating the folder-structure from the source on the destination as well.
Somehow my initial script does not work like that, and i don't really have an idea on how to fix it to behave that way.
Initially, i've copied the files with "ls -1 /foldername" as sourceFolder, which is not best-practice. So i've tried to change it to using "find" as described beyond.
find "$sourceFolder" -maxdepth 2 -mindepth 2 -print0 | xargs --verbose -0 -I {} -P $maxThreads -n 1 rsync -valhP {}/ "$destFolder"/ --checksum --human-readable --stats --dry-run >> "$logDir"/result.log
--
If i run this script, it literaly would recopy the file(s), although those exist in the destination folder i would expect it to be.
sending incremental file list
.d..tp..... ./
>f+++++++++ macs.txt
Number of files: 2 (reg: 1, dir: 1)
Number of created files: 1 (reg: 1)
Number of deleted files: 0
Number of regular files transferred: 1
Total file size: 242 bytes
Total transferred file size: 242 bytes
Literal data: 0 bytes
Matched data: 0 bytes
File list size: 0
File list generation time: 0.484 seconds
File list transfer time: 0.000 seconds
Total bytes sent: 83
Total bytes received: 22
So as it turned out, sourceFolder and destFolder have to match, and i can't run anything like sourceFolder/folder1/folder1a destFolder/.
Can anyone help me out on what's wrong and why it behaves like that?
Thanks a Lot,
M.

Read .mat files from Unix shell

Is there a way to read .mat files from the UNIX command line shell?
Such as cat mymat.mat ?
I am aware of the possibilities to load it in MATLAB or python, but these are not available for me atm.
GNU Octave may be an option as it can be freely installed without cost.
Say you ran a session something like this and created two arrays, A and B:
octave:1> A = [ 1:3; 4:6; 7:9 ];
octave:2> B = [ 11:13; 14:16; 17:19 ];
octave:3> save -7 myfile.mat A B
Then, in the shell, outside of Octave, you could do this to see the names of the variables in the file:
$ octave-cli <<< "who -file myfile.mat"
Sample Output
Variables in the file myfile.mat:
A B
And then this to dump the variables:
$ octave-cli <<< "load myfile.mat;A"
Sample Output
A =
1 2 3
4 5 6
7 8 9
And:
$ octave-cli <<< "load myfile.mat;B"
Sample Output
B =
11 12 13
14 15 16
17 18 19
No. .mat-files are saved in a binary format. You will not be able to interpret their content from the UNIX shell.

Adding data to a text file with SED not changing the file size

I have some text files where I need to add 1 character to the beginning of every line of the file.
In windows, I found that a quick way to do this was by installing Cygwin and using the following command, which prepends the letter N to every line of the file:
$ sed 's/^/N/' inputFile.txt > outputFile.txt
What I found strange, was that after I added a new character to the front of each line, the file size was almost completely unchanged. I tested this further, to see if I could recreate the problem with the following steps:
Created a text file called "Test.txt", which had 10,000 lines with the word "TEST" on each line.
Created a text file called "TestWithNPrefix.txt" which had 10,000 lines with the word "NTEST" on each line.
Executed the following command to create another file which had 10,000 lines of "NTEST"
$ sed 's/^/N/' Test.txt > "SEDTest.txt"
Results
"Test" and "SEDTest" were almost the exact same size, while "TestWithNPrefix" was 10KB larger.
Test = 59,998 Bytes; SEDTest = 59,999 Bytes; TestWithNPrefix = 69,998 Bytes
When I ran the "fc" command in Command Prompt, it returned that there were no differences between "SEDTest" and "TestWithNPrefix". "FC" between "SEDTest" and "Test" returned "Resync Filed. Files are too different".
Can someone please help me understand what is causing these file size discrepancies?
EDIT: I created the files "Test.txt" and "TestWithNPrefix.txt" in UltraEdit. I just typed out the word "TEST"/"NTEST", then copied and pasted it 10,000 times.
Not an answer, but a comment with formatting:
You seem to be running into some odd situation with DOS versus Unix line endings. I have to ask: How are you creating the files? I would expect 10,000 lines of "TEST\r\n" would be exactly 60,000 bytes in size, not 59,999
On Linux (I don't have access to a cygwin environment at the moment):'
$ yes $'TEST\r' | head -n 10000 > Test
$ ll Test
-rw-r--r-- 1 jackman jackman 60000 Jan 8 13:06 Test
$ sed 's/^/N/' Test > SEDTest
$ ll *Test
-rw-r--r-- 1 jackman jackman 70000 Jan 8 13:06 SEDTest
-rw-r--r-- 1 jackman jackman 60000 Jan 8 13:06 Test

delete the lines that don't have 4 columns

I have some files as follows.I would like to delete the lines that don't have 4 columns.
263 126.9 263 50.2
264 76.6 264 6.2
265 62.3 265 49.9
266 84.2 266 18.3
7 63.8
8 59.7
9 36.4
11 12.0
Desired output
263 126.9 263 50.2
264 76.6 264 6.2
265 62.3 265 49.9
266 84.2 266 18.3
That's straight forward with awk
awk 'NF==4' input.txt
NF is the number of fields, if it is equal to 4, the default action is executed, which is printing the whole line.
sed '/\([^[:blank:]]\{1,\}[[:blank:]]\{1,\}\)\{3\}[^[:blank:]]\{1,\}/ !d'
in this sample, optimization with
sed '/\([0-9.]\{1,\}[[:blank:]]\{1,\}\)\{3\}[0-9.]\{1,\}/ !d'
including remark of sudo_O for 4 column ONLY (other allow more to be printed) and a parameter variable for number of column (must be at least 2)
ColNbr=4
ColBefore=$(( ${ColNbr} - 1 ))
sed "/^\([^[:blank:]]\{1,\}[[:blank:]]\{1,\}\)\{${ColBefore}\}[^[:blank:]]\{1,\}[[:blank:]]*$/ !d"
As he state, sed is not the optimize tools to use in this case (read comment for arguments)
This might work for you (GNU sed):
sed -r 's/\S+/&/4;t;d' file

US-ASCII encoding with Odd and Even numbers?

I splitted the list of numbers 1-100 to files of 2 bytes. Then, I noticed that each odd number btw 11-99 needs 2 files, ie 4bytes, while each even number btw 11-99 needs 1 file, 2bytes. A file is enough for numbers btw 1-10. You can confirm the finding below.
How can you explain the finding?
What did I do?
save numbers to a file, like in VIM:
:%!python -c "for i in range(1,100): print i"
:w! list_0_100
split the file to 2 bytes' files with the GNU Split-command in Coreutils:
$ split -b2 -d list_0_100
Finding: Each odd number btw 11-99 needs only two files, ie 4bytes. In contrast, each even number btw 11-99 needs one file, ie 2bytes:
$ head x*
==> x07 <==
8
==> x08 <==
9
==> x09 <==
10
==> x10 <==
1
==> x11 <==
1
==> x12 <==
12
==> x13 <==
1
==> x14 <==
3
==> x15 <==
14
==> x16 <==
1
==> x17 <==
5
Each number greater than 9 requires three bytes (one byte for each digit, and one more byte for the new line). Writing _ instead of the new line character (for visibility) you have this:
10_11_12_13_14
After the split:
10 _1 1_ 12 _1 3_ 14
The even numbers happen to lie in one file, but the odd number get split over two files.