Using diff/sdiff - diff

I have one text file containing 78 numbers, and then I have another text file that contains 63 numbers that were pulled from the first file. Therefore, there are 15 numbers in text1 that doesn't exist in text2. How can I find out which ones these are?
I've tried commands such as "sdiff text1 text2", and cannot find these specific 15 numbers for the life of me. I'm sure it's simple, but I'm obviously missing it.

Use the comm utility.
E.g., in bash:
comm -23 <(sort -n textfile1) <(sort -n textfile2)
comm requires sorted input, hence the process substitutions.
By default, comm outputs 3 columns: lines only in file 1, lines only in file2, lines in both files.
-23 suppresses columns 2 and 3, i.e., the command only outputs lines exclusive to file textfile1.

Related

Comm command issue

I'm trying to compare two gene lists and extract the common ones. I sorted my .txt files and used comm command:
comm gene_list1.txt gene_list2.txt
Strangely, when I check the output, there are many common genes that are not printed in the third line. Here is part of the output:
As you can see, AAAS and AAGAB etc. exist in both files, but they are not printed as common lines! Any idea why this happens?
Thank you
$ comm file1.txt file2.txt
The output of the above command contains three columns where the first column is separated by zero tabs and contains names only present in file1.txt.
The second column contains names only present in file2.txt and separated by one tab.
The third column contains names common to both the files and is separated by two tabs from the beginning of the line.
This is the default pattern of the output produced by comm command when no option is used.
I am assuming, both the input files are in the sorted order. Then the required command for your use case would be
$ comm -12 gene_list1.txt gene_list2.txt
This means both the columns (1 and 2) are suppressed (not displayed). Since you are only interested in the elements common to both the files.

Insert filename into text file with sed

I've been learning about sed and finding it very useful, but cannot find an answer to this in any of the many guides and examples ... I'd like to insert the filename of a text file, minus its path and extension, into a specific line within the text itself. Possible?
In such cases, the correct starting point should be man pages. Manual of sed does not provide a feature for sed to understand "filename", but sed does support inserting a text before/after a line.
As a result you need to isolate the filename separatelly , store the text to a variable and inject this text after/before the line you wish.
Example:
$ a="/home/gv/Desktop/PythonTests/cpu.sh"
$ a="${a##*/}";echo "$a"
cpu.sh
$ a="${a%.*}"; echo "$a"
cpu
$ cat file1
LOCATION 0 X 0
VALUE 1a 2 3
VALUE 1b 2 3
VALUE 1c 2 3
$ sed "2a $a" file1 # Inject the contents of variable $a after line2
LOCATION 0 X 0
VALUE 1a 2 3
cpu
VALUE 1b 2 3
VALUE 1c 2 3
$ sed "2i $a" file1 # Inject the contetns of variable $a before line2
LOCATION 0 X 0
cpu
VALUE 1a 2 3
VALUE 1b 2 3
VALUE 1c 2 3
$ sed "2a George" file1 #Inject a fixed string "George" after line 2
LOCATION 0 X 0
VALUE 1a 2 3
George
VALUE 1b 2 3
VALUE 1c 2 3
Explanation:
a="${a##*/}" : Removes all chars from the beginning of string up to last found slash / (longer match)
a="${a%.*}" : Remove all chars starting from the end of the string up to the first found dot . (short match) . You can also use %% for the longest found dot.
sed "2a $a" : Insert after line 2 the contents of variable $a
sed "2i $q" : Insert before line 2 the contents of $a
Optionally you can use sed -i to make changes in-place / in file under process
wrt I've been learning about sed then you may have been wasting your time as there isn't a lot TO learn about sed beyond s/old/new. Sure there's a ton of other language constructs and things you could do with sed, but in practice you should avoid them all and simply use awk instead. If you edit your question to include concise, testable sample input and expected output and add an awk tag then we can show you how to do whatever you want to do the right way.
Meanwhile, it sounds like all you need is:
$ cat /usr/tmp/file
a
b
c
d
e
$ awk 'NR==3{print gensub(/.*\//,"",1,FILENAME)} 1' /usr/tmp/file
a
b
file
c
d
e
The above inserts the current file name before line 3 of the open file. It uses GNU awk for gensub(), with other awks you'd just use sub() and a variable.

How to use 'sed or gawk' to delete a text block until the third line previous the last one

Good day,
I was wondering how to delete a text block like this:
1
2
3
4
5
6
7
8
and delete from the second line until the third line previous the last one, to obtain:
1
2
6
7
8
Thanks in advance!!!
BTW This text block is just an example, the real text blocks I working on are huge and each one differs among them in the line numbers.
Getting the number of lines with wc and using awk to print the requested range:
$ awk 'NR<M || NR>N-M' M=3 N="$(wc -l file)" file
1
2
6
7
8
This allows you to easily change the range by just changing the value of M.
This might work for you (GNU sed):
sed '3,${:a;$!{N;s/\n/&/3;Ta;D}}' file
or i f you prefer:
sed '1,2b;:a;$!{N;s/\n/&/3;Ta;D}' file
These always print the first two lines, then build a running window of three lines.
Unless the end of file is reached the first line is popped off the window and deleted. At the end of file the remaining 3 lines are printed.
since you mentioned huge and also line numbers could be differ. I would suggest this awk one-liner:
awk 'NR<3{print;next}{delete a[NR-3];a[NR]=$0}END{for(x=NR-2;x<=NR;x++)print a[x]}' file
it processes the input file only once, without (pre) calculating total line numbers
it stores minimal data in memory, in all processing time, only 3 lines data were stored.
If you want to change the filtering criteria, for example, removing from line x to $-y, you just simply change the offset in the oneliner.
add a test:
kent$ seq 8|awk 'NR<3{print;next}{delete a[NR-3];a[NR]=$0}END{for(x=NR-2;x<=NR;x++)print a[x]}'
1
2
6
7
8
Using sed:
sed -n '
## Append second line, print first two lines and delete them.
N;
p;
s/^.*$//;
## Read next three lines removing leading newline character inserted
## by the "N" command.
N;
s/^\n//;
N;
:a;
N;
## I will keep three lines in buffer until last line when I will print
## them and exit.
$ { p; q };
## Not last line yet, so remove one line of buffer based in FIFO algorithm.
s/^[^\n]*\n//;
## Goto label "a".
ba
' infile
It yields:
1
2
6
7
8

Ubuntu Splitting a file into three files with a third of the number of total lines in each

I have a simple ascii text file with a string in each line, something like
aa1
aa2
ab1
...
with a total of N lines. I know I can use the split command to split it out into a fixed number of lines per file. How do I specify the number of files I want to split it into and let split decide how many lines go into each file. For example if the file had 100 lines, I want to be able to specify
split 3 foo.txt
and it would write out three files xaa xab and xac each with 33, 33 and 34 lines. Is this even possible? Or do I write a custom Perl script for this?
Try doing this :
split -n 3 file
see
man split | less +/'^\s*-n'
There's no option for that[*]
You could use 'wc' to get the number of lines, and divide by 3, so it's few lines of whatever scripting you want to use.
([*]update: on ubuntu there is, and that's what the question was about. -n Does not seem to be there on all linux, or older).
If your split implementation doesn't accept -n paramater you can use this bash function:
function split_n() { split -l $((($1+`wc -l <"$2"`-1)/$1)) "$2" "${3:-$2.}"; }
You can invoke it as
split_n 3 file.txt
or
split_n 3 file.txt prefix
Given your comment that you do not have the -n option in your split, here is a slightly hackier approach you could take.
lines=`wc -l < foo.txt`
lines=$((lines/3+1))
split $lines foo.txt
If you do this often you could store it in a script by taking in the number of splits and filename as follows:
splits=$1
filename=$2
lines=`wc -l < $filename`
lines=$((lines/$splits+1))
split $lines $filename

Keep the content of a text with specific same columns in command line

Basically I tried to operate files in command line like this:
File1:
,1,this is some content,
,2,another content,
,3,blablabla,
,4,xxxxxxxx,
,5,yyyyyyyy,
,6,zzzzzzzzzz,
... ...
File2:
1
3
4
5
Now I want to keep the content of file1 with the same column numbers in file2, so the output should be:
,1,this is some content,
,3,blablabla,
,4,xxxxxxxx,
,5,yyyyyyyy,
I used comm -3 file1 file2 but it doesn't work. Then I tried sed but also didn't work. Is there any other handy tool?
The following will work on the example as given - it won't work if numbers appear in your string after the comma:
grep -F -f File2 File1
An alternative would be
join -t, -1 2 -2 1 -o 1.1, 1.2, 1.3 File1 File2
Here is how that works:
-t, considers the `,` as terminator
-1 2 look at the second column in file 1
-2 1 look at the first column in file 2
-o 1.1, 1.2, 1.3 output the first, second, third column of file 1
This still has the drawback that if there are multiple commas in the text that follows, it terminates after the first comma ("field 3" is the last one output).
Fixing that issue requires the use of xargs:
join -t, -1 2 -2 1 -o 1.1, 1.2 File1 File2 | xargs -Ixx grep xx File1
Explanation:
-Ixx : replace the string xx in the command that follows with each of the output lines from the preceding command; the execute that command for each line. This means we will find the lines that match the first ,number, which should make us insensitive to anything else.