Comm command issue - comm

I'm trying to compare two gene lists and extract the common ones. I sorted my .txt files and used comm command:
comm gene_list1.txt gene_list2.txt
Strangely, when I check the output, there are many common genes that are not printed in the third line. Here is part of the output:
As you can see, AAAS and AAGAB etc. exist in both files, but they are not printed as common lines! Any idea why this happens?
Thank you

$ comm file1.txt file2.txt
The output of the above command contains three columns where the first column is separated by zero tabs and contains names only present in file1.txt.
The second column contains names only present in file2.txt and separated by one tab.
The third column contains names common to both the files and is separated by two tabs from the beginning of the line.
This is the default pattern of the output produced by comm command when no option is used.
I am assuming, both the input files are in the sorted order. Then the required command for your use case would be
$ comm -12 gene_list1.txt gene_list2.txt
This means both the columns (1 and 2) are suppressed (not displayed). Since you are only interested in the elements common to both the files.

Related

Joining specific lines in file

I have a text file (snippet below) containing some public-domain corporate earnings report data, formatted as follows:
Current assets:
Cash and cash equivalents
$ 21,514 $ 21,120
Short-term marketable securities
33,769 20,481
Accounts receivable
12,229 16,849
Inventories
2,281 2,349
and what I'm trying to do (with sed) is the following: if the current line starts with a capital letter, and the next line starts with whitespace, copy the last N characters from the next line into the last N columns of the current line, then delete the next line. I'm doing it this way, because there are other lines in the files that begin with whitespace that I want to ignore. The results should look like the following:
Current assets:
Cash and cash equivalents $ 21,514 $ 21,120
Short-term marketable securities 33,769 20,481
Accounts receivable 12,229 16,849
Inventories 2,281 2,349
The closest I've come to getting what I want is:
sed -i -r ':a;N;$!ba;s/[^A-Z]*\n([[:space:]])/\1/g' file.txt
and I believe I've got the pattern matching ok, but the subsequent substitution really messes up the alignment of the columns of numbers. When I first started this, this seemed like a simple operation, but hours of searching and experimenting haven't helped. I'm open to any solutions that use something else other than sed, but would prefer to keep it strictly bash. Thank you much!
This might work for you (GNU sed):
sed -r '/^[[:upper:]]/{N;/\n\s/{h;x;s/\n.*//;s/./ /g;x;G;s/(\n *)(.*)\1$/\2/};P;D}' file
This solution only processes two consecutive lines that start with an upper-case letter and a white space respectively. All other lines are printed as is.
Having gathered the above two lines into the pattern space (PS), a copy is made and stored in the hold space (HS). Processing now swaps to the HS. The second line is removed and the contents of the first turned into spaces. Processing now swaps back to the PS. The HS is appended to the PS and using matching and back references the length of the first line in spaces is subtracted from the combined lines.
The line(s) are printed and then deleted. If the second line did not begin with a space, by use of the P and D commands, it is not deleted but re-appraised by virtue of the regexp at the start of the sed script.

adding delimiters to end of file

I am working on a TPT script to process some large files we have. Right now, each record length in the file has a delimiter, |.
The problem is that not all fields are used by each record. For example, record 1 may have 100 fields and record 2 may have 260. For TPT to work, we need to have a delimiter for each field, so the records that have less than 261 fields populated, I need to append the appropriate number of pipes to the end of each record.
So, taking my example above, record one would have 161 pipes appended to the end and record two would have 1.
I have a perl script which will count the number of pipes in each record, but I am not sure how to take that info and accomplish the task of appending that many pipes to the field.
perl -ne 'print scalar(split(/\|/, $_)) . "\n"'
Any advice?
To get the number of pipe symbols, you can use the tr operator.
my $count = tr/|//;
Just subtract the number of pipe symbols from the maximal number to get the number of pipes to add, use the x (times) operator to get them:
perl -lne 'print $_, "|" x (260 - tr/|//)'
I'm not sure the number is correct, it depends on whether the pipes also start or end the line.

Remove similar rows after merging 2 files notepad++

i have two files file1.txt and file2.txt both contain clients mailist i wanna do:
- Merging the files.
2- exclude the similar entrie from both file
example:
file1.txt
email1
email2
email3
email4
file2.txt
email5
email6
email4
email1
email8
Result it will be:
email2
email3
email5
email6
email8
How to do this with Notepad++ or any other program.
Thanks
There is a three step approach:
merge the files by copying their line in a single file
sort the merged file: Edit -> Line Operatoions -> Sort Lines Lexicographically ( you need a current version of notepad++ for this menu function, see the answer here)
remove duplicated lines by a search and replace:
Find what: (.*)\r\n\1\r\n
Replace whith: (leave empty)
select Regular expression in the lower left and click on Replace or Replace All
use \r\n for Windows files and use only \n for unix files
The regexp searches for something (.*) that is repeated in the next line (that is the \1 between the lineendings and replaces the matched two lines with nothing, i.e. removing them.

Extracting Specific Variables from a | delimited .txt and extracting them to a new .txt

I have a .txt that is say 100,000 rows (observations) by 50 columns (variables), and the variables are | delimited. I would like to extract the 8th and 9th variables (or 7 and 8 if the indexing were to start at 0). In doing so, I'd like to create a new .txt that is 100,000 rows (the same observations) by 2 columns (these 2 variables) in which these 2 variables remain | delimited.
For example, the data in one row is formatted as:
var1|var2|var3|var4|var5|var6|var7|var8|var9|var10|var11 .........
I'd like to create a .txt with this row being:
var7|var8
I've tried:
$ perl -wplaF'|' -e'$_ = join "|", #F[7, 8]' fileoriginal.txt > filenew.txt
This output is just kind of gibberish, however.
Any help would be greatly appreciated!
The argument to -F is compiled into a regular expression, and | is a special character in regular expressions. To use a literal | char, you need to escape it on the command line.
One of
perl -F\\\| -wlape ...
perl -F'\|' -wlape ...
does the trick on Unix.

Use sed to delete a matched regexp and the line (or two) underneath it

OK I found this question:
How do I delete a matching line, the line above and the one below it, using sed?
and just spent the last hour trying to write something that will match a string and delete the line containing the string and the line beneath it (or a variant - delete 2 lines beneath it).
I feel I'm now typing random strings. Please somebody help me.
If I've understood that correctly, to delete match line and one line after
/matchstr/{N;d;}
Match line and two lines after
/matchstr/{N;N;d;}
N brings in the next line
d - deletes the resulting single line
you can use awk. eg search for the word "two" and skip 2 lines after it
$ cat file
one
two
three
four
five
six
seven
eight
$ awk -vnum=2 '/two/{for(i=0;i<=num;i++)getline}1' file
one
five
six
seven
eight