Remove similar rows after merging 2 files notepad++ - merge

i have two files file1.txt and file2.txt both contain clients mailist i wanna do:
- Merging the files.
2- exclude the similar entrie from both file
example:
file1.txt
email1
email2
email3
email4
file2.txt
email5
email6
email4
email1
email8
Result it will be:
email2
email3
email5
email6
email8
How to do this with Notepad++ or any other program.
Thanks

There is a three step approach:
merge the files by copying their line in a single file
sort the merged file: Edit -> Line Operatoions -> Sort Lines Lexicographically ( you need a current version of notepad++ for this menu function, see the answer here)
remove duplicated lines by a search and replace:
Find what: (.*)\r\n\1\r\n
Replace whith: (leave empty)
select Regular expression in the lower left and click on Replace or Replace All
use \r\n for Windows files and use only \n for unix files
The regexp searches for something (.*) that is repeated in the next line (that is the \1 between the lineendings and replaces the matched two lines with nothing, i.e. removing them.

Related

Comm command issue

I'm trying to compare two gene lists and extract the common ones. I sorted my .txt files and used comm command:
comm gene_list1.txt gene_list2.txt
Strangely, when I check the output, there are many common genes that are not printed in the third line. Here is part of the output:
As you can see, AAAS and AAGAB etc. exist in both files, but they are not printed as common lines! Any idea why this happens?
Thank you
$ comm file1.txt file2.txt
The output of the above command contains three columns where the first column is separated by zero tabs and contains names only present in file1.txt.
The second column contains names only present in file2.txt and separated by one tab.
The third column contains names common to both the files and is separated by two tabs from the beginning of the line.
This is the default pattern of the output produced by comm command when no option is used.
I am assuming, both the input files are in the sorted order. Then the required command for your use case would be
$ comm -12 gene_list1.txt gene_list2.txt
This means both the columns (1 and 2) are suppressed (not displayed). Since you are only interested in the elements common to both the files.

Joining specific lines in file

I have a text file (snippet below) containing some public-domain corporate earnings report data, formatted as follows:
Current assets:
Cash and cash equivalents
$ 21,514 $ 21,120
Short-term marketable securities
33,769 20,481
Accounts receivable
12,229 16,849
Inventories
2,281 2,349
and what I'm trying to do (with sed) is the following: if the current line starts with a capital letter, and the next line starts with whitespace, copy the last N characters from the next line into the last N columns of the current line, then delete the next line. I'm doing it this way, because there are other lines in the files that begin with whitespace that I want to ignore. The results should look like the following:
Current assets:
Cash and cash equivalents $ 21,514 $ 21,120
Short-term marketable securities 33,769 20,481
Accounts receivable 12,229 16,849
Inventories 2,281 2,349
The closest I've come to getting what I want is:
sed -i -r ':a;N;$!ba;s/[^A-Z]*\n([[:space:]])/\1/g' file.txt
and I believe I've got the pattern matching ok, but the subsequent substitution really messes up the alignment of the columns of numbers. When I first started this, this seemed like a simple operation, but hours of searching and experimenting haven't helped. I'm open to any solutions that use something else other than sed, but would prefer to keep it strictly bash. Thank you much!
This might work for you (GNU sed):
sed -r '/^[[:upper:]]/{N;/\n\s/{h;x;s/\n.*//;s/./ /g;x;G;s/(\n *)(.*)\1$/\2/};P;D}' file
This solution only processes two consecutive lines that start with an upper-case letter and a white space respectively. All other lines are printed as is.
Having gathered the above two lines into the pattern space (PS), a copy is made and stored in the hold space (HS). Processing now swaps to the HS. The second line is removed and the contents of the first turned into spaces. Processing now swaps back to the PS. The HS is appended to the PS and using matching and back references the length of the first line in spaces is subtracted from the combined lines.
The line(s) are printed and then deleted. If the second line did not begin with a space, by use of the P and D commands, it is not deleted but re-appraised by virtue of the regexp at the start of the sed script.

Copy selected content form one sheet of notepad++ to another

I have a data which is pipe separated ex.
1|2|3|4|5|6|7|8|9|10|
I have to copy and paste (to new sheet) only that which is between pipe 6 - 9
I have 10,000 rows like this
how can we do this? How can we write a macro for the same? Is there any other solution?
Copy the entire text into a new buffer then edit the text to remove the unwanted parts. Can do that with a regular expression replace-all of ^(?:[^|\r\n]*\|){5}([^|\r\n]*)\|.*$ with \1.
Explanation
^ - start of line
(?: - start of a non-capturing group
[^|\r\n]* - zero or more characters that are not a | or newlines or carriage returns
\| - a |
){5} - exactly 5 occurences of the previous group
-- the efect of the above is to match the unwanted leading characters
([^|\r\n]*) - a group containing the characters to keep
-- the wanted part of the line is saved in capture group 1
\|.*$ - a | then everything else to the end of the line
-- matches the unwanted right-hand part of the line
The final $ is not strictly needed. But, when considered with the opening ^, it serves to document that the regular expression looks at the whole line.

How can I ignore line endings when comparing files?

I am comparing two text files and I get the following result:
diff file1 file2 | grep 12345678
> 12345678
< 12345678
As you can see, the same string exists in both files, and both files were sorted with sort.
The line endings must be getting in the way here (Windows vs. Unix).
Is there a way to get diff to ignore line endings on Unix?
Use the --strip-trailing-cr option:
diff --strip-trailing-cr file1 file2
The option causes diff to strip the trailing carriage return character before comparing the files.

Use sed to delete a matched regexp and the line (or two) underneath it

OK I found this question:
How do I delete a matching line, the line above and the one below it, using sed?
and just spent the last hour trying to write something that will match a string and delete the line containing the string and the line beneath it (or a variant - delete 2 lines beneath it).
I feel I'm now typing random strings. Please somebody help me.
If I've understood that correctly, to delete match line and one line after
/matchstr/{N;d;}
Match line and two lines after
/matchstr/{N;N;d;}
N brings in the next line
d - deletes the resulting single line
you can use awk. eg search for the word "two" and skip 2 lines after it
$ cat file
one
two
three
four
five
six
seven
eight
$ awk -vnum=2 '/two/{for(i=0;i<=num;i++)getline}1' file
one
five
six
seven
eight