How can I ignore line endings when comparing files? - diff

I am comparing two text files and I get the following result:
diff file1 file2 | grep 12345678
> 12345678
< 12345678
As you can see, the same string exists in both files, and both files were sorted with sort.
The line endings must be getting in the way here (Windows vs. Unix).
Is there a way to get diff to ignore line endings on Unix?

Use the --strip-trailing-cr option:
diff --strip-trailing-cr file1 file2
The option causes diff to strip the trailing carriage return character before comparing the files.

Related

Using grep and sed to extract file name and number

I have a list of files in the current directory, some of those contain the keyword "speed", assuming in the same line with the keyword, I have a number.
For example, in the file "filename.txt", I have the following lines:
some text
speed: this is the keyword, and equals 150
some text
I want to use a combination of grep and sed to get the following output:
filename: 150
Currently, I can only extract file names and the line that contains the keyword using grep, but I don't know how to form the output as above using a combination of grep and sed. The grep command I have so far is:
grep -r "speed"
which gives me:
filename.txt:speed: this is the keyword, and equals 150
Any help would be appreciated!
As Wiktor Stribiżew as mentioned in the comment
The below command will provide you the desired output
awk '/speed/{print FILENAME": "$NF}' filename.txt
Explanation
/speed/ is used since that is the keyword used as a reference for extracting.
{print FILENAME": "$NF}
print FILENAME will print the respective filename
$NF which denotes the number of fields, using NF with awk will print the string or word at the last field, for this text file that is 150
Assuming the filenames do not contain colon : character, would you please try the following:
grep -r "speed" | sed -E 's/^([^:]+):[^0-9]*([0-9]+)/\1: \2/'
In the sed command:
^([^:]+) matches the filename and the 1st capture group is assigned to it.
[^0-9]* matches non-digits to be skipped.
([0-9]+) matches digits and the 2nd capture group is assigned to it.

Input from one file and match it in other and print until a pattern match

I am having two files. File1 contain the following IDs:
id/35651
id/35325
id/20993
id/30167
id/29807
id/28315
id/29759
id/27715
id/26884
id/30412
File2 contains multiple IDs, similar pattern like File1, followed by multiline description. Now, I want to print all the IDs with description from File2 which are present in File1.
File2 is huge. I am having a smaller version here
>id/30412
GCACACATTTTCTCGCGCTCTCTCCGGCTCTCCTTTGTTTATTTTCTAATCTATATTTTTACTGGAAGAT
TTCCTCTTTATTCTCTCCCGCCCTCCTACAAGCGCTCTTGCTGGCCGTCTGGGTGCACACACCGCTCCCT
CGATCACCCCAGCCCCCTTCCTGGTCTCCCGAGCGCGGGGTTTGAAGGTCACCTCCTTTCCAGTCCCCGT
GCGAGCCGCGCTGCCGCCGCCTCCTCCAGCCAGAGTCGGTGGGACTGGCTGCGCTGCCCTGAAGTGGTTC
TCCAAGCAGCGCGGAGGGTGGCGGACGGCGGACGGAGCCCAGGGGCCGCGTCGGGTGGGGAAACCCGAAC
>id/28315
TCGCGGAGGGGAATCCCTCCCCCTCCGCCCCAGCCCCCCAGCAGCACCCGCGGTGGGGCGGGGGCGCTCT
GCCAGCCCCGGGAACAGCAGAGGCGGCGGCACTGGCTGGACCCACGCGCGCGCCTCCGGGGCTGAAGAAG
GAAGGAGTGAGCCGAGCCGAGCACCCCACATCTGGAGGGGACAGCCAGCCGTGGGCCCCGCCCCGGCGTC
CGGAGCAGGAGAACTCCGAGCTTCTTGCCCAGGCAGAGAGAGCAGGAGCGGACCGCGCGCCCGGGATTGA
>id/2313
GAGTCCTTGCGCTCCAGACCCCCACCCAGTGGCCGCCAGGGTCCCCGCCTGTCCGGACCCTCGCCGCGCC
CAGGCAGGCGCGCCAGGGCGGGGCTGACCTGCCCGCGAAGTTGCGGACAGTGCGTGAGAAACCAGCACCC
CCTTTATGGAAACTGGTCAAAGAACTCATGCAAGTGGAACTTACAGCTTCCTTGATCGGACTCAGCATTC
AGGGCCCAGTTTGCTCCCCCGCAGAACGGTATCCCCGCGGAATACACGGCCCCTCATCCCCACCCCGCGC
CAGAGTACACAGGCCAGACCACGGTTCCCGAGCACACATTAAACCTGTACCCTCCCGCCCAGACGCACTC
>id/26884
CGAGCAGAGCCCGGCGGACACGAGCGCTCAGACCGTCTCTGGCACCGCCACACAGACAGATGACGCAGCA
CCGACGGATGGCCAGCCCCAGACACAACCTTCTGAAAACACGGAAAACAAGTCTCAGCCCAAGCGGCTGC
ATGTCTCCAATATCCCCTTCAGGTTCCGGGATCCGGACCTCAGACAAATGTTTGGTCAATTTGGTAAAAT
CTTAGATGTTGAAATTATTTTTAATGAGCGAGGCTCAAAGGGATTTGGTTTCGTAACTTTCGAAAATAGT
>id/29807
GCCGATGCGGACAGGGCGAGGGAGAAATTACACGGCACCGTGGTAGAGGGCCGTAAAATCGAGGTAAATA
ATGCCACAGCACGTGTAATGACAAATAAAAAGACCGTCAACCCTTATACAAATGGCTGGAAATTGAATCC
AGTTGTGGGTGCAGTCTACAGTCCCGAATTCTATGCAGCACGGTCCTGTTGTGCCAGGCCAACCAGGAGG
GATCTTCCATGTACAGTGCCCCCAGTTCACTTGTATATACTTCTGCAATGCCAGGCTTCCCGTATCCAGC
AGCCACCGCCGCGGCCGCCTACCGAGGGGCGCACCTGCGAGGCCGCGGTCGCACCGTGTACAACACCTTC
>id/980
AGGGCCGCGGCGCCCCCGCCCCCGATCCCGGCCTACGGCGGTGTTGTTTACCAGGATGGATTTTATGGTG
CAGACATTTATGGTGGTTATGCTGCATACCGCTACGCCCAGCCTACCCCTGCCACTGCCGCTGCCTACAG
TGACAGTTACGGACGAGTTTATGCTGCCGACCCCTACCACCACGCACTTGCTCCAGCCCCCACCTACGGC
GTTGGTGCCATGAATGCTTTTGCACCTTTGACTGATGCCAAGACTAGGAGCCATGCTGATGATGTGGGTC
TCGTTCTTTCTTCATTGCAGGCTAGTATATACCGAGGGGGATACAACCGTTTTGCTCCATACTAAATGAC
AAAACCATAAAAACCTTCCAATGTGGGGAGAAAGGAAGCTTTCCGAGGCCTGAGTATTGCAATACATGCA
GTAGTACATCATTTTAGCAACTCT
I can do it one by one with the following command:
sed -n -e '/id\/30412/,/id/p' File2
But I am not sure how to tell sed to get the input from File1.
Also, is it possible not to print the matching pattern id\number in the last line?
This might work for you (GNU sed):
sed 's|id/\(.*\)|\\#^>id/\1$#{:\1;n;/^>/ba;b\1}|' file1 |
sed -e ':a' -f - -e 'd' file2
Build a sed script from file1 and run it against file2.
For each id build a loop which prints the current line then fetches the next line (n) and then checks if that line begins with <. If it does the script breaks to :a and checks for a new id, otherwise it prints the current lines and loops to a unique place holder based on the current id and continues printing.
Lines that do not match any id are deleted (d).

How to insert the content of a file two lines after the line where a pattern is found?

I have a file like as below and I want to search for the pattern "Unix" and insert the content of another file two lines after the line where the pattern is matched. I want to do it in sed.
$ cat text1
Unix
Windows
Database
Wintel
Sql
Java
$
Output should be
Unix
Windows
Database
CONTENT OF ANOTHER FILE
Wintel
Sql
Java
It looks a bit funny, but this works with both GNU sed and BSD sed (on Mac OS X), and should work with most versions of sed:
sed -e '/Unix/{N;N;p;r content' -e 'd;}' data
Or:
sed -e '/Unix/{
N
N
p
r content
d
}' data
The N commands add extra lines to the pattern space (so the pattern space holds three lines containing Unix, Windows and Database); the p command prints the pattern space; the r content reads the file content and adds it to the output; the d deletes the pattern space; the {} group these operations so that they only occur when the input line matches Unix.
The r content must be at the end of a line of the script, or at the end of a -e argument, as shown. Trying to add a semicolon after it does not work (after all, the file name might contain a semicolon).
This might work for you (GNU sed):
sed '/Unix/!b;n;n;r another_file' text1
If the line doesn't contain unix bail out. Otherwise print it and get the next line, repeat and then read in the second file.
N.B. The second line following unix is printed first as this is now part of the current cycle, another_file is inserted into the pattern space following the end of the current cycle.

Is it possible to show line number in side-by-side diff output?

I am using diff with the -y and --suppress-common-lines options and the output is almost perfect except I'd like to see the line numbers of the changes.
Example:
file1:
line a
line b
line c
file2:
line a
line B
line c
line d
command and output:
$ diff -y --suppress-common-lines file1 file2
line b | line B
> line d
Is this combination of options possible with diff or do I need another tool?
Unfortunately the -y option uses the formatting style internally (as does --LFMT-line-format), you cannot cumulate formatting commands with -y.
You cannot obtain from formatting parameters what -y does, so you cannot workaround directly with diff (I checked diff 3.2 source code).
You need to use another tool.
If you are always comparing lines with the same line numbers, you can use something like this:
$ awk 'NR==FNR{a[NR]=$0;next}{x=a[FNR];if($0!=x)printf("%s;%s;%s\n",FNR,x,$0)}' file1 file2
327;有る;ある
431;先ず;まず
543;連れて行く;連れていく
719;幾ら;いくら
1318;込む;混む
1415;かわいそう;可哀相
1713;だんだん;段々
2491;大みそか;大晦日
4120;もうける;儲ける
4510;ほほ笑む;微笑む
4512;もうかる;儲かる
5727;剥げる;剝げる
FNR (file number of record) is equal to NR when awk is processing the first file. The next statement skips to the next record.

diff + ignore lines spaces in case text is the same but on different lines number

I used the diff --ignore-all-space
in order to ignore white spaces when I do diff file1 file2
but what I need to add if I want to ignore also line spaces (text in file1 and file2 are the same but on different lines number)
because actually file1 and file2 are the same text but the text position in file1 is different from file2
for example
diff --ignore-all-space
391a376
> AAAAAAAA
397d381
< AAAAAAAA
423a408
>
Lidia
It's not related to ignoring spaces or not. If you move a line between file1 to file2 , even though it's exactly the same line, diff will detect the line has moved. Which is the result your having. So diff works. If you really don't care of the order of the lines in your files (but I doubt of it ) you can trny sort them (with the sort command ) before diffing them.