How to show diff of two files but only show lines which have common starting string? - diff

How to find diff of two files but only show lines which have common starting string?
For example,
file1:
start1 1234
1234
start2 1234
file2:
start1 ABCD
ABCD
start2 ABCD
And the diff should be just:
> start1 1234
---
< start1 ABCD
> start2 1234
----
< start2 ABCD
or something like this:
start1
start2

You would need to script/code it yourself, because a classic diff (or git diff --no-index which can be used with any two files outside of any Git repository) would display only hunks.
See "In the context of git (and diff), what is a “hunk”": that would display more than just the different lines.
diff finds sequences of lines common to both files, interspersed with groups of differing lines called hunks.

Related

Using sed to extract data from a file. I know the string I'm looking but I need to get the whole block of data that this string is in

I'm using sed to extract data from a file. Lots of same style data in there. I want every occurrence of a specific string occurs but the string is part of a block of information and I want to extract the whole block based of that string.
Example data in file:
123
AAA
ABC
ZZZ
123
KJG
HJY
ZZZ
123
LPC
ABC
TRY
ZZZ
In this example 123 is the start of the block of data I want and ZZZ the end. ABC is the string I search for. So from this example my output should be:
123
AAA
ABC
ZZZ
123
LPC
ABC
TRY
ZZZ
sed -n '/ABC/{:a;p;n;/123/b;ba};' testfile.txt > testfile2.txt
the output with this is
ABC
ZZZ
ABC
TRY
ZZZ
so I'm not getting the data before ABC in the block
This might work for you (GNU sed):
sed -n '/123/{:a;N;/ZZZ/!ba;/ABC/p}' file
Gather up lines between 123 and ZZZ and then print them if they contain ABC.
N.B. n prints the current line and replaces it with the next. Whereas N appends the next line to the pattern space, inserting a newline. Thus keeping those lines current and searchable.

How to replace every 2nd tab character with a newline character using sed

given the input
123\t456\tabc\tdef
create the output
123\t456\nabc\tdef
which would display like
123 456
abc def
Note that it needs to work across multiple lines, not just two.
EDIT
a better example might help clarify.
input (there is only expected to be 1 line of input)
1\t2\t3\t4\t5\t6\t7\t8
expected output
1 2
3 4
5 6
7 8
...
With GNU sed:
sed 's/\t/\n/2;P;D;' file
Replaces second occurrence of tab character with newline character.
This little trick should work:
sed 's/\(\t[^\t]*\)\t/\1\n/g' < input_file.txt
EDIT:
Below is an example:
$ cat 1.txt
one two three four five six seven
five six seven
$ sed 's/\(\t[^\t]*\)\t/\1\n/g' < 1.txt
one two
three four
five six
seven
five six
seven
$
EDIT2:
For MacOS' standard sed try this:
$ sed $'s/(\t[^\t]*\t/\\1\\\n/g' < 1.txt
$ is used for replacing escape characters on the bash-level.
Let's say following is the Input_file:
cat Input_file
123 456 abc def
Then to get them into 2 columns following may help you in same.
xargs -n2 < Input_file
Output will be as follows.
123 456
abc def

How to search only first pattern range in sed

My input file looks something like this
Start1
some text
that I want
modified
Pattern1
some other text
which I do not want
to modify
End1
Start1
Pattern2
End1
My sed pattern looks like this
/Start1/,/Pattern1/c\
Start1\
Modification text here\
Pattern1\
additional modifications
I only want the text within the first range of Start1 and End1 modified.
Additional, I am also specifying Pattern1 which does not exist in the second range.
I run
sed -i -f <sed_file> <input_file>
However, my output is given below. For some reason it wipes out the second range even though Pattern1 does not exist in it.
Start1
Modification text here
Pattern1
additional modifications
some other text
which I do not want
to modify
End1
Expected result
Start1
Modification text here
Pattern1
additional modifications
some other text
which I do not want
to modify
End1
Start1
Pattern2
End1
Try this one
sed ':A;/Start1/!b;N;/Pattern1/!bA;s/\(Start1\n\)\(.*\)\(\nPattern1\)/\1Modification text here\3\nadditional modifications/' infile
In GNU sed:
sed -e '/START/,/END/c TEXT
is not the same as
sed -e '/START/,/END/{c TEXT' -e '}'
The first will start omitting the range from the output stream and emit one instance of TEXT into the output string upon reaching the end of the range. The second will replace each line in the range with TEXT.
Your issue is that the second range is being omitted from the output stream even though you never reach the end of the second range. /START/,/END/c where /END/ never appears is basically like /START/,$d
The only solutions that I can figure are clunky:
/Start1/,/Pattern1/{
/Pattern1/{
# Insert into output stream
i\
Start1\
Modification text here\
Pattern1\
additional modifications
# Read in the rest of the file
:a
$!N
$!ba
# Remove the original Pattern1 line from the pattern space
# (Remove first line and newline of pattern space)
s/^[^\n]*\n//
# Print pattern space and quit
q
}
# Delete lines in the range other than /Pattern1/
d
}

*nix: perform set union/intersection/difference of lists

I sometimes need to compare two text files. Obviously, diff shows the differences, it also hides the similarities, which is kind of the point.
Suppose I want to do other comparisons on these files: set union, intersection, and subtraction, treating each line as an element in the set.
Are there similarly simple common utilities or one-liners which can do this?
Examples:
a.txt
john
mary
b.txt
adam
john
$> set_union a.txt b.txt
john
mary
adam
$> set_intersection a.txt b.txt
john
$> set_difference a.txt b.txt
mary
Union: sort -u files...
Intersection: sort files... | uniq -d
Overall difference (elements which are just in one of the files):
sort files... | uniq -u
Mathematical difference (elements only once in one of the files):
sort files... | uinq -u | sort - <(sort -u fileX ) | uniq -d
The first two commands get me all unique elements. Then we merge this with the file we're interested in. Command breakdown for sort - <(sort -u fileX ):
The - will process stdin (i.e. the list of all unique elements).
<(...) runs a command, writes the output in a temporary file and passes the path to the file to the command.
So this gives is a mix of all unique elements plus all unique elements in fileX. The duplicates are then the unique elements which are only in fileX.
If you want to get the common lines between two files, you can use the comm utility.
A.txt :
A
B
C
B.txt
A
B
D
and then, using comm will give you :
$ comm <(sort A.txt) <(sort B.txt)
A
B
C
D
In the first column, you have what is in the first file and not in the second.
In the second column, you have what is in the second file and not in the first.
In the third column, you have what is in the both files.
If you don't mind using a bit of Perl, and if your file sizes are reasonable such that they can be written into a hash, you could collect the files into two hashes to do:
#...get common keys in an array...
my #both_things
for (keys %from_1) {
push #both_things, $_ if exists $from_2{$_};
}
#...put unique things in an array...
my #once_only
for (keys %from_1) {
push #once_only, $_ unless exists $from_2($_);
}
I can't comment on Aaron Digulla's answer, which despite being accepted does not actually compute the set difference.
The set difference A\B with the given inputs should only return mary, but the accepted answer also incorrectly returns adam.
This answer has an awk one-liner that correctly computes the set difference:
awk 'FNR==NR {a[$0]++; next} !a[$0]' b.txt a.txt

diff + ignore lines spaces in case text is the same but on different lines number

I used the diff --ignore-all-space
in order to ignore white spaces when I do diff file1 file2
but what I need to add if I want to ignore also line spaces (text in file1 and file2 are the same but on different lines number)
because actually file1 and file2 are the same text but the text position in file1 is different from file2
for example
diff --ignore-all-space
391a376
> AAAAAAAA
397d381
< AAAAAAAA
423a408
>
Lidia
It's not related to ignoring spaces or not. If you move a line between file1 to file2 , even though it's exactly the same line, diff will detect the line has moved. Which is the result your having. So diff works. If you really don't care of the order of the lines in your files (but I doubt of it ) you can trny sort them (with the sort command ) before diffing them.