Diff command - avoiding monolithic grouping of consecutive differing lines - diff

Playing around with the standard linux diff command, I could not find a way to avoid the following type of grouping in its output (the output listings here assume the unified format)
This question aims at the case that each line differs by little from its counterpart in the other file, and it's more useful to see each line next to its counterpart.
I would like instead of having groups like this show up in the comparison output:
- line 1
- line 2
- line 3
+ line 1 modified
+ line 2 modified
+ line 3 modified
To get this:
- line 1
+ line 1 modified
- line 2
+ line 2 modified
- line 3
+ line 3 modified
Of course, this is a convenience question as this can be accomplished by writing your own code to post-process the diff output, or diverging from the lcs algorithm with your own algorithm. I don't think variants like wdiff etc. would help much, as the plain diff -U0 output format fits my needs very well except for this grouping property, whereas wdiff introduces other aspects that are not optimal for my case.
I'm looking for a command-line way, or a library that can be used in code, not a UI tool.

I was trying to solve this myself. The closest I go was this:
diff -y -W 10000 file1 file2 | grep '|' | sed 's/\s*|\s*/\n/g'
The one issue is that this assumes there are no "white space" difference at the beginning of the lines (or that you don't care about it).

Related

Diff tool filter

How can I diff two files but ignore all differences between comment strings. I would like to see the comments in the resulting diff, but not have the tool consider differences between comments to be real differences.
File1.py
# File 1 code
print(“code”)
print(“same code”)
print(“code”) # comment 1
File2.py
# File 2 code
print(“different code”)
print(“same code”)
print(“code”) # comment 2
When I diff file1.py and file2.py I want to be able to ignore comments, but still print them in the diff. Perhaps some command like:
diff -y file1.py file2.py -- magicRegex “#.*”
The desired output might look like:
#File 1 code # File 2 code
print(“code”) | print(“different code”)
print(“same code”) print(“same code”)
print(“code”) # comment 1 print(“code”) # comment 2
I was thinking more about this today. Ideally, there's a tool out there to do this, but if not, I think this might work, depending on how much it is worth to you to script it:
Comment-preserving diff algorithm:
1 . For file1 and file2, process them and create 2 new files for each:
i. A version of each file with the comments removed, (file1.py.nocom).
Lines containing only a comment would not be removed. Just the comment
removed. The line numbering would need to stay the same.
ii. A file containing the locations for all the comments as well as the
actual comment text. Something like:
1,1:# File 1 code
4,15:# comment 1
2. Do the diff between file1.py.nocom and file1.py.nocom, but without the -y
flag. This will be easier to parse. Even easier, use the -c flag with a
really high value. Hopefully you can get the whole file in the diff
without any missing "common" lines that way.
3. Go through the output from #2 and add back in the comments using the info
from 1.ii. I experimented with manually editing the diff from #2 and
applying it with vim, but it didn't seem to like one of the "common" lines
having a comment change. But there may be some tool that will allow you to
view it. Barring that:
4. Use the commented diff output to recreate yourself the -y flag style
output. I guess the tricky part will be determining the width of the
left side and printing out the right column. If on #2 you weren't able
to get all the common lines into the diff output using the -c flag, then
here you'll have to re-add those missing common lines.
The above won't (easily) work with docstrings, and there are probably other cases I haven't thought of. I guess it might need to be tweaked if you have additional/removal of comment lines between files as well. But there's my two cents. It seems doable, but definitely a chunk of work.
You could preprocess them with sed. You could make a wrapper that does something like:
sed -e 's/#.*$//' file1.py > file1.stripped
sed -e 's/#.*$//' file2.py > file2.stripped
diff -y file1.stripped file2.stripped
rm file1.stripped file2.stripped

Why is no output files written in prinseqlite perl loop?

I am completely new to this type of coding/command lines, so I am sorry if I am asking this question in a wrong way.
I want to loop over all files in a directory (I am quality trimming DNA sequencing files (.fastq format))
I have written this loop:
for i in *.fastq; do
perl /apps/prinseqlite/0.20.4/prinseq-lite.pl -fastq $i -min_len 220 -max_len 240 -min_qual_mean 30 -ns_max_n 5 -trim_tail_right 15 -trim_tail_left 15 -out_good /proj/forhot/qfiltered/looptest/$i_filtered.fastq -out_bad null; done
The code itself seems to work, I can see in my terminal that it is taking the right files and it is doing the trimming (it is writing a summary log in the terminal as it goes), but no output files are generated - i.e these ones:
-out_good /proj/forhot/qfiltered/looptest/$i_filtered.fastq
If I run the code in a non-loop way, just on one file it works (= the output is generated). link this example:
prinseq-lite.pl -fastq 60782_merged_rRNA.fastq -min_len 220 -max_len 240 -min_qual_mean 30 -ns_max_n 5 -trim_tail_right 15 -trim_tail_left 15 -out_good 60782_merged_rRNA_filt_codeTEST.fastq -out_bad null
Is there a simple reason/answer to this?
This problem has nothing to do with Perl at all.
/proj/forhot/qfiltered/looptest/$i_filtered.fastq is read by the shell as interpolating the contents of i_filtered. There is no such shell variable, so this argument turns into /proj/forhot/qfiltered/looptest/.fastq ($i_filtered turns into nothing).
Therefore all of your prinseq-lite.pl executions place their output in the same file, which (because its name starts with a .) is "hidden": You need to use ls -a to see it, not just ls.
Fix
... -out_good /proj/forhot/qfiltered/looptest/${i}_filtered.fastq
Note that this would give you e.g. 60782_merged_rRNA.fastq_filtered.fastq for an input file of 60782_merged_rRNA.fastq. If you want to get rid of the duplicate .fastq part, you need something like:
... -out_good /proj/forhot/qfiltered/looptest/"${i%.fastq}"_filtered.fastq

how do I delete a regex-delimited range plus a few lines with sed?

I have a file containing a header I want to get rid of. I don't have a good way of addressing either the last line of the header or the first line of the data, but I can address the line before the next-to-last line of the header via a regular expression.
Example input:
a bunch of make output which I don't care about
for junk in blah; do
can't check for done!
done
for test in blurfl; do # this is the addressable line
more garbage
done
line 1
line 2
line 3
line 4
line 5
I've done the obvious 1,/for test in blurfl/d, but that doesn't get the next two lines. I can make the command {N;d} which gets rid of the next line, but {N;N;d} just blows away the rest of the file except the last line, which I figured out is because the range isn't slurped up and treated as a single entity, but instead is processed line-by-line.
I feel like I'm missing something obvious because I don't know some sed idiom, but none of the examples on the web or in the GNU manual have managed to trigger anything useful.
I can do this in awk, but other transformations I need to do make awk somewhat, well, awkward. But GNU sed is acceptable.
I have to disagree about [not] using awk. Anything non-trivial is almost always easier in awk than sed [even the sed manpage says so]. Personally, I'd use perl, but ...
So, here's the awk script:
BEGIN {
phase = 0
}
# initial match -- find second loop
phase == 0 {
if ($0 ~ /for test in blurfl/) {
phase = 1
next
}
}
# wait for end of second loop
phase == 1 {
if ($0 ~ /done/) {
phase = 2
next
}
}
# print phase
phase == 2 {
print($0)
}
If you wish to torture yourself [and sed] for complex changes, well, caveat emptor, but don't say I didn't warn you ...
I don't think you can do multi line matches in sed. First time I went down this rabbit hole I ended up using awk, which can support, but now recently I'd probably use Python or Ruby for this kind of thing.

Reading huge .csv files with matlab - file is not well orgenized

I have several .csv files that I read with matlab using textscan, beause csvread and xlsread do not support this size of a file 200Mb-600Mb.
I use this line to read it:
C = textscan(fileID,'%s%d%s%f%f%d%d%d%d%d%d%d','delimiter',',');
the problem that I have found that sometimes the data is not in this format and then the textscan stop to read in that line without any error.
So what I have done is to read it in this way
C = textscan(fileID,'%s%d%s%f%f%s%s%s%s%s%s%s%s%s%s%s','delimiter',',');
In this way I see the in 2 rows out of 3 milion there is a change in the format.
I want to read all the lines except the bad/different lines.
In addition if its possible to read only the lines that the first string is 'PAA'. is it possible ?
I have tried to load it directly to matlab but its super slow and sometime it get stuck. Or for the realy big one it will announce memory problem.
Any recomendations?
For large files which are still small enough to fit your memory, parsing all lines at once is typically the best choice.
f = fopen('data.txt');
g = textscan(f,'%s','delimiter','\n');
fclose(f);
In a next step you have to identify the lines starting with PAA use strncmp.
Now having your data filtered, apply your textscan expression above to each line. If it fails, try the other.
Matlab is slow with this kind of thing because it needs to load everything into memory. I would suggest using grep/bash/cmd lines to reduce your file to readable lines before processing them in Matlab, in Linux you can:
awk '{if (p ~ /^PAA/ && $1 ~ /^PAA/) print; p=$1}' yourfile.csv > yourNewFile.csv %// This will give you a new file with all the lines that starts with PAA (NOTE: Case sensitive)
To Find lines that does not have the same format, you can use:
awk -F ',' 'NF = 12 {print NR, $0} ' yourfile.csv > yourNewFile.csv
This line looks at 12 delimiters for each line, and discard any line that has more than 12 ",".

use perl to extract specific output lines

I'm endeavoring to create a system to generalize rules from input text. I'm using reVerb to create my initial set of rules. Using the following command[*], for instance:
$ echo "Bananas are an excellent source of potassium." | ./reverb -q | tr '\t' '\n' | cat -n
To generate output of the form:
1 stdin
2 1
3 Bananas
4 are an excellent source of
5 potassium
6 0
7 1
8 1
9 6
10 6
11 7
12 0.9999999997341693
13 Bananas are an excellent source of potassium .
14 NNS VBP DT JJ NN IN NN .
15 B-NP B-VP B-NP I-NP I-NP I-NP I-NP O
16 bananas
17 be source of
18 potassium
I'm currently piping the output to a file, which includes the preceding white space and numbers as depicted above.
What I'm really after is just the simple rule at the end, i.e. lines 16, 17 & 18. I've been trying to create a script to extract just that component and put it to a new file in the form of a Prolog clause, i.e. be source of(banans, potassium).
Is that feasible? Can Prolog rules contain white space like that?
I think I'm locked into getting all that output from reVerb so, what would be the best way to extract the desirable component? With a Perl script? Or maybe sed?
*Later I plan to replace this with a larger input file as opposed to just single sentences.
This seems wasteful. Why not leave the tabs as they are, and use:
$ echo "Bananas are an excellent source of potassium." \
| ./reverb -q | cut --fields=16,17,18
And yes, you can have rules like this in Prolog. See the answer by #mat. You need to know a bit of Prolog before you move on, I guess.
It is easier, however, to just make the string a a valid name for a predicate:
be_source_of with underscores instead of spaces
or 'be source of' with spaces, and enclosed in single quotes.
You can use probably awk to do what you want with the three fields. See for example the printf command in awk. Or, you can parse it again from Prolog directly. Both are beyond the scope of your current question, I feel.
sed -n 'N;N
:cycle
$!{N
D
b cycle
}
s/\(.*\)\n\(.*\)\n\(.*\)/\2 (\1,\3)/p' YourFile
if number are in output and not jsut for the reference, change last sed action by
s/\^ *[0-9]\{1,\} \{1,\}\(.*\)\n *[0-9]\{1,\} \{1,\}\(.*\)\n *[0-9]\{1,\} \{1,\}\(.*\)/\2 (\1,\3)/p
assuming the last 3 lines are the source of your "rules"
Regarding the Prolog part of the question:
Yes, Prolog facts can contain whitespace like this, with suitable operator declarations present.
For example:
:- op(700, fx, be).
:- op(650, fx, source).
:- op(600, fx, of).
Example query and its result, to let you see the shape of terms that are created with this syntax:
?- write_canonical(be source of(a, b)).
be(source(of(a,b))).
Therefore, with these operator declarations, a fact like:
be source of(a, b).
is exactly the same as stating:
be(source(of(a,b)).
Depending on use cases and other definitions, it may even be an advantage to create this kind of facts (i.e., facts of the form be/1 instead of source_of/2). If this is the only kind of facts you need, you can simply write:
source_of(a, b).
This creates no redundant wrappers and is easier to use.
Or, as Boris suggested, you can use single quotes as in 'be source of'/2.