I was using:
sed '/\Additional categories/a new'
and it works great for inserting a new line immediately after the pattern.
Now I need to find a block, then get to the first blank line, and insert another blank line. File example:
Additional categories
stuff A
stuff B
other stuff
other stuff
Additional categories
stuff A
stuff B
stuff C
stuff D
other stuff
other stuff
Desired result:
Additional categories
stuff A
stuff B
other stuff
other stuff
Additional categories
stuff A
stuff B
stuff C
stuff D
other stuff
other stuff
Just a blank line is added, so that I have more space to make changes in the data file. If you can find this already, share the link and I will delete my question. Thanks.
You can try this gnu sed
sed '/Additional categories/{:A;N;/\n$/!bA;s/$/\n/}' infile
Related
How can I diff two files but ignore all differences between comment strings. I would like to see the comments in the resulting diff, but not have the tool consider differences between comments to be real differences.
File1.py
# File 1 code
print(“code”)
print(“same code”)
print(“code”) # comment 1
File2.py
# File 2 code
print(“different code”)
print(“same code”)
print(“code”) # comment 2
When I diff file1.py and file2.py I want to be able to ignore comments, but still print them in the diff. Perhaps some command like:
diff -y file1.py file2.py -- magicRegex “#.*”
The desired output might look like:
#File 1 code # File 2 code
print(“code”) | print(“different code”)
print(“same code”) print(“same code”)
print(“code”) # comment 1 print(“code”) # comment 2
I was thinking more about this today. Ideally, there's a tool out there to do this, but if not, I think this might work, depending on how much it is worth to you to script it:
Comment-preserving diff algorithm:
1 . For file1 and file2, process them and create 2 new files for each:
i. A version of each file with the comments removed, (file1.py.nocom).
Lines containing only a comment would not be removed. Just the comment
removed. The line numbering would need to stay the same.
ii. A file containing the locations for all the comments as well as the
actual comment text. Something like:
1,1:# File 1 code
4,15:# comment 1
2. Do the diff between file1.py.nocom and file1.py.nocom, but without the -y
flag. This will be easier to parse. Even easier, use the -c flag with a
really high value. Hopefully you can get the whole file in the diff
without any missing "common" lines that way.
3. Go through the output from #2 and add back in the comments using the info
from 1.ii. I experimented with manually editing the diff from #2 and
applying it with vim, but it didn't seem to like one of the "common" lines
having a comment change. But there may be some tool that will allow you to
view it. Barring that:
4. Use the commented diff output to recreate yourself the -y flag style
output. I guess the tricky part will be determining the width of the
left side and printing out the right column. If on #2 you weren't able
to get all the common lines into the diff output using the -c flag, then
here you'll have to re-add those missing common lines.
The above won't (easily) work with docstrings, and there are probably other cases I haven't thought of. I guess it might need to be tweaked if you have additional/removal of comment lines between files as well. But there's my two cents. It seems doable, but definitely a chunk of work.
You could preprocess them with sed. You could make a wrapper that does something like:
sed -e 's/#.*$//' file1.py > file1.stripped
sed -e 's/#.*$//' file2.py > file2.stripped
diff -y file1.stripped file2.stripped
rm file1.stripped file2.stripped
I have a file containing a header I want to get rid of. I don't have a good way of addressing either the last line of the header or the first line of the data, but I can address the line before the next-to-last line of the header via a regular expression.
Example input:
a bunch of make output which I don't care about
for junk in blah; do
can't check for done!
done
for test in blurfl; do # this is the addressable line
more garbage
done
line 1
line 2
line 3
line 4
line 5
I've done the obvious 1,/for test in blurfl/d, but that doesn't get the next two lines. I can make the command {N;d} which gets rid of the next line, but {N;N;d} just blows away the rest of the file except the last line, which I figured out is because the range isn't slurped up and treated as a single entity, but instead is processed line-by-line.
I feel like I'm missing something obvious because I don't know some sed idiom, but none of the examples on the web or in the GNU manual have managed to trigger anything useful.
I can do this in awk, but other transformations I need to do make awk somewhat, well, awkward. But GNU sed is acceptable.
I have to disagree about [not] using awk. Anything non-trivial is almost always easier in awk than sed [even the sed manpage says so]. Personally, I'd use perl, but ...
So, here's the awk script:
BEGIN {
phase = 0
}
# initial match -- find second loop
phase == 0 {
if ($0 ~ /for test in blurfl/) {
phase = 1
next
}
}
# wait for end of second loop
phase == 1 {
if ($0 ~ /done/) {
phase = 2
next
}
}
# print phase
phase == 2 {
print($0)
}
If you wish to torture yourself [and sed] for complex changes, well, caveat emptor, but don't say I didn't warn you ...
I don't think you can do multi line matches in sed. First time I went down this rabbit hole I ended up using awk, which can support, but now recently I'd probably use Python or Ruby for this kind of thing.
As the question suggests I want to comment out some part of a line in MATLAB.
Also I want to comment out some part of a line not till the end of line.
Reason for this is, I have to try two different versions of a line and I don't want to replicate the line twice. I know it is easy to comment/uncomment if I replicate the line , But I want it this way.
Within one line is not possible (afaik), but you can split up your term into multiple lines:
x=1+2+3 ... optional comments for each line
... * factorA ... can be inserted here
* factorB ...
+4;
Here * factorA is commented out and * factorB is used, resulting in the term x=1+2+3*factorB+4.
The documentation contains a similar example, commenting out one part of an array:
header = ['Last Name, ', ...
'First Name, ', ...
... 'Middle Initial, ', ...
'Title']
Nope, this is not possible. From help '%':
% Percent. The percent symbol is used to begin comments.
Logically, it serves as an end-of-line character. Any
following text on the line is ignored or printed by the
HELP system.
So just copy-paste the line, or write a tiny function so that it's easier to switch between versions.
Let's say I am trying to read in data line by line from a file called input.txt. There's about 20 lines and each line consists of 3 different data types. If I use this code:
while(!file.eof){ ..... }
Does this function look at only one data type from each line per iteration, or does it look at the all the data types at once for each line per iteration--so the next iteration would look at the next line instead of the next data type?
Many thanks.
.eof() looks at the end of file flag. The flag is set after you run over the end of the file. This is not desirable.
A great blog post on how this works and best practice can be found here.
Basically, use
std::string line;
while(getline(file, line)) { ... }
or
while (file >> some_data) { ... }
as it will notice errors and the end of the file at the correct time and act accordingly.
So me being the 'noob' that I am, being introduced to programming via Perl just recently, I'm still getting used to all of this. I have a .fasta file which I have to use, although I'm unsure if I'm able to open it, or if I have to work with it 'blindly', so to speak.
Anyway, the file that I have contains DNA sequences for three genes, written in this .fasta format.
Apparently it's something like this:
>label
sequence
>label
sequence
>label
sequence
My goal is to write a script to open and read the file, which I have gotten the hang of now, but I have to read each sequence, compute relative amounts of 'G' and 'C' within each sequence, and then I'm to write it to a TAB-delimited file the names of the genes, and their respective 'G' and 'C' content.
Would anyone be able to provide some guidance? I'm unsure what a TAB-delimited file is, and I'm still trying to figure out how to open a .fasta file to actually see the content. So far I've worked with .txt files which I can easily open, but not .fasta.
I apologise for sounding completely bewildered. I'd appreciate your patience. I'm not like you pros out there!!
I get that it's confusing, but you really should try to limit your question to one concrete problem, see https://stackoverflow.com/faq#questions
I have no idea what a ".fasta" file or 'G' and 'C' is.. but it probably doesn't matter.
Generally:
Open input file
Read and parse data. If it's in some strange format that you can't parse, go hunting on http://metacpan.org for a module to read it. If you're lucky someone has already done the hard part for you.
Compute whatever you're trying to compute
Print to screen (standard out) or another file.
A "TAB-delimite" file is a file with columns (think Excel) where each column is separated by the tab ("\t") character. As quick google or stackoverflow search would tell you..
Here is an approach using 'awk' utility which can be used from the command line. The following program is executed by specifying its path and using awk -f <path> <sequence file>
#NR>1 means only look at lines above 1 because you said the sequence starts on line 2
NR>1{
#this for-loop goes through all bases in the line and then performs operations below:
for (i=1;i<=length;i++)
#for each position encountered, the variable "total" is increased by 1 for total bases
total++
}
{
for (i=1;i<=length;i++)
#if the "substring" i.e. position in a line == c or g upper or lower (some bases are
#lowercase in some fasta files), it will carry out the following instructions:
if(substr($0,i,1)=="c" || substr($0,i,1)=="C")
#this increments the c count by one for every c or C encountered, the next if statement does
#the same thing for g and G:
c++; else
if(substr($0,i,1)=="g" || substr($0,i,1)=="G")
g++
}
END{
#this "END-block" prints the gene name and C, G content in percentage, separated by tabs
print "Gene name\tG content:\t"(100*g/total)"%\tC content:\t"(100*c/total)"%"
}