Remove row which contains specific delimiter with different number of columns

Remove row which contains specific delimiter with different number of columns - sed

Want to remove row which contains specific delimiter with different number of columns
CPU Load for sdp4
7c:e5:3b:6e:2e:5f:d9:4d:68:4d:d5:57:3a:cb:4d:45.
02:30PM up 1 day, 9:20, 2 users, load average: 6.88, 5.96, 5.57
In that case I want to remove everything with delimiter ":" :
7c:e5:3b:6e:2e:5f:d9:4d:68:4d:d5:57:3a:cb:4d:45.
I want remove any kind of this which contains that delimiter.
Expected view:
CPU Load for sdp4
02:30PM up 1 day, 9:20, 2 users, load average: 6.88, 5.96, 5.57

You could try:
grep -v ":..:" yourfile
That will remove all lines that contain a pair of colons separated by any two characters - which doesn't seem to appear in the lines that you do want.

try this grep line:
grep -Pv '^..(:..)+\.$' file
with your example:
kent$ echo " CPU Load for sdp4
7c:e5:3b:6e:2e:5f:d9:4d:68:4d:d5:57:3a:cb:4d:45.
02:30PM up 1 day, 9:20, 2 users, load average: 6.88, 5.96, 5.57"|grep -Pv '^..(:..)+\.$'
CPU Load for sdp4
02:30PM up 1 day, 9:20, 2 users, load average: 6.88, 5.96, 5.57

Related

How to copy last 13 characters of a string?

In Notepad++ I have a list of entries and at the end of each entry is a phone number (with dashes, 12 characters total). How do I go about either removing all the text before the number or copy/cut the number from the end of the entry for multiple entries? Thanks!
i.e.
1 $1,300 Deposit $1,300 Available 12/15/16 2050 Hurricane Shoals 678-790-0986
2 7 $1,400 Deposit $1,400 Available 12/22/16 1453 Alamein Dr  404-294-6441
3 $1,500 - $1,590 Not Income Based  /  Deposit $1,500 - $1,590 678-328-7952

Here is a way:
Ctrl+H
Find what: ^.*([\d-]{12})$
Replace with: $1
Replace all

why my sed script to split FASTA file is slow?

I have a 600 Mb FASTA file containing many alignments blocks from 12 species and I want to split them into smaller FASTA files containing one block each with its corresponding alignments
I have a sed script that looks like this:
#!/bin/bash
echo
for i in {0..Nblocks}; do
sed -n "/block_index=$i|/,/^$/p" genome12species.fasta > bloque$i.fasta
done
This works at a small scale but for a big file as 600Mb it takes too long, around 2 days. I don't think this is a matter of the computer I am running.
Does anyone knows how to make this faster?
The input Fasta file looks like this:
dm3.chr3R(-):17092630-17092781|sequence_index=0|block_index=4|species=dm3|dm3_4_0
GGCGGAGATCAAGAATCGCGTCGGGCCGCCGTCCAGCGCCACTGACAACGCTAGCAAAGTGAAAATCGATCAAGGACGTCCAGTGGAAAACACCAAATCCGGTTGCTGCTGAATAA-CTCTGATTGTGAATCattattttattatacaatta
droGri2.scaffold_15074(-):2610183-2610334|sequence_index=0|block_index=4|species=droGri2|droGri2_4_0
GGCGGAGATCAAGAATCGTGTTGGGCCGCCGTCGAGCGCCACCGATAACGCTAGCAAAGTGAAAATCGATCAGGGACGCCCAGTGGAAAACAATAGATCTGGTTGCTGCTAAATAA-CTCTGATTGTGAATCATTATTTTATTATACAATTa
droMoj3.scaffold_6540(+):33866311-33866462|sequence_index=0|block_index=4|species=droMoj3|droMoj3_4_0
TGCCGAGATTAAGAATCGTGTCGGTCCGCCGTCCAGCGCAACCGACAATGCAAGCAAAGTGAAAATCGATCAGGGACGTCCAGTGGAGAACACCAGATCTGGTTGCTGCTGAATAA-CTCTGATTGTGAATCATTATTTTATTatacaatta
droVir3.scaffold_12822(+):1248119-1248270|sequence_index=0|block_index=4|species=droVir3|droVir3_4_0
GGCCGAGATTAAGAATCGCGTCGGGCCGCCGTCCAGCGCCACCGATAATGCTAGCAAAGTGAAAATCGATCAGGGTCGTCCAGTGGAGAACACCAAATCTGGTTGCTGCTGAATAA-CTCTGATTGTGAATCattattttattatacaatta
droWil1.scaffold_181130(-):16071336-16071488|sequence_index=0|block_index=4|species=droWil1|droWil1_4_0
GGCCGAGATTAAGAATCGTGTTGGGCCGCCGTCCAGCGCCACTGATAATGCTAGCAAAGTGAAAATCGATCAAGGACGTCCAGTGGAAAATACCAAATCCGGTTGCTGCTGAATAAACTCTGATTGTGAATCATTATTTTATTATACAATTA
droPer1.super_19(-):1310088-1310239|sequence_index=0|block_index=4|species=droPer1|droPer1_4_0
GGCTGAGATCAAGAATCGCGTCGGACCGCCGTCCAGCGCCACCGACAACGCTAGCAAAGTGAAAATCGATCAAGGACGTCCAGTGGAAAAACCCAATTCTGGTTGCTGCTGAATAA-CTCTGATTGTGAATCattattttattatacaatta
dp4.chr2(-):5593491-5593642|sequence_index=0|block_index=4|species=dp4|dp4_4_0
GGCTGAGATCAAGAATCGCGTCGGACCGCCGTCCAGCGCCACCGACAACGCTAGCAAAGTGAAAATCGATCAAGGACGTCCAGTGGAAAAGCCCAATTCTGGTTGCTGCTGAATAA-CTCTGATTGTGAATCattattttattatacaatta
droAna3.scaffold_13340(-):3754154-3754305|sequence_index=0|block_index=4|species=droAna3|droAna3_4_0
GGCCGAGATCAAGAATCGCGTCGGGCCACCGTCCAGCGCCACCGACAACGCTAGCAAAGTGAAAATCGATCAAGGACGTCCAGTGGAAAACACCAGATCCGGTTGCTGCTGAATAA-CTCTGATTGTGAATCattattttattataaaatta
droEre2.scaffold_4770(+):4567591-4567742|sequence_index=0|block_index=4|species=droEre2|droEre2_4_0
GGCCGAGATCAAGAATCGCGTCGGGCCGCCGTCCAGCGCCACCGACAACGCTAGCAAAGTGAAAATCGATCAAGGACGTCCAGTGGAAAACACCAAATCCGGTTGCTGCTGAATAA-CTCTGATTGTGAATCattattttattatacaatta
droYak2.chr3R(-):5883047-5883198|sequence_index=0|block_index=4|species=droYak2|droYak2_4_0
GGCCGAGATCAAGAATCGCGTCGGGCCGCCATCCAGCGCCACCGACAACGCTAGCAAAGTGAAAATCGATCAAGGACGTCCAGTGGAAAACACCAAATCCGGTTGCTGCTGAATAA-CTCTGATTGTGAATCattattttattatacaatta
droSec1.super_38(+):36432-36583|sequence_index=0|block_index=4|species=droSec1|droSec1_4_0
GGCGGAGATCAAGAATCGCGTCGGTCCGCCGTCCAGCGCCACTGACAACGCTAGCAAAGTGAAAATCGATCAAGGACGTCCAGTGGAAAACACCAAATCCGGTTGCTGCTGAATAA-CTCTGATTGTGAATCattattttattatacaatta
droSim1.chr3R(+):4366350-4366501|sequence_index=0|block_index=4|species=droSim1|droSim1_4_0
GGCGGAGATCAAGAATCGCGTCGGGCCGCCGTCCAGCGCCACTGACAACGCTAGCAAAGTGAAAATCGATCAAGGACGTCCAGTGGAAAACACCAAATCCGGTTGCTGCTGAATAA-CTCTGATTGTGAATCattattttattatacaatta
dm3.chr3R(-):17092781-17092867|sequence_index=0|block_index=5|species=dm3|dm3_5_0
GAGTACGCCGCCCAGTTAGGCATTCCATTCCTTGAAACTTCGGCCAAGAGCGCCACCAACGTTGAGCAGGCCTTCATGACGATGGC
droSim1.chr3R(+):4366264-4366350|sequence_index=0|block_index=5|species=droSim1|droSim1_5_0
GAGTACGCCGCCCAGTTAGGCATTCCATTCCTTGAAACTTCGGCCAAGAGCGCCACCAACGTTGAGCAGGCCTTTATGACGATGGC
droSec1.super_38(+):36346-36432|sequence_index=0|block_index=5|species=droSec1|droSec1_5_0
GAGTACGCCGCCCAGTTAGGCATTCCATTCCTTGAAACTTCGGCCAAGAGCGCCACCAACGTTGAGCAGGCCTTCATGACGATGGC
droYak2.chr3R(-):5883198-5883284|sequence_index=0|block_index=5|species=droYak2|droYak2_5_0
GAGTACGCCGCCCAGTTAGGCATTCCATTCCTTGAAACATCGGCCAAGAGCGCCACCAACGTGGAGCAGGCCTTCATGACGATGGC
droEre2.scaffold_4770(+):4567505-4567591|sequence_index=0|block_index=5|species=droEre2|droEre2_5_0
GAGTACGCCGCCCAGTTAGGCATTCCATTCCTTGAAACTTCGGCCAAGAGCGCCACCAACGTGGAGCAGGCCTTCATGACGATGGC
droAna3.scaffold_13340(+):20375068-20375148|sequence_index=0|block_index=5|species=droAna3|droAna3_5_0
------GCCGAAAACTTCGACATGCCCTTCTTCGAGGTCTCTTGCAAGTCAAACATCAATATTGAAGATGCGTTTCTTTCCCTGGC
dp4.chr2(-):5593642-5593728|sequence_index=0|block_index=5|species=dp4|dp4_5_0
GAGTATGCAGCTCAGTTAGGCATTCCATTTCTTGAAACTTCGGCCAAGAGCGCCACGAACGTGGAGCAGGCCTTCATGACGATGGC
droPer1.super_19(-):1310239-1310325|sequence_index=0|block_index=5|species=droPer1|droPer1_5_0
GAGTATGCAGCTCAGTTAGGCATTCCATTTCTTGAAACTTCGGCCAAGAGCGCCACGAACGTGGAGCAGGCCTTCATGACGATGGC
droWil1.scaffold_181130(-):16071488-16071574|sequence_index=0|block_index=5|species=droWil1|droWil1_5_0
GAATATGCGGCTCAGTTAGGCATTCCATTCCTTGAAACTTCGGCAAAGAGTGCCACCAATGTGGAGCAGGCCTTTATGACGATGGC
droVir3.scaffold_12822(+):1248033-1248119|sequence_index=0|block_index=5|species=droVir3|droVir3_5_0
GAGTACGCACATCAGTTAGGCATTCCATTCCTTGAAACTTCGGCCAAGAGCGCCACCAACGTGGAGCAGGCATTTATGACGATGGC
droMoj3.scaffold_6540(+):33866225-33866311|sequence_index=0|block_index=5|species=droMoj3|droMoj3_5_0
GAGTATGCACATCAGTTAGGCATTCCATTCCTTGAAACTTCGGCCAAGAGCGCCACCAATGTAGAGCAGGCATTCATGACGATGGC
droGri2.scaffold_15074(-):2610334-2610420|sequence_index=0|block_index=5|species=droGri2|droGri2_5_0
GAGTACGCAAATCAGTTAGGCATTCCATTCCTTGAAACTTCGGCGAAGAGTGCCACCAATGTGGAACAGGCATTCATGACGATGGC

Here's an awk oneliner to get you started - it uses the same regex range as your sed - the matched block_index is in m[1] - 600MB should take just a few minutes
awk 'match($0, /block_index=([0-9]+)\|/, m),/^$/ {print >"bloque"m[1]".fasta"}'

How to use fishshell to add numbers to files

I have a very simple mp3 player, and the order it plays audio files are based on the file names, and the rule is there must be a 3-size number in the beginning of file name, such as:
001file.mp3
002file.mp3
003file.mp3
I want to write a fish shell sortmp3 to add numbers to the files of a directory. Say directory myfiles contains files:
aaa.mp3
bbb.mp3
ccc.mp3`
When I run sortmp3 myfiles, the file names will be changed to:
001aaa.mp3
002bbb.mp3
003ccc.mp3
But my question is:
how to generate some sequential numbers?
how to make sure the size of each number is exactly 3?

I would write this, which makes no assumptions about how many files there are in a directory:
function sortmp3
set -l files *
set -l i
for i in (seq (count $files))
echo mv $files[$i] (printf "%03d%s" $i $files[$i])
end
end
Remove the "echo" if you like how it works.

You can generate sequential numbers with the seq tool - an external program.
This will only take care of the first part, it won't pad to three characters.
To do that, there's a variety of choices:
printf '%s\n' 00(seq 0 99) | rev | cut -c 1-3 | rev
printf '%s\n' 00(seq 0 99) | sed 's/^.*\(...\)$/\1/'
The 00(seq 0 99) part will generate numbers from "1" to "99" with two zeroes prepended - ie. from "001" to "0099". The later parts of the pipeline remove the superfluous zeroes again.
Or with the next fish version, you can use the new string tool:
string sub -s -3 -- 00(seq 0 99)

Depending on your specific situation you should use the "seq" command to generate sequential numbers or the "math" command to increment a counter. To format the number with a predictable number of leading zeros use the "printf" command:
set idx 12
printf '%03d' $idx

emulate SAS' datastep statement FIRST using linux command line tools

Let's say I have the first column of the following dataset in a file and I want to emulate the flag in the second column so I export only that row tied to a flag = 1 (dataset is pre-sorted by the target column):
1 1
1 0
1 0
2 1
2 0
2 0
I could run awk 'NR==1 {print; next} seen[$1]++ {print}' dataset but would run into a problem for very large files (seen keeps growing). Is there an alternative to handle this without tracking every single unique value of the target column (here column #1)? Thanks.

So you only have the first column? And would like to generate the second? I think a slightly different awk command could work
awk '{if (last==$1) {flag=0} else {last=$1; flag=1}; print $0,flag}' file.txt
Basically you just check if the first field matches the last one you've seen. Since it's sorted, you don't have to keep track of everything you've seen, only the last one to know if the value is different.

Seems like grep would be fine for this:
$ grep " 1" dataset

How to get Difference in time data kept in two different files in Unix?

I Have Two files each with some 200K Timestamps in a single column. I want to find the difference between each rows(mapped one to one) in seconds.
For example:
One file has 2013-06-04 11:21:28 and Second file 2013-06-04 11:21:55 in the same row, so I want to get the output as 27. That is 27 seconds.
Can some one help me with a Unix command to get this done?

https://github.com/hroptatyr/dateutils ddif to the rescue
ddiff 2012-03-01T12:17:00 2012-03-02T14:00:00
=>
92580s

paste -d, a b | while IFS=, read t1 t2
do
echo "$(( $( date -d "$t2" +%s ) - $( date -d "$t1" +%s ) ))"
done
That should do it.
Filenames are assumed to be "a" and "b".

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Remove row which contains specific delimiter with different number of columns - sed

You could try: grep -v ":..:" yourfile That will remove all lines that contain a pair of colons separated by any two characters - which doesn't seem to appear in the lines that you do want.

Related

How to copy last 13 characters of a string?

why my sed script to split FASTA file is slow?

How to use fishshell to add numbers to files

emulate SAS' datastep statement FIRST using linux command line tools

How to get Difference in time data kept in two different files in Unix?

Categories

Resources