sed with more than 100000 characters - sed

I need to identify and remove several occurrences of 100000 N (as in character N) from a 18 GB file. They occur in long strings. The command I want to use is:
sed -r '/N{100000}/d' bigFile > newBigFile
The error I get is that the { is an illegal charcater. Decreasing the number to 10000 yields no errors, and the process runs just fine.
Help is appreciated.

I've checked sed on my fedora linux and I've found that sed has a maximum string length: 2^15 - 1. So, you can write regex with max length of N = 32767
sed -r 's/N{32767}//g' bigFile > newBigFile
Also, you can multiply this value, like this (e.g. multiply on 3):
sed -r 's/(NNN){32767}//g' bigFile > newBigFile
Even, you can play with regex rule without right border if it acceptable in your case:
sed -r 's/N{32767,}//g' bigFile > newBigFile

Related

How to make regex works with perl command and extract numbers from a file?

I'm trying to extract from a tab delimited file a number that i need to store in a variable. I'm approaching the problem with a regex that thanks to some research online I have been able to built.
The file is composed as follow:
0 0 2500 5000
1 5000 7500 10000
2 10000 12500 15000
3 15000 17500 20000
4 20000 22500 25000
5 25000 27500 30000
I need to extract the number in the second column given a number of the first one. I wrote and tested online the regex:
(?<=5\t).*?(?=\t)
I need the 25000 from the sixth line.
I started working with sed but as you already know, it doesn't like lookbehind and lookahead pattern even with the -E option to enable extended version of regular expressions. I tried also with awk and grep and failed for similar reasons.
Going further I found that perl could be the right command but I'm not able to make it work properly. I'm trying with the command
perl -pe '/(?<=5\t).*?(?=\t)/' | INFO.out
but I admit my poor knowledge and I'm a bit lost.
The next step would be to read the "5" in the regex from a variable so if you already know problems that could rise, please let me know.
No need for lookbehinds -- split each line on space and check whether the first field is 5.
In Perl there is a command-line option convenient for this, -a, with which each line gets split for us and we get #F array with fields
perl -lanE'say $F[1] if $F[0] == 5' data.txt
Note that this tests for 5 numerically (==)
grep supports -P for perl regex, and -o for only-matching, so this works with a lookbehind:
grep -Po '(?<=5\t)\d+' file
That can use a shell variable pretty easily:
VAR=5 && grep -Po "(?<=$VAR\t)\d+"
Or perl -n, to show using s///e to match and print capture group:
perl -lne 's/^5\t(\d+)/print $1/e' file
Why do you need to use a regex? If all you are doing is finding lines starting with a 5 and getting the second column you could use sed and cut, e.g.:
<infile sed -n '/^5\t/p' | cut -f2
Output:
25000
One option is to use sed, match 5 at the start of the string and after the tab capture the digits in a group
sed -En 's/^5\t([[:digit:]]+)\t.*/\1/p' file > INFO.out
The file INFO.out contains:
25000
Using sed
$ var1=$(sed -n 's/^5[^0-9]*\([^ ]*\).*/\1/p' input_file)
$ echo "$var1"
25000

Modify number with pattern X.XXX

I have variable data with digits and minus "[-]", "[0-9]". It's may be:
source NUMBER ->result NUMBER AFTER MODIFY
XXX ->0.XXX,
XXXX ->X.XXX,
XXXXX ->XX.XXX,
-XXX -> -0.XXX,
-XXXX ->-X.XXX,
-XXXXX ->-XX.XXX,
Can this be done with sed?
I'd say:
sed -r 's/[0-9]{3}$/.&/; s/^(-?)\./\10./' filename
That is:
s/[0-9]{3}$/.&/ # put a dot before the last three digits in a line
s/^(-?)\./\10./ # if the result of this begins with . or -., insert a 0
# before the .
-r requires GNU sed. If you're on BSD or Mac OS X, which comes with BSD sed, you could use
sed 's/[0-9]\{3\}$/.&/;s/^\(-\?\)\./\10./' filename
That's the same thing with basic instead of extended regex syntax.
Addendum: Come to think of it, this appears to be equivalent to
awk '{ printf("%.3f\n", $0 / 1000) }' filename
sed 's/\(-*\)\(.*\)\(...\)/\10\2.\3/;s/0\([1-9]\)/\1/' YourFile
another way, remove 0 if needed and without back reference
perl -pe 's/(\d{3})\b/.$1/;
s/\B\./0./' file
line1 : 222<word-boundary> --> .222
line2 : <non-word-boundary>.222 --> 0.222

Remove all lines before a match with sed

I'm using sed to filter a list of files. I have a sorted list of folders and I want to get all lines after a specific one. To do this task I'm using the solution described here which works pretty well with any input I tried but it doesn't work when the match is on the first line. In that case sed will remove all lines of the input
Here it's an example:
$ ls -1 /
bin
boot
...
sys
tmp
usr
var
vmlinuz
$ ls -1 / | sed '1,/tmp/d'
usr
var
vmlinuz
$ ls -1 / | sed '1,/^bin$/d'
# sed will delete all lines from the input stream
How should I change the command to consider also the limit case when first line is matched by regexp?
BTW sed '1,1d' correctly works and remove the first line only.
try this (GNU sed only):
sed '0,/^bin$/d'
..output is:
$sed '0,/^bin$/d' file
boot
...
sys
tmp
usr
var
vmlinuz
This sed command will print all lines after and including the matching line:
sed -n '/^WHATEVER$/,$p'
The -n switch makes sed print only when told (the p command).
If you don't want to include the matching line you can tell sed to delete from the start of the file to the matching line:
sed '1,/^WHATEVER$/d'
(We use the d command which deletes lines.)
you can also try with :
awk '/searchname/{p=1;next}{if(p){print}}'
EDIT(considering the comment from Joe)
awk '/searchname/{p++;if(p==1){next}}p' Your_File
I would insert a tag before a match and delete in scope /start/,/####tag####/.

remove spaces from cells in matrix

I have a matrix(5800 rows and 350 columns) of numbers. Each cell is either
0 / 0
1 / 1
2 / 2
What is the fastest way to remove all spaces in each cell, to have:
0/0
1/1
2/2
Sed, R, anything that will do it fastest.
If you are going for efficiency, you should probably use coreutils tr for such a simple task:
tr -d ' ' < infile
I compared the posted answers against a 300K file, using GNU awk, GNU sed, perl v5.14.2 and GNU coreutils v8.13. The tests were each run 30 times, this is the average:
awk - 1.52s user 0.01s system 99% cpu 1.529 total
sed - 0.89s user 0.00s system 99% cpu 0.900 total
perl - 0.59s user 0.00s system 98% cpu 0.600 total
tr - 0.02s user 0.00s system 90% cpu 0.020 total
All testes were run as above (cmd < infile) and with the output directed to /dev/null.
Using sed:
sed "s/ \/ /\//g" input.txt
It means:
Replace the string " / " (/ \/ /) by one slash (/\/) and do it globally (/g).
Here's an awk alternative that does exactly the same thing:
awk '{gsub(" ",""); print}' input.txt > output.txt
Explanations:
awk '{...}': invoke awk, then for each line do the stuff enclosed by braces.
gsub(" ","");: replace all space chars (single or multiple in a row) with the empty string.
print: print the entire line
input.txt: specifying your input file as argument to awk
> output.txt: redirect output to a file.
A perl solution could look like this:
perl -pwe 'tr/ //d' input.txt > output.txt
You can add the -i switch to do in-place edit.

How to output lines 800-900 of a file with a unix command?

I want to output all lines between a and b in a file.
This works but seems like overkill:
head -n 900 file.txt | tail -n 100
My lack of unix knowledge seems to be the limit here. Any suggestions?
sed -n '800,900p' file.txt
This will print (p) lines 800 through 900, including both line 800 and 900 (i.e. 101 lines in total). It will not print any other lines (-n).
Adjust from 800 to 801 and/or 900 to 899 to make it do exactly what you think "between 800 and 900" should mean in your case.
Found a prettier way: Using sed, to print out only lines between a and b:
sed -n -e 800,900p filename.txt
From the blog post: Using sed to extract lines in a text file
One way I am using it is to find (and diff) similar sections of files:
sed -n -e 705,830p mnetframe.css > tmp1; \
sed -n -e 830,955p mnetframe.css > tmp2; \
diff --side-by-side tmp1 tmp2
Which will give me a nice side-by-side comparison of similar sections of a file :)