remove spaces from cells in matrix - perl

I have a matrix(5800 rows and 350 columns) of numbers. Each cell is either
0 / 0
1 / 1
2 / 2
What is the fastest way to remove all spaces in each cell, to have:
0/0
1/1
2/2
Sed, R, anything that will do it fastest.

If you are going for efficiency, you should probably use coreutils tr for such a simple task:
tr -d ' ' < infile
I compared the posted answers against a 300K file, using GNU awk, GNU sed, perl v5.14.2 and GNU coreutils v8.13. The tests were each run 30 times, this is the average:
awk - 1.52s user 0.01s system 99% cpu 1.529 total
sed - 0.89s user 0.00s system 99% cpu 0.900 total
perl - 0.59s user 0.00s system 98% cpu 0.600 total
tr - 0.02s user 0.00s system 90% cpu 0.020 total
All testes were run as above (cmd < infile) and with the output directed to /dev/null.

Using sed:
sed "s/ \/ /\//g" input.txt
It means:
Replace the string " / " (/ \/ /) by one slash (/\/) and do it globally (/g).

Here's an awk alternative that does exactly the same thing:
awk '{gsub(" ",""); print}' input.txt > output.txt
Explanations:
awk '{...}': invoke awk, then for each line do the stuff enclosed by braces.
gsub(" ","");: replace all space chars (single or multiple in a row) with the empty string.
print: print the entire line
input.txt: specifying your input file as argument to awk
> output.txt: redirect output to a file.

A perl solution could look like this:
perl -pwe 'tr/ //d' input.txt > output.txt
You can add the -i switch to do in-place edit.

Related

How to make regex works with perl command and extract numbers from a file?

I'm trying to extract from a tab delimited file a number that i need to store in a variable. I'm approaching the problem with a regex that thanks to some research online I have been able to built.
The file is composed as follow:
0 0 2500 5000
1 5000 7500 10000
2 10000 12500 15000
3 15000 17500 20000
4 20000 22500 25000
5 25000 27500 30000
I need to extract the number in the second column given a number of the first one. I wrote and tested online the regex:
(?<=5\t).*?(?=\t)
I need the 25000 from the sixth line.
I started working with sed but as you already know, it doesn't like lookbehind and lookahead pattern even with the -E option to enable extended version of regular expressions. I tried also with awk and grep and failed for similar reasons.
Going further I found that perl could be the right command but I'm not able to make it work properly. I'm trying with the command
perl -pe '/(?<=5\t).*?(?=\t)/' | INFO.out
but I admit my poor knowledge and I'm a bit lost.
The next step would be to read the "5" in the regex from a variable so if you already know problems that could rise, please let me know.
No need for lookbehinds -- split each line on space and check whether the first field is 5.
In Perl there is a command-line option convenient for this, -a, with which each line gets split for us and we get #F array with fields
perl -lanE'say $F[1] if $F[0] == 5' data.txt
Note that this tests for 5 numerically (==)
grep supports -P for perl regex, and -o for only-matching, so this works with a lookbehind:
grep -Po '(?<=5\t)\d+' file
That can use a shell variable pretty easily:
VAR=5 && grep -Po "(?<=$VAR\t)\d+"
Or perl -n, to show using s///e to match and print capture group:
perl -lne 's/^5\t(\d+)/print $1/e' file
Why do you need to use a regex? If all you are doing is finding lines starting with a 5 and getting the second column you could use sed and cut, e.g.:
<infile sed -n '/^5\t/p' | cut -f2
Output:
25000
One option is to use sed, match 5 at the start of the string and after the tab capture the digits in a group
sed -En 's/^5\t([[:digit:]]+)\t.*/\1/p' file > INFO.out
The file INFO.out contains:
25000
Using sed
$ var1=$(sed -n 's/^5[^0-9]*\([^ ]*\).*/\1/p' input_file)
$ echo "$var1"
25000

Sed Process Substitution on Insert - Without Backslashes

I have function that prints a header that needs to be applied across several files, but if I utilize a sed process substitution the lines prior to the last have a backslash \ on them.
E.g.
function print_header() {
cat << EOF
-------------------------------------------------------------------
$(date '+%B %d, %Y # ~ %r') ID:$(echo $RANDOM)
EOF
}
If I then take a file such as test.txt:
line 1
line 2
line 3
line 4
line 5
sed "1 i $(print_header | sed 's/$/\\/g')" test.txt
I get:
-------------------------------------------------------------------\
November 24, 2015 # ~ 11:18:28 AM ID:13187
line 1
line 2
line 3
line 4
line 5
Notice the troublesome backslash at the end of the first line, I'd like to not have that backslash appear. Any ideas?
I would use cat for that:
cat <(print_header) file > file_with_header
This behavior depends on the sed dialect. Unfortunately, it's one of the things which depends on which version you have.
To simplify debugging, try specifying verbatim text. Here's one from a Debian system.
vnix$ sed '1i\
> foo\
> bar' <<':'
> hello
> goodbye
> :
foo
bar
hello
goodbye
Your diagnostics appear to indicate that your sed dialect does not in fact require the backslash after the first i.
Since you are generating the contents of the header programmatically anyway, my recommended solution would be to refactor the code so that you can avoid this conundrum. If you don't want cat <<EOF test.txt then maybe experiment with sed 1r/dev/stdin' <<EOF test.txt (I could not get 1r- to work, but /dev/stdin should be portable to any Linux.)
Here is my kludgy fix, if you can find something more elegant I'll gladly credit you:
sed "1 i $(print_header | sed 's/$/\\/g;$s/$/\x01/')" test.txt | tr -d '\001'
This puts an unprintable SOH (\x01) ascii Start Of Header character after the inserted text, that precludes the backslashes and then I run it over tr to delete the SOH chars.

how to replace the tabs with empty space in each file of a directory

I would like to replace the tabs in each file of a directory with the corresponding empty space. I found already a solution 11094383, where you can replace tabs with given number of empty spaces:
> find ./ -type f -exec sed -i 's/\t/ /g' {} \;
In the solution above tabs are replaced with four spaces. But in my case tabs can occupy more spaces - e.g. 8.
An example of file with tabs, which should be replaced with 8 spaces is:
NSMl1 100 PSHELL 0.00260 400000 400200 400300
400400 400500 400600 400700 400800 400900
401000 401100 400100 430000 430200 430300
430400 430500 430600 430700 430800 430900
431000 431100 430100 401200 431200
here the lines with tabs are the 3th to the 5th line.
An example of file with tabs, which should be replaced with 4 tabs is:
RBE2 1101001 5000511 123456 1100
Could anybody help?
The classic answer is to use the pr command with options to expand tabs into an appropriate number of spaces, turning of the pagination features:
pr -e8 -l1 -t …files…
The tricky part is getting the file over-written that seems to be part of the question. Of course, sed in the GNU and BSD (Mac OS X) incarnations supports overwriting with the -i option — with variant behaviours between the two as BSD sed requires a suffix for the backup files and GNU sed does not. However, sed does not (readily) support converting tabs to an appropriate number of blanks, so it isn't wholly appropriate.
There's a script overwrite (which I abbreviate to ow) in The UNIX Programming Environment that can do that. I've been using the script since 1987 (first checkin — last updated in 2005).
#!/bin/sh
# Overwrite file
# From: The UNIX Programming Environment by Kernighan and Pike
# Amended: remove PATH setting; handle file names with blanks.
case $# in
0|1) echo "Usage: $0 file command [arguments]" 1>&2
exit 1;;
esac
file="$1"
shift
new=${TMPDIR:-/tmp}/ovrwr.$$.1
old=${TMPDIR:-/tmp}/ovrwr.$$.2
trap "rm -f '$new' '$old' ; exit 1" 0 1 2 15
if "$#" >"$new"
then
cp "$file" "$old"
trap "" 1 2 15
cp "$new" "$file"
rm -f "$new" "$old"
trap 0
exit 0
else
echo "$0: $1 failed - $file unchanged" 1>&2
rm -f "$new" "$old"
trap 0
exit 1
fi
It would be possible and arguably better to use the mktemp command on most systems these days; it didn't exist way back then.
In the context of the question, you could then use:
find . -type f -exec ow {} pr -e8 -t -l1 \;
You do need to process each file separately.
If you are truly determined to use sed for the job, then you have your work cut out. There's a gruesome way to do it. There is a notational problem; how to represent a literal tab; I will use \t to denote it. The script would be stored in a file, which I'll assume is script.sed:
:again
/^\(\([^\t]\{8\}\)*\)\t/s//\1 /
/^\(\([^\t]\{8\}\)*\)\([^\t]\{1\}\)\t/s//\1\3 /
/^\(\([^\t]\{8\}\)*\)\([^\t]\{2\}\)\t/s//\1\3 /
/^\(\([^\t]\{8\}\)*\)\([^\t]\{3\}\)\t/s//\1\3 /
/^\(\([^\t]\{8\}\)*\)\([^\t]\{4\}\)\t/s//\1\3 /
/^\(\([^\t]\{8\}\)*\)\([^\t]\{5\}\)\t/s//\1\3 /
/^\(\([^\t]\{8\}\)*\)\([^\t]\{6\}\)\t/s//\1\3 /
/^\(\([^\t]\{8\}\)*\)\([^\t]\{7\}\)\t/s//\1\3 /
t again
That's using the classic sed notation.
You can then write:
sed -f script.sed …data-files…
If you have GNU sed or BSD (Mac OS X) sed, you can use the extended regular expressions instead:
:again
/^(([^\t]{8})*)\t/s//\1 /
/^(([^\t]{8})*)([^\t]{1})\t/s//\1\3 /
/^(([^\t]{8})*)([^\t]{2})\t/s//\1\3 /
/^(([^\t]{8})*)([^\t]{3})\t/s//\1\3 /
/^(([^\t]{8})*)([^\t]{4})\t/s//\1\3 /
/^(([^\t]{8})*)([^\t]{5})\t/s//\1\3 /
/^(([^\t]{8})*)([^\t]{6})\t/s//\1\3 /
/^(([^\t]{8})*)([^\t]{7})\t/s//\1\3 /
t again
and then run:
sed -r -f script.sed …data-files… # GNU sed
sed -E -f script.sed …data-files… # BSD sed
What do the scripts do?
The first line sets a label; the last line jumps to that label if any of the s/// operations in between made a substitution. So, for each line of the file, the script loops until there are no matches made, and hence no substitutions performed.
The 8 substitutions deal with:
A block of zero or more sequences of 8 non-tabs, which is captured, followed by
a sequence of 0-7 more non-tabs, which is also captured, followed by
a tab.
It replaces that match with the captured material, followed by an appropriate number of spaces.
One curiosity found during the testing is that if a line ends with white space, the pr command removes that trailing white space.
There's also the expand command on some systems (BSD or Mac OS X at least), which preserves the trailing white space. Using that is simpler than pr or sed.
With these sed scripts, and using the BSD or GNU sed with backup files, you can write:
find . -type f -exec sed -i.bak -r -f script.sed {} +
(GNU sed notation; substitute -E for -r for BSD sed.)

How to output lines 800-900 of a file with a unix command?

I want to output all lines between a and b in a file.
This works but seems like overkill:
head -n 900 file.txt | tail -n 100
My lack of unix knowledge seems to be the limit here. Any suggestions?
sed -n '800,900p' file.txt
This will print (p) lines 800 through 900, including both line 800 and 900 (i.e. 101 lines in total). It will not print any other lines (-n).
Adjust from 800 to 801 and/or 900 to 899 to make it do exactly what you think "between 800 and 900" should mean in your case.
Found a prettier way: Using sed, to print out only lines between a and b:
sed -n -e 800,900p filename.txt
From the blog post: Using sed to extract lines in a text file
One way I am using it is to find (and diff) similar sections of files:
sed -n -e 705,830p mnetframe.css > tmp1; \
sed -n -e 830,955p mnetframe.css > tmp2; \
diff --side-by-side tmp1 tmp2
Which will give me a nice side-by-side comparison of similar sections of a file :)

How can I apply Unix's / Sed's / Perl's transliterate (tr) to only a specific column?

I have program output that looks like this (tab delim):
$ ./mycode somefile
0000000000000000000000000000000000 238671
0000000000000000000000000000000001 0
0000000000000000000000000000000002 0
0000000000000000000000000000000003 0
0000000000000000000000000000000010 0
0000000000000000000000000000000011 1548.81
0000000000000000000000000000000012 0
0000000000000000000000000000000013 937.306
What I want to do is on FIRST column only: replace 0 with A, 1 with C, 2 with G, and 3 with T.
Is there a way I can transliterate that output piped directly from "mycode".
Yielding this:
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA 238671
...
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACT 937.306
Using Perl:
C:\> ./mycode file | perl -lpe "($x,$y)=split; $x=~tr/0123/ACGT/; $_=qq{$x\t$y}"
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA 238671
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAC 0
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAG 0
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAT 0
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACA 0
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACC 1548.81
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACG 0
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACT 937.306
You can use single quotes in Bash:
$ ./mycode file | perl -lpe '($x,$y)=split; $x=~tr/0123/ACGT/; $_="$x\t$y"'
As #ysth notes in the comments, perl actually provides the command line options -a and -F:
-a autosplit mode with -n or -p (splits $_ into #F)
...
-F/pattern/ split() pattern for -a switch (//'s are optional)
Using those:
perl -lawnF'\t' -e '$,="\t"; $F[0] =~ y/0123/ACGT/; print #F'
It should be possible to do it with sed, put this in a file (you can do it command-line to, with -e, just don't forget those semicolons, or use separate -e for each line). (EDIT: Keep in mind, since your data is tab delimited, it should in fact be a tab character, not a space, in the first s//, make sure your editor doesn't turn it into spaces)
#!/usr/bin/sed -f
h
s/ .*$//
y/0123/ACGT/
G
s/\n[0-3]*//
and use
./mycode somefile | sed -f sedfile
or chmod 755 sedfile and do
./mycode somefile | sedfile
The steps performed are:
copy buffer to hold space (replacing held content from previous line, if any)
remove trailing stuff (from first space to end of line)
transliterate
append contents from hold space
remove the newline (from the append step) and all digits following it (up to the space)
Worked for me on your data at least.
EDIT:
Ah, you wanted a one-liner...
GNU sed
sed -e "h;s/ .*$//;y/0123/ACGT/;G;s/\n[0-3]*//"
or old-school sed (no semicolons)
sed -e h -e "s/ .*$//" -e "y/0123/ACGT/" -e G -e "s/\n[0-3]*//"
#sarathi
\AWK solution for this
awk '{gsub("0","A",$1);gsub("1","C",$1);gsub("2","G",$1);gsub("3","T",$1); print $1"\t"$2}' temp.txt