How do I remove whitespace but maintain column structure?

How do I remove whitespace but maintain column structure? - sed

I am trying to take several rows of data that look like this:
green open foundational-bus-layer-comm-ticket-details-f MqWrI9I6Q7enZnLjH9xZHw 32 1 4488163 0 14.7gb 7.4gb
green open foundational-cm-add-salesforce-customer-number-c GA3dXwz3Rn2_EmZGV1oEfg 32 1 219696 0 1gb 520.3mb
close foundational-sls-dtl-bpcs-otc-stg g2xS6fDRR0OW_W_24UjuYQ
green open foundational-cm-dw-customer-dim-hist-filtered koNU-arFQHSFOEkmj_xc9w 32 1 141210 0 887.1mb 450mb
green open datasync-dm-customer-vw-coalesce-a rvEuYU4NQ0SS69qB3UGLCA 32 1 2656210 0 11.6gb 5.8gb
And use this sed command to remove extra whitespace: sed 's/\s+/ /g'
The issue is that in doing this I get the following:
green open foundational-bus-layer-comm-ticket-details-f MqWrI9I6Q7enZnLjH9xZHw 32 1 4488163 0 14.7gb 7.4gb
green open foundational-bus-layer-comm-instrument-customer-f WF0wR4O3RxOZ2bzwm_yGRw 32 1 842214 0 1.5gb 808mb
close foundational-sls-dtl-bpcs-otc-stg g2xS6fDRR0OW_W_24UjuYQ
green open foundational-cm-add-salesforce-customer-number-c GA3dXwz3Rn2_EmZGV1oEfg 32 1 219696 0 1gb 520.3mb
green open foundational-cm-dw-customer-dim-hist-filtered koNU-arFQHSFOEkmj_xc9w 32 1 141210 0 887.1mb 450mb
What I would like is something that looks like this:
green open foundational-bus-layer-comm-ticket-details-f MqWrI9I6Q7enZnLjH9xZHw 32 1 4488163 0 14.7gb 7.4gb
green open foundational-bus-layer-comm-instrument-customer-f WF0wR4O3RxOZ2bzwm_yGRw 32 1 842214 0 1.5gb 808mb
close foundational-sls-dtl-bpcs-otc-stg g2xS6fDRR0OW_W_24UjuYQ
green open foundational-cm-add-salesforce-customer-number-c GA3dXwz3Rn2_EmZGV1oEfg 32 1 219696 0 1gb 520.3mb
green open foundational-cm-dw-customer-dim-hist-filtered koNU-arFQHSFOEkmj_xc9w 32 1 141210 0 887.1mb 450mb
So I would like to maintain the column structure while also removing the extra whitespace.
Any ideas??
**********EDIT************
I tried the suggestion below, and got the following:
green open foundational-bus-layer-comm-contract-line-item-f 3987969 6.2gb
green open foundational-idea-dlvry-lot-vldtd 0 4.2kb
green open .trek-new 0 1.2kb
green open add-pabbto-idaqowner-idaq-customerinformation-v9c2 948 3.4mb
close add-pabbto-idaowner-results-cc-v26
green open sym-tib-add-openorder-detail 261763 399.7mb
green open idn 10417 8.2mb
green open sym-adc-outboundinvoice-c 43012 46mb
So.. close? But the "close" still needs to move over...

You can perhaps try with this gnu sed
sed 's/ */ /3g;s/ */'$'\1''/2' infile | column -s $'\1' -t
Explain :
s/ */ /3g
replace 1 or more whitespace by only one from the third occurence to the end.
The start of the line is never change.
So the first line
green open foundational-bus-layer-comm-ticket-details-f MqWrI9I6Q7enZnLjH9xZHw 32 1 4488163 0 14.7gb 7.4gb
became
green open foundational-bus-layer-comm-ticket-details-f MqWrI9I6Q7enZnLjH9xZHw 32 1 4488163 0 14.7gb 7.4gb
The change start after -f
The problematic line
close foundational-sls-dtl-bpcs-otc-stg g2xS6fDRR0OW_W_24UjuYQ
became
close foundational-sls-dtl-bpcs-otc-stg g2xS6fDRR0OW_W_24UjuYQ
The change start after -stg
s/ */'$'\1''/2
replace 1 or more whitespace by the char Hex01 on the second occurence.
So the first line became
green openHex01foundational-bus-layer-comm-ticket-details-f MqWrI9I6Q7enZnLjH9xZHw 32 1 4488163 0 14.7gb 7.4gb
The problematic line became
closeHex01foundational-sls-dtl-bpcs-otc-stg g2xS6fDRR0OW_W_24UjuYQ
column -s $'\1' -t
format the output in 2 col with the separator Hex01
If there are not whitespace but tab, you can use
sed 's/[[:blank:]][[:blank:]]*/ /3g;s/[[:blank:]][[:blank:]]*/'$'\1''/2' infile | column -s $'\1' -t

Related

Why does sed (insert line) output spaces between each character?

I have split a larger data file into individual 2-column files for each field. This results in something like this:
0.00 3.02211e+07
1.00 3.02211e+07
2.00 3.02211e+07
3.00 3.02211e+07
4.00 3.02211e+07
5.00 3.01295e+07
6.00 3.00608e+07
7.00 2.99768e+07
When I try to add a row via sed,
sed -i '1i pressure-prof' myfile.txt the output has a space character between each character (including existing spaces). If I look in notepad++, the extra spaces appear as the ASCII "NULL". In the terminal it looks like this:
pressure-prof
0 . 0 0 3 . 0 2 2 1 1 e + 0 7
1 . 0 0 3 . 0 2 2 1 1 e + 0 7
2 . 0 0 3 . 0 2 2 1 1 e + 0 7
3 . 0 0 3 . 0 2 2 1 1 e + 0 7
4 . 0 0 3 . 0 2 2 1 1 e + 0 7
5 . 0 0 3 . 0 1 2 9 5 e + 0 7
6 . 0 0 3 . 0 0 6 0 8 e + 0 7
7 . 0 0 2 . 9 9 7 6 8 e + 0 7
This is on Windows, and I think sed is being provided by cygwin or msys2. I don't know if that has anything to do with the output format issues.
Yes, I can resort to opening up files in a text editor and just adding that way. I would like to be able to utilize sed in the future though.
Thanks for any thoughts and assistance.

cat myfile.txt | tr -d ' ' | sed 's/./0 /4' | sed '1s/0 //' > mf2 && mv mf2 myfile.txt
Run that after you've finished adding your rows. Using tr initially wipes all the spaces, and then sed counts to the fourth character and re-adds a space.

Extraction of rows which have a value > 50

How to select those lines which have a value < 10 value from a large matrix of 21 columns and 150 rows.eg.
miRNameIDs degradome AGO LKM......till 21
osa-miR159a 0 42 42
osa-miR396e 0 7 9
vun-miR156a 121 77 4
ppt-miR156a 12 7 4
gma-miR6300 118 2 0
bna-miR156a 0 114 48
gma-miR156k 0 46 1
osa-miR1882e 0 7 0
.
.
.
Desired output is:-
miRNameIDs degradome AGO LKM......till 21
vun-miR156a 121 77 4
gma-miR6300 118 2 0
bna-miR156a 0 114 48
.
.
.
till 150 rows

Using a perl one-liner
perl -ane 'print if $. == 1 || grep {$_ > 50} #F[1..$#F]' file.txt
Explanation:
Switches:
-a: Splits the line on space and loads them in an array #F
-n: Creates a while(<>){...} loop for each “line” in your input file.
-e: Tells perl to execute the code on command line.
Code:
$. == 1: Checks if the current line is line number 1.
grep {$_ > 50} #F[1..$#F]: Looks at each entries from the array to see if it is greater than 50.
||: Logical OR operator. If any of our above stated condition is true, it prints the line.

Ghostscript postscript pswrite is encoding text

Why is Ghostscript pswrite encoding my text in its output? Consider the following MWE:
%!PS-Adobe-3.0
%%Title: mwe.ps
%%Pages: 001
%%BoundingBox: 0 0 595 842
%%EndComments
%%Page: 1 1
%%PageBoundingBox: 0 0 595 842
0 0 1 setrgbcolor
0 0 595 842 rectfill
1 0 0 setrgbcolor
247 371 100 100 rectfill
/Times-Roman findfont
72 scalefont
setfont
newpath
247 300 moveto
(Chris) show
showpage
Saving this MWE to file and viewing in GSview will display a blue page with red square and my name underneath. Now run this file through Ghostscript 9.06 with the following command line:
"c:\Program Files\gs\gs9.06\bin\gswin64c.exe" ^
-dSAFER -dBATCH -dNOPAUSE ^
-sDEVICE=pswrite -sPAPERSIZE=a4 -r72 -sOutputFile=mwe_gs.ps mwe.ps
See Ghostscript output below. Can someone please explain what is happening here. Whilst the two rectfill commands are still apparent, my text (Chris) has been encoded and is no longer distinguishable.
Is there an alternative postscript device which would retain my text please?
<snip>
%%Page: 1 1
%%PageBoundingBox: 0 0 595 842
%%BeginPageSetup
GS_pswrite_2_0_1001 begin
595 842 /a4 setpagesize
/pagesave save store 197 dict begin
1 1 scale
%%EndPageSetup
gsave mark
255 0 r6
0 0 595 842 rf
255 0 r3
247 371 100 100 rf
Q q
0 0 595 0 0 842 ^ Y
255 0 r3
249 299 43 50 /5D
$C
,6CW56m1G"ZORNkWR*rB:!c2;9rlWTH="2^^[(q"h>cG<omZ2l^=qC[XbO:8_[?kji-8^"N#3q*
jhL~>
,
289 300 41 49 /0P
$C
4r?0p$m<EkK3,0>s8W-!s8W-!s8W,u]<1irI=*p=<t0>_#<)>Is8K6,aTi'$~>
,
325 300 30 33 /5I
$C
49S"pc4+Rhs8W-!s8W)oqdD:saRZq[4+k%):]~>
,
349 300 24 49 /0T
$C
4q%Ms%;PqCs8W-!s8W%1_qkn/K?*sYFSGd:5Q~>
,
377 299 23 34 /5M
$C
-TQR7$&O'!K+D:XribR9;$mr4#sqUi.T#,dX=Y&Llg+F`d^HC#%$"]~>
,
cleartomark end end pagesave restore
showpage
%%PageTrailer
%%Trailer
%%Pages: 1
%%EOF
NOTE: This might seem an odd activity but I'm exploring the idea of using Ghostscript to 'clean up' postscript output from Matlab application..

The 'text' has been converted to images, not vector paths. This is a serious limitation of the pswrite device, and one of the reasons it is deprecated, you should use the ps2write device instead. The only reason the pswrite device is still included at all is for epswrite which uses it (which is why the pswrite and epswrite output looks the same). At some point there will be an eps2write device and pswrite will be binned.
ps2write output is, by default, compressed. If you want uncompressed output, use the -dCompressPages=false switch on the command line.
If all you want is the location of the text you might consider the txtwrite device. The default implementation of this creates a plain text representation of the input, but you can have it output a faked up XML instead which includes things like the origin of the text.

Here is a simple example of the show operator being redefined to display position information about the show, along with performing the standard show operation. With ghostscript you can run multiple files, so the header file would be a prefix to the other file, which alters standard behavior.
The redefined show could have included font name and size. The data could have been written to a disk file, rather than dumped to the console. Any of other operator could have also been redefined, like rectfill, fill, stroke... Because the original operator is also called, you can convert a .ps to .pdf using a pdfwrite device, while at the same time obtaining position information.
gswin32c.exe -dBATCH -dNOPAUSE header.ps trash.ps
gswin32c.exe -sDEVICE=pdfwrite -dCompressPages=false -sOutputFile=test.pdf header.ps trash.ps
output
currentpoint x:247.0 y:300.0 pathbbox 249.015,298.992 400.066,349.184 text:Chris currentrgbcolor:1.0,0.0,0.0( )
currentpoint x:50.0 y:90.0 pathbbox 50.8682,89.2852 181.327,139.184 text:Fred currentrgbcolor:1.0,0.0,0.0( )
currentpoint x:150.0 y:200.0 pathbbox 150.867,184.298 304.154,247.673 text:Mary currentrgbcolor:1.0,0.0,0.0( )
currentpoint x:300.0 y:350.0 pathbbox 300.867,348.993 598.79,398.681 text:Mr. Green currentrgbcolor:0.0,1.0,0.0( )
currentpoint x:100.0 y:400.0 pathbbox 100.866,399.202 358.547,449.183 text:Mr. Blue currentrgbcolor:0.0,0.0,1.0( )
Header.ps
/mydict 5 dict def
mydict begin
/show
{
(currentpoint ) print
currentpoint exch 10 string cvs ( x:) print print 10 string cvs ( y:) print print
gsave dup false charpath flattenpath
( pathbbox ) print
pathbbox
4 -1 roll 10 string cvs print (,) print
3 -1 roll 10 string cvs print ( ) print
2 -1 roll 10 string cvs print (,) print
10 string cvs print ( ) print
grestore
( text:) 10 string cvs print
dup print ( ) print
( currentrgbcolor:) print
currentrgbcolor
3 -1 roll 10 string cvs print (,) print
2 -1 roll 10 string cvs print (,) print
10 string cvs print ( ) ==
systemdict /show get exec
} def
trash.ps
%!PS-Adobe-3.0
%%Title: mwe.ps
%%Pages: 001
%%BoundingBox: 0 0 595 842
%%EndComments
%%Page: 1 1
%%PageBoundingBox: 0 0 595 842
0 0 1 setrgbcolor
0 0 595 842 rectfill
1 0 0 setrgbcolor
247 371 100 100 rectfill
/Times-Roman findfont
72 scalefont
setfont
newpath
247 300 moveto (Chris) show
50 90 moveto (Fred) show
150 200 moveto (Mary) show
0 1 0 setrgbcolor
300 350 moveto (Mr. Green) show
0 0 1 setrgbcolor
100 400 moveto (Mr. Blue) show
showpage

The text has been converted to vector paths. 249 299 43 50 /5D begins the first letter "C", then 289 300 is the "h", 289 300 the "r"....
What pswrite has done is eliminate the need for a font, so while your original code used /Times-Roman, the distilled code doesn't need any font, but rather draws the text using vectors.
I'm not sure exactly what you are after, but you could try "ps2write" or "epswrite" as alternatives to "pswrite". pswrite is used to write to ps level 1 standard and ps2write will write ps level 2 output. Nobody requires ps level 1 anymore, so level 2 would be acceptable. The epswrite will write to encapsulated postscript (eps).

sed replace end of line

I'm trying to add a bunch of 0s at the end of a line. The way the line is identified is that it is followed by a line which starts with "expr1"
in Vim what I do is:
s/\nexpr1/ 0 0 0 0 0 0\rexpr1/
and it works fine. I know that in ubuntu \n is what is normally used to terminate the line but whenever I do that I get a ^# symbol so \r works fine for me. I thought I'd use this with sed but it hasn't really worked. here is what I normally write:
sed "s/\nexpr1/ 0 0 0 0 0 0\rexpr1/" infile > outfile

The end-of-line marker is $. Try this:
s/$/ 0 0 0 0 0 0/
Depending on your environment, you might need to escape the $.

awk '{$0=$0" 0 0 0 0 0 "}1' file > tmp && mv tmp file
ruby -i.bak -ne '$_=$_.chomp!+" 0 0 0 0 0\n";print' file

awk '$(NF + 1) = " 0 0 0 0 0 0"' infile > outfile

line extraction dependin on range for specific colums

I would like to extract some lines from a text file, I have started to tweak sed lately,
I have a file with the structure
88 3 3 0 0 1 101 111 4 3
89 3 3 0 0 1 3 4 112 102
90 3 3 0 0 1 102 112 113 103
91 3 3 0 0 2 103 113 114 104
What I would like to do is to extract the information according to the second column, I use sth like in my bash script(argument 2 is infile)
sed -n '/^[0-9]* [23456789]/ p' < $2 > out
however I have different entries other than the range [23456789], for instance 10, since it is composed of 1 and 0, to get that these two characters should be in the range I guess, however there are entries with '1'(for the second column) that I do not like to keep so how can write '10's but not '1's.
Best,
Umut

sed -rn '/^[0-9]* ([23456789]|10)/ p' < $2 > out
You need the extend-regexp support (-r) to have the | operator (or)
Another interesting way is:
sed -rn '/^[0-9]* ([23456789]|[0-9]{2,})/ p' < $2 > out
Which means [23456789] or 2 or more repetition of a digit.

The instant you see variable-sized columns in your data, you should start thinking about awk:
awk '$2 > 1 && $2 < 11 {print}{}'
will do the trick assuming your file format is correct.

sed -rn '/^[0-9]* (2|3|4|5|6|7|8|9|10)/p' < $2 > out

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

How do I remove whitespace but maintain column structure? - sed

Related

Why does sed (insert line) output spaces between each character?

Extraction of rows which have a value > 50

Ghostscript postscript pswrite is encoding text

sed replace end of line

line extraction dependin on range for specific colums

Categories

Resources