Awk: merge orphaned text to specific field in line above - sed

Given a tab-delimited text file containing item information:
41850 0.4 0.5 LG EN RP Billy Makes a Fridgewell, Norm
Friend
9338 0.4 0.5 LG EN RP Shine, The Musical! Mustard, Colonel
7255 0.5 0.5 LG EN RP Can You Play the Truman, Harriet
Jew's Harp
9314 0.5 0.5 LG EN RP Hi, Skippy Plum, Prof
Note the "orphaned" titles on two of the lines. Using Awk, how can I merge this orphan back into the title field above?
Pseudo-awk:
awk '/^[[:digit:]]/{getline; ???
if next line ~ /^[[:alpha:]]/ title=$7 + previous
END{print $0}' <FILE
Anyway the steps seem to be:
Either
Find "normal" lines,
Test if following line is "orphan"
if so, append "orphan" to field 7 [ title field ],
Print line
or
Find "orphans"
Somehow append to field 7 of previous line [ there will never be two consecutive orphans ]
The first way seems easiest to me --- but then, I'm the one in ignorance here.

This might work for you (GNU sed):
sed '$!N;/\n\([^\t]*\t\)\{7\}/!s/\(\t[^\t]*\)\n\(.*\)/ \2\1/;P;D' file

$ tac file | awk 'BEGIN{FS=OFS="\t"} NF==1{s=" "$0;next} {$7=$7 s; s=""}1' | tac
41850 0.4 0.5 LG EN RP Billy Makes a Friend Fridgewell, Norm
9338 0.4 0.5 LG EN RP Shine, The Musical! Mustard, Colonel
7255 0.5 0.5 LG EN RP Can You Play the Jew's Harp Truman, Harriet
9314 0.5 0.5 LG EN RP Hi, Skippy Plum, Prof
Here's an alternative approach without tac and using GNU awk (just replace gensub() with 2 sub() calls or match() or whatever if you don't want to use gawk):
$ cat tst.awk
BEGIN { FS="\t" }
NF==1 { s = gensub(/([^\t]+[\t]){6}[^\t]+/, "\\0 "$1, "", s); next }
{ printf "%s",s; s=$0 ORS }
END { printf "%s",s }
$ gawk -f tst.awk file
41850 0.4 0.5 LG EN RP Billy Makes a Friend Fridgewell, Norm
9338 0.4 0.5 LG EN RP Shine, The Musical! Mustard, Colonel
7255 0.5 0.5 LG EN RP Can You Play the Jew's Harp Truman, Harriet
9314 0.5 0.5 LG EN RP Hi, Skippy Plum, Prof

I realize the question is tagged awk, but this might be one of those times when it's easier with Perl:
perl -F"\t" -lane 'BEGIN { $, = "\t" }
if (/^\d{2}/) { print #saved if #saved; #saved = #F }
else { $saved[6].=" $_" };
END { print #saved }' foo.txt
Though here's an awk version of the same idea (with some improvements via Ed Morton):
awk -F"\t" '/^[0-9][0-9]/ { if (prefix) { print prefix"\t"title"\t"suffix }
prefix=$1
for ( i=2; i<=6; ++i ) prefix=prefix"\t"$i
title=$7; suffix=$8
next }
{ title = title" "$0 }
END { print prefix"\t"title"\t"suffix }' foo.txt
Both scripts give me this output, which looks like what you want:
41850 0.4 0.5 LG EN RP Billy Makes a Friend Fridgewell, Norm
9338 0.4 0.5 LG EN RP Shine, The Musical! Mustard, Colonel
7255 0.5 0.5 LG EN RP Can You Play the Jew's Harp Truman, Harriet
9314 0.5 0.5 LG EN RP Hi, Skippy Plum, Prof

Related

Option to cut values below a threshold in papaja::apa_table

I can't figure out how to selectively print values in a table above or below some value. What I'm looking for is known as "cut" in Revelle's psych package. MWE below.
library("psych")
library("psychTools")
derp <- fa(ability, nfactors=3)
print(derp, cut=0.5) #removes all loadings smaller than 0.5
derp <- print(derp, cut=0.5) #apa_table still doesn't print like this
Question is, how do I add that cut to an apa_table? Printing apa_table(derp) prints the entire table, including all values.
The print-method from psych does not return the formatted loadings but only the table of variance accounted for. You can, however, get the result you want by manually formatting the loadings table:
library("psych")
library("psychTools")
derp <- fa(ability, nfactors=3)
# Class `loadings` cannot be coerced to data.frame or matrix
class(derp$Structure)
[1] "loadings"
# Class `matrix` is supported by apa_table()
derp_loadings <- unclass(derp$Structure)
class(derp_loadings)
[1] "matrix"
# Remove values below "cut"
derp_loadings[derp_loadings < 0.5] <- NA
colnames(derp_loadings) <- paste("Factor", 1:3)
apa_table(
derp_loadings
, caption = "Factor loadings"
, added_stub_head = "Item"
, format = "pandoc" # Omit this in your R Markdown document
, format.args = list(na_string = "") # Don't print NA
)
*Factor loadings*
Item Factor 1 Factor 2 Factor 3
---------- --------- --------- ---------
reason.4 0.60
reason.16
reason.17 0.65
reason.19
letter.7 0.61
letter.33 0.56
letter.34 0.65
letter.58
matrix.45
matrix.46
matrix.47
matrix.55
rotate.3 0.70
rotate.4 0.73
rotate.6 0.63
rotate.8 0.63

Ghostscript -- convert PS to PNG, rotate, and scale

I have an application that creates very nice data plots rendered in PostScript with letter size and landscape mode. An example of the input file is at http://febo.com/uploads/blip.ps. [ Note: this image renders properly in a viewer, but the PNG conversion comes out with the image sideways. ] I need to convert these PostScript files into PNG images that are scaled down and rotated 90 degrees for web presentation.
I want to do this with ghostscript and no other external tool, because the conversion program will be used on both Windows and Linux systems and gs seems to be a common denominator. (I'm creating a perl script with a "PS2png" function that will call gs, but I don't think that's relevant to the question.)
I've searched the web and spent a couple of days trying to modify examples I've found, but nothing I have tried does the combination of (a) rotate, (b) resize, (c) maintain the aspect ratio and (d) avoid clipping.
I did find an example that injects a "scale" command into the postscript stream, and that seems to work well to scale the image to the desired size while maintaining the aspect ratio. But I can't find a way to rotate the resized image so that the, e.g., 601 x 792 point (2504 x 3300 pixel) postscript input becomes an 800 x 608 pixel png output.
I'm looking for the ghostscript/postscript fu that I can pass to the gs command line to accomplish this.
I've tried gs command lines with various combinations of -dFIXEDMEDIA, -dFitPage, -dAutoRotatePages=/None, or /All, -c "<> setpagedevice", changing -dDISPLAYWIDTHPOINTS and -dDISPLAYHEIGHTPOINTS, -g[width]x[height], -dUseCropBox with rotated coordinates, and other things I've forgotten. None of those worked, though it wouldn't surprise me if there's a magic combination of some of them that will. I just haven't been able to find it.
Here is the core code that produces the scaled but not rotated output:
## "$molps" is the input ps file read to a variable
## insert the PS "scale" command
$molps = $sf . " " . $sf . " scale\n" . $molps;
$gsopt1 = " -r300 -dGraphicsAlphaBits=4 -dTextAlphaBits=4";
$gsopt1 = $gsopt1 . " -dDEVICEWIDTHPOINTS=$device_width_points";
$gsopt1 = $gsopt1 . " -dDEVICEHEIGHTPOINTS=$device_height_points";
$gsopt1 = $gsopt1 . " -sOutputFile=" . $outfile;
$gscmd = "gs -q -sDEVICE=pnggray -dNOPAUSE -dBATCH " . $gsopt1 . " - ";
system("echo \"$molps\" \| $gscmd");
$device_width_points and $device_height_points are calculated by taking the original image size and applying the scaling factor $sf.
I'll be grateful to anyone who can show me the way to accomplish this. Thanks!
Better Answer:
You almost had it with your initial research. Just set orientation in the gs call:
... | gs ... -dAutoRotatePages=/None -c '<</Orientation 3>> setpagedevice' ...
cf. discussion of setpagedevice in the Red Book, and ghostscript docs (just before section 6.2)
Original Answer:
As well as "scale", you need "rotate" and "translate", not necessarily in that order.
Presumably these are single-page PostScript files?
If you know the bounding box of the Postscript, and the dimensions of the png, it is not too arduous to calculate the necessary transformation. It'll be about one line of code. You just need to ensure you inject it in the correct place.
Chapter 6 of the Blue Book has lots of details
A ubc.ca paper provides some illustrated examples (skip to page 4)
Simple PostScript file to play around with. You'll just need the three translate,scale,rotate commands in some order. The rest is for demonstrating what's going on.
%!
% function to define a 400x400 box outline, origin at 0,0 (bottom left)
/box { 0 0 moveto 0 400 lineto 400 400 lineto 400 0 lineto closepath } def
box clip % pretend the box is our bounding box
clippath stroke % draw initial black bounding box
(Helvetica) findfont 50 scalefont setfont % setup a font
% draw box, and show some text # 100,100
box stroke
100 100 moveto (original) show
% try out some transforms
1 0 0 setrgbcolor % red
.5 .5 scale
box stroke
100 100 moveto (+scaled) show
0 1 0 setrgbcolor % green
300 100 translate
box stroke
100 100 moveto (+translated) show
0 0 1 setrgbcolor % blue
45 rotate
box stroke
100 100 moveto (+rotated) show
showpage
It may be possible to insert the calculated transformation into the gs commandline like this:
... | gs ... -c '1 2 scale 3 4 translate 5 6 rotate' -# ...
Thanks to JHNC, I think I have it licked now, and for the benefit of posterity, here's what worked. (Please upvote JHNC, not this answer.)
## code to determine original size, scaling factor, rotation goes above
my $device_width_points;
my $device_height_points;
my $orientation;
if ($rotation) {
$orientation = 3;
$device_width_points = $ytotal_png_pt;
$device_height_points = $xtotal_png_pt;
} else {
$orientation = 0;
$device_width_points = $xtotal_png_pt;
$device_height_points = $ytotal_png_pt;
}
my $orientation_string =
" -dAutoRotatePages=/None -c '<</Orientation " .
$orientation . ">> setpagedevice'";
## $ps = .ps file read into variable
## insert the PS "scale" command
$ps = $sf . " " . $sf . " scale\n" . $ps;
$gsopt1 = " -r300 -dGraphicsAlphaBits=4 -dTextAlphaBits=4";
$gsopt1 = $gsopt1 . " -dDEVICEWIDTHPOINTS=$device_width_points";
$gsopt1 = $gsopt1 . " -dDEVICEHEIGHTPOINTS=$device_height_points";
$gsopt1 = $gsopt1 . " -sOutputFile=" . $outfile;
$gsopt1 = $gsopt1 . $orientation_string;
$gscmd = "gs -q -sDEVICE=pnggray -dNOPAUSE -dBATCH " . $gsopt1 . " - ";
system("echo \"$ps\" \| $gscmd");
One of the problems I had was that some options apparently don't play well together -- for example, I tried using the -g option to set the output size in pixels, but in that case the rotation didn't work. Using the DEVICE...POINTS commands instead did work.

How to extract data set from a text file?

Am quite new in the Unix field and I am currently trying to extract data set from a text file. I tried with sed, grep, awk but it seems to only work with extracting lines, but I want to extract an entire dataset... Here is an example of file from which I'd like to extract the 2 data sets (figures after the lines "R.Time Intensity")
[Header]
Application Name LabSolutions
Version 5.87
Data File Name C:\LabSolutions\Data\Antoine\170921_AC_FluoSpectra\069_WT3a derivatized lignin LiCl 430_GPC_FOREVER_430_049.lcd
Output Date 2017-10-12
Output Time 12:07:32
[Configuration]
Instrument Name BOTAN127-Instrument1
Instrument # 1
Line # 1
# of Detectors 3
Detector ID Detector A Detector B PDA
Detector Name Detector A Detector B PDA
# of Channels 1 1 2
[LC Chromatogram(Detector A-Ch1)]
Interval(msec) 500
# of Points 9603
Start Time(min) 0,000
End Time(min) 80,017
Intensity Units mV
Intensity Multiplier 0,001
Ex. Wavelength(nm) 405
Em. Wavelength(nm) 430
R.Time (min) Intensity
0,00000 -709779
0,00833 -709779
0,01667 17
0,02500 3
0,03333 7
0,04167 19
0,05000 9
0,05833 5
0,06667 2
0,07500 24
0,08333 48
[LC Chromatogram(Detector B-Ch1)]
Interval(msec) 500
# of Points 9603
Start Time(min) 0,000
End Time(min) 80,017
Intensity Units mV
Intensity Multiplier 0,001
R.Time (min) Intensity
0,00000 149
0,00833 149
0,01667 -1
I would greatly appreciate any idea. Thanks in advance.
Antoine
awk '/^[^0-9]/&&d{d=0} /R.Time/{d=1}d' file
Brief explanation,
Set d as a flag to determine print line or not
/^[^0-9]/&&d{d=0}: if regex ^[^0-9] matched && d==1, disabled d
/R.Time/{d=1}: if string "R.Time" searched, enabled d
awk '/R.Time/,/LC/' file|grep -v -E "R.Time|LC"
grep part will remove the R.Time and LC lines that come as a part of the output from awk
I think it's a job for sed.
sed '/R.Time/!d;:A;N;/\n$/!bA' infile

How to delete all the lines which have blank fields with sed or awk?

I would like to delete all the lines which have blank fields from text files. How can I change the following code to make the changes directly to the files?
awk '!/^\t|\t\t|\t$/' *.txt
AD 125.9 MN 124.9
AC 38.9 VG 13.2
AV 34.6 BG 33.0
GL 126.2
CY 34.9
CY 44.9
desired output
AD 125.9 MN 124.9
AC 38.9 VG 13.2
AV 34.6 BG 33.0
alternative awk one-liner:
awk 'NF==4' yourFile
One way using GNU sed:
sed -n -s -i '/[^ \t]* [^ \t]* [^ \t]* [^ \t]/p' *.txt
Results:
AD 125.9 MN 124.9
AC 38.9 VG 13.2
AV 34.6 BG 33.0
You could use a for loop with your original awk command:
for f in *.txt
do
cp $f /tmp/mytempfile
awk '!/^\t|\t\t|\t$/' /tmp/mytempfile > $f
done
Using your input this might work (but it depends if there are spaces or tabs):
sed '/^.\{13\}\s/d' INPUTFILE
Or this:
awk 'substr($0,14,1) = " " { print }' INPUTFILE
assuming your file has a max of 4 fields:
awk '{for(i=1;i<5;i++){if($i=="")next;}print}' file_name
tested below:
> cat temp
AD 125.9 MN 124.9
AC 38.9 VG 13.2
AV 34.6 BG 33.0
GL 126.2
CY 34.9
CY 44.9
>awk '{for(i=1;i<5;i++){if($i=="")next;}print}' temp
AD 125.9 MN 124.9
AC 38.9 VG 13.2
AV 34.6 BG 33.0
>

Insert space between pairs of characters - sed

Another sed question! I have nucleotide data in pairs
1 Affx-14150122 0 75891 00 CT TT CT TT CT
split by spaces and I need to put a space into every pair, eg
1 Affx-14150122 0 75891 0 0 C T T T C T T T C T
I've tried sed 's/[A-Z][A-Z]/ &/g' and sed 's/[A-Z][A-Z]/& /g'
And both A-Z replaced with .. and it never splits the pair as I'd like it to (it puts spaces before or after or splits every other pair or similar!).
I assume that this will work for you, however it's not perfect!
echo "1 Affx-14150122 0 75891 00 CT TT CT TT CT" | \
sed 's/\(\s[A-Z]\)\([A-Z]\)/\1 \2/g'
gives
1 Affx-14150122 0 75891 00 C T T T C T T T C T
sed 's/\(\s[A-Z]\)\([A-Z]\)/\1 \2/g' matches whitespace (\s) upper case character ([A-Z]), puts that in a group (\(...\)), and then matches upper case character and stores that in second group. Then this match is substituted by first group (\1) space second group (\2).
NOTE:
This fails when you have sequences that are longer than 2 characters.
An solution using awk which modifies only pairs of characters and might be more robust depending on your input data:
echo "1 Affx-14150122 0 75891 00 CT TT CT TT CT" | \
awk '
{
for(i=1;i<=NF;i++) {
if($i ~ /^[A-Z][A-Z]$/){
$i=substr($i,1,1)" "substr($i,2,1)
}
}
}
1'
gives
1 Affx-14150122 0 75891 00 C T T T C T T T C T1
This might work for you (GNU sed):
echo '1 Affx-14150122 0 75891 00 CT TT CT TT CT' |
sed ':a;s/\(\s\S\)\(\S\(\s\|$\)\)/\1 \2/g;ta'
1 Affx-14150122 0 75891 0 0 C T T T C T T T C T
This second method works but might provide false positives:
echo '1 Affx-14150122 0 75891 00 CT TT CT TT CT' | sed 's/\<\(.\)\(.\)\>/\1 \2/g'
1 Affx-14150122 0 75891 0 0 C T T T C T T T C T
This is actually easier in python than in awk:
echo caca | python -c 'import sys;\
for line in sys.stdin: print (" ".join(line))'
c a c a