TXR: removing trailing and leading commas in data respecting header line - text-processing
I have a lot of data like the following:
There are many ways data could be missing.,,,,,,,,,
,,,,,,,,,,,
An entire interior column could be missing.,,,,,,,,,
[missing/data/inside],,,,,,,,,
a,b,c,,,,,,,
1,,3,,,,,,,
1,,4,,,,,,,
3,,2,,,,,,,
,,,,,,,,,
An indented data with 2 completely missing columns.,,,,,,,,,
,,,,,,,[missing/data/outside],,
,,,,,,,a,b,c
,,,,,,,,3,
,,,,,,,,4,,,,,,,,
,,,,,,,,2,,,,,,,,
I want to tidy it up a bit into:
There are many ways data could be missing.
An entire interior column could be missing.
[missing/data/inside]
a,b,c
1,,3
1,,4
3,,2
An indented data with 2 completely missing columns.
[missing/data/outside]
a,b,c
,3,
,4,
,2,
The challenges are:
keeping all non-table text annotations (cleanup up any leading or trailing commas)
keeping the appropriate number of commas in data tables based on their header
If I didn't have the second challenge, I would just pipe my output through sed:
... | output | sed 's/,*$//g' | sed 's/^,*//g'
I trust that the number of commas to the left of the data will be equal in the header and data lines. However, I can't trust the same for the lagging commas.
I've written the following TXR code:
#(define empty_line)#\
# (cases)#\
#/,*/#(eol)#\
# (or)#\
#/[ ]*/#(eol)#\
# (or)#\
#(eol)#\
# (end)#\
#(end)
#(define heading)#/[a-z]+(:[^,]*)?/#(end)
#(define header)#\
# (cases)#\
# (heading),#(header)#\
# (or)#\
# (heading)#\
# (end)#\
#(end)
#(define content (hdr))#/.*/#(end)
#(define table (loc head data))
#/,*/[#loc]#(skip)
#{lead /,*/}#{head (header)}#(skip)
# (collect)
#lead#{data (content head)}#(skip)
# (until)
#(empty_line)
# (end)
#(end)
#(collect)
#annotation
#(empty_line)
#(table loc head data)
#(end)
#(output)
# (repeat)
#annotation
[#loc]
#head
# (repeat)
#data
# (end)
# (end)
#(end)
How might I write the content function to extract out the appropriate number of columns from the input data? I thought maybe it might be as easy as using the coll or rep directives like:
#(define content (hdr))#\
# (coll :gap 0 :times (length (split-str hdr ",")))#{x /[^,]/}#(end)#\
#(end)
This code doesn't reliably capture nor clean up annotations. Since annotation can exist anywhere that is not a table. How can I extract them and clean them up? I tried a few ways using #(maybe) and another nested #(collect) with no luck.
# (maybe)
# (collect)
#/,*/#annotation#/,*/
# (until)
#(empty_line)
#/,*/[#loc]#(skip)
# (end)
# (end)
Update:
I tried to solve just the table data collection part independently, for which I wrote the following code:
#(define heading)#/[^,]+/#(end)
#(define header)#\
# (cases)#\
# (heading),#(header)#\
# (or)#\
# (heading)#\
# (end)#\
#(end)
#(define content (hdr))#\
# (coll :gap 1 :mintimes 1 :maxtimes (length (split-str hdr ",")))#\
#/[^,]*/#\
# (end)#\
#(end)
#{lead /,*/}#{head (header)}#(skip)
#(collect :gap 0 :vars (data))
#lead#{data (content head)}#/,*/
#(end)
#(output)
#head
# (repeat)
#data
# (end)
#(end)
Here is my sample data:
,,alpha,foxtrot: m,tango: b,,
,,1,a,3,,
,,1,b,,,
,,whisky,c,foxtrot,,
,,,d,,,
,,1,,,,
,,,c,,,,,,
The code gives the correct result in all cases except for the penultimate line. It seems to me the trick to solving this problem is to write a regular expression for coll that correctly extracts blank data. Is there another approach that would make this possible? For example, appending the necessary remaining commas?
Just for reference, here is something I hacked up using somewhat different approach. Input is split early into fields, and things proceed from there.
It works on the sample data but doesn't capture it in the right way (following the syntax of annotation lines, empty line, table). Also, it isn't checking whether the data lines in the table have only blank fields before the indented position.
There may be something of use in this anyway.
#(define get-fields (f line))
# (bind f #(split-str line ","))
#(end)
#(define is-empty (f line))
# (require (or [all f empty]
[all line (op eql #\space)]))
#(end)
#(define is-table-start (f loc pos))
# (next :list f)
# (skip)
# (line pos)
[#loc]
# (rebind pos #(pred pos))
# (require (and [all [f 0..pos] empty]
[all [f (succ pos)..:] empty]))
#(end)
#(define is-headings (f pos))
# (require (and [all [f 0..pos] empty]
(empty [drop-while empty
(drop-while (f^$ #/[a-z]+(:[^,]*)?/)
[f pos..:])])))
#(end)
#(define out-fields (f))
# (do (put-line `#{f ","}`))
#(end)
#(repeat)
#line
# (get-fields f line)
# (cases)
# (is-empty f line)
# (do (put-line))
# (or)
# (is-table-start f loc pos)
# hline
# (get-fields hf hline)
# (is-headings hf pos)
# (collect :gap 0)
# dline
# (get-fields df dline)
# (until)
# (is-empty df dline)
# (end)
# (do (put-line `[#loc]`))
# (bind headings #(take-while [notf empty] (drop pos hf)))
# (bind endpos #(+ pos (length headings)))
# (merge tbl hf df)
# (output)
# (repeat)
# {tbl [pos..endpos] ","}
# (end)
# (end)
# (or)
# (bind trim-f #[take-while [notf empty] [drop-while empty f]])
# (do (put-line `#{trim-f ","}`))
# (end)
#(end)
Below is code which seems to work:
#(define empty_line)#\
# (cases)#\
#/,*/#(eol)#\
# (or)#\
#/[ ]*/#(eol)#\
# (or)#\
#(eol)#\
# (end)#\
#(end)
#(define heading)#/[^,]+/#(end)
#(define header)#\
# (cases)#\
# (heading),#(header)#\
# (or)#\
# (heading)#\
# (end)#\
#(end)
#(define content (hdr))#\
#/[^,]*/#\
# (coll :gap 0 :times (- (length (split-str hdr ",")) 1))#\
,#/[^,]*/#\
# (end)#\
#(end)
#(define table (loc head data))
#/,*/[#loc]#(skip)
#{lead /,*/}#{head (header)}#(skip)
# (collect)
#lead#{data (content head)}#(skip)
# (until)
#(empty_line)
# (end)
#(end)
#(collect)
# (collect)
#/,*/#{annotation /[A-Za-z0-9]+.*[^,]+/}#/,*/
# (until)
# (cases)
#(empty_line)
#/,*/[#loc]#(skip)
# (or)
#(eof)
# (end)
# (end)
#(empty_line)
#(table loc head data)
#(end)
#(output)
# (repeat)
# (repeat)
#annotation
# (end)
[#loc]
#head
# (repeat)
#data
# (end)
# (end)
#(end)
Related
perltidy indentation on method calls with or operator
I am attempting to get perltidy to indent correctly. It works almost perfectly, but there are issues with some lines of code. For example: $foo = something() or Foo->throw( 'a string which is longer than -l line length. Gets wrapped to next line, but not indented further than line above' ); which should be: $foo = something() or Foo->throw( 'a string which is longer than -l line length. Gets wrapped to next line, but not indented further than line above' ); Also, if a line break already exists, it does not get the indentation right: $foo = something() or Foo->throw( 'string' ); should be: $foo = something() or Foo->throw( 'string' ); Funnily enough, it gets it right if the function call contains a hashref... The perltidyrc: # Line -l=78 # Max line width is 78 cols -ole=unix # Unix line endings # Indentation -i=4 # Indent level is 4 cols -ci=4 # Continuation indent is 4 cols -dt=4 # Default tab size is 4 cols -noll # Don't outdent long quoted strings or lines # Comments -iscl # Ignore inline comment (side comments) length # Blank lines -blbs=1 # Ensure a blank line before methods -bbb # Ensure a blank line before blocks -mbl=1 # Maximum consecutive blank lines # Braces/parens/brackets -nbl # Opening braces on same line (incl. methods) -pt=0 # Low parenthesis tightness -sbt=0 # Low square bracket tightness -bt=0 # Low brace tightness -bbt=0 # Low block brace tightness # Semicolons -nsfs # No space for semicolons within for loops -nsts # No space before terminating semicolons # Spaces / Tightness -baao # Break after all operators -bbao # Break before all operators -cti=0 # No extra indentation for closing brackets # General perltidy settings -conv # Use as many iterations as necessary to beautify, until successive runs produce identical output (converge) -b # Backup files and modify in-place -se # Errors to STDERR I've gone back and forth a lot with varying degrees of success, but not managed to get it exactly right. Any pointers?
Configuring a Yasnippet for two scenarios -- (1) region is active; (2) region is not active
In conjunction with a user-configuration of (delete-selection-mode 1), is there a way to consolidate the two following Yasnippets into just one snippet so that it will work differently depending upon whether the region is active. For example: (if (region-active-p) (yas/selected-text) "$1") Active region -- surround the active region with the snippet: # -*- mode: snippet -*- # contributor: lawlist # key: bold_selected # name: bold_selected # binding: C-I b b s # -- {\bf `yas/selected-text`} Non-active region -- insert the snippet and place the cursor at the position of $1: # -*- mode: snippet -*- # contributor: lawlist # key: bold # name: bold # binding: C-I b b b # -- {\bf $1}
Back-ticks surrounding the elisp code to be evaluated are required. The built-in variable yas-selected-text stores the text of the selected region, which can be used to reinsert the same text during the snippet creation. Four (4) backslashes are needed for every one (1) backslash. # -*- mode: snippet -*- # contributor: lawlist # key: bold # name: bold # binding: TAB <f6> # -- `(if (region-active-p) (concat "{\\\\bf " yas-selected-text "}") "{\\\\bf $1}")`
# -*- mode: snippet -*- # name: bold # key: bold # type: command # -- (if (region-active-p) (yas-expand-snippet "{\\bf `yas-selected-text`}") (yas-expand-snippet "{\\bf $0}"))
I am using this snippet to conditionally wrap variables in JavaScript template literals. If there is selected text, then $1 uses that. Otherwise it uses the default value var which the user can overtype to replace the mirrored instance of $1. # -*- coding: utf-8; mode: snippet -*- # name: wrap variable in string template to log its value `var=${var}` # expand-env : ((yas-wrap-around-region nil)) # -- ${1:`(if (region-active-p) (yas-selected-text) "var")`}=\${$1}$0 For your problem, this snippet seems to work # -*- coding: utf-8; mode: snippet -*- # name: wrap selected text, or user provided text, in bold font # expand-env: ((yas-wrap-around-region nil)) # -- {\bf ${1:`(if (region-active-p) (yas-selected-text) "text-to-bold")`}}$0
Perltidy autoformat hashref as parameter
I have the following code snippet: my $obj = $class->new({ schema => $schema, reminder => $reminder, action => $action, dt => $dt, }); My problem is, that perltidy tries to format it into something, like this: my $obj = $class->new( { schema => $schema, reminder => $reminder, action => $action, dt => $dt, } ); I don't like the curly brace placement. Can I somehow configure perltidy to format it like the first example? (Skipping the formatting for the block is not an option. I want to format every longer hashref into that format, so it is more compact and readable) My perltidyrc so far: -l=79 # Max line width is 78 cols -i=4 # Indent level is 4 cols -ci=4 # Continuation indent is 4 cols -st # Output to STDOUT -se # Errors to STDERR -vt=2 # Maximal vertical tightness -cti=0 # No extra indentation for closing brackets -pt=1 # Medium parenthesis tightness -bt=1 # Medium brace tightness -sbt=1 # Medium square bracket tightness -bbt=1 # Medium block brace tightness -nsfs # No space before semicolons -nolq # Don't outdent long quoted strings If I remove the '{}' and pass the parameters as a list, it does the right thing btw. But i have to pass a hashref. Or could you recommend a sane way of formatting such code?
How about this option? perltidy -lp -vt=2 -vtc=1 which yields my $obj = $class->new( { schema => $schema, reminder => $reminder, action => $action, dt => $dt, } ); which is here http://perltidy.sourceforge.net/perltidy.html#line_break_control Closing tokens (except for block braces) are controlled by -vtc=n, or --vertical-tightness-closing=n, where -vtc=0 always break a line before a closing token (default), -vtc=1 do not break before a closing token which is followed by a semicolon or another closing token, and is not in a list environment. -vtc=2 never break before a closing token. EDIT I suspect you were missing the -lp (line up parameters) option which is also needed for vertical tightness (-vt and -vtc)
The following seems to solve the above problem and works for me: # perltidy configuration file created Thu Sep 24 15:54:07 2015 # using: - # I/O control --standard-error-output # -se --nostandard-output # -nst # Basic formatting options --indent-columns=4 # -i=4 [=default] --maximum-line-length=140 # -l=140 # Code indentation control --closing-brace-indentation=0 # -cbi=0 [=default] --closing-paren-indentation=0 # -cpi=0 [=default] --closing-square-bracket-indentation=0 # -csbi=0 [=default] --continuation-indentation=4 # -ci=4 --nooutdent-labels # -nola --nooutdent-long-quotes # -nolq # Whitespace control --block-brace-tightness=1 # -bbt=1 --brace-tightness=1 # -bt=1 [=default] --paren-tightness=2 # -pt=2 --nospace-for-semicolon # -nsfs --square-bracket-tightness=1 # -sbt=1 [=default] --square-bracket-vertical-tightness=0 # -sbvt=0 [=default] # Comment controls --ignore-side-comment-lengths # -iscl --minimum-space-to-comment=2 # -msc=2 --static-side-comment-prefix="#" # -sscp="#" --static-side-comments # -ssc # Linebreak controls --brace-vertical-tightness=0 # -bvt=0 [=default] --paren-vertical-tightness=0 # -pvt=0 [=default] --stack-closing-hash-brace # -schb --stack-closing-paren # -scp --stack-closing-square-bracket # -scsb --stack-opening-hash-brace # -sohb --stack-opening-paren # -sop --stack-opening-square-bracket # -sosb --want-break-before="% + - * / x != == >= <= =~ < > | & **= += *= &= <<= &&= -= /= |= + >>= ||= .= %= ^= x=" # -wbb="% + - * / x != == >= <= =~ < > | & **= += *= &= <<= &&= -= /= |= + >>= ||= .= %= ^= x=" # Blank line control --noblanks-before-comments # -nbbc
How to remove leading comment whitespace in Perl::Tidy?
I'm just configuring Perl::Tidy to match my preference. I have only one issue left which I can't find a fix. Sample script: #!/usr/bin/perl # | | | | | < "|" indicates first five "tabs" (1 tab 4 spaces). use strict; # Enable strict programming mode. use warnings; # Enable Perl warnings. use utf8; # This is an UTF-8 encoded script. 1; perltidyrc: # Perl Best Practices (plus errata) .perltidyrc file -l=76 # Max line width is 76 cols -i=4 # Indent level is 4 cols -ci=4 # Continuation indent is 4 cols -et=4 # 1 tab represent 4 cols -st # Output to STDOUT -se # Errors to STDERR -vt=2 # Maximal vertical tightness -cti=0 # No extra indentation for closing brackets -pt=0 # Medium parenthesis tightness -bt=1 # Medium brace tightness -sbt=1 # Medium square bracket tightness -bbt=1 # Medium block brace tightness -nsfs # No space before semicolons -nolq # Don't outdent long quoted strings -wbb="% + - * / x != == >= <= =~ < > | & **= += *= &= <<= &&= -= /= |= >>= ||= .= %= ^= x=" # Break before all operators # extras/overrides/deviations from PBP #--maximum-line-length=100 # be slightly more generous --warning-output # Show warnings --maximum-consecutive-blank-lines=2 # default is 1 --nohanging-side-comments # troublesome for commented out code -isbc # block comments may only be indented if they have some space characters before the # # for the up-tight folk :) -pt=2 # High parenthesis tightness -bt=2 # High brace tightness -sbt=2 # High square bracket tightness Result: #!/usr/bin/perl # | | | | | < "|" indicates first five "tabs" (1 tab 4 spaces). use strict; # Enable strict programming mode. use warnings; # Enable Perl warnings. use utf8; # This is an UTF-8 encoded script. 1; As you can see there is a leading space which causes that the "#" doesn't match the forth tab. How to remove this leading space?
Perltidy is only able to change perl code, as it knows the meaning of perl code. Comments can contain entirely arbitrary data and as such perltidy cannot touch it. So, this kind of thing you'll have to resolve yourself.
sed, awk or perl: Pattern range match, print 45 lines then add record delimiter
I have a file containing records delimited by the pattern /#matchee/. These records are of varying lengths ...say 45 - 75 lines. They need to ALL be 45 lines and still maintain the record delimiter. Records can be from different departments, department name is on line 2 following a blank line. So record delimiter could be thought of as simply /^#matchee/ or /^matchee/ followed by \n. There is a Deluxe edition of this problem and a Walmart edition ... DELUXE EDITION Pull each record by pattern range so I can sort records by department. Eg., with sed sed -n '/^DEPARTMENT NAME/,/^#matchee/{p;}' mess-o-records.txt Then, Print only the first 45 lines of each record in the file to conform to the 45 line constraint. Finally, make sure the result still has the record delimiter on line 45. WALMART EDITION Same as above, but instead of using a range, just use the record delimiter. STATUS My attempt at this might clarify what I'm trying to do. sed -n -e '/^DEPARTMENT-A/,/^#matchee/{p;}' -e '45q' -e '$s/.*/#matchee/' mess-o-records.txt This doesn't work, of course, because sed is operating on the entire file at each command. I need it to operate on each range match not the whole file. SAMPLE INPUT - 80 Lines ( truncated for space ) <blank line> DEPARTMENT-A Office space 206 Anonymous, MI 99999 Harold O Nonymous Buckminster Abbey Anonymous, MI 99999 item A Socket B 45454545 item B Gizmo Z 76767676 <too many lines here> <way too many lines here> #matchee SAMPLE OUTPUT - now only 45 lines <blank line> DEPARTMENT-A Office space 206 Anonymous, MI 99999 Harold O Nonymous Buckminster Abbey Anonymous, MI 99999 item A Socket B 45454545 item B Gizmo Z 76767676 <Record now equals exactly 45 lines> <yet record delimiter is maintained> #matchee CLARIFICATION UPDATE I will never need more than the first 40 lines if this makes things easier. Maybe the process would be: Match pattern(s) Print first 40 lines. Pad to appropriate length. Eg., 45 lines. Tack delimiter back on. Eg., #matchee I think this would be more flexible -- Ie., can handle record shorter than 45 lines. Here's a riff based on #Borodin's Perl example below: my $count = 0; $/ = "#matchee"; while (<>) { if (/^REDUNDANCY.*DEPT/) { print; $count = 0; } else { print if $count++ < 40; print "\r\n" x 5; print "#matchee\r\n"; } } This add 5 newlines to each record + the delimiting pattern /#matchee/. So it's wrong -- but it illustrates what I want. Print 40 lines based on department -- pad -- tack delimiter back on.
I think I understand what you want. Not sure about the bit about pull each record by pattern range. Is #matchee always followed by a blank line and then the department line? So in fact record number 2? This Perl fragment does what I understand you need. If you prefer you can put the input file on the command line and drop the open call. Then the loop would have to be while (<>) { ... }. Let us know if this is right so far, and what more you need from it. use strict; use warnings; open my $fh, '<', 'mess-o-records.txt' or die $!; my $count = 0; while (<$fh>) { if (/^#matchee/) { print; $count = 0; } else { print if $count++ < 45; } }
I know this has already had an accepted answer, but I figured I'd post an awk example for anyone interested. It's not 100%, but it gets the job done. Note This numbers the lines so you can verify the script is working as expected. Remove the i, from print i, current[i] to remove the line numbers. dep.awk BEGIN { RS = "#matchee\n\n" } $0 ~ /[a-zA-Z0-9]+/ { split($0, current, "\n") for (i = 1; i <= 45; i++) { print i, current[i]; } print "#matchee\n" } In this example, you begin the script by setting the record separator (RS) to "#matchee\n\n". There are two newlines because the first ends the line on which #matchee occurs and the second is the blank line on its own. The match validates that a record contains letters or numbers to be valid. You could also check that the match starts with 'DEPARTMENT-', but this would fail if there is a stray newline. Checking the content is the safest route. Because this uses a block record (i.e., DEPARTMENT-A through #matchee), you could either pass $0 through awk or sed again, or use the awk split function and loop through 45 lines. In awk, the arrays aren't zero-indexed. The print function includes a newline, so the block ends with print "#matchee\n" only instead of the double \n in the record separator variable. You could also drop the same awk script into a bash script and change the number of lines and field separator. Of course, you should add validations and whatnot, but here's the start: dep.sh #!/bin/bash # prints the first n lines within every block of text delimited by splitter splitter=$1 numlines=$2 awk 'BEGIN { RS="'$1'\n\n" } $0 ~ /[a-zA-Z0-9]+/ { split($0, current, "\n") for(i=1;i<='$numlines';i++) { print i, current[i] } print "'$splitter'", "\n" }' $3 Make the script executable and run it. ./dep.sh '#matchee' 45 input.txt > output.txt I added these files to a gist so you could also verify the output
This might work for you: D="DEPARTMENT-A" M="#matchee" sed '/'"$D/,/$M"'/{/'"$D"'/{h;d};H;/'"$M"'/{x;:a;s/\n/&'"$M"'/45;tb;s/'"$M"'/\n&/;ta;:b;s/\('"$M"'\).*/\1/;p};d}' file Explanation: Focus on range of lines /DEPARTMENT/,/#matchee/ At start of range move pattern space (PS) to hold space (HS) and delete PS /DEPARTMENT/{h;d} All subsequent lines in the range append to HS and delete H....;d At end of range:/#matchee/ Swap to HS x Test for 45 lines in range and if successful append #matchee at the 45th line s/\n/&#matchee/45 If previous substitution was successful branch to label b. tb If previous substitution was unsuccessful insert a linefeed before #matchee s/'"$M"'/\n&/ thus lengthening a short record to 45 lines. Branch to label a and test for 45 lines etc . ta Replace the first occurrence of #matchee to the end of the line by it's self. s/\('"$M"'\).*/\1/ thus shortening a long record to 45 lines. Print the range of records. p All non-range records pass through untouched.
TXR Solution ( http://www.nongnu.org/txr ) For illustration purposes using the fake data, I shorten the requirement from 40 lines to 12 lines. We find records beginning with a department name, delimited by #matchee. We dump them, chopped to no more than 12 lines, with #matchee added again. #(collect) # (all) #dept # (and) # (collect) #line # (until) #matchee # (end) # (end) #(end) #(output) # (repeat) #{line[0..12] "\n"} #matchee # (end) #(end) Here, the dept variable is expected to come from a -D command line option, but of course the code can be changed to accept it as an argument and put out a usage if it is missing. Run on the sample data: $ txr -Ddept=DEPARTMENT-A trim-extract.txr mess-o-records.txt DEPARTMENT-A Office space 206 Anonymous, MI 99999 Harold O Nonymous Buckminster Abbey Anonymous, MI 99999 item A Socket B 45454545 item B Gizmo Z 76767676 <too many lines here> #matchee The blank lines before DEPARTMENT-A are gone, and there are exactly 12 lines, which happen to include one line of the <too many ...> junk. Note that the semantics of #(until) is such that the #matchee is excluded from the collected material. So it is correct to unconditionally add it in the #(output) clause. This program will work even if a record happens to be shorter than 12 lines before #matchee is found. It will not match a record if #matchee is not found.