TXR: Parsing summary reports containing unicode with a more complicated syntax using functions

TXR: Parsing summary reports containing unicode with a more complicated syntax using functions - text-processing

I'm trying to parse a "summary" region of a bunch of computer reports, where the report names and their associated variables changes from file to file. I give a made up example following the format below:
Summary Report
Bath Tub
Temperature: 30 °C
Water ready
volume: 200000 cm³
Bath Room
Floor Area: 40 ft²
Door Height: 9 ± 0.1 ft
Full Report Set
It's hard to see from the above what the white space looks like, so here is a screenshot of my text editor with visible white space.
The region of interest starts with Summary Report and ends with Full Report Set. Properties can potentially span two lines. The property names are aligned such that the colon : stays at the same character position within each sub-report.
From the diagnostic output, it appears my attempt to exploit this fact is not working.
txr: (src/generic-micrometrics-report.txr:36) chr mismatch (position 11 vs. k)
txr: (src/generic-micrometrics-report.txr:36) variable k binding mismatch (13 vs. 12)
txr: (src/generic-micrometrics-report.txr:36) chr mismatch (position 12 vs. k)
txr: (src/generic-micrometrics-report.txr:36) string matched, position 13-18 (data/dummy-generic-report.txt:6)
txr: (src/generic-micrometrics-report.txr:36) Temperature: 30 °C
txr: (src/generic-micrometrics-report.txr:36) ^ ^
txr: (src/generic-micrometrics-report.txr:23) spec ran out of data
txr: (source location n/a) function (capture (nil (k . 13) (report . "Bath Tub"))) failed
I've included the code below. Can you explain why this code does not work? Am I doing what I think I'm doing with the colon_position function? If so, why is it failing? How would you write the capture function? Is this the general approach you would take? Is there a better way? Thanks so much for all your help and advice.
#; This output format always starts with or ends with atleast 2 blank spaces.
#; Fully blank spaced lines follow each property value pair line.
#(define blank_spaces)
#/[ ]+/#(eol)
#(end)
#; All colons align at the same column position within the body of a report.
#; If that doesn't happen, that means there is nothing to capture,
#; which shouldn't happen.
#; This function should bind the appropriate position without updating
#; the line position.
#; Reports end when there is an empty line, so don't look past that.
#(define colon_position (column))
#(trailer)
#(gather :vars (column))
#(skip)#(chr column):#(skip)
#(until)
#(end)
#(end)
#; Capture values for a property. Values are always given on a single line.
#; If there is error information, it will be indicated by a ± character.#\x00B1
#(define capture (value error units))
#(cases)#value#\ ±#\ #error#\ #units#/[ ]+/#(eol)#\
#(or)#value#\ #units#/[ ]+/#(eol)#(bind error "")#\
#(end)
#(end)
Summary Report
#(collect :vars (report property value error units))
#report
#(forget k)
#(colon_position k)
#(cases)
#property#(chr k): #(capture value error units)#(blank_spaces)
#(ord)
#; Properties can span two lines. I have not seen any that span more.
#property_head#(chr k) #(blank_spaces)
#property_tail#(chr k): #(capture value error units)#(blank_spaces)
#(merge property property_head property_tail)
#(cat property " ")
#(end)
#(blank_spaces)
#(end)
Full Report Set
#(output)
report,property,value,error,units
#(repeat)
#report,#property,#value,#error,#units
#(end)
#(end)

After making some changes here and there, I'm now getting this output:
report,property,value,error,units
Bath Tub,Temperature,30,,°C
Bath Tub,Water ready volume,200000,,cm³
Bath Room,Floor Area,40,,ft²
Bath Room,Door Height,9,0.1,ft
Code:
#; This output format always starts with or ends with atleast 2 blank spaces.
#; Fully blank spaced lines follow each property value pair line.
#(define blank_spaces)#\
#/[ ]*/#(eol)#\
#(end)
#; All colons align at the same column position within the body of a report.
#; If that doesn't happen, that means there is nothing to capture,
#; which shouldn't happen.
#; This function should bind the appropriate position without updating
#; the line position.
#; Reports end when there is an empty line, so don't look past that.
#(define colon_position (column))
# (trailer)
# (gather :vars (column))
# (skip)#(chr column):#(skip)
#(until)
#(end)
#(end)
#; Capture values for a property. Values are always given on a single line.
#; If there is error information, it will be indicated by a ± character.#\x00B1
#(define capture (value error units))#\
#(cases)#value#\ ±#\ #error#\ #units #(eol)#\
#(or)#value#\ #units#/[ ]+/#(eol)#(bind error "")#\
#(end)#\
#(end)
Summary Report
#(collect :vars (report property value error units))
#report
# (colon_position k)
# (collect)
# (cases)
#property#(chr k): #(capture value error units)#(blank_spaces)
# (or)
#; Properties can span two lines. I have not seen any that span more.
#property_head#(chr k) #(blank_spaces)
#property_tail#(chr k): #(capture value error units)#(blank_spaces)
# (merge property property_head property_tail)
# (cat property " ")
# (end)
# (until)
# (end)
#(until)
Full Report Set
#(end)
#(output)
report,property,value,error,units
# (repeat)
# (repeat)
#report,#property,#value,#error,#units
# (end)
# (end)
#(end)
The trick with the colon actually works (nice application of trailer and chr there). Where the code is tripped up is various small details. Misspelling #(or) as #(orf), pattern functions that should be horizontal not using the proper #\ line continuations, and incorrectness in the #(blank_spaces) causing it to want to consume some spaces unconditionally, spurious whitespace before #(merge) and such.
Also, the main problem is that the data is doubly nested, so we need a collect within a collect. We also need proper #(until) termination patterns. For the inner collect, I chose two blank lines; that seems to be what terminates the sections (it works for the data sample). The outer collect is terminated on the Full Report Set, but that is not strictly necessary.
To go with the nested collection, we use a nested repeat in the output.
I applied some indentation. Horizontal functions can use whitespace indentation because leading whitespace after line continuations is ignored.
The #(forget k) is gone; there is no k in the scope there. Each iteration of the surrounding collect will freshly bind k in an environment that is devoid of k.
Addendum: here is a diff against the code for making it more robust against unexpected data. As it is, the inner #(collect) will silently skip over nonmatching elements, which means that if the file contains elements that do not conform to the expected cases, they will be ignored. This behavior is already being taken advantage of: it is why the blank lines between the data items are ignored. We can tighten that with a :gap 0 (collected regions must be consecutive) and handling the blank lines as a case. A fallback case can then diagnose an input lines as unrecognized:
diff --git a/extract.txr b/extract.txr
index 8c93d89..3d1fac6 100644
--- a/extract.txr
+++ b/extract.txr
## -24,6 +24,7 ##
#(or)#value#\ #units#/[ ]+/#(eol)#(bind error "")#\
#(end)#\
#(end)
+#(name file)
Summary Report
#(collect :vars (report property value error units))
## -31,7 +32,7 ##
#report
# (colon_position k)
-# (collect)
+# (collect :gap 0)
# (cases)
#property#(chr k): #(capture value error units)#(blank_spaces)
# (or)
## -40,6 +41,12 ##
#property_tail#(chr k): #(capture value error units)#(blank_spaces)
# (merge property property_head property_tail)
# (cat property " ")
+# (or)
+
+# (or)
+# (line ln)
+# badline
+# (throw error `#file:#ln unrecognized syntax: #badline`)
# (end)
# (until)

Related

multi-character separator in `set datafile separator "|||"` doesn't work

I have an input file example.data with a triple-pipe as separator, dates in the first column, and also some more or less unpredictable text in the last column:
2019-02-01|||123|||345|||567|||Some unpredictable textual data with pipes|,
2019-02-02|||234|||345|||456|||weird symbols # and commas, and so on.
2019-02-03|||345|||234|||123|||text text text
When I try to run the following gnuplot5 script
set terminal png size 400,300
set output 'myplot.png'
set datafile separator "|||"
set xdata time
set timefmt "%Y-%m-%d"
set format x "%y-%m-%d"
plot "example.data" using 1:2 with linespoints
I get the following error:
line 8: warning: Skipping data file with no valid points
plot "example.data" using 1:2 with linespoints
^
"time.gnuplot", line 8: x range is invalid
Even stranger, if I change the last line to
plot "example.data" using 1:4 with linespoints
then it works. It also works for 1:7 and 1:10, but not for other numbers. Why?

When using the
set datafile separator "chars"
syntax, the string is not treated as one long separator. Instead, every character listed between the quotes becomes a separator on its own. From [Janert, 2016]:
If you provide an explicit string, then each character in the string will be
treated as a separator character.
Therefore,
set datafile separator "|||"
is actually equivalent to
set datafile separator "|"
and a line
2019-02-05|||123|||456|||789
is treated as if it had ten columns, of which only the columns 1,4,7,10 are non-empty.
Workaround
Find some other character that is unlikely to appear in the dataset (in the following, I'll assume \t as an example). If you can't dump the dataset with a different separator, use sed to replace ||| by \t:
sed 's/|||/\t/g' example.data > modified.data # in the command line
then proceed with
set datafile separator "\t"
and modified.data as input.

You basically gave the answer yourself.
If you can influence the separator in your data, use a separator which typically does not occur in your data or text. I always thought \t was made for that.
If you cannot influence the separator in your data, use an external tool (awk, Python, Perl, ...) to modify your data. In these languages it is probably a "one-liner". gnuplot has no direct replace function.
If you don't want to install external tools and want ensure platform independence, there is still a way to do it with gnuplot. Not just a "one-liner", but there is almost nothing you can't also do with gnuplot ;-).
Edit: simplified version with the input from #Ethan (https://stackoverflow.com/a/54541790/7295599).
Assuming you have your data in a dataset named $Data. The following code will replace ||| with \t and puts the result into $DataOutput.
### Replace string in dataset
reset session
$Data <<EOD
# data with special string separators
2019-02-01|||123|||345|||567|||Some unpredictable textual data with pipes|,
2019-02-02|||234|||345|||456|||weird symbols # and commas, and so on.
2019-02-03|||345|||234|||123|||text text text
EOD
# replace string function
# prefix RS_ to avoid variable name conflicts
replaceStr(s,s1,s2) = (RS_s='', RS_n=1, (sum[RS_i=1:strlen(s)] \
((s[RS_n:RS_n+strlen(s1)-1] eq s1 ? (RS_s=RS_s.s2, RS_n=RS_n+strlen(s1)) : \
(RS_s=RS_s.s[RS_n:RS_n], RS_n=RS_n+1)), 0)), RS_s)
set print $DataOutput
do for [RS_j=1:|$Data|] {
print replaceStr($Data[RS_j],"|||","\t")
}
set print
print $DataOutput
### end of code
Output:
# data with special string separators
2019-02-01 123 345 567 Some unpredictable textual data with pipes|,
2019-02-02 234 345 456 weird symbols # and commas, and so on.
2019-02-03 345 234 123 text text text

How to write or print multiple variable outputs with headers, line by line per tick to command center in netlogo

How does one write or print multiple variable outputs with headers, line by line per tick to command center in netlogo? the idea is to print out ticking results of more than one variable (reported by procedures) so that it appears in they appear as follows in the command center output window:
length weight height area
24.2 23.1 22.0 25.1
18.7 19.2 10.4 22.0
and so on, updating per tick in columnar form.
I eventually want to be able to use the export-output command to transport the output to a csv file at the end of the simulation run. I know there are other ways of doing this but I want it this way specifically for a reason.

You need the type and print commands. Your heading would need to be printed during initialisation and the variable values would need to be printed each tick. Assuming your procedures are named cal-length etc, code would look something like this. Note that there is no spacing control or other formatting.
to setup
...
print "length weight height area"
...
end
to go
...
dump-to-screen
...
end
to dump-to-screen
type calc-length type " " type calc-weight type " "
type calc-height type " " print calc-area
end

TXR: How to combine all lines where the following line begins with a tab?

I am trying to parse the text output of a shell command using txr.
The text output uses a tab indented line following it to continue the current line (not literal \t characters as I show below). Note that on other variable assignment lines (that don't represent extended length values), there are leading spaces in the input.
Variable Group: 1
variable = the value of the variable
long_variable = the value of the long variable
\tspans across multiple lines
really_long_variable = this variable extends
\tacross more than two lines, but it
\tis unclear how many lines it will end up extending
\tacross ahead of time
Variable Group: 2
variable = the value of the variable in group 2
long_variable = this variable might not be that long
really_long_variable = neither might this one!
How might I capture these using the txr pattern language? I know about the #(freeform) directive and it's optional numeric argument to treat the next n lines as one big line. Thus, it seems to me the right approach would be something like:
#(collect)
Variable Group: #i
variable = #value
#(freeform 2)
long_variable = #long_value
#(set long_value #(regsub #/[\t ]+/ "" long_value))
#(freeform (count-next-lines-starting-with-tab))
really_long_variable = #really_long_value
#(set really_long_value #(regsub #/[\t ]+/ "" really_long_value))
#(end)
However, it's not clear to me how I might write the count-next-lines-starting-with-tab procedure with TXR lisp. On the other hand, maybe there is another better way I could approach this problem. Could you provide any suggestions?
Thanks in advance!

Let's apply the KISS principle; we don't need to bring in #(freeform). Instead we can separately capture the main line and the continuation lines for the (potentially) multi-line variables. Then, intelligently combine them with #(merge):
#(collect)
Variable Group: #i
variable = #value
long_variable = #l_head
# (collect :gap 0 :vars (l_cont))
#l_cont
# (end)
really_long_variable = #rl_head
# (collect :gap 0 :vars (rl_cont))
#rl_cont
# (end)
# (merge long_variable l_head l_cont)
# (merge really_long_variable rl_head rl_cont)
#(end)
Note that the big indentations in the above are supposed to be literal tabs. Instead of literal tabs, we can encode tabs using #\t.
Test run on the real data with \t replaced by tabs:
$ txr -Bl new.txr data
(i "1" "2")
(value "the value of the variable" "the value of the variable in group 2")
(l_head "the value of the long variable" "this variable might not be that long")(l_cont ("spans across multiple lines") nil)
(rl_head "this variable extends" "neither might this one!")
(rl_cont ("across more than two lines, but it" "is unclear how many lines it will end up extending"
"across ahead of time") nil)
(long_variable ("the value of the long variable" "spans across multiple lines")
("this variable might not be that long"))
(really_long_variable ("this variable extends" "across more than two lines, but it"
"is unclear how many lines it will end up extending" "across ahead of time")
("neither might this one!"))
We use a strict collect with :vars for the continuation lines, so that the variable is bound (to nil) even if nothing is collected. :gap 0 prevents these inner collects from scanning across lines that don't start with tabs: another strictness measure.
#(merge) has "special" semantics for combining lists of strings that haver different nesting levels; it's perfect for assembling data from different levels of collection and is basically tailor made for this kind of thing. This problem is very similar to extracting HTTP, Usenet or e-mail headers, which can have continuation lines.
On the topic of how to write a Lisp function to look ahead in the data, the most important aspect is how to get a handle on the data at the current position. The TXR pattern matching works by backtracking over a lazy list of strings (lines/records).　We can use the #(data) directive to capture the list pointer at the given input position. Then we can just treat that as a list:
#(data here)
#(bind tab-start-lines #(length (take-while (f^ #/\t/) here))
Now tab-start-lines has a count of how many lines in the input start with tabs. However, take-while has a termination condition bug, unfortunately; if the following data consists of nothing but one or more tab lines, it misbehaves.⚠ Until TXR 166 is released, this requires a little workaround: (take-while [iff stringp (f^ #/\t/)] here).

Efficient string editing in Applescript

I'm writing an applescript that needs to take a string and output just the numbers in that string. I have a method that works
do shell script "sed s/[a-zA-Z\\']//g <<< " & s
where s is the input string, but this script is doing this thousands of times and it ends up taking on the order of twenty minutes to get through them all. Is there any way I could make this faster?
Expected input is a string that can contain pretty much anything (except no / or \. I've tried to have there be no whitespace characters as that breaks my method, but sometimes I get one still). Expected output is a string of numbers (or an empty string). For example the sentence "1 man paid 1,23 cents for 51 cans of beer " would have the desired output of simply "112351", and "I h8 dealing with all these numb3rs" would output "83"

If you really have to pick the numbers from a string at a time, then maybe this approach is faster, even though it seems like a lot more, but I should tell you, that specifying the full path of the tr command, is something that will save some, if just some milliseconds.
set mlist to "1 man paid 1,23 cents for 51 can\\'s of beer "
script o
property l : missing value
end script
set o's l to words of mlist
repeat with i from 1 to (count o's l)
try
item i of o's l as integer
on error
set item i of o's l to missing value
end try
end repeat
set o's l to o's l's text
set tids to my text item delimiters
set my text item delimiters to space
set o's l to text items of o's l
set my text item delimiters to tids
set o's l to o's l as text
set my text item delimiters to ","
set o's l to text items of o's l
set my text item delimiters to tids
set numb3rs to o's l as text
log numb3rs
--> (*112351*)
This approach may be faster, because you save the overhead of do shell script for every line you got, it should also just return the numbers, if those numbers are well formed (correct decimal separator). I haven't tried it with e-notation, but I think that should work as well.

have you tried tr?
tr -d '[:alpha:][:punct:]' <<< string

Another sed-solution, to make sure that only numbers are displayed:
do shell script "echo " & quoted form of s & "|sed s/[^0-9]//g"

I offer this alternate code. I don't see why McUsr's code needs to be that complicated.
set s to "1 man paid 1,23 cents for 51 can\\'s of beer "
set n to ""
repeat with i from 1 to (length of s)
set c to (character i of s)
# http://stackoverflow.com/questions/20313799
if class of c is number then
set n to n & c
end if
end repeat
return n

AutoHotKey Source Code Line Break

Is there a way to do line break in AutoHotKey souce code? My code is getting longer than 80 characters and I would like to separate them neatly. I know we can do this in some other language, such as VBA for example below:
http://www.excelforum.com/excel-programming-vba-macros/564301-how-do-i-break-vba-code-into-two-or-more-lines.html
If Day(Date) > 10 _
And Hour(Time) > 20 Then _
MsgBox "It is after the tenth " & _
"and it is evening"
Is there a souce code line break in AutoHotKey? I use a older version of the AutoHotKey, ver 1.0.47.06

There is a Splitting a Long Line into a Series of Shorter Ones section in the documentation:
Long lines can be divided up into a collection of smaller ones to
improve readability and maintainability. This does not reduce the
script's execution speed because such lines are merged in memory the
moment the script launches.
Method #1: A line that starts with "and", "or", ||, &&, a comma, or a
period is automatically merged with the line directly above it (in
v1.0.46+, the same is true for all other expression operators except
++ and --). In the following example, the second line is appended to the first because it begins with a comma:
FileAppend, This is the text to append.`n ; A comment is allowed here.
, %A_ProgramFiles%\SomeApplication\LogFile.txt ; Comment.
Similarly, the following lines would get merged into a single line
because the last two start with "and" or "or":
if (Color = "Red" or Color = "Green" or Color = "Blue" ; Comment.
or Color = "Black" or Color = "Gray" or Color = "White") ; Comment.
and ProductIsAvailableInColor(Product, Color) ; Comment.
The ternary operator is also a good candidate:
ProductIsAvailable := (Color = "Red")
? false ; We don't have any red products, so don't bother calling the function.
: ProductIsAvailableInColor(Product, Color)
Although the indentation used in the examples above is optional, it might improve
clarity by indicating which lines belong to ones above them. Also, it
is not necessary to include extra spaces for lines starting with the
words "AND" and "OR"; the program does this automatically. Finally,
blank lines or comments may be added between or at the end of any of
the lines in the above examples.
Method #2: This method should be used to merge a large number of lines
or when the lines are not suitable for Method #1. Although this method
is especially useful for auto-replace hotstrings, it can also be used
with any command or expression. For example:
; EXAMPLE #1:
Var =
(
Line 1 of the text.
Line 2 of the text. By default, a line feed (`n) is present between lines.
)
; EXAMPLE #2:
FileAppend, ; The comma is required in this case.
(
A line of text.
By default, the hard carriage return (Enter) between the previous line and this one will be written to the file as a linefeed (`n).
By default, the tab to the left of this line will also be written to the file (the same is true for spaces).
By default, variable references such as %Var% are resolved to the variable's contents.
), C:\My File.txt
In the examples above, a series of lines is bounded at
the top and bottom by a pair of parentheses. This is known as a
continuation section. Notice that the bottom line contains
FileAppend's last parameter after the closing parenthesis. This
practice is optional; it is done in cases like this so that the comma
will be seen as a parameter-delimiter rather than a literal comma.
Please read the documentation link for more details.
So your example can be rewritten as the following:
If Day(Date) > 10
And Hour(Time) > 20 Then
MsgBox
(
It is after the tenth
and it is evening
)

I'm not aware of a general way of doing this, but it seems you can break a line and start the remainder of the broken line (e.g. the next real line) with an operator. As long as the second line (and the third, fourth, etc., as applicable) starts with (optional whitespace plus) an operator, AHK will treat the whole thing as one line.
For instance:
hello := "Hello, "
. "world!"
MsgBox %hello%
The presence of the concatenation operator . at the logical beginning of the second line here makes AHK treat both lines as one.
(I also tried leaving the operator and the end of the first line and starting the second off with a double-quoted string; that didn't work.)

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

TXR: Parsing summary reports containing unicode with a more complicated syntax using functions - text-processing

Related

multi-character separator in `set datafile separator "|||"` doesn't work

How to write or print multiple variable outputs with headers, line by line per tick to command center in netlogo

TXR: How to combine all lines where the following line begins with a tab?

Efficient string editing in Applescript

AutoHotKey Source Code Line Break

Categories

Resources