ruamel parser error on reading special characters - ruamel.yaml

I am using ruamel.yaml (0.15.37) and have a data structure like:
- !Message
Name: my message
Messages:
- !Message
name: InputMsg1
- !Variable
Name: control_word
Length: 8
Type: Signed
Unit: % # ruamel parser erro
If I read the YAML-file I get the error
File "_ruamel_yaml.pyx", line 904, in
_ruamel_yaml.CParser._parse_next_event (ext/_ruamel_yaml.c:12818) ruamel.yaml.scanner.ScannerError: while scanning for the next token
found character that cannot start any token
If I start with any other character then no error will be generated.
- !Message
Name: my message
Messages:
- !Message
name: InputMsg1
- !Variable
Name: control_word
Length: 8
Type: Signed
Unit: a % # no parser erro
I also tried %

The percent sign is an indicator character and those cannot start plain scalars. So you will have to quote the percent sign:
Unit: "%"
or
Unit: '%'
(you can probably also make it a literal block scalar:
Unit: |
%
or folding scalar, but I don't think that is better readable).
Since & is an indicator character as well that will throw the same error, but you seem to (mistakingly) assume you can do HTML escapes in YAML (you can't).

Related

Safely Evaluating Input of Multiple Types - OPA Gatekeeper/Rego

I'm trying to deploy a Constraint Template to my Kubernetes cluster for enforcing PodDisriptionBudgets contain a maxUnavailable percentage higher than a given percentage, and denying integer values.
However, I'm unsure how to safely evaluate maxUnavailable since it can be an integer or a string. Here is the constraint template I am using:
apiVersion: templates.gatekeeper.sh/v1beta1
kind: ConstraintTemplate
metadata:
name: pdbrequiredtolerance
spec:
crd:
spec:
names:
kind: PdbRequiredTolerance
validation:
# Schema for the `parameters` field
openAPIV3Schema:
properties:
minAllowed:
type: integer
targets:
- target: admission.k8s.gatekeeper.sh
rego: |
package pdbrequiredtolerance
# Check that maxUnavailable exists
violation[{"msg": msg }] {
not input.review.object.spec.maxUnavailable
msg := "You must use maxUnavailable on your PDB"
}
# Check that maxUnavailable is a string
violation[{"msg": msg}] {
not is_string(input.review.object.spec.maxUnavailable)
msg := "maxUnavailable must be a string"
}
# Check that maxUnavailable is a percentage
violation[{"msg": msg}] {
not endswith(input.review.object.spec.maxUnavailable,"%")
msg := "maxUnavailable must be a string ending with %"
}
# Check that maxUnavailable is in the accpetable range
violation[{"msg": msg}] {
percentage := split(input.review.object.spec.maxUnavailable, "%")
to_number(percentage[0]) < input.parameters.minAllowed
msg := sprintf("You must have maxUnavailable of %v percent or higher", [input.parameters.minAllowed])
}
When I enter a PDB with a value that's too high, I receive the expected error:
Error from server ([pdb-must-have-max-unavailable] You must have maxUnavailable of 30 percent or higher)
However, when I use a PDB with an integer value:
Error from server (admission.k8s.gatekeeper.sh: __modset_templates["admission.k8s.gatekeeper.sh"]["PdbRequiredTolerance"]_idx_0:14: eval_type_error: endswith: operand 1 must be string but got number)
This is because endswith rule is trying to evaluate a string. Is there any way around this in Gatekeeper? Both PDBs I specified are valid Kubernetes manifests. I do not wish to return this confusing error to our end users, and would rather clarify that they cannot use integers.
I believe this was solved elsewhere, but for posterity, one solution to this would be to simply convert the value of variable type to a known type (like string) before doing the comparison or operation.
maxUnavailable := sprintf("%v", [input.review.object.spec.maxUnavailable])
maxUnavailable can now safely be dealt with as a string regardless of the original type.

Sed replace with conditions

I have the following text file.
2017-03-01 10:57:50,892 [Thread-977] limits.compiler : ERROR - Error in formula Undefined_CountryDom
cannot get field LOCKS_CountryDom, String, CountryDom, belongs to Header, scalar (dynamic index: 172)
limits.compiler.LimitsVariablesException: cannot get field LOCKS_CountryDom, String, CountryDom, belongs to Header, scalar (dynamic index: 172)
at limits.compiler.ExpressionHandler.evaluateBoolean(ExpressionHandler.java:170)
at limits.compiler.ExpressionHandler.getBoolean(ExpressionHandler.java:266)
2017-03-01 10:57:50,700 [Thread-231] console : ERROR - at limits.compiler.ExpressionHandler.getString(ExpressionHandler.java:700)
2017-03-01 10:57:50,892 [Thread-977] console : ERROR - at limits.compiler.compliance.ComplianceCheckFactoryImpl.compileDefaultMessageExpression(ComplianceCheckFactoryImpl.java:107)
2017-03-01 10:57:50,892 [Thread-564] console : ERROR - at limits.compiler.compliance.ComplianceCheckFactoryImpl.createOverflow(ComplianceCheckFactoryImpl.java:231)
2017-03-01 10:57:50,893 [Thread-977] console : ERROR - at limits.compiler.compliance.ComplianceCheckFactoryImpl.evaluateTickLockCombinations(ComplianceCheckFactoryImpl.java:498)
2017-03-01 10:57:50,893 [Thread-977] console : ERROR - at limits.engine.stream.TickWriterImpl.doMLCOperations(TickWriterImpl.java:2488)
I require the removal of 2017-03-01 10:57:50,700 [Thread-231] console : ERROR - such that the lines with time and - at would be made similar to the lines without the time above.
The result should be something like this:
2017-03-01 10:57:50,892 [Thread-977] limits.compiler : ERROR - Error in formula Undefined_CountryDom
cannot get field LOCKS_CountryDom, String, CountryDom, belongs to Header, scalar (dynamic index: 172)
limits.compiler.LimitsVariablesException: cannot get field LOCKS_CountryDom, String, CountryDom, belongs to Header, scalar (dynamic index: 172)
at limits.compiler.ExpressionHandler.evaluateBoolean(ExpressionHandler.java:170)
at limits.compiler.ExpressionHandler.getBoolean(ExpressionHandler.java:266)
at limits.compiler.ExpressionHandler.getString(ExpressionHandler.java:700)
at limits.compiler.compliance.ComplianceCheckFactoryImpl.compileDefaultMessageExpression(ComplianceCheckFactoryImpl.java:107)
at limits.compiler.compliance.ComplianceCheckFactoryImpl.createOverflow(ComplianceCheckFactoryImpl.java:231)
at limits.compiler.compliance.ComplianceCheckFactoryImpl.evaluateTickLockCombinations(ComplianceCheckFactoryImpl.java:498)
at limits.engine.stream.TickWriterImpl.doMLCOperations(TickWriterImpl.java:2488)
How can I do that?
You can group regular expressions within sed's famous substitution s function s/regex/pattern/. In this case, we use two groups \(a_regex_group\)and print one pattern, the second \2.
sed 's/\(^[0-9]*.*-\s\s\)\(.*$\)/\t\2/'
This chops everything which starts with numbers ^[0-9]* followed by any arbitrary characters .* including a dash and two whitespaces -\s\s and leave the rest \2 with a leading tabulator \t.

How to recognize ID, Literals and Comments in Lex file

I have to write a lex program that has these rules:
Identifiers: String of alphanumeric (and _), starting with an alphabetic character
Literals: Integers and strings
Comments: Start with ! character, go to until the end of the line
Here is what I came up with
[a-zA-Z][a-zA-Z0-9]+ return(ID);
[+-]?[0-9]+ return(INTEGER);
[a-zA-Z]+ return ( STRING);
!.*\n return ( COMMENT );
However, I still get a lot of errors when I compile this lex file.
What do you think the error is?
It would have helped if you'd shown more clearly what the problem was with your code. For example, did you get an error message or did it not function as desired?
There are a couple of problems with your code, but it is mainly correct. The first issue I see is that you have not divided your lex program into the necessary parts with the %% divider. The first part of a lex program is the declarations section, where regular expression patterns are specified. The second part is where the action that match patterns are specified. The (optional) third section is where any code (for the compiler) is placed. Code for the compiler can also be placed in the declaration section when delineated by %{ and %} at the start of a line.
If we put your code through lex we would get this error:
"SoNov16.l", line 1: bad character: [
"SoNov16.l", line 1: unknown error processing section 1
"SoNov16.l", line 1: unknown error processing section 1
"SoNov16.l", line 1: bad character: ]
"SoNov16.l", line 1: bad character: +
"SoNov16.l", line 1: unknown error processing section 1
"SoNov16.l", line 1: bad character: (
"SoNov16.l", line 1: unknown error processing section 1
"SoNov16.l", line 1: bad character: )
"SoNov16.l", line 1: bad character: ;
Did you get something like that? In your example code you are specifying actions (the return(ID); is an example of an action) and thus your code is for the second section. You therefore need to put a %% line ahead of it. It will then be a valid lex program.
You code is dependant on (probably) a parser, which consumes (and declares) the tokens. For testing purposes it is often easier to just print the tokens first. I solved this problem by making a C macro which will do the print and can be redefined to do the return at a later stage. Something like this:
%{
#define TOKEN(t) printf("String: %s Matched: " #t "\n",yytext)
%}
%%
[a-zA-Z][a-zA-Z0-9]+ TOKEN(ID);
[+-]?[0-9]+ TOKEN(INTEGER);
[a-zA-Z]+ TOKEN (STRING);
!.*\n TOKEN (COMMENT);
If we build and test this, we get the following:
abc
String: abc Matched: ID
abc123
String: abc123 Matched: ID
! comment text
String: ! comment text
Matched: COMMENT
Not quite correct. We can see that the ID rule is matching what should be a string. This is due to the ordering of the rules. We have to put the String rule first to ensure it matches first - unless of course you were supposed to match strings inside some quotes? You also missed the underline from the ID pattern. Its also a good idea to match and discard any whitespace characters:
%{
#define TOKEN(t) printf("String: %s Matched: " #t "\n",yytext)
%}
%%
[a-zA-Z]+ TOKEN (STRING);
[a-zA-Z][a-zA-Z0-9_]+ TOKEN(ID);
[+-]?[0-9]+ TOKEN(INTEGER);
!.*\n TOKEN (COMMENT);
[ \t\r\n]+ ;
Which when tested shows:
abc
String: abc Matched: STRING
abc123_
String: abc123_ Matched: ID
-1234
String: -1234 Matched: INTEGER
abc abc123 ! comment text
String: abc Matched: STRING
String: abc123 Matched: ID
String: ! comment text
Matched: COMMENT
Just in case you wanted strings in quotes, that is easy too:
%{
#define TOKEN(t) printf("String: %s Matched: " #t "\n",yytext)
%}
%%
\"[^"]+\" TOKEN (STRING);
[a-zA-Z][a-zA-Z0-9_]+ TOKEN(ID);
[+-]?[0-9]+ TOKEN(INTEGER);
!.*\n TOKEN (COMMENT );
[ \t\r\n] ;
"abc"
String: "abc" Matched: STRING

How to read a string containing a comma and an at sign with textread?

My prototype data line looks like this:
(1) 11 July England 0-0 Uruguay # Wembley Stadium, London
Currently I'm using this:
[no,dd,mm,t1,p1,p2,t2,loc]=textread('1966.txt','(%d) %d %s %s %d-%d %s # %[%s \n]');
But it gives me the following error:
Error using dataread
Trouble reading string from file (row 1, field 12) ==> Wembley Stadium, London\n
Error in textread (line 174)
[varargout{1:nlhs}]=dataread('file',varargin{:}); %#ok<REMFF1>
So it seems to have trouble with reading a string that contains a comma, or it's the at sign that causes trouble. I read the documentation thoroughly but nowhere does it mention what to do when you have special characters such as # or if you want to read a string that contains a delimiter even though it I don't want it recognized as a delimiter.
You want
[no,dd,mm,t1,p1,p2,t2,loc] = ...
textread('1966.txt','(%d) %d %s %s %d-%d %s # %[^\n]');

How to use Unicode codepoints above U+FFFF in Rebol 3 strings like in Rebol 2?

I know you can't use caret style escaping in strings for codepoints bigger than ^(FF) in Rebol 2, because it doesn't know anything about Unicode. So this doesn't generate anything good, it looks messed up:
print {Q: What does a Zen master's {Cow} Say? A: "^(03BC)"!}
Yet the code works in Rebol 3 and prints out:
Q: What does a Zen master's {Cow} Say? A: "μ"!
That's great, but R3 maxes out its ability to hold a character in a string at all at U+FFFF apparently:
>> type? "^(FFFF)"
== string!
>> type? "^(010000)"
** Syntax error: invalid "string" -- {"^^(010000)"}
** Near: (line 1) type? "^(010000)"
The situation is a lot better than the random behavior of Rebol 2 when it met codepoints it didn't know about. However, there used to be a workaround in Rebol for storing strings if you knew how to do your own UTF-8 encoding (or got your strings by way of loading source code off disk). You could just assemble them from individual characters.
So the UTF-8 encoding of U+010000 is #F0908080, and you could before say:
workaround: rejoin [#"^(F0)" #"^(90)" #"^(80)" #"^(80)"]
And you'd get a string with that single codepoint encoded using UTF-8, that you could save to disk in code blocks and read back in again. Is there any similar trick in R3?
There is a workaround using the string! datatype as well. You cannot use UTF-8 in that case, but you can use UTF-16 workaround as follows:
utf-16: "^(d800)^(dc00)"
, which encodes the ^(10000) code point using UTF-16 surrogate pair. In general, the following function can do the encoding:
utf-16: func [
code [integer!]
/local low high
] [
case [
code < 0 [do make error! "invalid code"]
code < 65536 [append copy "" to char! code]
code < 1114112 [
code: code - 65536
low: code and 1023
high: code - low / 1024
append append copy "" to char! high + 55296 to char! low + 56320
]
'else [do make error! "invalid code"]
]
]
Yes, there is a trick...which is the trick you should have been using in R2 as well. Don't use a string! Use a binary! if you have to do this sort of thing:
good-workaround: #{F0908080}
It would've worked in Rebol2, and it works in Rebol3. You can save it and load it without any funny business.
In fact, if care about Unicode at all, ever...stop doing string processing that is using codepoints higher than ^(7F) if you are stuck in Rebol 2 and not 3. We'll see why by looking at that terrible workaround:
terrible-workaround: rejoin [#"^(F0)" #"^(90)" #"^(80)" #"^(80)"]
..."And you'd get a string with that single UTF-8 codepoint"...
The only thing you should get is a string with four individual character codepoints, and with 4 = length? terrible-workaround. Rebol2 is broken because string! is basically no different from binary! under the hood. In fact, in Rebol2 you could alias the two types back and forth without making a copy, look up AS-BINARY and AS-STRING. (This is impossible in Rebol3 because they really are fundamentally different, so don't get attached to the feature!)
It's somewhat deceptive to see these strings reporting a length of 4, and there's a false comfort of each character producing the same value if you convert them to integer!. Because if you ever write them out to a file or port somewhere, and they need to be encoded, you'll get bitten. Note this in Rebol2:
>> to integer! #"^(80)"
== 128
>> to binary! #"^(80)"
== #{80}
But in R3, you have a UTF-8 encoding when binary conversion is needed:
>> to integer! #"^(80)"
== 128
>> to binary! #"^(80)"
== #{C280}
So you will be in for a surprise when your seemingly-working code does something different at a later time, and winds up serializing differently. In fact, if you want to know how "messed up" R2 is in this regard, look at why you got a weird symbol for your "mu". In R2:
>> to binary! #"^(03BC)"
== #{BC}
It just threw the "03" away. :-/
So if you need for some reason to work with a Unicode strings and can't switch to R3, try something like this for the cow example:
mu-utf8: #{03BC}
utf8: rejoin [#{} {Q: What does a Zen master's {Cow} Say? A: "} mu-utf8 {"!}]
That gets you a binary. Only convert it to string for debug output, and be ready to see gibberish. But it is the right thing to do if you're stuck in Rebol2.
And to reiterate the answer: it's also what to do if for some odd reason stuck needing to use those higher codepoints in Rebol3:
utf8: rejoin [#{} {Q: What did the Mycenaean's {Cow} Say? A: "} #{010000} {"!}]
I'm sure that would be a very funny joke if I knew what LINEAR B SYLLABLE B008 A was. Which leads me to say that most likely, if you're doing something this esoteric you probably only have a few codepoints being cited as examples. You can hold most of your data as string up until you need to slot them in conveniently, and hold the result in a binary series.
UPDATE: If one hits this problem, here is a utility function that can be useful for working around it temporarily:
safe-r2-char: charset [#"^(00)" - #"^(7F)"]
unsafe-r2-char: charset [#"^(80)" - #"^(FF)"]
hex-digit: charset [#"0" - #"9" #"A" - #"F" #"a" - #"f"]
r2-string-to-binary: func [
str [string!] /string /unescape /unsafe
/local result s e escape-rule unsafe-rule safe-rule rule
] [
result: copy either string [{}] [#{}]
escape-rule: [
"^^(" s: 2 hex-digit e: ")" (
append result debase/base copy/part s e 16
)
]
unsafe-rule: [
s: unsafe-r2-char (
append result to integer! first s
)
]
safe-rule: [
s: safe-r2-char (append result first s)
]
rule: compose/deep [
any [
(either unescape [[escape-rule |]] [])
safe-rule
(either unsafe [[| unsafe-rule]] [])
]
]
unless parse/all str rule [
print "Unsafe codepoints found in string! by r2-string-to-binary"
print "See http://stackoverflow.com/questions/15077974/"
print mold str
throw "Bad codepoint found by r2-string-to-binary"
]
result
]
If you use this instead of a to binary! conversion, you will get the consistent behavior in both Rebol2 and Rebol3. (It effectively implements a solution for terrible-workaround style strings.)