lex default token definition syntax - lex

I guess this is a simple question, but I have found no reference. I have a small lex file defining some tokens from a string and altering them (actually converting them to uppercase).
Basically it is a list of commands like this:
word {setToUppercase(yytext);}
Where setToUppercase is a procedure to change case and store it.
I need to have the complete entry string with the altered words. Is there a way to define a default token / rest of tokens so I can asociate them with an unaltered storage in an output string?

You can do that in one shot with:
.|\n {save_str(yytext);}

I said it was an easy one.
. {save_str(yytext);}
\n {save_str(yytext);}
This way all characters and newline are treated.

Related

What are allowed characters for a submodule name in registerModule?

registerModule() expects a submodule key as a third parameter.
I think it should probably not contain a space and only alphabetic characters (or alphanumeric?) and underscore ('_'), but I'm not really sure.
I could not find specific information for this.
The function makes use of \TYPO3\CMS\Core\Utility\GeneralUtility::underscoredToUpperCamelCase to generate the full module name combined of main module and sub module connected with an _
So you already guessed the right answer.
It's a bit complicated strange to answer!
Official API document does not provide exact information. I have worked around some extension which has multiple sub-module. I'm quite sure this not allow special character as you sub-module key.
eg. web_TestTestbe123 (mainModulename_subModuleKey)
I have noticed bellow characteristic for the key:
Key must be lowercase
No space allowed
Numerica value would be fine
Does this make sense?
I found this in the documentation just now:
Backend modules
1. The modkey is made up of alphanumeric characters only. It does not contain underscores and starts with a letter.
https://docs.typo3.org/m/typo3/reference-coreapi/master/en-us/ExtensionArchitecture/NamingConventions/Index.html

Parse::RecDescent does it only operate on a string?

I was looking at using Parse::RecDescent to parse some large files. I was figuring I'd pass it a token at a time. After looking at it a while, it appears that the tokenizer is built into it and you have to pass it the whole string up front. Is this correct?
Yes. You normally pass the complete text to be parsed as a string.
However, note that it's documented that if you pass the text as a reference:
$parser->startrule(\$text);
then the matched portion of $text will be removed, leaving only what did not match. It may be possible to design your grammar so that you can parse a file in chunks.

Need token delimiter that doesn't conflict with PowerShell or RegEx special characters

I have written a PowerShell function that expands tokens quite nicely. Unfortunately, I used % as my token delimiter, and now I find that % is a special character in PowerShell v3 that is likely to conflict. I am thinking I might use ~ as my delimiter, but I wonder if there is a best practice here? Something that is sure to not conflict with either PowerShell or RegEx, and since my users are not always IT folks, and only need to use my tool a few weeks out of the year, something that is really obvious would be helpful.
I am currently expanding tokens before I do any other RegEx processing, and before I do any $ExecutionContext.InvokeCommand.ExpandString(), but I keep finding new opportunities to expand the tool and I may be doing those things earlier at some point, so I want to future proof the data as much as possible.
Any advice is greatly appreciated!
Gordon
How about using # as your delimiter?
It's not used at all in regex. In Powershell is designates a comment, but only if it appears outside the context of a quoted string literal. In this application you're using it as a string delimiter, so it's always going to be used in a quoted string argument.

Wikipedia (MediaWiki) URI encoding scheme

How do Wikipedia (or MediaWiki in general) encode page titles in URIs? It's not normal URI encoding, since spaces are replaced with underscores and double quotes are not encoded and things like that.
http://en.wikipedia.org/wiki/Wikipedia:Naming_conventions_%28technical_restrictions%29 - here you've got some kind of description of what their engine enforces on article names.
They should have something like this in their LocalSettings.php:
$wgArticlePath = '/wiki/$1';
and proper server URI rewrites configuration - they seem to be using Apache (HTTP header), so it's probably mod_rewrite. http://www.mediawiki.org/wiki/Manual:Short_URL
You can also refer to the index.php file for an article on Wikipedia like this: http://en.wikipedia.org/w/index.php?title=Foo%20bar and get redirected by the engine to http://en.wikipedia.org/wiki/Foo_bar. Behind the scenes mod_rewrite translates it into /index.php?title=Foo_bar. For the MediaWiki engine it's the same as if you visited http://en.wikipedia.org/w/index.php?title=Foo_bar - this page doesn't redirect you.
The process is quite complex and isn't exactly pretty. You need to look at the Title class found in includes/Title.php. You should start with the newFromText method, but the bulk of the logic is in the secureAndSplit method.
Note that (as ever with MediaWiki) the code is not decoupled in the slightest. If you want to replicate it, you'll need to extract the logic rather than simply re-using the class.
The logic looks something like this:
Decode character references (e.g. é)
Convert spaces to underscores
Check whether the title is a reference to a namespace or interwiki
Remove hash fragments (e.g. Apple#Name
Remove forbidden characters
Forbid subdirectory links (e.g. ../directory/page)
Forbid triple tilde sequences (~~~) (for some reason)
Limit the size to 255 bytes
Capitalise the first letter
Furthermore, I believe I'm right in saying that quotation marks don't need to be encoded by the original user -- browsers can handle them transparently.
I hope that helps!

Apostrophe issue in RTF

I have a function within a custom CRM web application (old VB.Net circa 2003) that takes a set of fields from a database and merges them with palceholders in a set of RTF based template documents. These generate merged letters and documentation. The code essentially loops through each line of the RTF template file and replaces any instances of the placeholder values with text from a database record. The issue I'm having is that users have pasted a certain type of apostrophe into the web app (and therefore into the database) that is not rendering correctly in the resulting RTF file. It is rendering like this - ’.
I need a way to spot this invalid apostrophe in the code and replace it with a valid one. Unfortunately when I paste the invalid apostrophe into the Visual Studio editor it gets converted into the correct one. So I need another way to express this invalid apostrophe's value. Unfortunately I do not know a great deal about unicode and other encodings so I'm calling out for help with this.
Any ideas?
If you really just want to figure out what the character is you might want to try and paste it into a text editor like ultraedit. It has a hex mode that you can flip to to see the actual underlying bytes.
In order to do the replace once you've figured out the character you'd do something like this in Vb,
text.Replace(ChrW(2001), "'")
Note that you might not be able to figure it out easily using the text editor because it might also get mangled by paste from the clipboard. You might want to either print some debug of the ascii values from code. You can use the AscW function to do that.
I can't help but think that it may actually simply be a case of specifying the correct encoding to use when you write out the stream though. Assuming you're using a StreamWriter you can specify it on the constructor. I'm guessing you actually want ASCII given your requirement.
oWriter = New System.IO.StreamWriter(path, False, System.Text.Encoding.ASCII)
It looks like you probably want to encode characters out of the 8 bit range (>255).
You can do that using \uNNNN according to the wikipedia article.