Parse::RecDescent does it only operate on a string? - perl

I was looking at using Parse::RecDescent to parse some large files. I was figuring I'd pass it a token at a time. After looking at it a while, it appears that the tokenizer is built into it and you have to pass it the whole string up front. Is this correct?

Yes. You normally pass the complete text to be parsed as a string.
However, note that it's documented that if you pass the text as a reference:
$parser->startrule(\$text);
then the matched portion of $text will be removed, leaving only what did not match. It may be possible to design your grammar so that you can parse a file in chunks.

Related

How to properly parse a multi-character token in tree-sitter scanner function

Tree-sitter allows you to use an external scanner for those tokens that are tricky to parse or that depend on specific states like multiline strings.
The scanner takes a convenient lexer object with several methods that allow you to "scan" the document looking for the proper token characters.
Two of the key parts of this lexer are lookahead, which tells you the next character the lexer is "lookin at" and advance, which will move the lexer pointer to the next caracter.
However, after reading the docs and several other parsers that make use of this it's still not clear to me if calling this methods will "affect" the overall tree-sitter parser of if they are just local to my function invocation.
Specially tricky is trying to parse a multi-character token (more than 2 characters in fact) because you need to "advance the lexer, consuming the potential next chars that may be part of other tokens. One possible escape is to just return false after consuming the tokens and let tree-sitter go to the next step in the parsing, but thay may skip other valid tokens that potentially depend on the characters that I already consumed.
Of course I can move this parsing to the bottom of the scan function, but then maybe other shorter tokens may shadow this longer one and also produce an incorrect parsing.
As far as I know, there is no way to "rewind" the parser to undo the "consumption" of the characters, so I am not sure how to deal with this.
The tokens that I'm trying to parse are {js| for string opening and |js} for string closing.

Removing spaces from a string using Powershell

I have an issue where extracting data from database it sometimes (quite often) adds spaces in between strings of texts that should not be there.
What I'm trying to do is create a small script that will look at these strings and remove the spaces.
The problem is that the spaces can be in any position in the string, and the string is a variable that changes.
Example:
"StaffID": "0000 25" <- The space in the number should not be there.
Is there a way to have the script look at this particular line, and if it finds spaces, to remove them.
Or:"DateOfBirth": "23-10-199 0" <-It would also need to look at these spaces and remove them.
The problem is that the same data also has lines such as:
"Address": " 91 Broad street" <- The spaces should be here obviously.
I've tried using TRIM, but that only removes spaces from start/end.
Worth mentioning that the data extracted is in json format and is then imported using API into the new system.
You should think about the logic of what you want to do, and whether or not it's programmatically possible to determine if you can teach your script where it is or is not appropriate to put spaces. As it is, this is one of the biggest problems facing AI research right now, so unfortunately you're probably going to have to do this by hand.
If it were me, I'd specify the kind of data format that I expect from each column, and try my best to attempt to parse those strings. For example, if you know that StaffID doesn't contain spaces, you can have a rule that just deletes them:
$staffid = $staffid.replace("\s+",'')
There are some more complicated things that you can do with forced formatting (.replace) that have already been covered in this answer, but again, that requires some expectation of exactly what data is going to come out of what column.
You might want to look more closely at where those spaces are coming from, rather than process the output like this. Is the retrieval script doing it? Maybe you can optimize the database that you're drawing from?

lex default token definition syntax

I guess this is a simple question, but I have found no reference. I have a small lex file defining some tokens from a string and altering them (actually converting them to uppercase).
Basically it is a list of commands like this:
word {setToUppercase(yytext);}
Where setToUppercase is a procedure to change case and store it.
I need to have the complete entry string with the altered words. Is there a way to define a default token / rest of tokens so I can asociate them with an unaltered storage in an output string?
You can do that in one shot with:
.|\n {save_str(yytext);}
I said it was an easy one.
. {save_str(yytext);}
\n {save_str(yytext);}
This way all characters and newline are treated.

regex to get string within 2 strings

"artistName":"Travie McCoy", "collectionName":"Billionaire (feat. Bruno Mars) - Single", "trackName":"Billionaire (feat. Bruno Mars)",
i wish to get the artist name so Travie McCoy from within that code using regex, please not i am using regexkitlite for the iphone sdk if this changes things.
Thanks
"?artistName"?\s*:\s*"([^"]*)("|$) should do the trick. It even handles some variations in the string:
White space before and after the :
artistName with and without the quotes
missing " at the end of the artist name if it is the last thing on the line
But there will be many more variations in the input you might encounter that this regex will not match.
Also you don’t want to use a regex for matching this for performance reasons. Right now you might only be interested in the artistName field. But some time later you will want information from the other fields. If you just change the field name in the regex you’ll have to match the whole string again. Much better to use a parser and transform the whole string into a dictionary where you can access the different fields easily. Parsing the whole string shouldn’t take much longer than matching the last key/value pair using a regex.
This looks like some kind of JSON, there are lots of good and complete parsers available. It isn’t hard to write one yourself though. You could write a simple recursive descent parser in a couple of hours. I think this is something every programmer should have done at least once.
\"?artistName\"?\s*:\s*\"([^\"]*)(\"|$)
Thats for objective c

Matching beginning of words in a NSString

Is there a method built in to NSString that tokenizes the string and searches the beginning of each token? the compare method seems to only do the beginning of a string, and using rangeOfString isn't really sufficient because it doesn't have knowledge of tokens. Right now I'm thinking the best way to do this is to call
[myString componentsSeparatedByString:#" "]
and then loop over the resulting array, calling compare on each component of the string. Is this built-in and I just missed it?
Using CFStringTokenizer for, um, tokenizing strings will be more robust than splitting on #" ", but searching the results is still left up to you.
You may want to look into RegexKit Lite:
http://regexkit.sourceforge.net/#RegexKitLite/
Although it's a third party library, it's basically a very small (one class) wrapper built around the built-in fairly powerful regular expression engine.
It seems like this would be more useful since you could have non-capturing expressions match around the token-separators and then the capturing portion include or not include the text you are looking for along with the remaining text between tokens. If you have not used regular expressions much before, you'll want to read some kind of reference but just be aware you can separate out matching patterns from content you want to see with a cryptic but very powerful syntax.
I'm also not sure you can use CFStringTokenizer on the iPhone since the iPhone specific doc set has no reference for it.