Regex for strings in Bibtex - variable-assignment

I've trying to write a Bibtex parser with flex/bison. Here are the rules for strings in bibtex:
Strings can be enclosed in double quotes "..." or in braces {...}
In a string, braces can be nested
Inside a string, the braces should be balanced (invalid string: {this is a { test})
Inside an "internet" {}, you can have any characters. So this string is valid: {This is a string {test"} and it is valid}
Any ideas on how to do this?

Now you're going into the field of a text parser. Surprisingly, nobody has made a bibtex library for Actionscript that I could find, so it's an interesting problem. If you do make one, do the community a favor and open source it :)
It won't be easy to do since you essentially have to go character by character and check for the chars that you need and do logic around that. However, I recommend you look at as3corelib's implementation of the JSON parser which is somewhat similar to what you're trying to accomplish. You'll at least get an idea of how to do it using a tokenizer and it's a very good start on your project.
Good luck.

Related

php preg_replace string doesn't work

I have tried to minify a json file with str_replace() but that doesn't work well as I used it.
//I want to minify a json file with php.
//Here I am trying to replace ", {" with ",{"
//$result = preg_replace('/abc/', 'def', $string); # Replace all 'abc' with 'def'
$new = preg_replace('/, {/', ',{', $new); //doesn't work.. why?
As for the specific issue, { is a special character in regular expressions and you need to escape it. See the Meta-characters section of PCRE syntax in the PHP manual. So change the first argument to '/, \{/'. Never mind, as #Hugo demonstrated, it should work, and without telling us how your approach failed, we can't help more.
More importantly, this is terribly error-prone. What about a JSON string like ['hello, {name}']. Your attempt will incorrectly "minify" the part inside the quotes and turn it into ['hello,{name}']. Not a critical bug in this case, but might be more severe in other cases. Handling string literals properly is a pain, the simplest solution to actually minify JSON strings is to do json_encode(json_decode($json)), since PHP by default does not pretty print or put unnecessary whitespace into JSON.
And finally, maybe you don't really need to do this. If you are doing this to save HTTP traffic or something, just make sure your server gzips responses, caches properly, etc.

.tmlanguage escape sequences and rule priorities

I'm implementing a syntax highlighter in Apple's Swift language by parsing .tmlanguage files and applying styles to a NSMutableAttributtedString.
I'm testing with javascript code, a javascript.tmlanguage file, and the monokai.tmtheme theme (both last included in sublime text 3) to check that the syntax get highlighted correctly. By applying each rule (patterns) in the .tmlanguage file in the same order they come, the syntax is almost perfectly highlighted.
The problem I'm having right now is that I don't know how to know that a quote (") should be escaped when it has a backslash before it (\"). Am I missing something in the .tmlanguage file that specifies that?. Other problem is that I have no idea how to know that other rules should be ignored when inside others, for example:
I'm getting double slashes taken as comments when inside strings: "http://stackoverflow.com/" a url is recognised as comment after //
Also double or single quotes are taken as strings when inside comments: // press "Enter" to continue, the word "Enter" gets highlighted as string when should be same color as comments
So, I don't know if there is some priority for some rules over others in the convention, or if there is something in the files that I haven't noticed.
Help please!
Update:
Here is a better example of what I meant by escape quotes:
I'm getting this: while all the letters should be yellow except for the escaped sequence (/") which should be blue.
The question is. How do I know that /" should be escaped? The rule for that piece of code is:
Maybe I am late to answer this. You can apply the following method.
(Ugly) In your end regex, use ([^/])(") and in your endCaptures, it would be
1 = string.quote.double.js
2 = punctuation.definition.string.end.js
If the string must be single line, you can use match=(")(.*)("), captures=
1 = punctuation.definition.string.begin.js
2 = string.quote.double.js
3 = punctuation.definition.string.end.js
and use your patterns
You can try applyEndPatternLast and see if it is allowed. Set applyEndPatternLast=1 will do.
The priority is that earlier rules in the file are prioritized over later rules. As an example, in my Python Improved language definition, I have a scope that contains a series of all-caps constants used in Django, a popular Python web framework. I also have a generic constant.other.allcaps.python scope that recognizes (just about) anything in all caps. Since the Django constants rule is before the allcaps rule in the .tmLanguage file, I can color it with a theme using one color, while the later-occurring "highlight everything in all caps" only grabs identifiers that are NOT part of the first list.
Because of this, you should put your "comments" scope(s) as early in the file as possible, then write your parser in such a way that it obeys the rule I described above. However, it's slightly more complicated than that, as I believe items in the repository are prioritized based on where their include line is, not where the repository rule is defined in the file. You may want to do some testing to verify that, though.
Unfortunately I'm not sure what you mean about the escaped quotes - could you expand on that, and maybe add an example or two?
Hope this helps.
Assuming that / is the correct character for escaping a double quote mark, the following should work:
"str_double_quote": {
"begin": "\"",
"end": "\"",
"name": "string.quoted.double.swift",
"patterns": [
{
"name": "constant.character.escape.swift",
"match": "/[\"/]"
}
]
}
You can match an escaped double quote mark (/") and a literal forward slash (//) in the patterns to consume them before the end marker is used to handle them.
If the character for escaping is actually a backslash, then the tricky bit is that there are two levels of escaping, for the JSON encoding as well as the regular expression syntax. To match \", the regular expression requires you to escape the backslash (\\"). JSON requires you to escape backslashes and double quotes, resulting in \\\\\" in a TextMate JSON grammar file. The match expression would thus be \\\\[\"\\\\].

How to parse special characters in XML for iPad?

I am getting problem while parsing xml files that contains some special characters like single quote,double quote (', "")etc.I am using NSXMLParser's parser:foundCharacters:method to collect characters in my code.
<synctext type = "word" >They raced to the park Arthur pointed to a sign "Whats that say" he asked Zoo said DW Easy as pie</synctext>
When i parse and save the text from above tag of my xml file,the resultant string is appearing,in GDB, as
"\n\t\tThey raced to the park Arthur pointed to a sign \"Whats that say\" he asked Zoo said DW Easy as pie";
Observe there are 2 issues:
1)Unwanted characters at the beginning of the string.
2)The double quotes around Whats that say.
Can any one please help me how to get rid of these unwanted characters and how to read special characters properly.
NSString*string =[string stringByTrimmingCharactersInSet:[NSCharacterSet characterSetWithCharactersInString:#" \n\t"]];
The parser is apparently returning exactly what's in the string. That is, the XML was coded with the starting tag on one line, a newline, two tabs, and the start of the string. And quotes in the string are obviously there in the original (and it's not clear in at least this example why you'd want to delete them).
But if you want these characters gone then you need to post-process the string. You can use Rams' statement to eliminate the newline and tabs, and stringByReplacingOccurrencesOfString:WithString: to zap the quotes.
(Note that some XML parsers can be instructed to return strings like this with the leading/trailing stuff stripped, but I'm not sure about this one. The quotes will always be there, though.)

regex to get string within 2 strings

"artistName":"Travie McCoy", "collectionName":"Billionaire (feat. Bruno Mars) - Single", "trackName":"Billionaire (feat. Bruno Mars)",
i wish to get the artist name so Travie McCoy from within that code using regex, please not i am using regexkitlite for the iphone sdk if this changes things.
Thanks
"?artistName"?\s*:\s*"([^"]*)("|$) should do the trick. It even handles some variations in the string:
White space before and after the :
artistName with and without the quotes
missing " at the end of the artist name if it is the last thing on the line
But there will be many more variations in the input you might encounter that this regex will not match.
Also you don’t want to use a regex for matching this for performance reasons. Right now you might only be interested in the artistName field. But some time later you will want information from the other fields. If you just change the field name in the regex you’ll have to match the whole string again. Much better to use a parser and transform the whole string into a dictionary where you can access the different fields easily. Parsing the whole string shouldn’t take much longer than matching the last key/value pair using a regex.
This looks like some kind of JSON, there are lots of good and complete parsers available. It isn’t hard to write one yourself though. You could write a simple recursive descent parser in a couple of hours. I think this is something every programmer should have done at least once.
\"?artistName\"?\s*:\s*\"([^\"]*)(\"|$)
Thats for objective c

Matching beginning of words in a NSString

Is there a method built in to NSString that tokenizes the string and searches the beginning of each token? the compare method seems to only do the beginning of a string, and using rangeOfString isn't really sufficient because it doesn't have knowledge of tokens. Right now I'm thinking the best way to do this is to call
[myString componentsSeparatedByString:#" "]
and then loop over the resulting array, calling compare on each component of the string. Is this built-in and I just missed it?
Using CFStringTokenizer for, um, tokenizing strings will be more robust than splitting on #" ", but searching the results is still left up to you.
You may want to look into RegexKit Lite:
http://regexkit.sourceforge.net/#RegexKitLite/
Although it's a third party library, it's basically a very small (one class) wrapper built around the built-in fairly powerful regular expression engine.
It seems like this would be more useful since you could have non-capturing expressions match around the token-separators and then the capturing portion include or not include the text you are looking for along with the remaining text between tokens. If you have not used regular expressions much before, you'll want to read some kind of reference but just be aware you can separate out matching patterns from content you want to see with a cryptic but very powerful syntax.
I'm also not sure you can use CFStringTokenizer on the iPhone since the iPhone specific doc set has no reference for it.