Parsing a Gmail-style advanced search syntax?

Parsing a Gmail-style advanced search syntax? - perl

I want to parse a search string similar to that provided by Gmail using Perl. An example input would be "tag:thing by:{user1 user2} {-tag:a by:user3}". I want to put it into a tree structure, such as
{and => [
"tag:thing",
{or => [
"by:user1",
"by:user2",
]},
{or => [
{not => "tag:a"},
"by:user3",
]},
}
The general rules are:
Tokens separated by space default to the AND operator.
Tokens in braces are alternative options (OR). The braces can go before or after the field specifier. i.e. "by:{user1 user2}" and "{by:user1 by:user2}" are equivalent.
Tokens prefixed with a hyphen are excluded.
These elements can also be combined and nested: e.g. "{by:user5 -{tag:k by:user3}} etc".
I'm thinking of writing a context-free grammar to represent these rules, and then parsing it into the tree. Is this unnecessary? (Is this possible using simple regexps?)
What modules are recommended for doing parsing context-free grammars?
(Eventually this will be used to generate an database query with DBIx::Class.)

Regex doesn't do nested things (like parenthesis) very well. By the time you get your regex counting parenthesis and capturing correctly, you could probably have a decent CFG parser. CFGs can logically guarantee correct parsing, while with a regex solution you're leaving a lot up to the magic. I can't recommend any Perl CFG libaries, but coding one sounds very cathartic.

YAPP might do what you want. You can use it to generate and then use a LALR(1) Parsing Automaton.

If your query isn't tree structured, then regexes will do the job for you.
For example:
my $search = "tag:thing by:{user1 user2} {-tag:a by:user3}"
my #tokens = split /(?![^{]*})\s+/, $search;
foreach (#tokens) {
my $or = s/[{}]//g; # OR mode
my ($default_field_specifier) = /(\w+):/;
}
Even if your query is tree structured, regexes can make recursive parsing much more pleasant:
$_ = "by:{user1 z:{user2 3} } x {-tag:a by:user3} zz";
pos($_) = 0;
scan_query("");
sub scan_query {
my $default_specifier = shift;
while (/\G\s*((?:[-\w:]+)|(?={))({)?/gc) {
scan_query($1), next if $2;
my $query_token = $default_specifier . $1;
}
/\G\s*\}/gc;
}
Regexes are awesome :)!

Parse::Recdescent can generate parsers for this sort of thing. You probably need some experience with parsers to use it effectively though.

Related

Cleaner way of declaring many classes in perl

I´m working in a code which follows some steps, and each of this steps is done in one class. Right now my code looks like this:
use step_1_class;
use step_2_class;
use step_3_class;
use step_4_class;
use step_5_class;
use step_6_class;
use step_7_class;
...
use step_n_class;
my $o_step_1 = step_1_class->new(#args1);
my $o_step_2 = step_2_class->new(#args2);
my $o_step_3 = step_3_class->new(#args3);
my $o_step_4 = step_4_class->new(#args4,$args_4_1);
my $o_step_5 = step_5_class->new(#args5);
my $o_step_6 = step_6_class->new(#args6,$args_6_1);
my $o_step_7 = step_7_class->new(#args7);
...
my $o_step_n = step_n_class->new(#argsn);
Is there a cleaner way of declaring this somewhat similar classes wihtout using hundreds of lines?

Your use classes as written are equivalent to
BEGIN {
require step_1_class;
step_1_class->import() if step_1_class->can('import');
require step_2_class;
step_2_class->import() if step_2_class->can('import');
...
}
This can be rewritten as
BEGIN {
foreach my $i ( 1 .. $max_class ) {
eval "require step_${i}_class";
"step_${i}_class"->import() if "step_${i}_class"->can('import');
}
}
The new statements are a little more complex as you have separate variables and differing parameters, however this can be worked around by storing all the objects in an array and also preprocessing the parameters like so
my #steps;
my #parameters = ( undef, \#args1, \#args2, \#args3, [ #args4, $args_4_1], ...);
for ($i = 1; $i <= $max_class; $i++) {
push #steps, "step_${i}_class"->new(#{$parameters[$i]});
}

You can generate the use clauses in a Makefile. Generating the construction of the object will be more tricky as the arguments aren't uniform - but you can e.g. save the exceptions to a hash. This will make deployment more complex and searching the code tricky.
It might be wiser to rename each step by its purpose, and group the steps together logically to form a hierarchy instead of a plain sequence of steps.

What are all the Unicode properties a Perl 6 character will match?

The .uniprop returns a single property:
put join ', ', 'A'.uniprop;
I get back one property (the general category):
Lu
Looking around I didn't see a way to get all the other properties (including derived ones such as ID_Start and so on). What am I missing? I know I can go look at the data files, but I'd rather have a single method that returns a list.
I am mostly interested in this because regexes understand properties and match the right properties. I'd like to take any character and show which properties it will match.

"A".uniprop("Alphabetic") will get the Alphabetic property. Are you asking for what other properties are possible?
All these that have a checkmark by them will likely work. This just displays that status of roast testing for it https://github.com/perl6/roast/issues/195
This may more more useful for you, https://github.com/rakudo/rakudo/blob/master/src/core/Cool.pm6#L396-L483
The first hash is just mapping aliases for the property names to the full names. The second hash specifices whether the property is B for boolean, S for a string, I for integer, nv for numeric value, na for Unicode Name and a few other specials.
If I didn't understand you question please let me know and I will revise this answer.
Update: Seems you want to find out all the properties that will match. What you will want to do is iterate all of https://github.com/rakudo/rakudo/blob/master/src/core/Cool.pm6#L396-L483 and looking only at string, integer and boolean properties. Here is the full thing: https://gist.github.com/samcv/ae09060a781bb4c36ae6cac80ea9325f
sub MAIN {
use Test;
my $char = 'a';
my #result = what-matches($char);
for #result {
ok EVAL("'$char' ~~ /$_/"), "$char ~~ /$_/";
}
}
use nqp;
sub what-matches (Str:D $chr) {
my #result;
my %prefs = prefs();
for %prefs.keys -> $key {
given %prefs{$key} {
when 'S' {
my $propval = $chr.uniprop($key);
if $key eq 'Block' {
#result.push: "<:In" ~ $propval.trans(' ' => '') ~ ">";
}
elsif $propval {
#result.push: "<:" ~ $key ~ "<" ~ $chr.uniprop($key) ~ ">>";
}
}
when 'I' {
#result.push: "<:" ~ $key ~ "<" ~ $chr.uniprop($key) ~ ">>";
}
when 'B' {
#result.push: ($chr.uniprop($key) ?? "<:$key>" !! "<:!$key>");
}
}
}
#result;
}
sub prefs {
my %prefs = nqp::hash(
'Other_Grapheme_Extend','B','Titlecase_Mapping','tc','Dash','B',
'Emoji_Modifier_Base','B','Emoji_Modifier','B','Pattern_Syntax','B',
'IDS_Trinary_Operator','B','ID_Continue','B','Diacritic','B','Cased','B',
'Hangul_Syllable_Type','S','Quotation_Mark','B','Radical','B',
'NFD_Quick_Check','S','Joining_Type','S','Case_Folding','S','Script','S',
'Soft_Dotted','B','Changes_When_Casemapped','B','Simple_Case_Folding','S',
'ISO_Comment','S','Lowercase','B','Join_Control','B','Bidi_Class','S',
'Joining_Group','S','Decomposition_Mapping','S','Lowercase_Mapping','lc',
'NFKC_Casefold','S','Simple_Lowercase_Mapping','S',
'Indic_Syllabic_Category','S','Expands_On_NFC','B','Expands_On_NFD','B',
'Uppercase','B','White_Space','B','Sentence_Terminal','B',
'NFKD_Quick_Check','S','Changes_When_Titlecased','B','Math','B',
'Uppercase_Mapping','uc','NFKC_Quick_Check','S','Sentence_Break','S',
'Simple_Titlecase_Mapping','S','Alphabetic','B','Composition_Exclusion','B',
'Noncharacter_Code_Point','B','Other_Alphabetic','B','XID_Continue','B',
'Age','S','Other_ID_Start','B','Unified_Ideograph','B','FC_NFKC_Closure','S',
'Case_Ignorable','B','Hyphen','B','Numeric_Value','nv',
'Changes_When_NFKC_Casefolded','B','Expands_On_NFKD','B',
'Indic_Positional_Category','S','Decomposition_Type','S','Bidi_Mirrored','B',
'Changes_When_Uppercased','B','ID_Start','B','Grapheme_Extend','B',
'XID_Start','B','Expands_On_NFKC','B','Other_Uppercase','B','Other_Math','B',
'Grapheme_Link','B','Bidi_Control','B','Default_Ignorable_Code_Point','B',
'Changes_When_Casefolded','B','Word_Break','S','NFC_Quick_Check','S',
'Other_Default_Ignorable_Code_Point','B','Logical_Order_Exception','B',
'Prepended_Concatenation_Mark','B','Other_Lowercase','B',
'Other_ID_Continue','B','Variation_Selector','B','Extender','B',
'Full_Composition_Exclusion','B','IDS_Binary_Operator','B','Numeric_Type','S',
'kCompatibilityVariant','S','Simple_Uppercase_Mapping','S',
'Terminal_Punctuation','B','Line_Break','S','East_Asian_Width','S',
'ASCII_Hex_Digit','B','Pattern_White_Space','B','Hex_Digit','B',
'Bidi_Paired_Bracket_Type','S','General_Category','S',
'Grapheme_Cluster_Break','S','Grapheme_Base','B','Name','na','Ideographic','B',
'Block','S','Emoji_Presentation','B','Emoji','B','Deprecated','B',
'Changes_When_Lowercased','B','Bidi_Mirroring_Glyph','bmg',
'Canonical_Combining_Class','S',
);
}

OK, so here's another take on answering this question, but the solution is not perfect. Bring the downvotes!
If you join #perl6 channel on freenode, there's a bot called unicodable6 which has functionality that you may find useful. You can ask it to do this (e.g. for character A and π simultaneously):
<AlexDaniel> propdump: Aπ
<unicodable6> AlexDaniel, https://gist.github.com/b48e6062f3b0d5721a5988f067259727
Not only it shows the value of each property, but if you give it more than one character it will also highlight the differences!
Yes, it seems like you're looking for a way to do that within perl 6, and this answer is not it. But in the meantime it's very useful. Internally Unicodable just iterates through this list of properties. So basically this is identical to the other answer in this thread.
I think someone can make a module out of this (hint-hint), and then the answer to your question will be “just use module Unicode::Propdump”.

What's wrong with my Meteor publication?

I have a publication, essentially what's below:
Meteor.publish('entity-filings', function publishFunction(cik, queryArray, limit) {
if (!cik || !filingsArray)
console.error('PUBLICATION PROBLEM');
var limit = 40;
var entityFilingsSelector = {};
if (filingsArray.indexOf('all-entity-filings') > -1)
entityFilingsSelector = {ct: 'filing',cik: cik};
else
entityFilingsSelector = {ct:'filing', cik: cik, formNumber: { $in: filingsArray} };
return SB.Content.find(entityFilingsSelector, {
limit: limit
});
});
I'm having trouble with the filingsArray part. filingsArray is an array of regexes for the Mongo $in query. I can hardcode filingsArray in the publication as [/8-K/], and that returns the correct results. But I can't get the query to work properly when I pass the array from the router. See the debugged contents of the array in the image below. The second and third images are the client/server debug contents indicating same content on both client and server, and also identical to when I hardcode the array in the query.
My question is: what am I missing? Why won't my query work, or what are some likely reasons it isn't working?

In that first screenshot, that's a string that looks like a regex literal, not an actual RegExp object. So {$in: ["/8-K/"]} will only match literally "/8-K/", which is not the same as {$in: [/8-K/]}.
Regexes are not EJSON-able objects, so you won't be able to send them over the wire as publish function arguments or method arguments or method return values. I'd recommend sending a string, then inside the publish function, use new RegExp(...) to construct a regex object.
If you're comfortable adding new methods on the RegExp prototype, you could try making RegExp an EJSON-able type, by putting this in your server and client code:
RegExp.prototype.toJSONValue = function () {
return this.source;
};
RegExp.prototype.typeName = function () {
return "regex";
}
EJSON.addType("regex", function (str) {
return new RegExp(str);
});
After doing this, you should be able to use regexes as publish function arguments, method arguments and method return values. See this meteorpad.

/8-K/.. that's a weird regex. Try /8\-K/.
A minus (-) sign is a range indicator and usually used inside square brackets. The reason why it's weird because how could you even calculate a range between 8 and K? If you do not escape that, it probably wouldn't be used to match anything (thus your query would not work). Sometimes, it does work though. Better safe than never.
/8\-K/ matches the string "8-K" anywhere once.. which I assume you are trying to do.
Also it would help if you would ensure your publication would always return something.. here's a good area where you could fail:
if (!cik || !filingsArray)
console.error('PUBLICATION PROBLEM');
If those parameters aren't filled, console.log is probably not the best way to handle it. A better way:
if (!cik || !filingsArray) {
throw "entity-filings: Publication problem.";
return false;
} else {
// .. the rest of your publication
}
This makes sure that the client does not wait unnecessarily long for publications statuses as you have successfully ensured that in any (input) case you returned either false or a Cursor and nothing in between (like surprise undefineds, unfilled Cursors, other garbage data.

How do I merge multiple FLV files using FLV::Splice module?

If I have an array that contains multiple FLV files like so:
my #src_files = qw (1.flv 2.flv 3.flv 4.flv 5.flv 6.flv 7.flv);
I can call flvbind.exe as an external commandline program to do the merging like so:
my $args = join(' ',#src_files);
my $dest_file = 'merged.flv';
system "flvbind $dest_file $args";
But the usage example for FLV::Splice is this:
use FLV::Splic;
my $converter = FLV::Splice->new();
$converter->add_input('first.flv');
$converter->add_input('second.flv');
$converter->save('output.flv');
I seem to be unable to figure out how to do the same thing like the flvbind does.
Do I have to verbosely add every param like so:
use FLV::Splice;
my $dest_file = 'merged.flv';
my $converter = FLV::Splice->new();
$converter->add_input('1.flv');
$converter->add_input('2.flv');
$converter->add_input('3.flv');
$converter->add_input('4.flv');
$converter->add_input('5.flv');
$converter->add_input('6.flv');
$converter->add_input('7.flv');
$converter->save("$dest_file");
This does not look like the correct way. I mean do I have to verbosely add every param? Is there a way to simplify the repeated use of the add_input method? Any pointers? Thanks like always :)
UPDATE:
It turns out that this is a silly question. Thanks, #Eric for giving me the correct answer. I thought of using a for loop to reduce the repeated use of the add_input method but somehow I thought it would not work and I thought I was stuck. Well, I'll be reminding myself not to easily bother other people next time.

It's fairly easy to reduce the repetition:
$converter->add_input($_) for #src_files;
Or wrap the whole thing in a subroutine:
sub flv_splice {
my $dest_file = shift;
my $converter = FLV::Splice->new();
$converter->add_input($_) for #_;
$converter->save($dest_file);
}
flv_splice 'merged.flv', #src_files;

How should I treat strings of digits in XML::RPC and Drupal?

I am trying to use an XML-RPC server on my Drupal (PHP) backend to make it easier for my Perl backend to talk to it. However, I've run into an issue and I'm not sure which parts, if any, are bugs. Essentially, some of the variables I need to pass to Drupal are strings that sometimes are strings full of numbers and the Drupal XML-RPC server is returning an error that when a string is full of numbers it is not properly formed.
My Perl code looks something like this at the moment.
use strict;
use warnings;
use XML::RPC;
use Data::Dumper;
my $xmlrpc = XML::RPC->new(URL);
my $result = $xmlrpc->call( FUNCTION, 'hello world', '9876352345');
print Dumper $result;
The output is:
$VAR1 = {
'faultString' => 'Server error. Invalid method parameters.',
'faultCode' => '-32602'
};
When I have the Drupal XML-RPC server print out the data it receives, I notice that the second argument is typed as i4:
<param>
<value>
<i4>9876352345</i4>
</value>
I think when Drupal then finishes processing the item, it is typing that variable as an int instead of a string. This means when Drupal later tries to check that the variable value is properly formed for a string, the is_string PHP function returns false.
foreach ($signature as $key => $type) {
$arg = $args[$key];
switch ($type) {
case 'int':
case 'i4':
if (is_array($arg) || !is_int($arg)) {
$ok = FALSE;
}
break;
case 'base64':
case 'string':
if (!is_string($arg)) {
$ok = FALSE;
}
break;
case 'boolean':
if ($arg !== FALSE && $arg !== TRUE) {
$ok = FALSE;
}
break;
case 'float':
case 'double':
if (!is_float($arg)) {
$ok = FALSE;
}
break;
case 'date':
case 'dateTime.iso8601':
if (!$arg->is_date) {
$ok = FALSE;
}
break;
}
if (!$ok) {
return xmlrpc_error(-32602, t('Server error. Invalid method parameters.'));
}
}
What I'm not sure about is on which side of the divide the issue lies or if there is something else I should be using. Should the request from the Perl side be typing the content as a string instead of i4 or is the Drupal side of the request too stringent for the string type? My guess is that the issue is the latter, but I don't know enough about how an XML-RPC server is supposed to work to know for sure.

The number 9876352345 is too big to fit in a 32bit integer. That might cause the problem.

are you using frontier? perhaps you could declare the string explicitly?
my $result =
$xmlrpc->call( FUNCTION, 'hello world', $xmlrpc->string('9876352345') );
info from the client docs:
By default, you may pass ordinary Perl values (scalars) to be encoded. RPC2 automatically converts them to XML-RPC types if they look like an integer, float, or as a string. This assumption causes problems when you want to pass a string that looks like "0096", RPC2 will convert that to an because it looks like an integer.

I don't have any experience with the XML::RPC package, but I'm the author of the RPC::XML CPAN module. As with the Frontier package, I provide a way to force a value into a specific type when it would otherwise default to something else.
If I had to guess, I would say that the package you're using simple does a regular-expression match on the data to decide how to type it. I had a similar problem with my package, and given the way Perl handles scalar values the only real way around it is to force it with explicit declaration. As a previous answerer pointed out, the value in question is actually outside the range of the <i4> type (which is a signed 32-bit value). So even if you had intended it to be an integer value, it would have been invalid with regards to the XML-RPC spec.
I would recommend switching to one of the other XML-RPC packages, which have clearer ways of explicitly typing data. According to the docs for XML::RPC, it is possible to force the typing of data, but I found it to be unclear and not very well explained.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Parsing a Gmail-style advanced search syntax? - perl

YAPP might do what you want. You can use it to generate and then use a LALR(1) Parsing Automaton.

Parse::Recdescent can generate parsers for this sort of thing. You probably need some experience with parsers to use it effectively though.

Related

Cleaner way of declaring many classes in perl

What are all the Unicode properties a Perl 6 character will match?

What's wrong with my Meteor publication?

How do I merge multiple FLV files using FLV::Splice module?

How should I treat strings of digits in XML::RPC and Drupal?

Categories

Resources