Perl & MongoDB binary data

Perl & MongoDB binary data - perl

From the MongoDB manual:
By default, all database strings are UTF8. To save images, binaries,
and other non-UTF8 data, you can pass the string as a reference to the
database.
I'm fetching pages and want store the content for later processing.
I can not rely on meta-charset, because many pages has utf8 content but wrongly declaring iso-8859-1 or similar
so can't use Encode (don't know the originating charset)
therefore, I want store the content simply as flow of bytes (binary data) for later processing
Fragment of my code:
sub save {
my ($self, $ok, $url, $fetchtime, $request ) = #_;
my $rawhead = $request->headers_as_string;
my $rawbody = $request->content;
$self->db->content->insert(
{ "url" => $url, "rhead" => \$rawhead, "rbody" => \$rawbody } ) #using references here
if $ok;
$self->db->links->update(
{ "url" => $url },
{
'$set' => {
'status' => $request->code,
'valid' => $ok,
'last_checked' => time(),
'fetchtime' => $fetchtime,
}
}
);
}
But get error:
Wide character in subroutine entry at
/opt/local/lib/perl5/site_perl/5.14.2/darwin-multi-2level/MongoDB/Collection.pm
line 296.
This is the only place where I storing data.
The question: The only way store binary data in MondoDB is encode them e.g. with base64?

It looks like another sad story about _utf8_ flag...
I may be wrong, but it seems that headers_as_string and content methods of HTTP::Message return their strings as a sequence of characters. But MongoDB driver expects the strings explicitly passed to it as 'binaries' to be a sequence of octets - hence the warning drama.
A rather ugly fix is to take down the utf8 flag on $rawhead and $rawbody in your code (I wonder shouldn't it be really done by MongoDB driver itself?), by something like this...
_utf8_off $rawhead;
_utf8_off $rawbody; # ugh
The alternative is to use encode('utf8', $rawhead) - but then you should use decode when extracting values from DB, and I doubt it's not uglier.

Your data is characters, not octets. Your assumption seems to be that you are just passing things through as octets, but you must have violated that assumption earlier somehow by decoding incoming text data, perhaps even without you noticing.
So simply do not decode, data stay octets, storing into the db won't fail.

Related

Convert Hash To Array in Perl Catalyst

I need help regarding handling of Perl variables. Here I am getting input as a hash. I now need to send this hash variable to another subroutine. How can pass data as an argument to another subroutine? The code below shows how I am approaching this:
if ($csData->{'CUSTOMER_INVOICE_DETAILS'})
{
$c->log->debug("API Response:". Dumper $csData->{'CUSTOMER_INVOICE_DETAILS'});
my $Charges = [];
my #customerCharges = $csData->{'CUSTOMER_INVOICE_DETAILS'};
foreach(#customerCharges)
{
my ($customername,$customeramount) = split /:/;
my $charge_hash = ({
customername => $customername,
customeramount => $customeramount
});
push(#$Charges, $charge_hash);
}
my #ReturnCharges = $self->API->get_customer_charges($Charges, $Customer->customerid, $params->{'invoiceid'});
The other subroutine where this data is being received is as follows:
sub get_customer_charges
{
my $self = shift;
my ($charge, $CustomerId, $INID) = #_;
my $http_request = {
action => 'GetTariff',
customerid => $CustomerId,
csid => $INID,
};
my $markups = $self->APIRequest($http_request);
###Charge Level ID Inserting As 10
my #ChargeLevels;
my #BaseLevelID;
foreach my $ch (#$charge)
{
my ($customername,$customeramount) = split(':', $ch->{'customername'}, $ch->{'customername'});
my $chargelevel = join(':', $ch->{'customername'}, $ch->{'customeramount'}, '10');
push(#BaseLevelID, $chargelevel);
}
push(#ChargeLevels, #BaseLevelID);
return #ChargeLevels;
}
When I print to the server log for CUSTOMER_INVOICE_DETAILS variable I am getting the following values:
API Response:$VAR1 = {
'Product' => '34.04',
'basetax' => '2.38',
'vattax' => '4.36'
};
After sending data to second subroutine the data coming in server log for second subroutine variable is as following:
Charges in API:$VAR1 = 'HASH(0xb75d6d8)::10';
Can anyone help how could I send the hash data from one subroutine to another?

Given your comments and that your source is:
API Response:$VAR1 = {
'Product' => '34.04',
'basetax' => '2.38',
'vattax' => '4.36'
};
And you're looking for:
API Response:$VAR1 = { 34.04:2.38:4.36:10 };
(and somehow you're getting:
Charges in API:$VAR1 = 'HASH(0xb75d6d8)::10';
This suggests this may be as simple as using the values system call. values extracts an array of all the values in the hash. Something like this (guessing a bit on which part of your code needs it).
my #list_of_values = values ( %{$csData->{'CUSTOMER_INVOICE_DETAILS'}} );

You say you want to "convert" a hash to an array, but your issue seems more complex and subtle so simple conversion is not likely what will solve your problem. Something in your subroutine is returning a hash reference when the rest of your code does not expect it to do so. If the data-structure you are passing contains the correct information but not in the form you expect, then you can either change the code to produce it in the expected form (e.g. to return an ARRAY) or change your subroutine so that it is able to handle the data that it is passed correctly.
As for "converting a hash" per se, if your data structure doesn't contain complex nested references and all you want to do is "convert" your hash to an array or list, then you can simply assign the hash to an array. Perhaps I'm not understanding your question but if this kind of simple "flattening" is all you want then you could try:
my $customer_purchase = {
'Product' => '34.04',
'basetax' => '2.38',
'vattax' => '4.36'
};
my #flat_customer_purchase = %{ $customer_purchase };
say "#flat_customer_purchase" ;
Output:
basetax 2.38 Product 34.04 vattax 4.36
You can then supply the hash data as the "array" to the second subroutine. e.g. treat #flat_customer_purchase as a list:
use List::AllUtils ':all';
say join " ", pairkeys #flat_customer_purchase
# basetax Product vattax
say join " ", pairvalues #flat_customer_purchase
# 2.38 34.04 4.36
etc.
NB: this assumes that for some reason you must pass an array. The example of running the pairvalues routine simply replicates #Sobrique's suggestion to use values directly on the hash you are passing but in my answer this grabs the values pairs from the array instead of the hash.
My sense is that there is more to the question. If API Response is a more complicated hash/object or, if for some other reason this basic perl doesn't work, then you will have to supply more information. You need to find out where your unexpected hash reference is coming from before you can decide how to handle it. You might find this SO discussion helpful:
Are Perl subroutines call-by-reference or call-by-value?

Output '{{$NEXT}}' with Text::Template

The NextRelease plugin for Dist::Zilla looks for {{$NEXT}} in the Changes file to put the release datetime information. However, I can't get this to be generated using my profile.ini. Here's what I have:
[GenerateFile / Generate-Changes ]
filename = Changes
is_template = 1
content = Revision history for {{$dist->name}}
content =
;todo: how can we get this to print correctly with a template?
content = {{$NEXT}}
content = initial release
{{$dist->name}} is correctly replaced with my distribution name, but {{$NEXT}} is, as-is, replaced with nothing (since it's not escaped and there is no $NEXT variable in). I've tried different combinations of slashes to escape the braces, but it either results in nothing or an error during generation with dzil new. How can I properly escape this string so that after dzil processes it with Text::Template it outputs {{$NEXT}}?

In the content, {{$NEXT}} is being interpreted as a template block and, as you say, wants to fill itself in as the contents of the missing $NEXT.
Instead, try:
content = {{'{{$NEXT}}'}}
Example program:
use 5.14.0;
use Text::Template 'fill_in_string';
my $result = fill_in_string(
q<{{'{{$NEXT}}'}}>,
DELIMITERS => [ '{{', '}}' ],
BROKEN => sub { die },
);
say $result;

What is the character encoding?

I have several characters that aren't recognized properly.
Characters like:
º
á
ó
(etc..)
This means that the characters encoding is not utf-8 right?
So, can you tell me what character encoding could it be please.

We don't have nearly enough information to really answer this, but the gist of it is: you shouldn't just guess. You need to work out where the data is coming from, and find out what the encoding is. You haven't told us anything about the data source, so we're completely in the dark. You might want to try Encoding.Default if these are files saved with something like Notepad.
If you know what the characters are meant to be and how they're represented in binary, that should suggest an encoding... but again, we'd need to know more information.

read this first http://www.joelonsoftware.com/articles/Unicode.html
There are two encodings: the one that was used to encode string and one that is used to decode string. They must be the same to get expected result. If they are different then some characters will be displayed incorrectly. we can try to guess if you post actual and expected results.

I wrote a couple of methods to narrow down the possibilities a while back for situations just like this.
static void Main(string[] args)
{
Encoding[] matches = FindEncodingTable('Ÿ');
Encoding[] enc2 = FindEncodingTable(159, 'Ÿ');
}
// Locates all Encodings with the specified Character and position
// "CharacterPosition": Decimal position of the character on the unknown encoding table. E.G. 159 on the extended ASCII table
//"character": The character to locate in the encoding table. E.G. 'Ÿ' on the extended ASCII table
static Encoding[] FindEncodingTable(int CharacterPosition, char character)
{
List matches = new List();
byte myByte = (byte)CharacterPosition;
byte[] bytes = { myByte };
foreach (EncodingInfo encInfo in Encoding.GetEncodings())
{
Encoding thisEnc = Encoding.GetEncoding(encInfo.CodePage);
char[] chars = thisEnc.GetChars(bytes);
if (chars[0] == character)
{
matches.Add(thisEnc);
break;
}
}
return matches.ToArray();
}
// Locates all Encodings that contain the specified character
static Encoding[] FindEncodingTable(char character)
{
List matches = new List();
foreach (EncodingInfo encInfo in Encoding.GetEncodings())
{
Encoding thisEnc = Encoding.GetEncoding(encInfo.CodePage);
char[] chars = { character };
byte[] temp = thisEnc.GetBytes(chars);
if (temp != null)
matches.Add(thisEnc);
}
return matches.ToArray();
}

Encoding is the form of modifying some existing content; thus allowing it to be parsed by the required destination protocols.
An example of encoding can be seen when browsing the internet:
The URL you visit: www.example.com, may have the search facility to run custom searches via the URL address:
www.example.com?search=...
The following variables on the URL require URL encoding. If you was to write:
www.example.com?search=cat food cheap
The browser wouldn't understand your request as you have used an invalid character of ' ' (a white space)
To correct this encoding error you should exchange the ' ' with '%20' to form this URL:
www.example.com?search=cat%20food%20cheap
Different systems use different forms of encoding, in this example I have used standard Hex encoding for a URL. In other applications and instances you may find the need to use other types of encoding.
Good Luck!

Parsing a Gmail-style advanced search syntax?

I want to parse a search string similar to that provided by Gmail using Perl. An example input would be "tag:thing by:{user1 user2} {-tag:a by:user3}". I want to put it into a tree structure, such as
{and => [
"tag:thing",
{or => [
"by:user1",
"by:user2",
]},
{or => [
{not => "tag:a"},
"by:user3",
]},
}
The general rules are:
Tokens separated by space default to the AND operator.
Tokens in braces are alternative options (OR). The braces can go before or after the field specifier. i.e. "by:{user1 user2}" and "{by:user1 by:user2}" are equivalent.
Tokens prefixed with a hyphen are excluded.
These elements can also be combined and nested: e.g. "{by:user5 -{tag:k by:user3}} etc".
I'm thinking of writing a context-free grammar to represent these rules, and then parsing it into the tree. Is this unnecessary? (Is this possible using simple regexps?)
What modules are recommended for doing parsing context-free grammars?
(Eventually this will be used to generate an database query with DBIx::Class.)

Regex doesn't do nested things (like parenthesis) very well. By the time you get your regex counting parenthesis and capturing correctly, you could probably have a decent CFG parser. CFGs can logically guarantee correct parsing, while with a regex solution you're leaving a lot up to the magic. I can't recommend any Perl CFG libaries, but coding one sounds very cathartic.

YAPP might do what you want. You can use it to generate and then use a LALR(1) Parsing Automaton.

If your query isn't tree structured, then regexes will do the job for you.
For example:
my $search = "tag:thing by:{user1 user2} {-tag:a by:user3}"
my #tokens = split /(?![^{]*})\s+/, $search;
foreach (#tokens) {
my $or = s/[{}]//g; # OR mode
my ($default_field_specifier) = /(\w+):/;
}
Even if your query is tree structured, regexes can make recursive parsing much more pleasant:
$_ = "by:{user1 z:{user2 3} } x {-tag:a by:user3} zz";
pos($_) = 0;
scan_query("");
sub scan_query {
my $default_specifier = shift;
while (/\G\s*((?:[-\w:]+)|(?={))({)?/gc) {
scan_query($1), next if $2;
my $query_token = $default_specifier . $1;
}
/\G\s*\}/gc;
}
Regexes are awesome :)!

Parse::Recdescent can generate parsers for this sort of thing. You probably need some experience with parsers to use it effectively though.

How should I treat strings of digits in XML::RPC and Drupal?

I am trying to use an XML-RPC server on my Drupal (PHP) backend to make it easier for my Perl backend to talk to it. However, I've run into an issue and I'm not sure which parts, if any, are bugs. Essentially, some of the variables I need to pass to Drupal are strings that sometimes are strings full of numbers and the Drupal XML-RPC server is returning an error that when a string is full of numbers it is not properly formed.
My Perl code looks something like this at the moment.
use strict;
use warnings;
use XML::RPC;
use Data::Dumper;
my $xmlrpc = XML::RPC->new(URL);
my $result = $xmlrpc->call( FUNCTION, 'hello world', '9876352345');
print Dumper $result;
The output is:
$VAR1 = {
'faultString' => 'Server error. Invalid method parameters.',
'faultCode' => '-32602'
};
When I have the Drupal XML-RPC server print out the data it receives, I notice that the second argument is typed as i4:
<param>
<value>
<i4>9876352345</i4>
</value>
I think when Drupal then finishes processing the item, it is typing that variable as an int instead of a string. This means when Drupal later tries to check that the variable value is properly formed for a string, the is_string PHP function returns false.
foreach ($signature as $key => $type) {
$arg = $args[$key];
switch ($type) {
case 'int':
case 'i4':
if (is_array($arg) || !is_int($arg)) {
$ok = FALSE;
}
break;
case 'base64':
case 'string':
if (!is_string($arg)) {
$ok = FALSE;
}
break;
case 'boolean':
if ($arg !== FALSE && $arg !== TRUE) {
$ok = FALSE;
}
break;
case 'float':
case 'double':
if (!is_float($arg)) {
$ok = FALSE;
}
break;
case 'date':
case 'dateTime.iso8601':
if (!$arg->is_date) {
$ok = FALSE;
}
break;
}
if (!$ok) {
return xmlrpc_error(-32602, t('Server error. Invalid method parameters.'));
}
}
What I'm not sure about is on which side of the divide the issue lies or if there is something else I should be using. Should the request from the Perl side be typing the content as a string instead of i4 or is the Drupal side of the request too stringent for the string type? My guess is that the issue is the latter, but I don't know enough about how an XML-RPC server is supposed to work to know for sure.

The number 9876352345 is too big to fit in a 32bit integer. That might cause the problem.

are you using frontier? perhaps you could declare the string explicitly?
my $result =
$xmlrpc->call( FUNCTION, 'hello world', $xmlrpc->string('9876352345') );
info from the client docs:
By default, you may pass ordinary Perl values (scalars) to be encoded. RPC2 automatically converts them to XML-RPC types if they look like an integer, float, or as a string. This assumption causes problems when you want to pass a string that looks like "0096", RPC2 will convert that to an because it looks like an integer.

I don't have any experience with the XML::RPC package, but I'm the author of the RPC::XML CPAN module. As with the Frontier package, I provide a way to force a value into a specific type when it would otherwise default to something else.
If I had to guess, I would say that the package you're using simple does a regular-expression match on the data to decide how to type it. I had a similar problem with my package, and given the way Perl handles scalar values the only real way around it is to force it with explicit declaration. As a previous answerer pointed out, the value in question is actually outside the range of the <i4> type (which is a signed 32-bit value). So even if you had intended it to be an integer value, it would have been invalid with regards to the XML-RPC spec.
I would recommend switching to one of the other XML-RPC packages, which have clearer ways of explicitly typing data. According to the docs for XML::RPC, it is possible to force the typing of data, but I found it to be unclear and not very well explained.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse