Perl - XML::LibXML: bad parsing performance on Apache2 default page - perl

I was testing some code and parsing XML was included. For simple testing I requested / of my localhost and the response was my Apache2 default page.
So far, so good.
The response is XHTML and therefore XML. So I took it for my parsing (~11k of size).
XML::LibXML->load_xml (string => $response);
It takes about 16s till it finishes with no error.
If I give it an other xml-file with double the size if need 0 time.
So...why????
Apache/2.4.10
Debian/8.6
XML::LibXML/2.0128
EDIT
I need to mention that I removed the non-XML HTTP-header.
So the string starts with
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
and ends with
</html>
EDIT
Link: http://s000.tinyupload.com/index.php?file_id=88759644475809123183

One possibility is that each time you parse the document the parser is downloading the DTD from W3C. You could confirm this using strace or similar tools depending on your platform.
The DTD contains (among other things) the named entity definitions which map for example the string to the character U+00A0. So in order to parse HTML documents, the parser does need the DTD, however fetching it via HTTP each time is obviously not a good idea.
One approach is to install a copy of the DTD locally and use that. On Debian/Ubuntu systems you can just install the w3c-dtd-xhtml package which also sets up the appropriate XML catalog entries to allow libxml to find it.
Another approach is to use XML::LibXML->load_html instead of XML::LibXML->load_xml. In HTML parsing mode, the parser is more forgiving of markup errors and I think also always uses a local copy of the DTD.
The parser also provides options which allow you to specify your own handler routine for retrieving reference URIs.

Related

Encoding a GPX file such that it's accepted by the /matchroute endpoint of the Here API

I am trying to call the resource /matchroute via a GET request.
However, I can't figure out how to encode the GPX file so that the resource accepts my request: I always receive HTTP error 400 as a response from the Here server.
As exemplary data I used the following file:
<?xml version="1.0"?>
<gpx version="1.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns="http://www.topografix.com/GPX/1/0"
xsi:schemaLocation="http://www.topografix.com/GPX/1/0
http://www.topografix.com/GPX/1/0/gpx.xsd">
<trk>
<trkseg>
<trkpt lat="51.10177" lon="0.39349"/>
<trkpt lat="51.10181" lon="0.39335"/>
<trkpt lat="51.10255" lon="0.39366"/>
<trkpt lat="51.10398" lon="0.39466"/>
<trkpt lat="51.10501" lon="0.39533"/>
</trkseg>
</trk>
</gpx>
that I got from the this example.
I encoded this file using MATLAB's function matlab.net.base64encode which yielded the following base64-encoded string:
PD94bWwgdmVyc2lvbj0iMS4wIj8+PGdweCB2ZXJzaW9uPSIxLjAieG1sbnM6eHNpPSJodHRwOi8vd3d3LnczLm9yZy8y
MDAxL1hNTFNjaGVtYS1pbnN0YW5jZSJ4bWxucz0iaHR0cDovL3d3dy50b3BvZ3JhZml4LmNvbS9HUFgvMS8wInhzaTpz
Y2hlbWFMb2NhdGlvbj0iaHR0cDovL3d3dy50b3BvZ3JhZml4LmNvbS9HUFgvMS8wIGh0dHA6Ly93d3cudG9wb2dyYWZp
eC5jb20vR1BYLzEvMC9ncHgueHNkIj48dHJrPjx0cmtzZWc+PHRya3B0IGxhdD0iNTEuMTAxNzciIGxvbj0iMC4zOTM0
OSIvPjx0cmtwdCBsYXQ9IjUxLjEwMTgxIiBsb249IjAuMzkzMzUiLz48dHJrcHQgbGF0PSI1MS4xMDI1NSIgbG9uPSIw
LjM5MzY2Ii8+PHRya3B0IGxhdD0iNTEuMTAzOTgiIGxvbj0iMC4zOTQ2NiIvPjx0cmtwdCBsYXQ9IjUxLjEwNTAxIiBs
b249IjAuMzk1MzMiLz48L3Rya3NlZz48L3Ryaz48L2dweD4=
However, as stated before, the HERE server consistently responds with HTTP-error 400 to my request
https://rme.api.here.com/2/matchroute.json?app_id={app_id}&app_code={app_code}&routemode=car&file=...
where "..." equals the above mentioned base64-encoded string.
Question: Could anyone please provide a code sample showing how to encode the above mentioned GPX file correctly (ideally in MATLAB language) so that the /matchroute resource is able to respond?
Remarks:
If I use the base64 string
UEsDBBQAAAAIANmztEQSwaeZzwAAAM8BAAAQAAAAc2FtcGxlLXRyYWNlLmdweIXPTQuCMBwG8HufQnZv%2F605S0k9dj
EIungdZjpSJ27kPn6%2BRBgYXcYYv2cPzzG2deU8805L1YSIYoLiaHMsWvv9uBlYowOrZYhKY9oAoO973DOsugJ2hFBI
z8k1K%2FNabGWjjWiy%2FJ36ShjVqqITd2lxpmo4XVKgMP6vZaCneKIyYabivzHnr4BhCbb6hoZRpnvMp86L%2BdIapx
ImRJxiSuh%2Bj5xq7CWY%2Bcz1EaypA10qxlfVjvOl8rxVxfzDQrk%2FFCfLRs7YpOCzA%2BZd49LoBVBLAQIUABQAAA
AIANmztEQSwaeZzwAAAM8BAAAQAAAAAAAAAAEAIAAAAAAAAABzYW1wbGUtdHJhY2UuZ3B4UEsFBgAAAAABAAEAPgAAAP
0AAAAAAA%3D%3D
from this example the GET request works. However, I couldn't figure out how to reproduce this encoding myself so that I am able to encode my own data accordingly.
Link to the Here API definition: https://developer.here.com/documentation/route-match/topics/resource-matchroute-request.html
Looking at the two base64 strings I can tell you the fundamental difference between them - the first one (which doesn't work) is unescaped whereas the second one (which works) is.
You can convert between the two formats manually using various online tools like this one. The escaped version of the non-working base64 string, in case you want to test it, is:
PD94bWwgdmVyc2lvbj0iMS4wIj8+PGdweCB2ZXJzaW9uPSIxLjAieG1sbnM6eHNpPSJodHRwOi8vd3d3LnczLm9yZy8y
%0AMDAxL1hNTFNjaGVtYS1pbnN0YW5jZSJ4bWxucz0iaHR0cDovL3d3dy50b3BvZ3JhZml4LmNvbS9HUFgvMS8wInhza
Tpz%0AY2hlbWFMb2NhdGlvbj0iaHR0cDovL3d3dy50b3BvZ3JhZml4LmNvbS9HUFgvMS8wIGh0dHA6Ly93d3cudG9wb2
dyYWZp%0AeC5jb20vR1BYLzEvMC9ncHgueHNkIj48dHJrPjx0cmtzZWc+PHRya3B0IGxhdD0iNTEuMTAxNzciIGxvbj0
iMC4zOTM0%0AOSIvPjx0cmtwdCBsYXQ9IjUxLjEwMTgxIiBsb249IjAuMzkzMzUiLz48dHJrcHQgbGF0PSI1MS4xMDI1
NSIgbG9uPSIw%0ALjM5MzY2Ii8+PHRya3B0IGxhdD0iNTEuMTAzOTgiIGxvbj0iMC4zOTQ2NiIvPjx0cmtwdCBsYXQ9I
jUxLjEwNTAxIiBs%0Ab249IjAuMzk1MzMiLz48L3Rya3NlZz48L3Ryaz48L2dweD4%3D
I'm not an expert on this, but as I understand, you need to URL-encode strings only when you want to paste them as-is into the web path of your browser (read about "URL Params"). If you construct your HTTP requests the right way™ (by this I mean specify the headers of the request and the key-value pairs correctly), you shouldn't have to worry about URL-encoding at all, since the tool that you're using (in this case, MATLAB) should take care of the conversion for you.
Unfortunately, I cannot test this theory, as I have no access to the discussed API - but I am fairly certain that this would solve your problem.
I had the exact same problem.
The documentation seems to be incomplete. You can check here for additional information. Several ways I solved this:
Use filetype='CSV' or filtetype='GPX' in parameter. It says the filetype is guessed if passed, that is actually not true. After passing an XML file the API told me my file didn't look like a 'CSV'
Compression is OPTIONAL, I suggest to avoid it completely I could not find a suitable compression either. It works fine with plain base64 encoding.
I suggest to actually use CSV because the XML actually returns parsing errors.
In python
data='''latitude,longitude
51.10177,0.39349
'''
r = requests.get('https://rme.api.here.com/2/matchroute.json?app_id={APP_ID}&app_code={APP_CODE}&routemode=car&file={file}&filetype={filetype}'.format(
APP_ID=os.getenv('HERE_APP_ID'),
APP_CODE=os.getenv('HERE_APP_CODE'),
filetype='CSV',
file=base64.b64encode(data.encode()).decode()
))

TYPO3 7.6: 404 error page: HTML wrapped in numbers

I created my own “404 Page not found” error page on a TYPO3 website and implemented it via the /typo3conf/LocalConfiguration.php as follows, using the page’s Speaking URL path:
return [
...
'FE' => [
...
'pageNotFound_handling' => '/page-not-found/',
]
]
Now when I call a non-existing page, the error page gets displayed but there is a 4-digit alphanumeric number (hexadecimal as far as I’ve seen by now) BEFORE the HTML source code and a “0” AFTER it. Example (the number in the beginning is different after most of the reloads):
37b3
<!DOCTYPE html>
...
</html>
0
When calling the error page URL itself the page is returned correctly without those numbers.
Having the RealURL extension activated or deactivated does not make a difference.
Thanks a lot in advance!
I added the full description from the install tool and I guess we might find the solution there.
How TYPO3 should handle requests for non-existing/accessible pages.
empty (default)
The next visible page upwards in the page tree is shown.
'true' or '1'
An error message is shown.
String
Static HTML file to show (reads content and outputs with correct headers), e.g. notfound.html or http://www.example.org/errors/notfound.html.
Prefix "REDIRECT:"
If prefixed with "REDIRECT:" it will redirect to the URL/script after the prefix.
Prefix "READFILE:"
If prefixed with "READFILE" then it will expect the remaining string to be a HTML file which will be read and outputted directly after having the marker "###CURRENT_URL###" substituted with REQUEST_URI and ###REASON### with reason text, for example: READFILE:fileadmin/notfound.html.
Prefix "USER_FUNCTION:"
If prefixed with "USER_FUNCTION:" a user function is called, e.g. USER_FUNCTION:fileadmin/class.user_notfound.php:user_notFound->pageNotFound where the file must contain a class user_notFound with a method pageNotFound() inside with two parameters $param and $ref.
What you configured:
You're passing a string, thus TYPO3 expects to find a file - which you don't have, because it's more like an URL.
From what you try to achieve I'd go with REDIRECT:/page-not-found/.
Thanks for pointing this one out btw, I will remove the string configuration from the core since it does not make sense to have more people trip into this pitfall.
In short: change the following line in the FE section of your LocalConfiguration.php:
'pageNotFound_handling' => '/your404page.html',
to
'pageNotFound_handling' => 'REDIRECT:/your404page.html',
Cause
The actual cause is a combination of chunked Content-Encoding and the TYPO3 not being able to decode that in some cases. In your case the page not found handler eventually uses GeneralUtility::getUrl() to retrieve the error page.
If you have [SYS][curlUse] enabled it will use cUrl to retrieve the page and there is no problem.
If you don't have [SYS][curlUse] enabled it will open a socket, read the headers and then read the rest of the body. If the webserver uses "chunked" Content-Encoding the body will contain blocks of data and each block starts with a line with the length in hexadecimal format. The content ends with an empty block (with of course a line with the length "0").
cUrl apparently knows how to decode chunked data.
getUrl() itself does not know how to handle chunked data and uses the content as is as the page content.
In TYPO3 8 LTS the guzzle library is used to handle HTTP requests. In the guzzle code I can't find anything about handling chunked data. Guzzle will check if the cUrl PHP extension is present and use that as preferred transport. In most installations cUrl is present and since this decodes chunked data automagically no problem is visible. I have to test guzzle with PHP that has cUrl disabled to see if the issue is also present in v8/master.
Workaround/solution
If the PHP extension cUrl is enabled in your installation you can simply set [SYS][curlUse] in the Install Tool. The numbers around the 404 page content will disappear.

validate MPD file - using MPEG-DASH

I have just started using MPEG-DASH (from the client side), following the c057623_ISO_IEC_23009-1_2012 spec.
Does anyone know if there is a public lib or open source to validate MPD file I receive?
I have no problem in processing the xml.
Any help will be appreciated.
You may want to check this MPEG-DASH MPD Validator
The DASH Industry Forum provides great software resources for all things MPEG DASH.
Here another MPD Validator from dashif: DASHIF Validator.
In respect to the error "Cannot find the declaration of element ‘MPD'" provided by the mentioned MPEG-DASH MPD Validator, I observed it may happen even when the MPD tag is present, but it encounters some differences to the expected text, such as:
<MPD xmlns="urn:mpeg:DASH:schema:MPD:2011" ...>
instead of
<MPD xmlns="urn:mpeg:dash:schema:mpd:2011" ...>

How to create and implement a pixel tracking code

OK, here's a goal I've been looking for a while.
As it's known, most advertising and analytics companies use a so called "pixel" code in order to track websites views, transactions, conversion etc.
I do have a general idea on how it works, the problem is how to implement it. The tracking codes consist from few parts.
The tracking code itself.
This is the code that the users inserts on his webpage in the <head> section. The main goal of this code is to set some customer specific variables and to call the *.js file.
*.js file.
This file holds all the magic of CRUD (create/read/update/delete) cookies, track user's events and interaction with the webpage.
The pixel code.
This is an <img> tag with the src atribute pointing to an image *.gif (for example) file that takes all the parameters collected on the page, and stores them in the database.
Example:
WordPress pixel code: <img id="wpstats" src="http://stats.wordpress.com/g.gif?host=www.hostname.com&list_of_cookies_value_pairs;" alt="">
Google Analitycs:
http://www.google-analytics.com/__utm.gif?utmwv=4&utmn=769876874&etc
Now, it's obvious that the *.gif request has to reach a server side scripting language in order to read the parameters data and store them in a db.
Does anyone have an idea how to implement this in Zend?
UPDATE
Another thing I'm interested in is: How to avoid the user's browser to load the cached *.gif ? Will a random parameter value do the trick? Example: src="pixel.gif?nocache=random_number" where the nocache parameter value will be different on every request.
As Zend is built using PHP, it might be worth reading the following question and answer: Developing a tracking pixel.
In addition to this answer and as you're looking for a way of avoiding caching the tracking image, the easiest way of doing this is to append a unique/random string to it, which is generated at runtime.
For example, server-side and with the creation of each image, you might add a random URL id:
<?php
// Generate random id of min/max length
$rand_id = rand(8, 8);
// Echo the image and append a random string
echo "<img src='pixel.php?a=".$vara."&b=".$varb."&rand=".$rand_id."'>";
?>
Just adding my 2 cents to this thread because I think an important, and frequently used, option is missing: you don't necessarily need a scripting language to capture the request. A more efficient approach is to use the web server access log (like apache access log for instance) to log the request and then handle that log with whatever tools you see fit, like ELK stack for instance.
This makes serving the requests much lighter because no scripting language is loaded to prepare the response, just native apache response, which is typically much more efficient.
First of all, the *.gif doesn't need to be that file type, the only thing that is of interest is the Content-Type http header. Set that to image/gif (or any other, appropiate type) in the beginning, execute your code and render some sort of image to the response body.
Well, all of the above codes are correct and is good but to be certain, the guy above mention "g.gif"
You can just add a simple php code to write to an sql or fwrite("file.txt",$opened)
where var $opened serves as the counter++ if someone opened your mail... then save it as "g.gif"
TO DO all of this just add these:
<Files "/thisdirectory">
AddType application/x-httpd-php .gif
</Files>
to your ".htaccess" file but be sure to make a new directory for that g.gif or whatever.gif where the directory only contains g.gif and .htaccess

Wrong encoding when saving forms on Orbeon

I created my own persistence for SQL Server, and the CRUD works fine,
BUT I'm having some trouble with the enconding i think,
i receive the xml text from the XForms like that when i'm going to save something
?xml version="1.0" encoding="UTF-8"?xhtml:html xmlns:xhtml="http://www.w3 ...............
metadata
application-name w4/application-name
form-name usuario/form-name
title xml:lang="en"Cadastro/title
description xml:lang="en"Usuário/description ---------PROBLEM!!!
metadata
xforms:instance....................
Any ideas how to solve this??
In general, you need to make sure, when you are decoding the XML, to properly deal with the character encoding. How exactly to do that depends on the programming language or framework you are using, but you should:
if possible use an XML parser and just feed it the bytes (the parser will take care of handling the encoding by itself)
never assume a default or platform encoding when converting bytes to characters (Java in particular has a number of APIs which, for very wrong reasons, use a default encoding which is platform-dependent)