Loading error of tdbloader2: Illegal character in IRI - encoding

I'm trying to replicate a DBpedia for an experiment.
I download the latest dataset of DBpedia from: http://downloads.dbpedia.org/2015-10/core/
and store them a directory dbp_201510/.
I tried to load the dataset using tdbloader2.
tdbloader2 --loc tdb dbp_201510/*
However, I receive the following error.
ERROR [line: 2, col: 145] Illegal character in IRI (codepoint 0x60, '`'): <http://www4.wiwiss.fu-berlin.de/gutendata/resource/people/[`]...>
org.apache.jena.riot.RiotException: [line: 2, col: 145] Illegal character in IRI (codepoint 0x60, '`'): <http://www4.wiwiss.fu-berlin.de/gutendata/resource/people/[`]...> at org.apache.jena.riot.system.ErrorHandlerFactory$ErrorHandlerStd.fatal(ErrorHandlerFactory.java:136)
at org.apache.jena.riot.lang.LangEngine.raiseException(LangEngine.java:165)
at org.apache.jena.riot.lang.LangEngine.nextToken(LangEngine.java:108)
at org.apache.jena.riot.lang.LangNTriples.parseOne(LangNTriples.java:71)
at org.apache.jena.riot.lang.LangNTriples.runParser(LangNTriples.java:58)
at org.apache.jena.riot.lang.LangBase.parse(LangBase.java:42)
at org.apache.jena.riot.RDFParserRegistry$ReaderRIOTLang.read(RDFParserRegistry.java:176)
at org.apache.jena.riot.RDFDataMgr.process(RDFDataMgr.java:861)
at org.apache.jena.riot.RDFDataMgr.parse(RDFDataMgr.java:667)
at org.apache.jena.riot.RDFDataMgr.parse(RDFDataMgr.java:637)
at org.apache.jena.riot.RDFDataMgr.parse(RDFDataMgr.java:626)
at org.apache.jena.riot.RDFDataMgr.parse(RDFDataMgr.java:617)
at org.apache.jena.tdb.store.bulkloader2.CmdNodeTableBuilder.exec(CmdNodeTableBuilder.java:165)
at jena.cmd.CmdMain.mainMethod(CmdMain.java:93)
at jena.cmd.CmdMain.mainRun(CmdMain.java:58)
at jena.cmd.CmdMain.mainRun(CmdMain.java:45)
at org.apache.jena.tdb.store.bulkloader2.CmdNodeTableBuilder.main(CmdNodeTableBuilder.java:85)
In addition, I receive a lot of warnings as below.
WARN [line: 92881, col: 1 ] Bad IRI: <http://dbpedia.org/resource/Ranma_½> Code: 56/COMPATIBILITY_CHARACTER in PATH: TODO
WARN [line: 92882, col: 1 ] Bad IRI: <http://dbpedia.org/resource/Ranma_½> Code: 47/NOT_NFKC in PATH: The IRI is not in Unicode Normal Form KC.
I use Apache Jena 3.0.1.
I'm looking for a way to avoid this error.
In addition, is there a good way to load without warning.
I did same thing for the former version of DBpedia (http://downloads.dbpedia.org/2015-04/core/) and loading was successfully completed without any warning and error.

The data should be make legal before loading. The 0x60, '`' is not legal in a URI. Maybe you want to replace it with %60 (it is then a different URI).
In many large datasets, data isn't perfect. It is worth checking it before loading using "riot --validate".
The warnings are just warning, not errors, and indicate that teh UTF-8 is not in the standards preferred form and might cause matching problems later. It looks like ½ can be written in different ways in UTF-8.
(I'm sure the DBpedia team would appreciate some feedback.)

Related

Cannot create user at runtime with Query with parameters

I am using C++ Builder 10.2.3 (Rad Studio Tokyo 10.2.3) with Interbase 2017
I need to create users at runtime for my users registration.
If I create the Query at runtime, in that case there is no parameter, it works. But this creates problems with MBCS characters I will explain later.
If I create the Query at design-time with parameters and try to set the parameters at runtime. I am getting the error message below:
[Application: ]
[Error] -104 335544569 Dynamic SQL Error
SQL error code = -104
Token unknown - line 2, char 14
?
The query I am using is below:
CREATE USER myuser
SET PASSWORD :mypass,
FIRST NAME :myfirstname,
LAST NAME :myname;
I replace the first line of the Query at runtime, so there is no character. And after all, Interbase cannot handle MBCS characters in USERNAME.
I need to use a Query with parameters because my application handles multi-bytes characters (MBCS), like Chinese and Japanese. And this is the only option to be sure of a proper conversion to UTF8 in Interbase. Because if the conversion of MBCS characters is not done, I cannot backup and restore my database. When I try to restore with MBCS characters in First and last name, I am getting an error message that Interbase cannot transliterate between character sets.
Base on the error message, it appears to me that it does not recognize the Query parameters.
I tried with both "TIBQuery" and "TIBSQL". Same issue. Impossible to use also Store procedures. Does not recognize the create word.
So, how to fix that ?

Azure APIM Policy Editor

I would very much like to be able to set Azure API Policy attributes based on a User's Jwt Claims data. I have been able to set string values for things like the counter-key and increment-condition but I can't set all attributes. I imagined doing something like the following:
<rate-limit-by-key
calls="#((int) context.Variables["IdentityToken"].AsJwt().Claims.GetValueOrDefault("/LimitRate/Limit", "5"))"
renewal-period="#((int) context.Variables["IdentityToken"].AsJwt().Claims.GetValueOrDefault("/LimitRate/Duration/InSeconds", "60"))"
counter-key="#((string)context.Variables["Subject"])"
increment-condition="#(context.Response.StatusCode == 200)"
/>
However there seems to be some validation happening when I save the policy as I get the following error:
Error in element 'rate-limit-by-key' on line 98, column 10: The 'calls' attribute is invalid - The value '#((int) context.Variables["IdentityToken"].AsJwt().Claims.GetValueOrDefault("/LimitRate/Limit", "5"))' is invalid according to its datatype 'http://www.w3.org/2001/XMLSchema:int' - The string '#((int) context.Variables["IdentityToken"].AsJwt().Claims.GetValueOrDefault("/LimitRate/Limit", "5"))' is not a valid Int32 value.
I even have trouble setting a string parameter (albeit one with a strict format)
<quota-by-key
calls="10"
bandwidth="100"
renewal-period="#((string) context.Variables["IdentityToken"].AsJwt().Claims.GetValueOrDefault("/Quota/RenewalPeriod", "P00Y00M01DT00H00M00S"))"
counter-key="#((string)context.Variables["Subject"])"
/>
Which gives the following when I try and save the policy:
Error in element 'quota-by-key' on line 99, column 6: #((string) context.Variables["IdentityToken"].AsJwt().Claims.GetValueOrDefault("/Quota/RenewalPeriod", "P00Y00M01DT00H00M00S")) is not in a valid format. Provide number of seconds or use 'PxYxMxDTxHxMxS' format where 'x' is a number.
I have tried a large set of variations casting, Convert.ToInt32, claims that are not strings, #{return 5}, #(5) etc but there seems to be some validation happening at save time that is stopping it.
Is there away around this issue as I think it would be a useful feature to add to my API?
calls attribute on rate-limit-by-key and quota-by-key does not support policy expressions. Internal limitations block us from treating it on per-request basis unfortunately. The best you can do is categorize requests into a few finite groups and apply rate limit/quota conditionally using choose policy.
Or try using increment-count attribute to control by how much counter is increased per each request.

Special character handling when fetching data from MS SQL Server using Perl DBD

I have an MS SQL Server 2008 Database, from which I am fetching data using perl DBD::Sybase module. But there are some special characters in the DB, like the Copyright symbol, Trademark symbol etc., which are not getting imported properly. Perl seems to change all of these special characters to a Question mark character. Is there a way to fix this?
I have tried specifying charset=utf8 in the connection string. The doc mentions a syb_enable_utf8 (bool) setting, but whenever I try that, I get an error:
Can't locate object method "syb_enable_utf8" via package "DBI::db"
One solution I found was this:
use Encode qw(encode_utf8);
Then, wherever you are writing data to a file or anywhere else, use Encode::encode_utf8($data);
where $data is the column/value which you have fetched from MSSQL.
I don't use DBD::Sybase but a) I use a lot of other DBDs and b) I am currently collecting information about unicode support in DBDs. According to the pod you need at least OpenClient 15.x when using syb_enable_utf8. Are you using 15.x or later? Perhaps syb_enable_utf8 is not defined if your client is less than 15.x or perhaps you have too old a version of DBD::Sybase. Unfortunately I cannot see from the Changes file when syb_enable_utf8 was added.
However, when you say "can't locate method" I think that is a clue as syb_enable_utf8 is not a method, it is an attribute (it is under Sybase Specific Attributes) in the pod. So you need to add it to your connect call or set it via a connection handle like this:
my $h = DBI->connect("dbi:Sybase:something","user","password", {syb_enable_utf8 => 1});
or
$h->{syb_enable_utf8} = 1;
You should also read the bits in the pod describing what happens when syb_enable_utf8 is set as it appears from the documents it only applies to UNIVARCHAR, UNICHAR, and UNITEXT columns.
Lastly, you need to ensure you insert the data correctly in the first place. I'd guess if it is not inserted from Perl with syb_enable_utf8 and charset=utf8 and your data is not proper unicode characters in Perl before you insert you'll get garbage back.
The comment Raze2dust made had nothing to do with your issue but is worth heeding if you are going to write the data retrieved from your database elsewhere. Just remember to decode any data input to your script and encode any data output.

Problem with the deprecation of the postgresql XML2 module 'xml_is_well_formed' function

We need to make extensive use of the 'xml_is_well_formed' function provided by the XML2 module.
Yet the documentation says that the xml2 module will be deprecated since "XML syntax checking and XPath queries"
is covered by the XML-related functionality based on the SQL/XML standard in the core server from PostgreSQL 8.3 onwards.
However, the core function XMLPARSE does not provide equivalent functionality since when it detects an invalid XML document,
it throws an error rather than returning a truth value (which is what we need and currently have with the 'xml_is_well_formed' function).
For example:
select xml_is_well_formed('<br></br2>');
xml_is_well_formed
--------------------
f
(1 row)
select XMLPARSE( DOCUMENT '<br></br2>' );
ERROR: invalid XML document
DETAIL: Entity: line 1: parser error : expected '>'
<br></br2>
^
Entity: line 1: parser error : Extra content at the end of the document
<br></br2>
^
Is there some way to use the new, core XML functionality to simply return a truth value
in the way that we need?.
Thanks,
-- Mike Berrow
After asking about this on the pgsql-hackers e-mail list, I am happy to report that the guys there agreed that it was still needed and they have now moved this function to the core.
See:
http://web.archiveorange.com/archive/v/alpsnGpFlZa76Oz8DjLs
and
http://postgresql.1045698.n5.nabble.com/review-xml-is-well-formed-td2258322.html

PostgreSQL UTF8 Handling

From time time to time my PostgreSQL DB is reporting a strange error:
[client] postgres7 error: [-1: ERROR: invalid byte sequence for encoding \"UTF8\": 0xb4
HINT: This error can also happen if the byte sequence does not match the encoding expected by the server, which is controlled by \"client_encoding\".] in adodb_throw(INSERT INTO
page_comments(pageid, pagetype, sender_name, sender_mail, sender_url, comment, owner_uid, owner_gid, sortorder, level, parent)
VALUES(
1493,
102,
\'alexis\',
\'xxx#xxx.es\',
\'\',
\'Next friday i´ll visit Barcelona so in case you need one of this mugs please let me know.\',
1000,
1000,
1,
1,
NULL
), )
Now, I see that it is coming from the funny apostrophe sign. Yet I am totally confused, as the DB was initialized in UTF8, the web application is serving UTF8 pages, and, moreover, the content is being even utf8_encoded before it is pushed into the database.
Does anybody know how to avoid this error?
U+00B4, ACUTE ACCENT, is encoded as '\xb4' in ISO-8859-1. In UTF-8, it would be '\xc2\xb4'. So some part of your application changes the encoding to Latin-1. Find and fix that place, and the error should go away.