Handling illegal character <0x1A> in database output - scala

I am working on a data transformation pipeline which reads data from an Oracle SQL relational DB, writes it to an RDF triplestore, and then pulls it into JVM memory. The original database contains some cells which have string values starting with the Unicode character sometimes represented as <0x1a> or U+001A. This is probably in the database by mistake, but I have no control over this database and have to deal with it as is. I also can't modify the strings, as they are later used as primary keys to lookup information from other tables in the database (yes, I understand this is not ideal). I am working on Windows.
The cells containing this character are mapped to literal values in the triplestore. When attempting to pull and iterate through the data from the triplestore, I receive the following error due to the presence of the illegal character:
error:org.eclipse.rdf4j.query.QueryEvaluationException:
org.eclipse.rdf4j.query.QueryEvaluationException:
org.eclipse.rdf4j.query.resultio.QueryResultParseException:
org.xml.sax.SAXParseException; lineNumber: 1085; columnNumber: 14; An
invalid XML character (Unicode: 0x1a) was found in the element content of
the document.
In case it's interesting, here's the code I'm using to iterate my results from the triplestore:
val cxn = getDatabaseConnection()
val query = getTriplestoreQuery()
val tupleQueryResult = cxn.prepareTupleQuery(QueryLanguage.SPARQL, query).evaluate()
// fails at this line when illegal XML character is discovered
while (tupleQueryResult.hasNext())
{
// do some stuff with the data
}
I'm struggling a bit because I have to find a way to pull this data into memory without modifying the strings as they currently exist in the database. I haven't been able to find an escape solution for this case yet. My last resort would be to catch the QueryEvaluationException and simply not process the damaged strings, but it would be preferable to be able to salvage this data.

Related

Db2 for I: Cpyf *nochk emulation

In the IBM i system there's a way to copy a from a structured file to one without structure using Cpyf *nochk.
How can it be done with sql?
The answer may be "You can't", not if you are using DDL defined tables anyway. The problem is that *NOCHK just dumps data into the file like a flat file. Files defined with CRTPF, whether they have source, or are program defined, don't care about bad data until read time, so they can contain bad data. In fact you can even read bad data out of a file if you use a program definition for that file.
But, an SQL Table (one defined using DDL) cannot contain bad data. No matter how you write it, the database validates the data at write time. Even the *NOCHK option of the CPYF command cannot coerce bad data into an SQL table.
There really isn't an easy way
Closest would be to just build a big character string using CONCAT...
insert into flatfile
select mycharfld1
concat cast(myvchar as char(20))
concat digits(zonedFld3)
from mytable
That works for fixed length, varchar (if casted to char) and zoned decimal...
Packed decimal would be problematic..
I've seen user defined functions that can return the binary character string that make up a packed decimal...but it's very ugly
I question why you think you need to do this.
You can use QSYS2.QCMDEXC stored procedure to execute OS commands.
Example:
call qsys2.qcmdexc ( 'CPYF FROMFILE(QTEMP/FILE1) TOFILE(QTEMP/FILE2) MBROPT(*replace) FMTOPT(*NOCHK)' )

How does one prevent MS Access from concatenating the schema and table names thereby taking them over the 64 character limit?

I have been trying to get around this for several day's now with no luck. I loaded Libre Office to see how that would handle it, and its native support for PostgeSQL works wonderfully and I can see the true data structure. Which is how I found out I was dealing with more than one table. What I am seeing in MS Access is the two names concatenated together. The concatenation takes them over the 64 character limit that appears to be built into the ODBC driver. I have seen many references to modifying namedatalen on the server side, but my problem is on the ODBC side. Most of the tables are under the 64 character limit even with the concatenation and work fine. As such I know everything else is working. The specific error I am getting is
'your_extra_long_schema_name_your_table_name_that_you_want_to_get_data_from'
is not a valid name. Make sure it does not include invalid characters
or punctuation and that it is not too long.
Object names in an Access database are limited to 64 characters (ref: here). When creating an ODBC linked table in the Access UI the default behaviour is to concatenate the schema name and table name with an underscore and use that as the linked table name so, for example, the remote table table1 in the schema public would produce a linked table in Access named public_table1. If such a name exceeds 64 characters then Access will throw an error.
However, we can use VBA to create the table link with a shorter name, like so:
Option Compare Database
Option Explicit
Sub so38999346()
DoCmd.TransferDatabase acLink, "ODBC Database", "ODBC;DSN=PostgreSQL35W", acTable, _
"public.hodor_hodor_hodor_hodor_hodor_hodor_hodor_hodor_hodor_hodor", _
"hodor_linked_table"
End Sub
(Tested with Access 2010.)

Min length for SQL Name/Identifier in PostgreSQL?

I need to parse an SQL name for PostgreSQL, and to be sure it is always compliant.
I know that, for example, an SQL name/identifier can even consist of a single white space.
What I'm trying to find out is - is there a single use case within the entire PostgreSQL syntax where it is possible to pass in an empty "" SQL name/identifier and still to be considered valid?
I want to know whether I should parse "" as always invalid in PostgreSQL or not.
An empty quoted identifier is not valid as per the SQL standard.
You get the following error message when you try to create a table with "":
ERROR: zero-length delimited identifier at or near """"
And what would it identify anyways? There is no such thing like an identifier that identifies the empty identifier.
You can find out more about the syntax here: http://www.postgresql.org/docs/9.4/static/sql-syntax-lexical.html
Quoted identifiers can contain any character, except the character
with code zero. (To include a double quote, write two double quotes.)
This allows constructing table or column names that would otherwise
not be possible, such as ones containing spaces or ampersands. The
length limitation still applies.

Scala Slick, Trouble with inserting Unicode into database

When I create a new row of data that contains several columns that may contain Unicode, the columns that do contain Unicode are being corrupted.
However, if I insert that data directly, using the mysql-cli Slick will retrieve that Unicode data fine.
Is there anything I should add to my table class to tell Slick that this Column may be a Unicode string?
I found the problem, I have to set the character encoding, for the connection.
db.default.url="jdbc:mysql://localhost/your_db_name?characterEncoding=UTF-8"
You probably need to configure that on the db schema side by setting the right collation.

Migrating Low-Values in flat file to RDB

I have an indexed file where a particular field now holds alphanumeric values and this field is a part of the Key, that particular column has LOW-VALUES in a rows and SPACES in another row, these two rows are identified as unique fields in indexed file, but when I try to migrate this to a RDB I get unique key violation since LOW-VALUES in RDB is treated as spaces. has anyone faced a similar instance and how did you handle it?
Note: Right now, I'm just planning to replace LOW_VALUE with "RANDOM" text. I just want to know is there any other possibility to handle LOW-VALUE in RDB.
It is a bit odd that a record key would contain either spaces or low-values. Strikes me that you may be migrating some "bad data".
However, if these are valid values, then you need to replace one of them: Low-values (probably binary zeros) or spaces with something else that will not conflict with any currently existing or likely to exist value for that key.
Keys on one file are often held as references in other files - you will need to track down and convert all of these as well. Failing to do so will lead to a corrupted database (broken RI constraints etc).
This does not look like a "pretty" situation.