I've retrieving some binary files (in this case, some PDFs) from a database using ExecuteSQL, which returns the result in an Avro FlowFile. I can't figure out how to get the binary result out of the Avro records.
I've tried using ConvertAvroToJSON, which gives me an object like:
{"MYBLOB": {"bytes": "%PDF-1.4\n [...] " }}
However, using EvaluateJSONPath and grabbing $.MYBLOB.bytes, causes corruption because the binary bytes get converted to UTF8.
None of the record writer options with ConvertRecord seem appropriate for binary data.
The best solution I can think of is to base64 encode the binary before it leaves the database, then I'm dealing with only character data and can decode it in NiFi. But that's extra steps and I'd prefer not to do that.
You may need a scripted solution in this case (as a workaround), to get the field and decode it using your own encoding. In any case please feel free to file a Jira case, ConvertAvroToJSON is deprecated but we should support Character Sets for the JsonRecordSetWriter in ExecuteSQLRecord/ConvertRecord (if that also doesn't work for you).
Related
I have a PostGIS + Debezium/Kafka + Debezium/Connect setup that is streaming changes from one database to another. I have been watching the messages via Kowl and everything is moving accordingly.
My problem relies when I'm reading the message from my Kafka Topic, the geometry (wkb) column in particular.
This is my Kafka message:
{
"schema":{
"type":"struct"
"fields":[...]
"optional":false
"name":"ecotx_geometry_kafka.ecotx_geometry_impo..."
}
"payload":{
"before":NULL
"after":{
"id":"d6ad5eb9-d1cb-4f91-949c-7cfb59fb07e2"
"type":"MultiPolygon"
"layer_id":"244458fa-e6e0-4c6c-a7e1-5bf0afce2fb8"
"geometry":{
"wkb":"AQYAACBqCAAAAQAAAAEDAAAAAQAAAAUAAABwQfUo..."
"srid":2154
}
"custom_style":NULL
"style_id":"default_layer_style"
}
"source":{...}
"op":"c"
"ts_ms":1618854994546
"transaction":NULL
}
}
As can be seem, the WKB information is something like "AQAAAAA...", despite the information inserted in my database being "01060000208A7A000000000000" or "LINESTRING(0 0,1 0)".
And I don't know how to parse/transform it to a ByteArray or a Geometry in my Consumer app (Kotlin/Java) to further use in GeoTools.
I don't know if I'm missing an import that is able to translate this information.
I'm have just a few questions around of people posting their json messages and every message that has a geom field (streamed w/ Debezium) got changed to this "AAAQQQAAAA".
Having said that, how can I parse/decoded/translate it to something that can be used by GeoTools?
Thanks.
#UPDATE
Additional info:
After an insert, when I analyze my slot changes (querying the database using pg_logical_slot_get_changes function), I'm able to see my changes in WKB:
{"change":[{"kind":"insert","schema":"ecotx_geometry_import","table":"geometry_data","columnnames":["id","type","layer_id","geometry","custom_style","style_id"],"columntypes":["uuid","character varying(255)","uuid","geometry","character varying","character varying"],"columnvalues":["469f5aed-a2ea-48ca-b7d2-fe6e54b27053","MultiPolygon","244458fa-e6e0-4c6c-a7e1-5bf0afce2fb8","01060000206A08000001000000010300000001000000050000007041F528CB332C413B509BE9710A594134371E05CC332C4111F40B87720A594147E56566CD332C4198DF5D7F720A594185EF3C8ACC332C41C03BEDE1710A59417041F528CB332C413B509BE9710A5941",null,"default_layer_style"]}]}
Which would be useful in the consumer app, the thing definitely relies on the Kafka Message content itself, just ain't sure who is transforming this value, if Kafka or DBZ/Connect.
I think it is just a different way to represent binary columns in PostGIS and in JSON. The WKB is a binary field, meaning it is has bytes with arbitrary values, many of which has no corresponding printable characters. PostGIS prints it out using HEX encoding, thus it looks like '01060000208A7A...' - hex digits, but internally it is just bytes. Kafka's JSON uses BASE64 encoding instead for exactly the same binary message.
Let's test with a prefix of your string,
select to_base64(from_hex('01060000206A080000010000000103000000010000000500'))
AQYAACBqCAAAAQAAAAEDAAAAAQAAAAUA
Im trying to figure out the answer to one of my other questions but anyways maybe this will help me.
When I persist and entity to the server, the byte[] property holds different information than what I persisted. Im persisting in utf-8 to
the server.
An example.
{"name":"asd","image":[91,111,98,106,101,99,116,32,65,114,114,97,121,66,117,102,102,101,114,93],"description":"asd"}
This is the payload I send to the server.
This is what the server has
{"id":2,"name":"asd","description":"asd","image":"W29iamVjdCBBcnJheUJ1ZmZlcl0="}
as you can see the image byte array is different.
WHat im trying to do it get the image bytes saved on the server and display them on the front end. But i dont know how to get the original bytes.
No, you are wrong. Both method stored the ASCII string [object ArrayBuffer].
You are confusing the data with its representation. The data it is the same, but on both examples, you represent binary data in two different way:
The first as an array of bytes (decimal representation), on the second a classic for binary data representation: BASE64 (you may discover it because of final character =.
So you just have different representation of the same data. But so the data is stored on the same manner.
You may need to specify how to get binary data in string form (as in your example), and so the actual representation.
Actually, I need to create a transformation which will read the JSON file from the system directory and rename the JSON fields(keys) based on the metadata inputs. Finally, write the modified JSON into '.js' file using JSON output step. This conversion must be done using the ETL Metadata Injection step.
Since I am new to Pentaho Data Integration tool, can anyone help me with the sample '.ktr' files for the above scenario.
Thanks in advance.
The same use case is on the Pentaho official documentation here, except it does it with Excel files rather than JSON objects.
Now, the Metadata Injection Step requires the development of a rather sophisticated machinery. And json, it is rather simple to build with a simple javascript. So, where do you get the "dictionary" (source field name -> target field name) from?
I set up a Spark-Streaming pipeline that gets measuring data via Kafka. This data was serialized using Avro. The data can be of two types - EquidistantData and DiscreteData. I created these using an avdl file and the sbt-avrohugger plugin. I use the variant that generates Scala case classes that inherit from SpecificRecord.
In my receiving application, I can get the two schemas by querying EquidistantData.SCHEMA$ and DiscreteData.SCHEMA$.
Now, my Kafka stream gives me RDDs whose value class is Array[Byte]. So far so good.
How can I find out from the byte array which schema was used when serializing it, i.e., whether to use EquidistantData.SCHEMA$ or DiscreteData.SCHEMA$?
I thought of sending an appropriate info in the message key. Currently, I don't use the message key. Would this be a feasible way or can I get the schema somehow from the serialized byte array I received?
Followup:
Another possibility would be to use separate topics for discrete and equidistant data. Would this be feasible?
The proposed storage model is to store attachments in separate files (or blobs), and to store the email itself as a MIME multipart message, with references to the attached file and how it was encoded. This allows the user to Show Original, but does not require me to actually store the less efficient base64 with the message. Most of the time I will be able to store just the base64 line length that was used.
This way, we can perform attachment-level deduplication.
But how can the deduplication go further? Here are my thoughts:
All attachments and emails could be compressed (byte-level deduplicated) individually of course.
I could compress sets of maybe 12 attachments together in a single file. Compressing multiple files of the same type (for example, PDFs), even those from the same sender, may be more effective.
The MIME messages can also be compressed in sets.
I am not concerned about search efficiency because there will be full text indexing used.
Searching of the emails would of course use a type of full text indexing, that would not be compressed.
Decompressed cache would be created as the email first arrives, and would only be deleted after the email is not viewed for a time.
Do you have any advice in this area? What is normal for an email storage system?
decode all base64 mime parts, not only attachments
calculate secure hash of its content
replace part with reference in email body, or create custom header with list of extracted mime parts
store in blob storage under secure hash (content addresable storage)
use reference counter for deletions and garbage collection, or smarter double counter (https://docs.wildduck.email/#/in-depth/attachment-deduplication, https://medium.com/#andrewsumin/efficient-storage-how-we-went-down-from-50-pb-to-32-pb-99f9c61bf6b4)
or store each reference relation hash-emailid in db
carefully check and control base64 folds, some email have shorter line in middle, some have additional characters (dot, whitespace) at the end
store encoding parameters (folds, tail) in reference in email body for exact reconstruction
compress compressible attachments, be carefull with content addresable storage because compression changes its content hash
jpeg images can be significantly losslessly compressed using JPEG XL or https://github.com/dropbox/lepton
wav files can be compressed using flac, etc.
content-type is sender specified, same attachment can have different content-types
quoted printable encoded parts are hard to decode and reconstruct exactly. There are many encoder parameters, because each encoder escapes different characters and fold lines differently.
be carefull with reference format, so malicious sender could not create email with reference and fetch attachment he does not own. Or detect and escape reference in received emails
small mime parts may not be worth extracting before specific number of duplicities are present in system