Using the CBOR format instead of JSON in elasticsearch ingest plugin - elastic-stack

In the documentation of Ingest Attachment Processor Plugin in Elasticsearch, it is mentioned, "If you do not want to incur the overhead of converting back and forth between base64, you can use the CBOR format instead of JSON and specify the field as a bytes array instead of a string representation. The processor will skip the base64 decoding then." Could anyone please throw some light on this or maybe share an example of how to achieve this? I need to index a very large number of documents having a significant size. So I need to minimise the latency.

Related

How best to store HTML alongside strings in Cloud Storage

I have a collection data of, and in each case there is chunk of HTML and a few strings, for example
html: <div>html...</div>, name string: html chunk 1, date string: 01-01-1999, location string: London, UK. I would like to store this information together as a single cloud storage object. Specifically, I am using Google Cloud Storage. There are two ways I can think of doing this. One is to store the strings as custom metadata, and the HTML as the actual file contents. The other is to store all the information as JSON file, with the HTML as a base64 encoded string.
I want to avoid a situation where after having stored a lot of data, I find there is some limitation to the approach I am using. What is the proper way to do this - is either of these approaches bad practice? Assuming there is no problem with either, I would go with the JSON approach because it is easier to pass around all the data together as a file.
There isn't a specific right way to do what you're talking about, there are potential pitfalls and performance criteria but they depend on what you're doing with the data and why. Do you ever need access to the metadata for queries? You won't be able to efficiently do that if you pack everything into one variable as a JSON object. What are you parsing the data with later? does it have built in support for JSON? Does it support something else? Is speed a consideration? Is cloud storage space a consideration? Does a user have the ability to input the html and could they potentially perform some sort of attack? How do you use the data when you retrieve it? How stable is the format of the data? You could use JSON, ProtocolBuffers, packed binary blobs in a length | value based format, base64 with a delimiter, zip files turned into binary blobs, do what suits your application and allows a clean structured design that you can test and maintain.

Swift FileHandle seek/readData performance

Context:
I have a project where I store a lot of data in Binary files and Data files. I retrieve offsets in a Binary file, stored as UInt64, and each of these offsets give me the position of an utf-8 encoded String in another file.
I am attempting, given all the offsets, to reconstruct all the strings from the utf-8 file. The file that hold all the strings has a size of exactly 20437 bytes / approx 177000 strings.
Assuming I already retrieved all the offsets and now need to rebuild each string one at a time. I also have the length in bytes of every String.
Method 1:
I open a FileHandle set to the utf8 encoded file, and for each offset I seek to the offset and perform a readData(ofLength:), the whole operation is very long... More than 35 seconds.
Method 2:
I initialize a Data object with Data(contentsOf: URL).
Then, I perform a Data.subdata(in: Range) for each string I want to build. The range starts from offset and ends to offset + size.
This will load the entire file into the RAM, and allow me to retrieve the bytes I need for each String. This is much faster than the first option, but probably as bad performance-wise.
How can I get the best performance for this particular task ?
I recently went through a similar experience when caching/loading binary data to/from disk.
Im not sure what the ultimate process is for best performance, but you can improve performance of method 2 further still, by using a "slice" of the data object instead of data.subdata(). This is similar to using array slices.
This probably because instead of creating more data objects with COPIES of the original data, the data returned from the slice uses the source Data object as a reference. This made a significant difference for me as my source data was actually pretty large. You should profile both methods and see if it makes a noticeable for you.
https://developer.apple.com/documentation/foundation/data/1779919-subscript

POSTGRESQL store a gzip or json as text

I need to save a JSON which has a size of about 20 MG (include some jpg base64 images inside).
Is any advantage in performance if I save it on a binary field, JSON field or a text field?
Any suggestion to save it?
The most efficient way to store this would be to extract the image data, base64-decode it, and store it in a bytea field. Then store the rest of the json in a json or text field. Doing that is likely to save you quite a bit of storage because you're storing the highly compressed JPEG data directly, rather than a base64-encoded version.
If you can't do that, or don't want to, you should just shove the whole lot in a json field. PostgreSQL will attempt to compress it, but base64 of a JPEG won't compress too wonderfully with the fast-but-not-very-powerful compression algorithm PostgreSQL uses. So it'll likely be signficantly bigger.
There is no difference in storage terms between text and json. (jsonb, in 9.4, is different - it's optimised for fast access, rather than compact storage).
For example, if I take this 17.5MB JPEG, it's 18MB as bytea. Base64-encoded it's 24MB uncompressed. If I shove that into a json field with minimal json syntax wrapping it remains 24MB - which surprised me a little, I expected to save some small amount of storage with TOAST compression. Presumably it wasn't considered compressible enough.
(BTW, base64 encoded binary isn't legal as an unmodified json value as you must escape slashes)

PostgreSQL protocol data representation format specification?

I am reading PostgreSQL protocol document. The document specifies message flow and containment format, but doesn't mention about how actual data fields are encoded in text/binary.
For the text format, there's no mention at all. What does this mean? Should I use just SQL value expressions? Or there's some extra documentation for this? If it's just SQL value expression, does this mean the server will parse them again?
And, which part of source code should I investigate to see how binary data is encoded?
Update
I read the manual again, and I found a mention about text format. So actually there is mention about text representation, and it was my fault that missing this paragraph.
The text representation of values is whatever strings are produced and
accepted by the input/output conversion functions for the particular
data type.
There are two possible data formats - text or binary. Default is a text format - that means, so there is only server <-> client encoding transformation (or nothing when client and server use same encoding). Text format is very simple - trivial - all result data is transformed to human readable text and it is send to client. Binary data like bytea are transformed to human readable text too - hex or Base64 encoding are used. Output is simple. There is nothing to describing in doc
postgres=# select current_date;
date
────────────
2013-10-27
(1 row)
In this case - server send string "2013-10-27" to client. First four bytes is length, others bytes are data.
Little bit difficult is input, because you can separate a data from queries - depends on what API you use. So if you use most simple API - then Postgres expect SQL statement with data together. Some complex API expected SQL statement and data separately.
On second hand a using of binary format is significantly difficult due wide different specific formats for any data type. Any PostgreSQL data type has a two functions - send and recv. These functions are used for sending data to output message stream and reading data from input message stream. Similar functions are for casting to/from plain text (out/in functions). Some clients drivers are able to cast from PostgreSQL binary format to host binary formats.
Some information:
libpq API http://www.postgresql.org/docs/9.3/static/libpq.html
you can look to PostgreSQL src to send/recv and out/in function - look on bytea or date implementation src/backend/utils/adt/date.c. Implementation of libpq is interesting too src/interfaces/libpq
-
The things closest to a spec of a PostgreSQL binary format I could find were the documentation and the source code of the "libpqtypes" library. I know, a terrible state of the documentation for such a huge product.
The text representation of values is whatever strings are produced and
accepted by the input/output conversion functions for the particular
data type. In the transmitted representation, there is no trailing
null character; the frontend must add one to received values if it
wants to process them as C strings. (The text format does not allow
embedded nulls, by the way.)
Binary representations for integers use network byte order (most
significant byte first). For other data types consult the
documentation or source code to learn about the binary representation.
Keep in mind that binary representations for complex data types might
change across server versions; the text format is usually the more
portable choice.
(quoted from the documentation, link)
So the binary protocol is not stable across versions, so you probably should treat it as an implementation detail and not use the binary representation. The text representation is AFAICT just the format of literals in SQL queries.

how to convert binary data into string format in j2me

I am using the gzip algorithm in j2me. After compressing the string I tried to send the compressed string as text message but the size was increasing drastically. So I used base64 encoding to convert the compressed binary to text. But while encoding the size is still increasing, please help me with an encoding technique which when used the data size remains the same.
I tried sending binary sms but as its limit is 134 characters I want to compress it before sending the sms.
You have some competing requirements here.
The fact you're considering using SMS as a transport mechanism makes me suspect that the data you have to send is quite short to start with.
Compression algorithms (in general) work best with large amounts of data and can end up creating a longer output than you started with if you start with something very short.
There are very few useful encoding changes that will leave you with output the same length as when you started. (I'm struggling to think of anything really useful right now.)
You may want to consider alternative transmission methods or alternatives to the compression techniques you have tried.