POSTGRESQL store a gzip or json as text - postgresql

I need to save a JSON which has a size of about 20 MG (include some jpg base64 images inside).
Is any advantage in performance if I save it on a binary field, JSON field or a text field?
Any suggestion to save it?

The most efficient way to store this would be to extract the image data, base64-decode it, and store it in a bytea field. Then store the rest of the json in a json or text field. Doing that is likely to save you quite a bit of storage because you're storing the highly compressed JPEG data directly, rather than a base64-encoded version.
If you can't do that, or don't want to, you should just shove the whole lot in a json field. PostgreSQL will attempt to compress it, but base64 of a JPEG won't compress too wonderfully with the fast-but-not-very-powerful compression algorithm PostgreSQL uses. So it'll likely be signficantly bigger.
There is no difference in storage terms between text and json. (jsonb, in 9.4, is different - it's optimised for fast access, rather than compact storage).
For example, if I take this 17.5MB JPEG, it's 18MB as bytea. Base64-encoded it's 24MB uncompressed. If I shove that into a json field with minimal json syntax wrapping it remains 24MB - which surprised me a little, I expected to save some small amount of storage with TOAST compression. Presumably it wasn't considered compressible enough.
(BTW, base64 encoded binary isn't legal as an unmodified json value as you must escape slashes)

Related

How best to store HTML alongside strings in Cloud Storage

I have a collection data of, and in each case there is chunk of HTML and a few strings, for example
html: <div>html...</div>, name string: html chunk 1, date string: 01-01-1999, location string: London, UK. I would like to store this information together as a single cloud storage object. Specifically, I am using Google Cloud Storage. There are two ways I can think of doing this. One is to store the strings as custom metadata, and the HTML as the actual file contents. The other is to store all the information as JSON file, with the HTML as a base64 encoded string.
I want to avoid a situation where after having stored a lot of data, I find there is some limitation to the approach I am using. What is the proper way to do this - is either of these approaches bad practice? Assuming there is no problem with either, I would go with the JSON approach because it is easier to pass around all the data together as a file.
There isn't a specific right way to do what you're talking about, there are potential pitfalls and performance criteria but they depend on what you're doing with the data and why. Do you ever need access to the metadata for queries? You won't be able to efficiently do that if you pack everything into one variable as a JSON object. What are you parsing the data with later? does it have built in support for JSON? Does it support something else? Is speed a consideration? Is cloud storage space a consideration? Does a user have the ability to input the html and could they potentially perform some sort of attack? How do you use the data when you retrieve it? How stable is the format of the data? You could use JSON, ProtocolBuffers, packed binary blobs in a length | value based format, base64 with a delimiter, zip files turned into binary blobs, do what suits your application and allows a clean structured design that you can test and maintain.

Using the CBOR format instead of JSON in elasticsearch ingest plugin

In the documentation of Ingest Attachment Processor Plugin in Elasticsearch, it is mentioned, "If you do not want to incur the overhead of converting back and forth between base64, you can use the CBOR format instead of JSON and specify the field as a bytes array instead of a string representation. The processor will skip the base64 decoding then." Could anyone please throw some light on this or maybe share an example of how to achieve this? I need to index a very large number of documents having a significant size. So I need to minimise the latency.

Retrieve the compressed (binary format) of the big column

As remembered here, when storing documents (suppose text or xml datatypes and EXTENDED storage) with more than 2k, it is compressed.
About table columns that was compressed, how to retrieve the compressed (binary format) of the column?
NOTE - Typical applications:
Operations as "long-term checksum of the document", like SHA256(compressed).PS: as it is a matter of convention, not needs complementar compression, inheriting the condition, SHA256(less2k? text: compressed).
Coping or downloading compressed documents directally (without CPU consume). PS: complementing operation (for "less than 2k row") with on-the-fly compression, when uniformity is required.
If this is possible at all, it would require writing a function in C.
Instead of going that way, I would recommend that you use EXTERNAL rather than EXTENDED storage and compress the data yourself before you store them in the database. That way you don't waste any space, and you can decide when to retrieve the compressed data and when to uncompress them.

how to store image in postgres using Flask model

I am trying to store an image using flask model. I don't know how to store the image in postgres, so I have encoded the image to base64 and I am trying to store that resulting text in postgres. It is working but is there any recommended way to store that encoded text or the image in postgres using flask model
class User_tbl(db.Model):
id = db.Column(db.Integer,primary_key=True)
mobile=db.Column(db.String(13),unique=True)
country=db.Column(db.String(30))
image=db.Column(db.String(256))
def __init__(self,mobile,country,image):
self.mobile=mobile
self.country=country
self.image = image
I know that maybe it's too late to answer this question, but in this days I was trying to solve something similar and none of the solutions proposed seem to shed light on the main problem.
Of course any best practice rests on your needs. In general terms however, you will find that embed a file in the database is not a good practice. Well, it depends.
Reading the "Storing Binary files in the Database" produced by postgresql wiki, I discovered that there are some circumtances in which this practice is instead higly recommended, for instance when the files must be ACID.
In those cases, at least in Postgres, bytea datatype is to be preferred over text or BLOB binary, sometimes at the cost of some higher memory requirements for the server.
In this case:
1) you don't need special sqlalchemy dialects. LargeBinary datatype will suffice, since it will be translated as a "large and/or unlengthed binary type for the target platform".
2) You don't need any encode/decode functions in PostgreSQL, of course in this specific case.
3) As I told before, it is not always a good strategy to save the files into the filesystem. In any case do not use text data type with base64 encoding. Your data will inflated more or less of the 33%, thus resulting in a huge storage impact, whereas bytea has not the same drawback
Thus, I propose these changes to your model:
class User_tbl(db.Model):
id = db.Column(db.Integer,primary_key=True)
mobile=db.Column(db.String(13),unique=True)
country=db.Column(db.String(30))
image=db.Column(db.LargeBinary)
Then you can save files into Postgres simply by passing your FileStorage parameter as a binary:
image = request.files['fileimg'].read()
It would be far easier to avoid all of this encoding and decoding and simply save it as a binary blob. In which case, use a sqlalchemy.dialects.postgresql.BYTEA column.
I know of the encode and decode functions in PostgreSQL for dealing with base64 data, see:
https://www.postgresql.org/docs/current/static/functions-string.html
(encode/decode)
Thanks,
The recomended way to store an image in postgres via flask is to store the image in your static folder(where you store Javascript & CSS files) and serve it via a web server i.e. nginx. It will be able to do it more efficiently than flask.You should only store the path to your image on postgres and then store the actual image on the File system.

Compressing ASCII data to fit within a UTF-32 API?

I have an API that receives Unicode data, but I only need to store ASCII in it. I'd like to compress & obfuscate (or encrypt) the string values that will be persisted in Unicode.
My desire is to either compress this schema data, or to encrypt it from prying eyes. I don't think it's possible to do both well.
Considering that I want to restrict my source data to valid, printable ASCII; how can I "compress" that original string value into a value that is either smaller, obfuscated, or both?
Here is how I imagine this working (though you may have a better way):
This source code will take a given String as input
The bytes representation of that string will be taken (UTF8, ASCII, you decide)
Some magic happens - (this is the part I need your help on)
The resulting bytes will be converted into an int or long (no decimal points)
The number will be converted into a corresponding character using this utility
http://baseanythingconvert.codeplex.com/SourceControl/changeset/view/77855#1558651
(note that utility will be used to enforce the constraint is that the "final" Unicode name must not include the following characters '/', '\', '#', '?' or '%')
Background
The Microsoft Azure Table has an API that accepts Unicode data for the storage or property names. This is a schema-free database (so columns can be created ad-hoc), therefore the schema is stored per row. The downside is that this schema data is stored on disk multiple times, and it is also transmitted over the wire, quite redundantly, in an XML blob.
In addition, I'm working on a utility that dynamically encrypts/decrypts Azure Table Data, but the schema is unencrypted. I'd like to mask or obfuscate this header information somehow.
These are just some ideas.
Isn't step 3 actually straightforward (just compress and/or encrypt the data into different bytes)? For 7-bit ASCII, you can also, before compressing and/or encrypting, store the data by packing the bits so they fit into fewer bytes.
If you can use UTF-32, UTF-8, and so on in step 5, you have access to all the characters in the Unicode Standard, up to 0x10FFFD, with some exceptions; for example, some code points are noncharacters in the Unicode Standard, such as 0xFFFF, and others are invalid characters, such as 0xD800.