As remembered here, when storing documents (suppose text or xml datatypes and EXTENDED storage) with more than 2k, it is compressed.
About table columns that was compressed, how to retrieve the compressed (binary format) of the column?
NOTE - Typical applications:
Operations as "long-term checksum of the document", like SHA256(compressed).PS: as it is a matter of convention, not needs complementar compression, inheriting the condition, SHA256(less2k? text: compressed).
Coping or downloading compressed documents directally (without CPU consume). PS: complementing operation (for "less than 2k row") with on-the-fly compression, when uniformity is required.
If this is possible at all, it would require writing a function in C.
Instead of going that way, I would recommend that you use EXTERNAL rather than EXTENDED storage and compress the data yourself before you store them in the database. That way you don't waste any space, and you can decide when to retrieve the compressed data and when to uncompress them.
Related
I need to store a lot of files (like millions per day). On average, a file is 20 KB. I also need to store some meta-data for these files (date, source, classification etc.) I need to be able to access and retrieve the files according to queries on metadata (No joins, only filtering with WHERE clauses). Writes must be fast, read times are not as important.
As far as I understand, I have 3 possible ways of storing data:
Use an RDBMS (e.g. PostgreSQL) to store meta-data and store file paths. Execute queries then read matching files from file system
Use only Cassandra (my company uses Cassandra). Store meta-data and file content on Cassandra.
Use Postgres + Cassandra together. Store meta-data and Cassandra keys on Postgres, query Postgres and retrieve Cassandra keys,then get actual file content from Cassandra
What are the advantages, disadvantages for these options? I am thinking I should go with option 2 but cannot be sure.
Thanks
It really depends on the size of your files. Storing large files in Cassandra generally isn't the best solution. You'd have to chunk your files at some point to store the content in separate columns using wide rows. In this case it would be better to use a distributed file system like ceph.
But if files are only 20k, the overhead using a distributed FS will not be worth it and Cassandra will do a good job storing this amount of data in a single column as a blob. You just need to be aware of the memory footprint while working with those rows. Each time you retrieve such a file from Cassandra, the whole content will be loaded into the heap, if you don't use chunking using a clustering key.
I need to save a JSON which has a size of about 20 MG (include some jpg base64 images inside).
Is any advantage in performance if I save it on a binary field, JSON field or a text field?
Any suggestion to save it?
The most efficient way to store this would be to extract the image data, base64-decode it, and store it in a bytea field. Then store the rest of the json in a json or text field. Doing that is likely to save you quite a bit of storage because you're storing the highly compressed JPEG data directly, rather than a base64-encoded version.
If you can't do that, or don't want to, you should just shove the whole lot in a json field. PostgreSQL will attempt to compress it, but base64 of a JPEG won't compress too wonderfully with the fast-but-not-very-powerful compression algorithm PostgreSQL uses. So it'll likely be signficantly bigger.
There is no difference in storage terms between text and json. (jsonb, in 9.4, is different - it's optimised for fast access, rather than compact storage).
For example, if I take this 17.5MB JPEG, it's 18MB as bytea. Base64-encoded it's 24MB uncompressed. If I shove that into a json field with minimal json syntax wrapping it remains 24MB - which surprised me a little, I expected to save some small amount of storage with TOAST compression. Presumably it wasn't considered compressible enough.
(BTW, base64 encoded binary isn't legal as an unmodified json value as you must escape slashes)
I have a 215MB csv file which I have parsed and stored in core data wrapped in my own custom objects. The problem is my core data sqlite file is around 260MB. The csv file contains about 4.5million lines of data on my city's transit system (bus stop, times, routes etc).
I have tried modifying attributes so that arrays of strings representing stop times are stored instead as NSData files but for some reason the file size still remains at around 260MB.
I can't ship an app this size. I doubt anyone would want to download a 260MB app even if it means they have the whole city's transit schedule on it.
Are there any ways to compress or minimize the storage space used (even if it means not using core data, I am willing to hear suggestions)?
EDIT: I just want to provide an update right now because I have been staring at the file size in disbelief. With some clever manipulation involving strings, indexing and database normalization in general, I have managed to reduce the size down to 6.5MB or 2.6MB when compressed. About 105,000 objects stored in Core Data containing the full details of the city's transit system. I'm almost in tears right now D':
Unless your original CSV is encoded in a really foolish manner, it seems unlikely that the size is not going to get below 100M, no matter how much you compress it. That's still really large for an app. The solution is to move your data to a web service. You may want to download and cache significant parts, but if you're talking about millions of records, then fetching from a server seems best. Besides, I have to believe that from time to time the transit system changes, and it would be frustrating to have to upgrade a many-10s-of-MB app every time there was a single stop adjustment.
I've said that, but actually there are some things you may consider:
Move booleans into a bit fields. You can put 64 booleans into an NSUInteger. (And don't use a full 64-bit integer if you just need 8 bits. Store the smallest thing you can.)
Compress how you store times. There are only 1440 minutes in a day. You can store that in 2 bytes. Transit times are generally not to the second; they don't need a CGFloat.
Days of the week and dates can similarly be compressed.
Obviously you should normalize any strings. Look at the CSV for duplicated string values on many lines.
I generally would recommend raw sqlite rather than core data for this kind of problem. Core Data is more about object persistence than raw data storage. The fact that you're seeing a 20% bloat over CSV (which is not itself highly efficient) is not a good direction for this problem.
If you want to get even tighter, and don't need very good searching capabilities, you can create packed data blobs. I used to do this on phone switches where memory was extremely tight. You create a bit field struct and allocate 5 bits for one variable, and 7 bits for another, etc. With that, and some time shuffling things so they line up correctly on word boundaries, you can get pretty tight.
Since you care most about your initial download size, and may be willing to expand your data later for faster access, you can consider very domain-specific compression. For example, in the above discussion, I mentioned how to get down to 2 bytes for a time. You could probably get down to 1 bytes in many cases by storing times as delta minutes since the last time (since most of your times are going to be always increasing by fairly small steps if they're bus and train schedules). Abandoning the database, you could create a very tightly encoded data file that you could extract into a database on first launch.
You also can use domain-specific knowledge to encode your strings into smaller tokens. If I were encoding the NY subway system, I would notice that some strings show up a lot, like "Avenue", "Road", "Street", "East", etc. I'd probably encode those as unprintable ASCII like ^A, ^R, ^S, ^E, etc. I'd probably encode "138 Street" as two bytes (0x8A13). This of course is based on my knowledge that รจ (0x8a) never shows up in the NY subway stops. It's not a general solution (in Paris it might be a problem), but it can be used to highly compress data that you have special knowledge of. In a city like Washington DC, I believe their highest numbered street is 38th St, and then there's a 4-value direction. So you can encode that in two bytes, first a "numbered street" token, and then a bit field with 2 bits for the quadrant and 6 bits for the street number. This kind of thinking can potentially significantly shrink your data size.
You might be able to perform some database normalization.
Look for anything that might be redundant or the same values being stored in multiple rows. You will probably need to restructure your database so these duplicate values (if any) are stored in separate tables and then referenced from their original row by means of id's.
How big is the sqlite file compressed? If it's satisfactorily small, the simplest thing would be to ship it compressed, then uncompress it to NSCachesDirectory.
My sqlite file has a size of 7MB. I want to reduce its size. How i can do that ? When am simply compressing it will come around only 1.2 MB. Can i compress my mydb.sqlite to a zip file ? If it is not possible, any other way to reduce size of my sqlite file ?
It is possible to compress before hand, but is very redundant. You will compress your binary before distribution, Apple distributes your app through the store compressed and the compression of a compressed file is fruitless. Thus, any work you do to compress beforehand should not have much of an effect on the resulted size of your application
without details of what you are storing in the DB it's hard to give specific advice. The usual generics on DB Design will apply. Normalise your database.. for example
reduce/remove repeating data. If you have text/data that is repeated then store it once, and use key to reference it
If you are storing large chunks of data then you might be able to zip and unzip these in and out of the database in your app code rather than try to zip the DB
QUESTION:
Is it better to send large data blobs in JSON for simplicity, or send them as binary data over a separate connection?
If the former, can you offer tips on how to optimize the JSON to minimize size?
If the latter, is it worth it to logically connect the JSON data to the binary data using an identifier that appears in both, e.g., as "data" : "< unique identifier >" in the JSON and with the first bytes of the data blob being < unique identifier > ?
CONTEXT:
My iPhone application needs to receive JSON data over the 3G network. This means that I need to think seriously about efficiency of data transfer, as well as the load on the CPU.
Most of the data transfers will be relatively small packets of text data for which JSON is a natural format and for which there is no point in worrying much about efficiency.
However, some of the most critical transfers will be big blobs of binary data -- definitely at least 100 kilobytes of data, and possibly closer to 1 megabyte as customers accumulate a longer history with the product. (Note: I will be caching what I can on the iPhone itself, but the data still has to be transferred at least once.) It is NOT streaming data.
I will probably use a third-party JSON SDK -- the one I am using during development is here.
Thanks
You could try to compress the JSON (gz perhaps) before you send it and then uncompress it on the client-side.
But I'm not sure how that affects iPhone performance.