What is the maximum string length I can store in GoInstant? - goinstant

Rather than deal with a lot of sub-keys I'd like to store a large json structure. Will this cause the world to end?

A string can contain any kind of data, as long as it's less than 32 KB in length. Which, for your reference, is approximately 16 pages of text!
The official docs have this and more information about Keys: https://developers.goinstant.com/v1/javascript_api/key/index.html

Related

Marc21 Binary Decoder with Akka-Stream

I'm trying to decode Marc21 binary data records which have the following specification concerning the field that provide the length of the record.
A Computer-generated, five-character number equal to the length of the
entire record, including itself and the record terminator. The number
is right justified and unused positions contain zeros.
I am trying to use
Akka Stream Framing.lengthField, however I just don't know how specify the size of that field. I imagine that a character is 8 bit, maybe 16 for a number, i am not sure, i wonder if that depend of the platform or language. In short, the question is is it possible to say what is the size of that field Knowing that i am in Scala/Java.
Also what does means:
The number is right justified and unused positions contain zeros"
Does that has implication on how one read the value if collected properly ?
If anyone know anything about this, please share.
EDIT1
Context:
I am trying to build a stream processing graph where the first stage would be processing the result of a sys command ran against a symphony (Vendor Cataloging system) server, which is a stream of unstructured byte chunck which as a whole represent all the Marc21 records Requested (full dump or partial dump).
By processing i mean, chunking that unstructured stream of byte into a stream of frames where the frames are the Records.
In other words, readying the bytes for one record at the time, and emitting it individually to the next stage.
The next stage will consist in emitting that record (Bytes) to apache Kafka.
Obviously the emission stage would be fully parallelize to speed up the process.
The Symphony server does not have the capability to stream a dump when requested, especially over the network. Hence, this Akka-stream based Graph processing to perform that work, for fast ingestion/production and overall streaming processing of our dumps in our overall fast data infrastructure.
EDIT2
Based on #badcook input, I wonder if ComputeFramesize could be used here. Not sure i am slightly confused by the function and what does it takes into parameters.
Little clarification would be much appreciated.
It looks like you're trying to parse MARC 21 records.
In that case I would recommend you just take a look at MARC4J and use that.
If you want to integrate it with Akka streams, or even if you want to parse MARC records your own way, I would recommend breaking up your byte steam with Framing.delimiter using the MARC 21 record terminator (ASCII control character 1D) into complete MARC records rather than try to stream and work with fragments of MARC records. It'll be a lot easier.
As for your specific questions: The MARC 21 specification uses characters rather than raw bytes when talking about its structure. It specifies two character encodings into raw bytes, UTF-8 and MARC 8, both of which are variable width encodings. Hence, no it is not true that every character is a byte. There is no single answer of how many bytes a character takes up.
"[R]ight justified and unused positions contain zeroes" is another way of saying that numbers are padded from the left with 0s. In this case this line comes from a larger quote staying that the numerical string must be 5 characters long. That means if you are trying to represent the number 1, you must represent it as 00001.

Is there a way to crack a 129 character length hash?

I was given a 129 character length hash ( probably whirlwind or SHA-512) to decrypt. I have tried cracking it using the whirlpooldeep and hashcat tools (without wordlists, i.e dictionary attack) but with no success. I'm sorry for the ambiguous question but this is all the information I have about my current task. Any suggestion would be gladly appreciated
I assume this task is feasible because whoever gave you the hash has prepared it on purpose in a way so you have a chance to crack it. If this is not the case, you have no chance at current computer performance and will not within the next few decades.
So, assuming it is feasible, e.g. because the hash is of some short data like less than 10 bytes:
You are probably making a mistake in how you compare the hashes. The hash should not be of length 129. It should be of length 128. That you see a file with 129 bytes is probably because there is a line feed (\n) at the end of the file.
Moreover, you should, for performance, not compare the hashes in hexadecimal format (length 128), but instead in binary format (length 64).
A rainbow table seems to be the perfect tool for your scenario.
There is a software called RainbowCrack to do this sort of reversing a hash. There should be rainbow tables available for SHA-512, which should be, roughly estimated, about 1TB in size (for finding original data of length up to 9 bytes).
Using this kind of hash reversal is likely to take you a day or two to get into the matter and understand it, and if it's feasible, after downloading the 1TB, you are likely to get the result in less than a day.
Before you start all this, I would ask the source of your hash how long the data is that you are searching for. This could easily end up being a kid's joke, e.g. that your source hashed 512bit of random data, in which case you would have absolutely no chance and you would waste your time.

Encoding scheme for large data sets

Assume there is a secure transport layer that securely transfer files, what if we want to transfer multiple files over this channel in one round? well it should be encoded into same file so when Bob receive that file is able to decode and see multiple files (e.g photo albums). i think ASN.1 is good for small data sets (e.g text certificates) but in large data sets such an encoding scheme can increase ciphertext size.
My question is what encoding rule you recommend for large data sets? it must be secure (audited well to be exploit free) and efficient (don't increase size in large scale)
ASN.1 is actually an efficient encoding. If you look at DER encoding of a long sequence of bytes (an OCTET STRING), you will see something like this:
04 [len] [the data bytes]
where 04 is a single byte value, and len is the length of the data expressed in a reasonably compact format. The data bytes then follow, as is. Overhead for a long stream of length n bytes is 2+ceil(log2(n)/8) bytes; in other words, for a stream of up to, say, 256 terabytes, the overhead will be at most eight bytes.
One quirk of DER is that you have to know the total length of the data before beginning to encode it. However, the BER encoding (of which DER is just a subset) can come to the rescue: a value can be split into successive chunks, without needing a prior notion of the number of chunks. Overhead per chunk will be minimal (say 4 bytes for chunks up to 64 kilobytes), allowing for a total overhead of less than 0.01%. You will have trouble doing better.
If you want to split that big stream into several internal "files" then you will need some convention, and, there again, ASN.1 is efficient:
BigStream ::= SEQUENCE OF SEQUENCE {
fileName UTF8String,
fileData OCTET STRING
}
One general header for the SEQUENCE OF (less than 8 bytes); then, for each file, the overhead beyond the data itself and the name (an UTF-8 string) will be less than 20 bytes (usually substantially less) if you use DER. For huge files such that the length is not easily computable in advance, BER encoding can be used, with an overhead which can be lowered to 0.01%, as explained above.
If you are short on size, and the data is "normal data" (with some structure), consider applying compression. Most programming framework can use zlib or already offer a zlib-compatible implementation (e.g. Java and C#/.NET both have it).
As for audit, this is not a matter of encoding but of implementation. ASN.1/BER encoding can be implemented in few lines of code, especially if you are interested only in OCTET STRING, UTF8String and SEQUENCE. We are talking about less than 1000 lines of commented Java or C# here; if you cannot audit that then you cannot audit anything.

Concept of "block size" in a cache

I am just beginning to learn the concept of Direct mapped and Set Associative Caches.
I have some very elementary doubts . Here goes.
Supposing addresses are 32 bits long, and i have a 32KB cache with 64Byte block size and 512 frames, how much data is actually stored inside the "block"? If i have an instruction which loads from a value from a memory location and if that value is a 16-bit integer, is it that one of the 64Byte blocks now stores only a 16 bit(2Bytes) integer value. What of the other 62 bytes within the block? If i now have another load instruction which also loads a 16bit integer value, this value now goes into another block of another frame depending on the load address(If the address maps to the same frame of the previous instruction, then the previous value is evicted and the block again stores only 2bytes in 64 bytes). Correct?
Please forgive me if this seems like a very stupid doubt, its just that i want to get my concepts correctly.
I typed up this email for someone to explain caches, but I think you might find it useful as well.
You have 32-bit addresses that can refer to bytes in RAM.
You want to be able to cache the data that you access, to use them later.
Let's say you want a 1-MiB (220 bytes) cache.
What do you do?
You have 2 restrictions you need to meet:
Caching should be as uniform as possible across all addresses. i.e. you don't want to bias toward any particular kind of address.
How do you do this? Use remainder! With mod, you can evenly distribute any integer over whatever range you want.
You want to help minimize bookkeeping costs. That means e.g. if you're caching in blocks of 1 byte, you don't want to store 4 bytes of data just to keep track of where 1 byte belongs to.
How do you do that? You store blocks that are bigger than just 1 byte.
Let's say you choose 16-byte (24-byte) blocks. That means you can cache 220 / 24 = 216 = 65,536 blocks of data.
You now have a few options:
You can design the cache so that data from any memory block could be stored in any of the cache blocks. This would be called a fully-associative cache.
The benefit is that it's the "fairest" kind of cache: all blocks are treated completely equally.
The tradeoff is speed: To find where to put the memory block, you have to search every cache block for a free space. This is really slow.
You can design the cache so that data from any memory block could only be stored in a single cache block. This would be called a direct-mapped cache.
The benefit is that it's the fastest kind of cache: you do only 1 check to see if the item is in the cache or not.
The tradeoff is that, now, if you happen to have a bad memory access pattern, you can have 2 blocks kicking each other out successively, with unused blocks still remaining in the cache.
You can do a mixture of both: map a single memory block into multiple blocks. This is what real processors do -- they have N-way set associative caches.
Direct-mapped cache:
Now you have 65,536 blocks of data, each block being of 16 bytes.
You store it as 65,536 "rows" inside your cache, with each "row" consisting of the data itself, along with the metadata (regarding where the block belongs, whether it's valid, whether it's been written to, etc.).
Question:
How does each block in memory get mapped to each block in the cache?
Answer:
Well, you're using a direct-mapped cache, using mod. That means addresses 0 to 15 will be mapped to block 0 in the cache; 16-31 get mapped to block 2, etc... and it wraps around as you reach the 1-MiB mark.
So, given memory address M, how do you find the row number N? Easy: N = M % 220 / 24.
But that only tells you where to store the data, not how to retrieve it. Once you've stored it, and try to access it again, you have to know which 1-MB portion of memory was stored here, right?
So that's one piece of metadata: the tag bits. If it's in row N, all you need to know is what the quotient was, during the mod operation. Which, for a 32-bit address, is 12 bits big (since the remainder is 20 bits).
So your tag becomes 12 bits long -- specifically, the topmost 12 bits of any memory address.
And you already knew that the lowermost 4 bits are used for the offset within a block (since memory is byte-addressed, and a block is 16 bytes).
That leaves 16 bits for the "index" bits of a memory address, which can be used to find which row the address belongs to. (It's just a division + remainder operation, but in binary.)
You also need other bits: e.g. you need to know whether a block is in fact valid or not, because when the CPU is turned on, it contains invalid data. So you add 1 bit of metadata: the Valid bit.
There's other bits you'll learn about, used for optimization, synchronization, etc... but these are the basic ones. :)
I'm assuming you know the basics of tag, index, and offset but here's a short explanation as I have learned in my computer architecture class. Blocks are replaced in 64 byte blocks, so every time a new block is put into cache it replaces all 64 bytes regardless if you only need one byte. That's why when addressing the cache there is an offset that specifies the byte you want to get from the block. Take your example, if only 16 bit integer is being loaded, the cache will search for the block by the index, check the tag to make sure its the right data and then get the byte according to the offset. Now if you load another 16 bit value, lets say with the same index but different tag, it will replace the 64 byte block with the new block and get the info from the specified offset. (assuming direct mapped)
I hope this helps! If you need more info or this is still fuzzy let me know, I know a couple of good sites that do a good job of teaching this.

How to reduce the size of an sqlite3 database for iphone?

edit: many thanks for all the answers. Here are the results after applying the optimisations so far:
Switching to sorting the characters and run length encoding - new DB size 42M
Dropping the indexes on the booleans - new DB size 33M
The really nice part is this hasn't required any changes in the iphone code
I have an iphone application with a large dictionary held in sqlite format (read only). I'm looking for ideas to reduce the size of the DB file, which is currently very large.
Here is the number of entries and resulting size of the sqlite DB:
franks-macbook:DictionaryMaker frank$ ls -lh dictionary.db
-rw-r--r-- 1 frank staff 59M 8 Oct 23:08 dictionary.db
franks-macbook:DictionaryMaker frank$ wc -l dictionary.txt
453154 dictionary.txt
...an average of about 135 bytes per entry.
Here is my DB schema:
create table words (word text primary key, sowpods boolean, twl boolean, signature text)
create index sowpods_idx on words(sowpods)
create index twl_idx on words(twl)
create index signature_idx on words(signature)
Here is some sample data:
photoengrave|1|1|10002011000001210101010000
photoengraved|1|1|10012011000001210101010000
photoengraver|1|1|10002011000001210201010000
photoengravers|1|1|10002011000001210211010000
photoengraves|1|1|10002011000001210111010000
photoengraving|1|1|10001021100002210101010000
The last field represents the letter frequencies for anagram retrieval (each position is in the range 0..9). The two booleans represent sub dictionaries.
I need to do queries such as:
select signature from words where word = 'foo'
select word from words where signature = '10001021100002210101010000' order by word asc
select word from words where word like 'foo' order by word asc
select word from words where word = 'foo' and (sowpods='1' or twl='1')
One idea I have is to encode the letter frequencies more efficiently, e.g. binary encode them as a blob (perhaps with RLE as there are many zeros?). Any ideas for how best to achieve this, or other ideas to reduce the size? I am building the DB in ruby, and reading it on the phone in objective C.
Also is there any way to get stats on the DB so I can see what is using the most space?
Have you tried typing the "vacuum" command to make sure you don't have extra space in the db you forgot to reclame?
Remove the indexes on sowpods and twl -- they are probably not helping your query times and are definitely taking lots of space.
You can get stats on the database using sqlite3_analyzer from the SQLite downloads page.
As a totally different approach, you could try using a bloom filter instead of a comprehensive database. Basically, a bloom filter consists of a bunch of hash functions, each of which is associated with a bitfield. For each legal word, each hash function is evaluated, and the corresponding bit in the corresponding bit field is set. Drawback is it's theoretically possible to get false positives, but those can be minimized/practically eliminated with enough hashes. Plus side is a huge space savings.
I'm not clear on all the use cases for the signature field but it seems like storing an alphabetized version of the word instead would be beneficial.
The creator of SQLite sells a version of SQLite that includes database compression (and encryption). This would be perfect.
Your best bet is to use compression, which unfortunately SQLite does not support natively at this point. Luckily, someone took the time to develop a compression extension for it which could be what you need.
Otherwise I'd recommend storing your data mostly in compressed format and uncompressing on the fly.
As a text field, signature is currently using at least 26 * 8 bytes per entry (208 bytes) but if you were to pack the data into a bitfield, you could probably get away with only 3 bits per letter (reducing your maximum frequency per letter to 7). That would mean you could pack the entire signature in 26 * 3 bits = 78 bits = 10 bytes. Even if you used 4 bits per letter (for a maximum frequency of 15 per letter) you would only use 104 bits (13 bytes).
EDIT: After a bit more thought, I think 4 bits per letter (instead of 3) would be a better idea because it would make the binary math easier.
EDIT2: Reading through the docs on SQLite data types, it seems that you might be able to just make the "signature" field span 26 columns of type INTEGER and SQLite will do the right thing and only use as many bits as required to store the value.
Do I reckon correctly that you have about 450K words like that in your database ?
I've got no clue about iPhone, neither serious about sqlitem but... as long as sqlite does not allow for a way to save the file as gz right away (it maybe already does internally? no, does not look like that when you say it's about 135 b per entry. not even with both indexes), I would move away from the table approach, save it "manually" in a dictionary approach compression and build the rest on the fly and in memory. That should perform VERY well on your type of data.
Wait... Are you using that signature to allow for fulltextsearching or mistyping recogition ? Would full text search on sqlite not obsolete that field ?
As noted storing "Signature" more efficiently seems like a good idea.
However, it also seems like you could gain a ton of space savings by using some kind of lookup table for words - since you seem to be taking a root word and then appending "er", "ed", "es", etc why not have a column with a numeric ID that references a root word from a separate lookup table, and then a separate column with a numeric ID that references a table of common word suffixes that would be appended to the base word.
If there were any tricks around storing shorthand versions of signatures for multiple entries with a single root word, you could also employ those to reduce the size of stored signatures (not sure what algorithm is producing those values)
This also seems to make a lot of sense to me as you have the "word" column as a primary key, but do not even index it - just create a separate numeric column that is the primary ID for the table.
mhmm... an iPhone... doesn't it have a permanent data connection ?
I think this is where a webapplication/webservice can jump in snugly.
Move most of your business logic to the webserver (he's gonna have real SQL with FTS and looooots of memory) and fetch that info online to the client on the device.
As mentioned elsewhere, lose the indexes on the boolean columns, they will almost certainly be slower (if used at all) than a table scan and are going to use space needlessly.
I'd consider applying a simple compression to the words, Huffman coding is pretty good for this sort of thing. Also, I'd look at the signatures: sort the columns in letter frequency order and don't bother storing trailing zeroes, which can be implied. I guess you could Huffman-encode those, too.
Always assuming your encoded strings don't upset SQLite, of course.