I want to change the file's byte for encryption, but when i use readAsBytes() method for large file, i get out of memory error. So is there any way to encrypt large file with less memory consumption.
Thank you
Generally speaking, you need a temporary buffer to hold your data. If your RAM is not large enough (very likely on mobile devices) it has to be the disk.
So create a second file, and read the first file in batches of bytes that are small enough your memory will be able to handle it. Your encryption method should be able to handle this, as it's a very common occurrence. Write the resulting batches of encrypted content to the second file. Once you are done, delete/overwrite the original.
Related
Many compression algorithms take advantage of the fact that there's redundancy/patterns in data. aaaaaaaaaabbbbbbbbbbbcccccccccccc could be compressed to 10'a'11'b'12'c', for example.
But that there's no more redundancy in my compressed data, I couldn't really compress it further. However, I can encrypt or encode it and turn it into a different string of bytes: xyzxyzxyzxyzxyz.
If the random bits just so happened to have a pattern in them, it seems that it'd be easy to take advantage of that: 5'xyz'
Here's what our flow looks like:
Original: aaaaaaaaaabbbbbbbbbbbcccccccccccc
Compressed: 10'a'11'b'12'c'
Encrypted: xyzxyzxyzxyzxyz
Compressed again: 5'xyz'
But the more data you have, the larger your file, the more effective many forms of encryption will be. Huffman encoding, especially, seems like it'd work really well on random bits of data, especially when the file gets pretty large!!
I imagine this would be atrocious when you need data fast, but I think it could have merits for storing archives, or other stuff like that. Maybe downloading a movie over a network would only take 1MB of bandwidth instead of 4MB. Then, you could unpack the movie as the download happened, getting the full 4MB file on your hard drive without destroying your network's bandwidth.
So I have a few questions:
Do people ever encode data so it can be compressed better?
Do people ever "double-compress" their data?
Are there any well-known examples of "double" compression, where data is compressed, encrypted or encoded, then compressed again?
Good encryption results in high-quality random data, so it can not be compressed. The probability of a compressible result "just so happened" from an encryption is the same as it would be from any other random data source. Which is simply never.
Double compression is like perpetual motion. It is a oft discussed idea but never works. If it worked, you could compress and compress and compress and get the file down to 1 bit... See
How many times can a file be compressed?
The fundamental problem is that most files are NOT compressible--random, encrypted files even less so.
To answer your questions:
1) yes! See burrows wheeler compression
2) no.
3) no.
I'm trying to understanding something a bit better about mmap. I recently read this portion of this accepted answer in the related stackoverflow question (quoted below):mmap and memory usage
Let's say you read a 100MB chunk of data, and according to the initial
1MB of header data, the information that you want is located at offset
75MB, so you don't need anything between 1~74.9MB! You have read it
for nothing but to make your code simpler. With mmap, you will only
read the data you have actually accessed (rounded 4kb, or the OS page
size, which is mostly 4kb), so it would only read the first and the
75th MB.
I understand most of the benefits of mmap (no need for context-switches, no need to swap contents out, etc), but I don't quite understand this offset. If we don't mmap and we need information at the 75th MB offset, can't we do that with standard POSIX file I/O calls without having to use mmap? Why does mmap exactly help here?
Of course you could. You can always open a file and read just the portions you need.
mmap() can be convenient when you don't want to write said code or you need sparse access to the contents and don't want to have to write a bunch of caching logic.
With mmap(), you're "mapping" the entire contest of the file to offsets in memory. Most implementation of mmap() do this lazily, so each ~4K block of the file is read on-demand, as you access those memory locations.
All you have to do is access the data in your file like it was a huge array of chars (i.e. int* someInt = &map[750000000]; return *someInt;), and let the OS worry about what portions of the file have been read, when to read the file, how much, writing the dirty data blocks back to the file, and purging the memory to free up RAM.
I have a comma separated data file, lets assume each record is of fixed length.
How does the OS(Linux) determine, which data parts are kept in one page in the hard disk?
Does it simply look at the file, organize the records one after the other(sequentially) in one page? Is it possible to programmatically set this or does the OS take care of it automatically?
Your question is quite general - you didn't specify which OS or filesystem - so the answer will be too.
Generally speaking the OS does not examine the data being written to a file. It simply writes the data to enough disk sectors to contain the data. If the sector size is 4K, then bytes 0-4095 are written to the first sector, bytes 4096-8191 are written to the second sector, etc. The OS does this automatically.
Very few programs wish to manage their disk sector allocation. One exception is high performance database management systems, which often implement their own filesystem in order have low level control of the file data to sector mapping.
I have to store about 10k text lines in an Array. Each line is stored as a separate encrypted entry. When the app runs I only need to access a small number and decrypt them - depending on user input. I thought of some kind of lazy evaluation but don't know how to do it in this case.
This is how I build up my array: [allElements addObject: #"wdhkasuqqbuqwz" ] The string is encrypted. Accessing is like txt = [[allElements objectAtIndex:n] decrypt]
The problem currently is that this uses lots of memory from the very start - most of the items I don't need anyway, just don't know which ones ;). Also I am hesitant to store the text externally eg in a textfile, since this would make it easier to access it.
Is there a way to minimize memory usage in such a case?
ps initialization is very fast, so no issue here
So it's quite a big array, although not really big enough to be triggering any huge memory warnings (unless my maths has gone horribly wrong, I reckon your array of 10,000 40-character strings is about 0.76 MB. Perhaps there are other things going on in your app causing these warnings - are you loading any large images or many assets?
What I'm a little confused about it how you're currently storing these elements before you initalise the array. Because you say you don't want to store the text externally in a text file, but you must be holding them in some kind of file before initialising your array, unless of course your values are generated on the fly.
If you've encrypted correctly, you shouldn't need to care whether your values are stored in plain-sight or not. Hopefully you're using an established standard and not rolling your own encryption, so really I think worrying about users getting hold of the file is a moot point. After all, the whole point of encryption is being able to hide data in plain sight.
I would recommend, as a couple of your commenters already have, is that you should just use some form of database storage. Core Data was made for this purpose - handling large amounts of data with minimal memory impact. But again, I'm not sure how that array alone could trigger a memory warning, so I suspect there's other stuff going on in your app that's eating up your memory.
A project I'm working on requires detection of duplicate files. Under normal circumstances I would simply compare the file bytes in blocks or hash value of the entire file contents. However, the system does not have access to the entire file - only the first 50KB or so. It also knows the total file size of the original file.
I was thinking of implementing the following: each time a file is added, I would look for possible duplicates using both the total file size and a hash calculation of (file-size)+(first-20KB-of-file). The hash algorithm itself is not the issue at this stage, but will likely be MurmurHash2.
Another option is to also store, say, bytes 9000 through 9020 and use that as a third condition when looking up a duplicate copy or alternatively to compare byte-to-byte when the aforementioned lookup method returns possible duplicates in a last attempt to discard false positives.
How naive is my proposed implementation? Is there a reliable way to predict the amount of false positives? What other considerations should I be aware of?
Edit: I forgot to mention that the files are generally going to be compressed archives (ZIP,RAR) and on occasion JPG images.
You can use file size, hashes and partial-contents to quickly detect when two files are different, but you can only determine if they are exactly the same by comparing their contents in full.
It's up to you to decide whether the failure rate of your partial-file-check will be low enough to be acceptable in your specific circumstances. Bearing in mind that even an "exceedingly unlikely" event will happen frequently if you have enough volume. But if you know the type of data that the files will contain, you can judge the chances of two near-identical files (idenitcal in the first 50kB) cropping up.
I would think that if a partial-file-match is acceptable to you, then a hash of those partial file contents is probably going to be pretty acceptable too.
If you have access to 50kB then I'd use all 50kB rather than only the first 20kB in my hash.
Picking an arbitrary 20 bytes probably won't help much (your file contents will either be very different in which case hash+size clashes will be unlikely, or they will be very similar in which case the chances of a randomly chosen 20 bytes being different will be quite low)
In your circumstances I would check the size, then a hash of the available daa (50kB), then if this suggests a file match, a brute-force comparison of the available data just to minimise the risks, if you don't expect to be adding so many duplicates that this would bog the system down.
It depends on the file types, but in most cases false positives will be pretty rare.
You probably won't have any in Office and graphical files. And executables are supposed have a checksum in the header.
I'd say that the most likely false positive you may encounter is in source code files. They change often and it may happen that a programmer replaces a few symbols something after the first 20K.
Other than that I'd say they are pretty unlikely.
Why don't use a hash of the first 50 KB, and then store the size on the side? That would give you the most security with what you have to work with (with that said, there could be totally different content in the files after the first 50 KB without you knowing, so it's not a really secure system).
I find it difficult. It's likely that you would catch most duplicates with this method, but the possibility of false positives is huge. What about two versions of a 5MB XML document whose last chapter is modified?