Image based steganography that survives resizing? - matlab

I am using a startech capture card for capturing video from the source machine..I have encoded that video using matlab so every frame of that video will contain that marker...I run that video on the source computer(HDMI out) connected via HDMI to my computer(HDMI IN) once i capture the frame as bitmap(1920*1080) i re-size it to 1280*720 i send it for processing , the processing code checks every pixel for that marker.
The issue is my capture card is able to capture only at 1920*1080 where as the video is of 1280*720. Hence in order to retain the marker I am down scaling the frame captured to 1280*720 which in turn alters the entire pixel array I believe and hence I am not able to retain marker I fed in to the video.
In that capturing process the image is going through up-scaling which in turn changes the pixel values.
I am going through few research papers on Steganography but it hasn't helped so far. Is there any technique that could survive image resizing and I could retain pixel values.
Any suggestions or pointers will be really appreciated.

My advice is to start with searching for an alternative software that doesn't rescale, compress or otherwise modify any extracted frames before handing them to your control. It may save you many headaches and days worth of time. If you insist on implementing, or are forced to implement a steganography algorithm that survives resizing, keep on reading.
I can't provide a specific solution because there are many ways this can be (possibly) achieved and they are complex. However, I'll describe the ingredients a solution will most likely involve and your limitations with such an approach.
Resizing a cover image is considered an attack as an attempt to destroy the secret. Other such examples include lossy compression, noise, cropping, rotation and smoothing. Robust steganography is the medicine for that, but it isn't all powerful; it may be able to provide resistance to only specific types attacks and/or only small scale attacks at that. You need to find or design an algorithm that suits your needs.
For example, let's take a simple pixel lsb substitution algorithm. It modifies the lsb of a pixel to be the same as the bit you want to embed. Now consider an attack where someone randomly applies a pixel change of -1 25% of the time, 0 50% of the time and +1 25% of the time. Effectively, half of the time it will flip your embedded bit, but you don't know which ones are affected. This makes extraction impossible. However, you can alter your embedding algorithm to be resistant against this type of attack. You know the absolute value of the maximum change is 1. If you embed your secret bit, s, in the 3rd lsb, along with setting the last 2 lsbs to 01, you guarantee to survive the attack. More specifically, you get xxxxxs01 in binary for 8 bits.
Let's examine what we have sacrificed in order to survive such an attack. Assuming our embedding bit and the lsbs that can be modified all have uniform probabilities, the probability of changing the original pixel value with the simple algorithm is
change | probability
-------+------------
0 | 1/2
1 | 1/2
and with the more robust algorithm
change | probability
-------+------------
0 | 1/8
1 | 1/4
2 | 3/16
3 | 1/8
4 | 1/8
5 | 1/8
6 | 1/16
That's going to affect our PSNR quite a bit if we embed a lot of information. But we can do a bit better than that if we employ the optimal pixel adjustment method. This algorithm minimises the Euclidean distance between the original value and the modified one. In simpler terms, it minimises the absolute difference. For example, assume you have a pixel with binary value xxxx0111 and you want to embed a 0. This means you have to make the last 3 lsbs 001. With a naive substitution, you get xxxx0001, which has a distance of 6 from the original value. But xxx1001 has only 2.
Now, let's assume that the attack can induce a change of 0 33.3% of the time, 1 33.3% of the time and 2 33.3%. Of that last 33.3%, half the time it will be -2 and the other half it will be +2. The algorithm we described above can actually survive a +2 modification, but not a -2. So 16.6% of the time our embedded bit will be flipped. But now we introduce error correcting codes. If we apply such a code that has the potential to correct on average 1 error every 6 bits, we are capable of successfully extracting our secret despite the attack altering it.
Error correction generally works by adding some sort of redundancy. So even if part of our bit stream is destroyed, we can refer to that redundancy to retrieve the original information. Naturally, the more redundancy you add, the better the error correction rate, but you may have to double the redundancy just to improve the correction rate by a few percent (just arbitrary numbers here).
Let's appreciate here how much information you can hide in a 1280x720 (grayscale) image. 1 bit per pixel, for 8 bits per letter, for ~5 letters per word and you can hide 20k words. That's a respectable portion of an average novel. It's enough to hide your stellar Masters dissertation, which you even published, in your graduation photo. But with a 4 bit redundancy per 1 bit of actual information, you're only looking at hiding that boring essay you wrote once, which didn't even get the best mark in the class.
There are other ways you can embed your information. For example, specific methods in the frequency domain can be more resistant to pixel modifications. The downside of such methods are an increased complexity in coding the algorithm and reduced hiding capacity. That's because some frequency coefficients are resistant to changes but make embedding modifications easily detectable, then there are those that are fragile to changes but they are hard to detect and some lie in the middle of all of this. So you compromise and use only a fraction of the available coefficients. Popular frequency transforms used in steganography are the Discrete Cosine Transform (DCT) and Discrete Wavelet Transform (DWT).
In summary, if you want a robust algorithm, the consistent themes that emerge are sacrificing capacity and applying stronger distortions to your cover medium. There have been quite a few studies done on robust steganography for watermarks. That's because you want your watermark to survive any attacks so you can prove ownership of the content and watermarks tend to be very small, e.g. a 64x64 binary image icon (that's only 4096 bits). Even then, some algorithms are robust enough to recover the watermark almost intact, say 70-90%, so that it's still comparable to the original watermark. In some case, this is considered good enough. You'd require an even more robust algorithm (bigger sacrifices) if you want a lossless retrieval of your secret data 100% of the time.
If you want such an algorithm, you want to comb the literature for one and test any possible candidates to see if they meet your needs. But don't expect anything that takes only 15 lines to code and 10 minutes of reading to understand. Here is a paper that looks like a good start: Mali et al. (2012). Robust and secured image-adaptive data hiding. Digital Signal Processing, 22(2), 314-323. Unfortunately, the paper is not open domain and you will either need a subscription, or academic access in order to read it. But then again, that's true for most of the papers out there. You said you've read some papers already and in previous questions you've stated you're working on a college project, so access for you may be likely.
For this specific paper, table 4 shows the results of resisting a resizing attack and section 4.4 discusses the results. They don't explicitly state 100% recovery, but only a faithful reproduction. Also notice that the attacks have been of the scale 5-20% resizing and that only allows for a few thousand embedding bits. Finally, the resizing method (nearest neighbour, cubic, etc) matters a lot in surviving the attack.

I have designed and implemented ChromaShift: https://www.facebook.com/ChromaShift/
If done right, steganography can resiliently (i.e. robustly) encode identifying information (e.g. downloader user id) in the image medium while keeping it essentially perceptually unmodified. Compared to watermarks, steganography is a subtler yet more powerful way of encoding information in images.
The information is dynamically multiplexed into the Cb Cr fabric of the JPEG by chroma-shifting pixels to a configurable small bump value. As the human eye is more sensitive to luminance changes than to chrominance changes, chroma-shifting is virtually imperceptible while providing a way to encode arbitrary information in the image. The ChromaShift engine does both watermarking and pure steganography. Both DRM subsystems are configurable via a rich set of of options.
The solution is developed in C, for the Linux platform, and uses SWIG to compile into a PHP loadable module. It can therefore be accessed by PHP scripts while providing the speed of a natively compiled program.

Related

Performing Intra-frame Prediction in Matlab

I am trying to implement a hybrid video coding framework which is used in the H.264/MPEG-4 video standard for which I need to perform 'Intra-frame Prediction' and 'Inter Prediction' (which in other words is motion estimation) of a set of 30 frames for video processing in Matlab. I am working with Mother-daughter frames.
Please note that this post is very similar to my previously asked question but this one is solely based on Matlab computation.
Edit:
I am trying to implement the framework shown below:
My question is how to perform horizontal coding method which is one of the nine methods of Intra Coding framework? How are the pixels sampled?
What I find confusing is that Intra Prediction needs two inputs which are the 8x8 blocks of input frame and the 8x8 blocks of reconstructed frame. But what happens when coding the very first block of the input frame since there will be no reconstructed pixels to perform horizontal coding?
In the image above the whole system is a closed loop where do you start?
END:
Question 1: Is intra-predicted image only for the first image (I-frame) of the sequence or does it need to be computed for all 30 frames?
I know that there are five intra coding modes which are horizontal, vertical, DC, Left-up to right-down and right-up to left-down.
Question 2: How do I actually get around comparing the reconstructed frame and the anchor frame (original current frame)?
Question 3: Why do I need a search area? Can the individual 8x8 blocks be used as a search area done one pixel at a time?
I know that pixels from reconstructed block are used for comparing, but is it done one pixel at a time within the search area? If so wouldn't that be too time consuming if 30 frames are to be processed?
Continuing on from our previous post, let's answer one question at a time.
Question #1
Usually, you use one I-frame and denote this as the reference frame. Once you use this, for each 8 x 8 block that's in your reference frame, you take a look at the next frame and figure out where this 8 x 8 block best moved in this next frame. You describe this displacement as a motion vector and you construct a P-frame that consists of this information. This tells you where the 8 x 8 block from the reference frame best moved in this frame.
Now, the next question you may be asking is how many frames is it going to take before we decide to use another reference frame? This is entirely up to you, and you set this up in your decoder settings. For digital broadcast and DVD storage, it is recommended that you generate an I-frame every 0.5 seconds or so. Assuming 24 frames per second, this means that you would need to generate an I-frame every 12 frames. This Wikipedia article was where I got this reference.
As for the intra-coding modes, these tell the encoder in what direction you should look for when trying to find the best matching block. Actually, take a look at this paper that talks about the different prediction modes. Take a look at Figure 1, and it provides a very nice summary of the various prediction modes. In fact, there are nine all together. Also take a look at this Wikipedia article to get better pictorial representations of the different mechanisms of prediction as well. In order to get the best accuracy, they also do subpixel estimation at a 1/4 pixel accuracy by doing bilinear interpolation in between the pixels.
I'm not sure whether or not you need to implement just motion compensation with P-frames, or if you need B frames as well. I'm going to assume you'll be needing both. As such, take a look at this diagram I pulled off of Wikipedia:
Source: Wikipedia
This is a very common sequence for encoding frames in your video. It follows the format of:
IBBPBBPBBI...
There is a time axis at the bottom that tells you the sequence of frames that get sent to the decoder once you encode the frames. I-frames need to be encoded first, followed by P-frames, and then B-frames. A typical sequence of frames that are encoded in between the I-frames follow this format that you see in the figure. The chunk of frames in between I-frames is what is known as a Group of Pictures (GOP). If you remember from our previous post, B-frames use information from ahead and from behind its current position. As such, to summarize the timeline, this is what is usually done on the encoder side:
The I-frame is encoded, and then is used to predict the first P-frame
The first I-frame and the first P-frame are then used to predict the first and second B-frame that are in between these frames
The second P-frame is predicted using the first P-frame, and the third and fourth B-frames are created using information between the first P-frame and the second P-frame
Finally, the last frame in the GOP is an I-frame. This is encoded, then information between the second P-frame and the second I-frame (last frame) are used to generate the fifth and sixth B-frames
Therefore, what needs to happen is that you send I-frames first, then the P-frames, and then the B-frames after. The decoder has to wait for the P-frames before the B-frames can be reconstructed. However, this method of decoding is more robust because:
It minimizes the problem of possible uncovered areas.
P-frames and B-frames need less data than I-frames, so less data is transmitted.
However, B-frames will require more motion vectors, and so there will be some higher bit rates here.
Question #2
Honestly, what I have seen people do is do a simple Sum-of-Squared Differences between one frame and another to compare similarity. You take your colour components (whether it be RGB, YUV, etc.) for each pixel from one frame in one position, subtract these with the colour components in the same spatial location in the other frame, square each component and add them all together. You accumulate all of these differences for every location in your frame. The higher the value, the more dissimilar this is between the one frame and the next.
Another measure that is well known is called Structural Similarity where some statistical measures such as mean and variance are used to assess how similar two frames are.
There are a whole bunch of other video quality metrics that are used, and there are advantages and disadvantages when using any of them. Rather than telling you which one to use, I defer you to this Wikipedia article so you can decide which one to use for yourself depending on your application. This Wikipedia article describes a whole bunch of similarity and video quality metrics, and the buck doesn't stop there. There is still on-going research on what numerical measures best capture the similarity and quality between two frames.
Question #3
When searching for the best block from an I-frame that has moved in a P-frame, you need to restrict the searching to a finite sized windowed area from the location of this I-frame block because you don't want the encoder to search all of the locations in the frame. This would simply be too computationally intensive and would thus make your decoder slow. I actually mentioned this in our previous post.
Using one pixel to search for another pixel in the next frame is a very bad idea because of the minuscule amount of information that this single pixel contains. The reason why you compare blocks at a time when doing motion estimation is because usually, blocks of pixels have a lot of variation inside the blocks which are unique to the block itself. If we can find this same variation in another area in your next frame, then this is a very good candidate that this group of pixels moved together to this new block. Remember, we're assuming that the frame rate for video is adequately high enough so that most of the pixels in your frame either don't move at all, or move very slowly. Using blocks allows the matching to be somewhat more accurate.
Blocks are compared at a time, and the way blocks are compared is using one of those video similarity measures that I talked about in the Wikipedia article I referenced. You are certainly correct in that doing this for 30 frames would indeed be slow, but there are implementations that exist that are highly optimized to do the encoding very fast. One good example is FFMPEG. In fact, I use FFMPEG at work all the time. FFMPEG is highly customizable, and you can create an encoder / decoder that takes advantage of the architecture of your system. I have it set up so that encoding / decoding uses all of the cores on my machine (8 in total).
This doesn't really answer the actual block comparison itself. Actually, the H.264 standard has a bunch of prediction mechanisms in place so that you're not looking at all of the blocks in an I-frame to predict the next P-frame (or one P-frame to the next P-frame, etc.). This alludes to the different prediction modes in the Wikipedia article and in the paper that I referred you to. The encoder is intelligent enough to detect a pattern, and then generalize an area of your image where it believes that this will exhibit the same amount of motion. It skips this area and moves onto the next.
This assignment (in my opinion) is way too broad. There are so many intricacies in doing motion prediction / compensation that there is a reason why most video engineers already use available tools to do the work for us. Why invent the wheel when it has already been perfected, right?
I hope this has adequately answered your questions. I believe that I have given you more questions than answers really, but I hope that this is enough for you to delve into this topic further to achieve your overall goal.
Good luck!
Question 1: Is intra-predicted image only for the first image (I-frame) of the sequence or does it need to be computed for all 30 frames?
I know that there are five intra coding modes which are horizontal, vertical, DC, Left-up to right-down and right-up to left-down.
Answer: intra prediction need not be used for all the frames.
Question 2: How do I actually get around comparing the reconstructed frame and the anchor frame (original current frame)?
Question 3: Why do I need a search area? Can the individual 8x8 blocks be used as a search area done one pixel at a time?
Answer: we need to use the block matching algo for finding the motion vector. so search area is reqd. Normally size of the search area should be larger than the block size. larger the search area, more the computation and higher the accuracy.

Bit operators issues and Steganography in image processing

Steganography link shows a demonstration of steganography. My question is when the number of bits to be replaced, n =1, then the method is irreversible i.e the Cover is not equal to Stego (in ideal and perfect cases the Cover used should be identical to the Steganography result). It only works perfectly when the number of bits to be replaced is n=4,5,6!! When n=7, the Stego image becomes noisy and different from the Cover used and the result does not become inconspicuous. So, it is evident that there has been an operation of steganography. Can somebody please explain why that is so and what needs to be done so as to make the process reversible and lossless.
So let's see what the code does. From the hidden image you extract the n most significant bits (MSB) and hide them in the n least significant bits (LSB) in the cover image. There are two points to notice about this, which answer your questions.
The more bits you change in your cover image, the more distorted your stego image will look like.
The more information you use from the hidden image, the closer the reconstructed image will look to the original one. The following link (reference) shows you the amount of information of an image from the most to the least significant bit.
If you want to visually check the difference between the cover and stego images, you can use the Peak Signal-to-Noise-Ratio (PSNR) equation. It is said the human eye can't distinguish differences for PSNR > 30 dB. Personally, I wouldn't go for anything less than 40 but it depends on what your aim is. Be aware that this is not an end-all, be-all type of measurement. The quality of your algorithm depends on many factors.
No cover and stego images are supposed to be the same. The idea is to minimise the differences so to resist detection and there are many compromises to achieve that, such as the size of the message you are willing to hide.
Perfect retrieval of a secret image requires hiding all the bits of all the pixels, which means you can only hide a secret 1/8th of the cover image size. Note though that this is worst case scenario, which doesn't consider encryption, compression or other techniques. That's the idea but I won't provide a code snippet based on the above because it is very inflexible.
Now, there are cases where you want the retrieval to be lossless, either because the data are encrypted or of sensitive nature. In other cases an approximate retrieval will do the job. For example, if you were to encode only the 4 MSB of an image, someone extracting the secret would still get a good idea of what it initially looked like. If you still want a lossless method but not the one just suggested, you need to use a different algorithm. The choice of the algorithm depends on various characteristics you want it to have, including but not restricted to:
robustness (how resistant the hidden information is to image editing)
imperceptibility (how hard it is for a stranger to know the existence of a secret, but not necessarily the secret itself, e.g. chi-square attack)
type of cover medium (e.g., specific image file type)
type of secret message (e.g., image, text)
size of secret

Virtual Memory Page Replacement Algorithms

I have a project where I am asked to develop an application to simulate how different page replacement algorithms perform (with varying working set size and stability period). My results:
Vertical axis: page faults
Horizontal axis: working set size
Depth axis: stable period
Are my results reasonable? I expected LRU to have better results than FIFO. Here, they are approximately the same.
For random, stability period and working set size doesnt seem to affect the performance at all? I expected similar graphs as FIFO & LRU just worst performance? If the reference string is highly stable (little branches) and have a small working set size, it should still have less page faults that an application with many branches and big working set size?
More Info
My Python Code | The Project Question
Length of reference string (RS): 200,000
Size of virtual memory (P): 1000
Size of main memory (F): 100
number of time page referenced (m): 100
Size of working set (e): 2 - 100
Stability (t): 0 - 1
Working set size (e) & stable period (t) affects how reference string are generated.
|-----------|--------|------------------------------------|
0 p p+e P-1
So assume the above the the virtual memory of size P. To generate reference strings, the following algorithm is used:
Repeat until reference string generated
pick m numbers in [p, p+e]. m simulates or refers to number of times page is referenced
pick random number, 0 <= r < 1
if r < t
generate new p
else (++p)%P
UPDATE (In response to #MrGomez's answer)
However, recall how you seeded your input data: using random.random,
thus giving you a uniform distribution of data with your controllable
level of entropy. Because of this, all values are equally likely to
occur, and because you've constructed this in floating point space,
recurrences are highly improbable.
I am using random, but it is not totally random either, references are generated with some locality though the use of working set size and number page referenced parameters?
I tried increasing the numPageReferenced relative with numFrames in hope that it will reference a page currently in memory more, thus showing the performance benefit of LRU over FIFO, but that didn't give me a clear result tho. Just FYI, I tried the same app with the following parameters (Pages/Frames ratio is still kept the same, I reduced the size of data to make things faster).
--numReferences 1000 --numPages 100 --numFrames 10 --numPageReferenced 20
The result is
Still not such a big difference. Am I right to say if I increase numPageReferenced relative to numFrames, LRU should have a better performance as it is referencing pages in memory more? Or perhaps I am mis-understanding something?
For random, I am thinking along the lines of:
Suppose theres high stability and small working set. It means that the pages referenced are very likely to be in memory. So the need for the page replacement algorithm to run is lower?
Hmm maybe I got to think about this more :)
UPDATE: Trashing less obvious on lower stablity
Here, I am trying to show the trashing as working set size exceeds the number of frames (100) in memory. However, notice thrashing appears less obvious with lower stability (high t), why might that be? Is the explanation that as stability becomes low, page faults approaches maximum thus it does not matter as much what the working set size is?
These results are reasonable given your current implementation. The rationale behind that, however, bears some discussion.
When considering algorithms in general, it's most important to consider the properties of the algorithms currently under inspection. Specifically, note their corner cases and best and worst case conditions. You're probably already familiar with this terse method of evaluation, so this is mostly for the benefit of those reading here whom may not have an algorithmic background.
Let's break your question down by algorithm and explore their component properties in context:
FIFO shows an increase in page faults as the size of your working set (length axis) increases.
This is correct behavior, consistent with Bélády's anomaly for FIFO replacement. As the size of your working page set increases, the number of page faults should also increase.
FIFO shows an increase in page faults as system stability (1 - depth axis) decreases.
Noting your algorithm for seeding stability (if random.random() < stability), your results become less stable as stability (S) approaches 1. As you sharply increase the entropy in your data, the number of page faults, too, sharply increases and propagates the Bélády's anomaly.
So far, so good.
LRU shows consistency with FIFO. Why?
Note your seeding algorithm. Standard LRU is most optimal when you have paging requests that are structured to smaller operational frames. For ordered, predictable lookups, it improves upon FIFO by aging off results that no longer exist in the current execution frame, which is a very useful property for staged execution and encapsulated, modal operation. Again, so far, so good.
However, recall how you seeded your input data: using random.random, thus giving you a uniform distribution of data with your controllable level of entropy. Because of this, all values are equally likely to occur, and because you've constructed this in floating point space, recurrences are highly improbable.
As a result, your LRU is perceiving each element to occur a small number of times, then to be completely discarded when the next value was calculated. It thus correctly pages each value as it falls out of the window, giving you performance exactly comparable to FIFO. If your system properly accounted for recurrence or a compressed character space, you would see markedly different results.
For random, stability period and working set size doesn't seem to affect the performance at all. Why are we seeing this scribble all over the graph instead of giving us a relatively smooth manifold?
In the case of a random paging scheme, you age off each entry stochastically. Purportedly, this should give us some form of a manifold bound to the entropy and size of our working set... right?
Or should it? For each set of entries, you randomly assign a subset to page out as a function of time. This should give relatively even paging performance, regardless of stability and regardless of your working set, as long as your access profile is again uniformly random.
So, based on the conditions you are checking, this is entirely correct behavior consistent with what we'd expect. You get an even paging performance that doesn't degrade with other factors (but, conversely, isn't improved by them) that's suitable for high load, efficient operation. Not bad, just not what you might intuitively expect.
So, in a nutshell, that's the breakdown as your project is currently implemented.
As an exercise in further exploring the properties of these algorithms in the context of different dispositions and distributions of input data, I highly recommend digging into scipy.stats to see what, for example, a Gaussian or logistic distribution might do to each graph. Then, I would come back to the documented expectations of each algorithm and draft cases where each is uniquely most and least appropriate.
All in all, I think your teacher will be proud. :)

Transferring data using ultrasound

Yamaha InfoSound and ShopKick application use technologies that allow to transfer data using ultrasound. That is playing an inaudible signal (>18kHz) that can be picked up by modern mobile phones (iOS, Android).
What is the approach used in such technologies? What kind of modulation they use?
I see several problems with this approach. First, 18kHz is not inaudible. Many people cannot hear it, especially as they age, but I know I certainly can (I do regular hearing tests, work-related). Also, most phones have different low-pass filters on their A/D converters, and many devices, especially older Android ones (I've personally seen that happen), filter everything below 16 kHz or so. Your app therefore is not guaranteed to work on any hardware. The iPhone should probably be able to do it.
In terms of modulation, it could be anything really, but I would definitely rule out AM. Sound has next to zero robustness when it comes to volume. If I were to implement something like that, I would go with FSK. I would think that PSK would fail due to acoustic reflections and such. The difficulty is that you're working with non-robust energy transfer within a very narrow bandwidth. I certainly do not doubt that it can be achieved, but I don't see something like this proving reliable. Just IMHO, that is.
Update: Now that i think about it, a plain on-off would work with a single tone if you're not transferring any data, just some short signals.
Can't say for Yamaha InfoSound and ShopKick, but what we used in our project was a variation of frequency modulation: the frequency of the carrier is modulated by a digital binary signal, where 0 and 1 correspond to 17 kHz and 18 kHz respectively. As for demodulator, we tried heterodyne. More details you could find here: http://rnd.azoft.com/mobile-app-transering-data-using-ultrasound/
There's nothing special in being ultrasound, the principle is the same as data transmission through a modem, so any digital modulation is -in principle- feasible. You only have a specific frequency band (above 18khz) and some practical requisites (the medium is very unreliable, I guess) that suggest to use a simple-robust scheme with low-bit rate.
I don't know how they do it but this is how I do it:
If it is a string then make sure it's not a long one (the longer the higher is the error probability ). Lets assume we're working with the vital part of the ASCII code, namely up to character number 127, then all you need is 7 bits per character. Transform this character into bits and modulate those bits using QFSK (there are several modulations to choose from, frequency shift based ones have turned out to be the most robust I've tried from the conventional ones... I've created my own modulation scheme for this use case). Select the carrier frequencies as 18.5,19,19.5, and 20 kHz (if you want to be mathematically strict in your design, select frequency values that assure you both orthogonality and phase continuity at symbol transitions, if you can't, a good workaround to avoid abrupt symbols transitions is to multiply your symbols by a window of the same size, eg. a Gaussian or Bartlet ). In my experience you can move this values in the range from 17.5 to 20.5 kHz (if you go lower it will start to bother people using your app, if you go higher the average type microphone frequency response will attenuate your transmission and induce unwanted errors).
On the receiver side implement a correlation or matched filter receiver (an FFT receiver works as well, specially a zero padded one but it might be a little bit slower, I wouldn't recommend Goertzel because frequency shift due to Doppler effect or speaker-microphone non-linearities could affect your reception). Once you have received the bit stream make characters with them and you will recover your message
If you face too many broadcasting errors, try selecting a higher amount of samples per symbol or band-pass filtering each frequency value before giving them to the demodulator, using an error correction code such as BCH or Reed Solomon is sometimes the only way to assure an error free communication.
One topic everybody always forgets to talk about is synchronization (to know on the receiver side when the transmission has begun), you have to be creative here and make a lot a tests with a lot of phones before you can derive an actual detection threshold that works on all, notice that this might also be distance dependent
If you are unfamiliar with these subjects I would recommend a couple of great books:
Digital Modulation Techniques from Fuqin Xiong
DIGITAL COMMUNICATIONS Fundamentals and Applications from BERNARD SKLAR
Digital Communications from John G. Proakis
You might have luck with a library I created for sound-based modems, libquiet. It gives you a handful of profiles to work from, including a slow "Ultrasonic whisper" profile with spectral content above 19kHz. The library is written in C but would require some work to interface with iOS.

Mixing sound files of different size

I want to mix audio files of different size into a one single .wav file without clipping any file.,i.e. The resulting file size should be equal to the largest sized file of all.
There is a sample through which we can mix files of same size
[(http://www.modejong.com/iOS/#ex4 )(Example 4)].
I modified the code to get the mixed file as a .wav file.
But I am not able to understand that how to modify this code for unequal sized files.
If someone can help me out with some code snippet,i'll be really thankful.
It should be as easy as sending all the files to the mixer simultaneously. When any single file gets to the end, just treat it as if the remainder is filled with zeroes. When all files get to the end, you are done.
Note that the example code says it returns an error if there would be clipping (the sum of the waves is greater than the max representable value.). This condition is more likely if you are mixing multiple inputs. The best way around it is to create some "headroom" in the input waves. You can do either do this in preprocessing, by ensuring that each wave's volume is no more than X% of maximum. (~80-90%, depending on number of inputs.). The other way is to do it dynamically in the mixer code by multiplying each sample by some value <1.0 as you add it to the mix.
If you are selecting the waves to mix at runtime and failure due to clipping is unacceptable, you will need to modify the sample code to pin the values at max/min instead of returning an error. Don't just let them overflow or you will get noisy artifacts.
(Clipping creates artifacts as well, but when you haven't created enough headroom before mixing, it is definitely preferrable to overflow. It is a more familiar-sounding type of distortion, similar to what you get when you overdrive your speakers. See this wikipedia article on clipping:
Clipping is preferable to the alternative in digital systems—wrapping—which occurs if the digital hardware is allowed to "overflow", ignoring the most significant bits of the magnitude, and sometimes even the sign of the sample value, resulting in gross distortion of the signal.
How I'd do it:
Much like the mix_buffers function that you linked to, but pass in 2 parameters for mixbufferNumSamples. Iterate over the whole of the longer of the two buffers. When the index has gone beyond the end of the shorter buffer, simply set the sample from that buffer to 0 for the rest of the function.
If you must avoid clipping and do it in real-time and you know nothing else about the two sounds, you must provide enough headroom. The simplest method is by halving each of the samples before mixing:
mixed = s1/2 + s2/2;
This ensures that the resultant mixed sample won't overflow an int16_t. It will have the side effect of making everything quieter though.
If you can run it offline, you can calculate a scale factor to apply to both waveforms which will keep the peaks when summed below the maximum allowed value.
Or you could mix them all at full volume to an int32_t buffer, keeping track of the largest (magnitude) mixed sample and then go back through the buffer multiplying each sample by a scale factor which will make that extreme sample just reach the +32767/-32768 limits.