Creating and loading .pngs in RGBA4444 RGBA5551 for openGL - iphone

I'm creating an openGL game and so far I have been using .pngs in the RGBA8888 format as texture sheets, but those are too memory hungry, and my app crashes frequently. I read in Apple's site that such format such be used just when too much quality is needed, and recommends to use RGBA4444 and RGBA5551 instead ( I already converted my textures to PVR but the quality loss is too great in most of the sprite sheets).
I only need to use GL_UNSIGNED_SHORT_5_5_5_1 or GL_UNSIGNED_SHORT_4_4_4_4 in my glTexImage2D call inside my texture loader class in order to load my textures, but I need to convert my texture sheets to RGBA4444 and RGBA5551, and I'm clueless about how could I achieve this.

Seriously? There are libraries to do this kind of conversion. But frankly, this is a bit of bit twiddling. There are libraries that use asm, or specialized SSE commands to accellerate this which will be fast, but its pretty easy to roll your own format converter in C/C++.
Your basic process would be:
Given a buffer of RGBA8888 encoded values
Create a buffer big enough to hold the RGBA4444 or RGBA5551 values. In this case, its simple - half the size.
Loop over the source buffer, unpacking each component, and repacking into the destination format, and write it into the destination buffer.
void* rgba8888_to_rgba4444(
void* src, // IN, pointer to source buffer
int cb) // IN size of source buffer, in bytes
{
// this code assumes that a long is 4 bytes and short is 2.
//on some compilers this isnt true
int i;
// compute the actual number of pixel elements in the buffer.
int cpel = cb/4;
unsigned long* psrc = (unsigned long*)src;
// create the RGBA4444 buffer
unsigned short* pdst = (unsigned short*)malloc(cpel*2);
// convert every pixel
for(i=0;i<cpel; i++)
{
// read a source pixel
unsigned pel = psrc[i];
// unpack the source data as 8 bit values
unsigned r = p & 0xff;
unsigned g = (pel >> 8) & 0xff;
unsigned b = (pel >> 16) & 0xff;
unsigned a = (pel >> 24) & 0xff;
//convert to 4 bit vales
r >>= 4;
g >>= 4;
b >>= 4;
a >>= 4;
// and store
pdst[i] = r | g << 4 | b << 8 | a << 12;
}
return pdst;
}
The actual conversion loop I did very wastefully, the components can be extracted, converted and repacked in a single pass, making for far faster code. I did it this way to make the conversion explicit, and easy to change. Also, im not sure that I got the component order the right way around. So it might be b, r, g, a, but it shouldn't effect the result of the function as it repackes in the same order into the dest buffer.

Using ImageMagick you can create RGBA4444 PNG files by running:
convert source.png -depth 4 destination.png
You can get ImageMagick from MacPorts.

You may consider using Imagination's PVRTexTool for Windows. It's specifically for creating PVR textures in every supported color format. It can create both PVRTC compressed textures (what you call "PVR") as well as uncompressed textures in 8888, 5551, 4444, etc.
However, it doesn't output PNGs (only PVRs) so your loading code would have change. Also, sometimes PVRs are much larger than PNGs because the pixels in PNGs are compressed with deflate compression.
Since you're most likely running OS X, you can use Darwine (now WineBottler) to run it (and other windows programs) on OS X.
You'll need to register as an Imagination developer before you can download PVRTexTool. Registration and the tool are both free.
Once you set it up, it's pretty painless and it gives you a decent GUI for working with PVRs.

You might also want to look how to optimize RGBA8888 for conversion to RGBA4444 using floyd-steinberg dithering in GIMP: http://www.youtube.com/watch?v=v1xGYecsnX0

You could also use http://www.texturepacker.com for conversion.

Here is an optimized in-place conversion of Chris' code which should run 2x as fast but is not as strait forward. The in-place conversion helps to avoid crashes by lowering the memory spike. Just thought I'd share in case anyone was planning on using this code. I've tested it and it works great:
void* rgba8888_to_rgba4444( void* src, // IN, pointer to source buffer
int cb) // IN size of source buffer, in bytes
{
int i;
// compute the actual number of pixel elements in the buffer.
int cpel = cb/4;
unsigned long* psrc = (unsigned long*)src;
unsigned short* pdst = (unsigned short*)src;
// convert every pixel
for(i=0;i<cpel; i++)
{
// read a source pixel
unsigned pel = psrc[i];
// unpack the source data as 8 bit values
unsigned r = (pel << 8) & 0xf000;
unsigned g = (pel >> 4) & 0x0f00;
unsigned b = (pel >> 16) & 0x00f0;
unsigned a = (pel >> 28) & 0x000f;
// and store
pdst[i] = r | g | b | a;
}
return pdst;
}

Related

Reducing LUT utilization in a Vivado HLS design (RSA cryptosystem using montgomery multiplication)

A question/problem for anyone experienced with Xilinx Vivado HLS and FPGA design:
I need help reducing the utilization numbers of a design within the confines of HLS (i.e. can't just redo the design in an HDL). I am targeting the Zedboard (Zynq 7020).
I'm trying to implement 2048-bit RSA in HLS, using the Tenca-koc multiple-word radix 2 montgomery multiplication algorithm, shown below (More algorithm details here):
I wrote this algorithm in HLS and it works in simulation and in C/RTL cosim. My algorithm is here:
#define MWR2MM_m 2048 // Bit-length of operands
#define MWR2MM_w 8 // word size
#define MWR2MM_e 257 // number of words per operand
// Type definitions
typedef ap_uint<1> bit_t; // 1-bit scan
typedef ap_uint< MWR2MM_w > word_t; // 8-bit words
typedef ap_uint< MWR2MM_m > rsaSize_t; // m-bit operand size
/*
* Multiple-word radix 2 montgomery multiplication using carry-propagate adder
*/
void mwr2mm_cpa(rsaSize_t X, rsaSize_t Yin, rsaSize_t Min, rsaSize_t* out)
{
// extend operands to 2 extra words of 0
ap_uint<MWR2MM_m + 2*MWR2MM_w> Y = Yin;
ap_uint<MWR2MM_m + 2*MWR2MM_w> M = Min;
ap_uint<MWR2MM_m + 2*MWR2MM_w> S = 0;
ap_uint<2> C = 0; // two carry bits
bit_t qi = 0; // an intermediate result bit
// Store concatenations in a temporary variable to eliminate HLS compiler warnings about shift count
ap_uint<MWR2MM_w> temp_concat=0;
// scan X bit-by bit
for (int i=0; i<MWR2MM_m; i++)
{
qi = (X[i]*Y[0]) xor S[0];
// C gets top two bits of temp_concat, j'th word of S gets bottom 8 bits of temp_concat
temp_concat = X[i]*Y.range(MWR2MM_w-1,0) + qi*M.range(MWR2MM_w-1,0) + S.range(MWR2MM_w-1,0);
C = temp_concat.range(9,8);
S.range(MWR2MM_w-1,0) = temp_concat.range(7,0);
// scan Y and M word-by word, for each bit of X
for (int j=1; j<=MWR2MM_e; j++)
{
temp_concat = C + X[i]*Y.range(MWR2MM_w*j+(MWR2MM_w-1), MWR2MM_w*j) + qi*M.range(MWR2MM_w*j+(MWR2MM_w-1), MWR2MM_w*j) + S.range(MWR2MM_w*j+(MWR2MM_w-1), MWR2MM_w*j);
C = temp_concat.range(9,8);
S.range(MWR2MM_w*j+(MWR2MM_w-1), MWR2MM_w*j) = temp_concat.range(7,0);
S.range(MWR2MM_w*(j-1)+(MWR2MM_w-1), MWR2MM_w*(j-1)) = (S.bit(MWR2MM_w*j), S.range( MWR2MM_w*(j-1)+(MWR2MM_w-1), MWR2MM_w*(j-1)+1));
}
S.range(S.length()-1, S.length()-MWR2MM_w) = 0;
C=0;
}
// if final partial sum is greater than the modulus, bring it back to proper range
if (S >= M)
S -= M;
*out = S;
}
Unfortunately, the LUT utilization is huge.
This is problematic because I need to be able to fit multiple of these blocks in hardware as axi4-lite slaves.
Could someone please provide a few suggestions as to how I can reduce the LUT utilization, WITHIN THE CONFINES OF HLS?
I've already tried the following:
Experimenting with different word lengths
switching the top level inputs to arrays so they are BRAM (i.e. not using ap_uint<2048>, but instead ap_uint foo[MWR2MM_e])
Experimenting with all sorts of directives: compartmentalizing into multiple inline functions, dataflow architecture, resource limits on lshr, etc.
However, nothing really drives the LUT utilization down in a meaningful way. Is there a glaringly obvious way that I could reduce the utilization that is apparent to anyone?
In particular, I've seen papers on implementations of the mwr2mm algorithm that (only use one DSP block and one BRAM). Is this even worth attempting to implement using HLS? Or is there no way that I can actually control the resources that the algorithm is mapped to without describing it in HDL?
Thanks for the help.

NEON: loading uint8_t array into 128 bit register

I need to load values from uint8 array into 128 NEON register. There is a similar question. But there were no good answers.
My solution is:
uint8_t arr[4] = {1,2,3,4};
//load 4 of 8-bit vals into 64 bit reg
uint8x8_t _vld1_u8 = vld1_u8(arr);
//convert to 16-bit and move to 128-bit reg
uint16x8_t _vmovl_u8 = vmovl_u8(_vld1_u8);
//get low 64 bit and move them to 64-bit reg
uint16x4_t _vget_low_u16 = vget_low_u16(_vmovl_u8);
//convert to 32-bit and move to 128-bit reg
uint32x4_t ld32x4 = vmovl_u16(_vget_low_u16);
This works fine, but it seems to me that this approach is not the fastest. Maybe there is a better and faster way to load 8bit data into 128 reg as 32bit ?
Edit:
Thanks to #FrankH. I've came up with the second version using some hack:
uint8x16x2_t z = vzipq_u8(vld1q_u8(arr), q_zero);
uint8x16_t rr = *(uint8x16_t*)&z;
z = vzipq_u8(rr, q_zero);
ld32x4 = *(uint8x16_t*)&z;
It boils down to this assembly (when compiler optimisations are on):
vld1.8 {d16, d17}, [r5]
vzip.8 q8, q9
vorr q9, q4, q4
vzip.8 q8, q9
So there are no redundant stores and it's pretty fast. But still it is about x1.5 slower then the first solution.
You can do a "double zip" with zeroes:
uint16x4_t zero = 0;
uint32x4_t ld32x4 =
vreinterpretq_u32_u16(
vzipq_u8(
vzip_u8(
vld1_u8(arr),
vreinterpret_u8_u16(zero)
),
zero
)
);
Since the vreinterpretq_*() are no-ops, this boils down to three instructions. Don't have a crosscompiler around at the moment, can't validate that :(
Edit:
Don't get me wrong there ... while vreinterpretq_*() isn't resulting in a Neon instruction, it's not a no-op; that's because it stops the compiler from doing the type of funky things you'd see if you'd instead use widerVal.val[0]. All it tells the compiler is, like:
"you've got a uint8x16x2_t but I want to use only half of that as a uint8x16_t, give me half the registers."
Or:
"you have a uint8x16x2_t but I want to use those regs as a uint32x4_t instead."
I.e. it tells the compilers to alias sets of neon registers - preventing stores/loads to/from the stack as you'd get if you do the explicit sub-set access through the .val[...] syntax.
In a way, the .val[...] syntax "is a hack" but the better method, the use of vreinterpretq_*(), "looks like a hack". Not using it results in more instructions and slower/inferior code.

How to store data larger than 128 byte in JavaCard

I can't write data at index above 128 in byte array.
code is given below.
private void Write1(APDU apdu) throws ISOException
{
apdu.setIncomingAndReceive();
byte[] apduBuffer = apdu.getBuffer();
byte j = (byte)apduBuffer[4]; // Return incoming bytes lets take 160
Buffer1 = new byte[j]; // initialize a array with size 160
for (byte i=0; i<j; i++)
Buffer1[(byte)i] = (byte)apduBuffer[5+i];
}
It gives me error 6F 00 (It means reach End Of file).
I am using:
smart card type = contact card
using java card 2.2.2 with jcop using apdu
Your code contains several problems:
As already pointed out by 'pst' you are using a signed byte value which works only up to 128 - use a short instead
Your are creating a new buffer Buffer1 on every call of your Write1 method. On JavaCard there is usually no automatic garbage collection - therefore memory allocation should only be done once when the app is installed. If you only want to process the data in the adpu buffer just use it from there. And if you want to copy data from one byte array into another better use javacard.framework.Util.arrayCopy(..).
You are calling apdu.setIncomingAndReceive(); but ignore the return value. The return value gives you the number of bytes of data you can read.
The following code is from the API docs and shows the common way:
short bytesLeft = (short) (buffer[ISO7816.OFFSET_LC] & 0x00FF);
if (bytesLeft < (short)55) ISOException.throwIt( ISO7816.SW_WRONG_LENGTH );
short readCount = apdu.setIncomingAndReceive();
while ( bytesLeft > 0){
// process bytes in buffer[5] to buffer[readCount+4];
bytesLeft -= readCount;
readCount = apdu.receiveBytes ( ISO7816.OFFSET_CDATA );
}
short j = (short) apdu_buffer[ISO7816.OFFSET_LC] & 0xFF
Elaborating on pst's answer. A byte has 2^8 bits numbers, or rather 256. But if you are working with signed numbers, they will work in a cycle instead. So, 128 will be actually -128, 129 will be -127 and so on.
Update: While the following answer is "valid" for normal Java, please refer to Roberts answer for Java Card-specific information, as well additional concerns/approaches.
In Java a byte has values in the range [-128, 127] so, when you say "160", that's not what the code is really giving you :)
Perhaps you'd like to use:
int j = apduBuffer[4] & 0xFF;
That "upcasts" the value apduBuffer[4] to an int while treating the original byte data as an unsigned value.
Likewise, i should also be an int (to avoid a nasty overflow-and-loop-forever bug), and the System.arraycopy method could be handy as well...
(I have no idea if that is the only/real problem -- or if the above is a viable solution on a Java Card -- but it sure is a problem and aligns with the "128 limit" mentioned.)
Happy coding.

EXC_BAD_ACCESS when calling avcodec_encode_video

I have an Objective-C class (although I don't believe this is anything Obj-C specific) that I am using to write a video out to disk from a series of CGImages. (The code I am using at the top to get the pixel data comes right from Apple: http://developer.apple.com/mac/library/qa/qa2007/qa1509.html). I successfully create the codec and context - everything is going fine until it gets to avcodec_encode_video, when I get EXC_BAD_ACCESS. I think this should be a simple fix, but I just can't figure out where I am going wrong.
I took out some error checking for succinctness. 'c' is an AVCodecContext*, which is created successfully.
-(void)addFrame:(CGImageRef)img
{
CFDataRef bitmapData = CGDataProviderCopyData(CGImageGetDataProvider(img));
long dataLength = CFDataGetLength(bitmapData);
uint8_t* picture_buff = (uint8_t*)malloc(dataLength);
CFDataGetBytes(bitmapData, CFRangeMake(0, dataLength), picture_buff);
AVFrame *picture = avcodec_alloc_frame();
avpicture_fill((AVPicture*)picture, picture_buff, c->pix_fmt, c->width, c->height);
int outbuf_size = avpicture_get_size(c->pix_fmt, c->width, c->height);
uint8_t *outbuf = (uint8_t*)av_malloc(outbuf_size);
out_size = avcodec_encode_video(c, outbuf, outbuf_size, picture); // ERROR occurs here
printf("encoding frame %3d (size=%5d)\n", i, out_size);
fwrite(outbuf, 1, out_size, f);
CFRelease(bitmapData);
free(picture_buff);
free(outbuf);
av_free(picture);
i++;
}
I have stepped through it dozens of times. Here are some numbers...
dataLength = 408960
picture_buff = 0x5c85000
picture->data[0] = 0x5c85000 -- which I take to mean that avpicture_fill worked...
outbuf_size = 408960
and then I get EXC_BAD_ACCESS at avcodec_encode_video. Not sure if it's relevant, but most of this code comes from api-example.c. I am using XCode, compiling for armv6/armv7 on Snow Leopard.
Thanks so much in advance for help!
I have not enough information here to point to the exact error, but I think that the problem is that the input picture contains less data than avcodec_encode_video() expects:
avpicture_fill() only sets some pointers and numeric values in the AVFrame structure. It does not copy anything, and does not check whether the buffer is large enough (and it cannot, since the buffer size is not passed to it). It does something like this (copied from ffmpeg source):
size = picture->linesize[0] * height;
picture->data[0] = ptr;
picture->data[1] = picture->data[0] + size;
picture->data[2] = picture->data[1] + size2;
picture->data[3] = picture->data[1] + size2 + size2;
Note that the width and height is passed from the variable "c" (the AVCodecContext, I assume), so it may be larger than the actual size of the input frame.
It is also possible that the width/height is good, but the pixel format of the input frame is different from what is passed to avpicture_fill(). (note that the pixel format also comes from the AVCodecContext, which may differ from the input). For example, if c->pix_fmt is RGBA and the input buffer is in YUV420 format (or, more likely for iPhone, a biplanar YCbCr), then the size of the input buffer is width*height*1.5, but avpicture_fill() expects the size of width*height*4.
So checking the input/output geometry and pixel formats should lead you to the cause of the error. If it does not help, I suggest that you should try to compile for i386 first. It is tricky to compile FFMPEG for the iPhone properly.
Does the codec you are encoding support the RGB color space? You may need to use libswscale to convert to I420 before encoding. What codec are you using? Can you post the code where you initialize your codec context?
The function RGBtoYUV420P may help you.
http://www.mail-archive.com/libav-user#mplayerhq.hu/msg03956.html

Decoding ima4 audio format

To reduce the download size of an iPhone application I'm compressing some audio files. Specifically I'm using afconvert on the command line to change .wav format to .caf format w/ ima4 compression.
I've read this (wooji-juice.com) awesome post about this exact topic. I'm having trouble w/ the "decoding ima4 packets" step. I've looked at their sample code and I'm stuck. Please help w/ some pseudo code or sample code that can guide me in the right direction.
Thanks!
Additional info:
Here is what I've completed and where I'm having trouble...
I can play .wav files in both the simulator and on the phone.
I can compress .wav files to .caf w/ ima4 compression using afconvert on the command line. I'm using the SoundEngine that came w/ CrashLanding (I fixed one memory leak).
I modified the SoundEngine code to look for the mFormatID 'ima4'.
I don't understand the blog post linked above starting w/ "Calculating the size of the unpacked data". Why do I need to do this? Also, what does the term "packet" refer to? I'm very new to any sort of audio programming.
After gathering all the data from Wooji-Juice, Multimedia Wiki and Apple, here is my proposal (may need some experiment):
File structure
Apple IMA4 file are made of packet of 34 bytes. This is the packet unit used to build the file.
Each 34 bytes packet has two parts:
the first 2 bytes contain the preamble: an initial predictor and a step index
the 32 bytes left contain the sound nibbles (a nibble of 4 bits is used to retrieve a 16 bits sample)
Each packet has 32 bytes of compressed data, that represent 64 samples of 16 bits.
If the sound file is stereo, the packets are interleaved (one for the left, one for the right); there must be an even number of packets.
Decoding
Each packet of 34 bytes will lead to the decompression of 64 samples of 16 bits. So the size of the uncompressed data is 128 bytes per packet.
The decoding pseudo code looks like:
int[] ima_index_table = ... // Index table from [Multimedia Wiki][2]
int[] step_table = ... // Step table from [Multimedia Wiki][2]
byte[] packet = ... // A packet of 34 bytes compressed
short[] output = ... // The output buffer of 128 bytes
int preamble = (packet[0] << 8) | packet[1];
int predictor = preamble && 0xFF80; // See [Multimedia Wiki][2]
int step_index = preamble && 0x007F; // See [Multimedia Wiki][2]
int i;
int j = 0;
for(i = 2; i < 34; i++) {
byte data = packet[i];
int lower_nibble = data && 0x0F;
int upper_nibble = (data && 0xF0) >> 4;
// Decode the lower nibble
step_index += ima_index_table[lower_nibble];
diff = ((signed)nibble + 0.5f) * step / 4;
predictor += diff;
step = ima_step_table[step index];
// Clamp the predictor value to stay in range
if (predictor > 65535)
output[j++] = 65535;
else if (predictor < -65536)
output[j++] = -65536;
else
output[j++] = (short) predictor;
// Decode the uppper nibble
step_index += ima_index_table[upper_nibble];
diff = ((signed)nibble + 0.5f) * step / 4;
predictor += diff;
step = ima_step_table[step index];
// Clamp the predictor value to stay in range
if (predictor > 65535)
output[j++] = 65535;
else if (predictor < -65536)
output[j++] = -65536;
else
output[j++] = (short) predictor;
}
The term "packet" refers to a group of compressed audio samples with a header. You need the header to decode the data immediately following. If you consider your ima4 file to be a book, then each packet is a page. At the top are the values needed to decode that page, followed by the compressed audio.
That's why you need to calculate the size of the unpacked data (and then make space for it) -- since it's compressed, you need to convert data from compressed audio to uncompressed audio before you can output it. In order to allocate an output buffer, you need to know how big it has to be (note: you may need to output in chunks that are larger than a single packet at a time).
It looks like the typical structure, per the earlier "Overview" section, is that sets of 64 samples, each 16 bits (so 128 bytes) are translated to a 2-byte header and a 32-byte set of compressed samples (34 bytes in all). So, in the typical case, you can produce your expected output datasize by taking the input data size, dividing by 34 to get the number of packets, then multiplying by 128 bytes for the uncompressed audio per packet.
You shouldn't do that, though. It looks like you should instead query kAudioFilePropertyDataFormat to get the mBytesPerPacket -- this is the "34" value above, and mFramesPerPacket -- this is the 64, above, that gets multiplied by 2 (for 16-byte samples) to make 128 bytes of output.
Then, for each packet, you will need to run through the decoding described in the post. In somewhat longer pseudo C-code, assuming you are getting arrays of bytes, to handle the header:
packet = GetPacket();
Header = (packet[0] << 8) | packet[1]; //Big-endian 16-bit value
step_index = Header & 0x007f; //Lower seven bits
predictor = Header & 0xff80; //Upper nine bits
for (i = 2; i < mBytesPerPacket; i++)
{
nibble = packet[i] & 0x0f; //Low Nibble
process that nibble, per the blogpost -- be careful on sign-extension!
nibble = (packet[i] & 0xf0) >> 4; //High Nibble
process that nibble, per the blogpost -- be careful on sign-extension!
}
The sign-extension above refers to the fact that the post involves handling each nibble both in an unsigned and a signed way. If the high bit of a nibble (bit 3) is a 1, then it is negative; additionally the bit-shift may do sign-extension. This is not handled in the above pseudocode.