How many bits do i need to store AB+C? - cpu-architecture

I was wondering about this-
If A, B are 16-bit numbers and C is 8-bit, how many bits would I need to store the result ? 32 or 33 ?
And, what if C was a 16-bit number? What then ?
I would appreciate if I got answers with an explanation of the hows and whys.

Why don't you just take the maximum value for each register, and check the result?
If all registers are unsigned:
0xFFFF * 0xFFFF + 0xFF = 0xFFFE0100 = // 32 bits are enough
0xFFFF * 0xFFFF + 0xFFFF = 0xFFFF0000 // 32 bits are enough
If all registers are signed, then 0xFFFF = -32767, but 0xFFFF * 0xFFFF would be the same as before (negative * negative = positive). Register C will make the result a little smaller than the previous result, but you would still require 32 bits in order to store it.

Related

Nibble shuffling with x64 SIMD

I'm aware of byte shuffling instructions, but I'd like to do the same with nibbles (4-bit values), concretely I'd like to shuffle 16 nibbles in a 64-bit word. My shuffling indices are also stored as 16 nibbles. What's the most efficient implementation of this?
Arbitrary shuffles with a control vector that has to be stored this way? Ugh, hard to work with. I guess you'd have to unpack both to feed SSSE3 pshufb and then re-pack that result.
Probably just punpcklbw against a right-shifted copy, then AND mask to keep only the low 4 bits in each byte. Then pshufb.
Sometimes an odd/even split is easier than widening each element (so bits just stay within their original byte or word). In this case, if we could change your nibble index numbering, punpcklqdq could put the odd or even nibbles in the high half, ready to bring them back down and OR.
But without doing that, re-packing is a separate problem. I guess combine adjacent pairs of bytes into a word in the low byte, perhaps with pmaddubsw if throughput is more important than latency. Then you can packuswd (against zero or itself) or pshufb (with a constant control vector).
If you were doing multiple such shuffles, you could pack two vectors down to one, to store with movhps / movq. Using AVX2, it might be possible to have all the other instructions working on two independent shuffles in the two 128-bit lanes.
// UNTESTED, requires only SSSE3
#include <stdint.h>
#include <immintrin.h>
uint64_t shuffle_nibbles(uint64_t data, uint64_t control)
{
__m128i vd = _mm_cvtsi64_si128(data); // movq
__m128i vd_hi = _mm_srli_epi32(vd, 4); // x86 doesn't have a SIMD byte shift
vd = _mm_unpacklo_epi8(vd, vd_hi); // every nibble at the bottom of a byte, with high garbage
vd = _mm_and_si128(vd, _mm_set1_epi8(0x0f)); // clear high garbage for later merging
__m128i vc = _mm_cvtsi64_si128(control);
__m128i vc_hi = _mm_srli_epi32(vc, 4);
vc = _mm_unpacklo_epi8(vc, vc_hi);
vc = _mm_and_si128(vc, _mm_set1_epi8(0x0f)); // make sure high bit is clear, else pshufb zeros that element.
// AVX-512VBMI vpermb doesn't have that problem, if you have it available
vd = _mm_shuffle_epi8(vd, vc);
// left-hand input is the unsigned one, right hand is treated as signed bytes.
vd = _mm_maddubs_epi16(vd, _mm_set1_epi16(0x1001)); // hi nibbles << 4 (*= 0x10), lo nibbles *= 1.
// vd has nibbles merged into bytes, but interleaved with zero bytes
vd = _mm_packus_epi16(vd, vd); // duplicate vd into low & high halves.
// Pack against _mm_setzero_si128() if you're not just going to movq into memory or a GPR and you want the high half of the vector to be zero.
return _mm_cvtsi128_si64(vd);
}
Masking the data with 0x0f ahead of the shuffle (instead of after) allows more ILP on CPUs with two shuffle units. At least if they already had the uint64_t values in vector registers, or if the data and control values are coming from memory so both can be loaded in the same cycle. If coming from GPRs, 1/clock throughput for vmovq xmm, reg means there's a resource conflict between the dep chains so they can't both start in the same cycle. But since we the data might be ready before the control, masking early keeps it off the critical path for control->output latency.
If latency is a bottleneck instead of the usual throughput, consider replacing pmaddubsw with right-shift by 4, por, and AND/pack. Or pshufb to pack while ignoring garbage in odd bytes. Since you'd need another constant anyway, might as well make it a pshufb constant instead of and.
If you had AVX-512, a shift and bit-blend with vpternlogd could avoid needing to mask the data before shuffling, and vpermb instead of vpshufb would avoid needing to mask the control, so you'd avoid the set1_epi8(0x0f) constant entirely.
clang's shuffle optimizer didn't spot anything, just compiling it as-written like GCC does (https://godbolt.org/z/xz7TTbM1d), even with -march=sapphirerapids. Not spotting that it could use vpermb instead of vpand / vpshufb.
shuffle_nibbles(unsigned long, unsigned long):
vmovq xmm0, rdi
vpsrld xmm1, xmm0, 4
vpunpcklbw xmm0, xmm0, xmm1 # xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1],xmm0[2],xmm1[2],xmm0[3],xmm1[3],xmm0[4],xmm1[4],xmm0[5],xmm1[5],xmm0[6],xmm1[6],xmm0[7],xmm1[7]
vmovq xmm1, rsi
vpsrld xmm2, xmm1, 4
vpunpcklbw xmm1, xmm1, xmm2 # xmm1 = xmm1[0],xmm2[0],xmm1[1],xmm2[1],xmm1[2],xmm2[2],xmm1[3],xmm2[3],xmm1[4],xmm2[4],xmm1[5],xmm2[5],xmm1[6],xmm2[6],xmm1[7],xmm2[7]
vmovdqa xmm2, xmmword ptr [rip + .LCPI0_0] # xmm2 = [15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15]
vpand xmm0, xmm0, xmm2
vpand xmm1, xmm1, xmm2
vpshufb xmm0, xmm0, xmm1
vpmaddubsw xmm0, xmm0, xmmword ptr [rip + .LCPI0_1]
vpackuswb xmm0, xmm0, xmm0
vmovq rax, xmm0
ret
(Without AVX, it requires 2 extra movdqa register-copy instructions.)
I came across this problem today. In AVX-512 you can use vpmultishiftqb (1), an amusing instruction available in Ice Lake and after (and apparently in Zen 4, according to Wikipedia), to shuffle nibbles much more quickly. Its power lies in its ability to permute bytes in an unaligned fashion: It takes the eight 8-bit chunks in each 64-bit element and selects unaligned 8-bit chunks from the corresponding element. Below is an implementation.
#include <immintrin.h>
#include <inttypes.h>
#include <stdint.h>
#include <stdio.h>
// Convention: (a & (0xf << (4 * i))) >> (4 * i) is the ith nibble of a
// (i.e., lowest-significant is 0)
uint64_t shuffle_nibbles(uint64_t data, uint64_t indices) {
#if defined(__AVX512VBMI__) && defined(__AVX512VL__)
// If your data is already in vectors, then this method also works in parallel
const __m128i lo_nibble_msk = _mm_set1_epi8(0x0f);
__m128i v_data = _mm_cvtsi64_si128(data);
__m128i v_indices = _mm_cvtsi64_si128(indices);
__m128i indices_lo = _mm_and_si128(lo_nibble_msk, v_indices);
__m128i indices_hi = _mm_andnot_si128(lo_nibble_msk, v_indices);
indices_lo = _mm_slli_epi32(indices_lo, 2);
indices_hi = _mm_srli_epi32(indices_hi, 2);
// Look up unaligned bytes
__m128i shuffled_hi = _mm_multishift_epi64_epi8(indices_hi, v_data);
__m128i shuffled_lo = _mm_multishift_epi64_epi8(indices_lo, v_data);
shuffled_hi = _mm_slli_epi32(shuffled_hi, 4);
// msk ? lo : hi
__m128i shuffled = _mm_ternarylogic_epi32(lo_nibble_msk, shuffled_lo, shuffled_hi, 202);
return _mm_cvtsi128_si64(shuffled);
#else
// Fallback scalar implementation (preferably Peter Cordes's SSE solution--this is as an example)
uint64_t result = 0;
for (int i = 0; i < 16; ++i) {
indices = (indices >> 60) + (indices << 4);
int idx = indices & 0xf;
result <<= 4;
result |= (data >> (4 * idx)) & 0xf;
}
return result;
#endif
}
int main() {
// 0xaa025411fe034102
uint64_t r1 = shuffle_nibbles(0xfedcba9876543210, 0xaa025411fe034102);
// 0x55fdabee01fcbefd
uint64_t r2 = shuffle_nibbles(0x0123456789abcdef, 0xaa025411fe034102);
// 0xaaaa00002222aaaa
uint64_t r3 = shuffle_nibbles(0xaa025411fe034102, 0xeeee11110000ffff);
printf("0x%" PRIx64 "\n", r1);
printf("0x%" PRIx64 "\n", r2);
printf("0x%" PRIx64 "\n", r3);
}
Clang yields (2):
.LCPI0_0:
.zero 16,60
shuffle_nibbles(unsigned long, unsigned long):
vmovq xmm0, rdi
vmovq xmm1, rsi
vpslld xmm2, xmm1, 2
vpsrld xmm1, xmm1, 2
vmovdqa xmm3, xmmword ptr [rip + .LCPI0_0] # xmm3 = [60,60,60,60,60,60,60,60,60,60,60,60,60,60,60,60]
vpand xmm1, xmm1, xmm3
vpmultishiftqb xmm1, xmm1, xmm0
vpand xmm2, xmm2, xmm3
vpmultishiftqb xmm0, xmm2, xmm0
vpslld xmm1, xmm1, 4
vpternlogd xmm1, xmm0, dword ptr [rip + .LCPI0_1]{1to4}, 216
vmovq rax, xmm1
In my case, I am shuffling nibbles in 64-bit-element vectors; this method also avoids the need for widening. If your shuffle(s) is/are constant and you stay in vectors, this method reduces to a measly four instructions: 2x vpmultishiftqb, 1x vpslld, and 1x vpternlogd. Counting µops suggests a latency of 5 and throughput of one every 2 cycles, bottlenecked on shuffle µops, for 128- and 256-bit vectors; and a throughput of 3 for 512-bit vectors, due to reduced execution units for the latter two instructions.

Understanding CRC32 value as division remainder

I'm struggling with understanding CRC algorithm. I've been reading this tutorial and if I got it correctly a CRC value is just a remainder of a division where message serves as the dividend and the divisor is a predefined value - carried out in a special kind of polynomial arithmetic. It looked quote simple so I tried implementing CRC-32:
public static uint Crc32Naive(byte[] bytes)
{
uint poly = 0x04c11db7; // (Poly)
uint crc = 0xffffffff; // (Init)
foreach (var it in bytes)
{
var b = (uint)it;
for (var i = 0; i < 8; ++i)
{
var prevcrc = crc;
// load LSB from current byte into LSB of crc (RefIn)
crc = (crc << 1) | (b & 1);
b >>= 1;
// subtract polynomial if we've just popped out 1
if ((prevcrc & 0x80000000) != 0)
crc ^= poly;
}
}
return Reverse(crc ^ 0xffffffff); // (XorOut) (RefOut)
}
(where Reverese function reverses bit order)
It is supposed to be analogous to following method of division (with some additional adjustments):
1100001010
_______________
10011 ) 11010110110000
10011,,.,,....
-----,,.,,....
10011,.,,....
10011,.,,....
-----,.,,....
00001.,,....
00000.,,....
-----.,,....
00010,,....
00000,,....
-----,,....
00101,....
00000,....
-----,....
01011....
00000....
-----....
10110...
10011...
-----...
01010..
00000..
-----..
10100.
10011.
-----.
01110
00000
-----
1110 = Remainder
For: 0x00 function returns 0xd202ef8d which is correct, but for 0x01 - 0xd302ef8d instead of 0xa505df1b (I've been using this page to verify my results).
Solution from my implementation has more sense to me: incrementing dividend by 1 should only change reminder by 1, right? But it turns out that the result should look completely different. So apparently I am missing something obvious. What is it? How can changing the least significant number in a dividend influence the result this much?
This is an example of a left shifting CRC that emulates division, with the CRC initialized = 0, and no complementing or reversing of the crc. The example code is emulating a division where 4 bytes of zeroes are appended to bytes[] ({bytes[],0,0,0,0} is the dividend, the divisor is 0x104c11db7, the quotient is not used, and the remainder is the CRC).
public static uint Crc32Naive(byte[] bytes)
{
uint poly = 0x04c11db7; // (Poly is actually 0x104c11db7)
uint crc = 0; // (Init)
foreach (var it in bytes)
{
crc ^= (((int)it)<<24); // xor next byte to upper 8 bits of crc
for (var i = 0; i < 8; ++i) // cycle the crc 8 times
{
var prevcrc = crc;
crc = (crc << 1);
// subtract polynomial if we've just popped out 1
if ((prevcrc & 0x80000000) != 0)
crc ^= poly;
}
}
return crc;
}
It's common to initialize the CRC to something other than zero, but it's not that common to post-complement the CRC, and I'm not aware of any CRC that does a post bit reversal of the CRC.
Another variations of CRC is one that right shifts, normally used to emulate hardware where data is sent in bytes least significant bit first.

Find most significant bit in Swift

I need to find the value (or position) of the most significant bit (MSB) of an integer in Swift.
Eg:
Input number: 9
Input as binary: 1001
MS value as binary: 1000 -> (which is 8 in decimal)
MS position as decimal: 3 (because 1<<3 == 1000)
Many processors (Intel, AMD, ARM) have instructions for this. In c, these are exposed. Are these instructions similarly available in Swift through a library function, or would I need to implement some bit twiddling?
The value is more useful in my case.
If a position is returned, then the value can be easily derived by a single shift.
Conversely, computing position from value is not so easy unless a fast Hamming Weight / pop count function is available.
You can use the flsl() function ("find last set bit, long"):
let x = 9
let p = flsl(x)
print(p) // 4
The result is 4 because flsl() and the related functions number the bits starting at 1, the least significant bit.
On Intel platforms you can use the _bit_scan_reverse intrinsic,
in my test in a macOS application this translated to a BSR
instruction.
import _Builtin_intrinsics.intel
let x: Int32 = 9
let p = _bit_scan_reverse(x)
print(p) // 3
You can use the the properties leadingZeroBitCount and trailingZeroBitCount to find the Most Significant Bit and Least Significant Bit.
For example,
let i: Int = 95
let lsb = i.trailingZeroBitCount
let msb = Int.bitWidth - 1 - i.leadingZeroBitCount
print("i: \(i) = \(String(i, radix: 2))") // i: 95 = 1011111
print("lsb: \(lsb) = \(String(1 << lsb, radix: 2))") // lsb: 0 = 1
print("msb: \(msb) = \(String(1 << msb, radix: 2))") // msb: 6 = 1000000
If you look at the disassembly(ARM Mac) in LLDB for the Least Significant Bit code, it uses a single instruction, clz, to count the zeroed bits. (ARM Reference)
** 15 let lsb = i.trailingZeroBitCount
0x100ed947c <+188>: rbit x9, x8
0x100ed9480 <+192>: clz x9, x9
0x100ed9484 <+196>: mov x10, x9
0x100ed9488 <+200>: str x10, [sp, #0x2d8]

Maximum integer in Perl

Set $i=0 and do ++$i while it increases. Which number we would reach?
Note that it may be not the same as maximum integer in Perl (as asked in the title), because there may be gaps between adjacent integers which are greater than 1.
"Integer" can refer to a family of data types (int16_t, uint32_t, etc). There's no gap in the numbers these can represent.
"Integer" can also refer to numbers without a fractional component, regardless of the type of the variable used to store it. ++ will seamlessly transition between data types, so this is what's relevant to this question.
Floating point numbers can store integers in this sense, and it's possible to store very large numbers as floats without being able to add one to them. The reason for this is that floating pointer numbers are stored using the following form:
[+/-]1._____..._____ * 2**____
For example, let's say the mantissa of your floats can store 52 bits after the decimal, and you want to add 1 to 2**53.
__52 bits__
/ \
1.00000...00000 * 2**53 Large power of two
+ 1.00000...00000 * 2**0 1
--------------------------
1.00000...00000 * 2**53
+ 0.00000...000001 * 2**53 Normalized exponents
--------------------------
1.00000...00000 * 2**53
+ 0.00000...00000 * 2**53 What we really get due to limited number of bits
--------------------------
1.00000...00000 * 2**53 Original large power of two
So it is possible to hit a gap when using floating point numbers. However, you started with a number stored as signed integer.
$ perl -MB=svref_2object,SVf_IVisUV,SVf_NOK -e'
$i = 0;
$sv = svref_2object(\$i);
print $sv->FLAGS & SVf_NOK ? "NV\n" # Float
: $sv->FLAGS & SVf_IVisUV ? "UV\n" # Unsigned int
: "IV\n"; # Signed int
'
IV
++$i will leave the number as a signed integer value ("IV") until it cannot anymore. At that point, it will start using an unsigned integer values ("UV").
$ perl -MConfig -MB=svref_2object,SVf_IVisUV,SVf_NOK -e'
$i = hex("7F".("FF"x($Config{ivsize}-2))."FD");
$sv = svref_2object(\$i);
for (1..4) {
++$i;
printf $sv->FLAGS & SVf_NOK ? "NV %.0f\n"
: $sv->FLAGS & SVf_IVisUV ? "UV %u\n"
: "IV %d\n", $i;
}
'
IV 2147483646
IV 2147483647 <-- 2**31 - 1 Largest IV
UV 2147483648
UV 2147483649
or
IV 9223372036854775806
IV 9223372036854775807 <-- 2**63 - 1 Largest IV
UV 9223372036854775808
UV 9223372036854775809
Still no gap because no floating point numbers have been used yet. But Perl will eventually use floating point numbers ("NV") because they have a far larger range than integers. ++$i will switch to using a floating point number when it runs out of unsigned integers.
When that happens depends on your build of Perl. Not all builds of Perl have the same integer and floating point number sizes.
On one machine:
$ perl -V:[in]vsize
ivsize='4'; # 32-bit integers
nvsize='8'; # 64-bit floats
On another:
$ perl -V:[in]vsize
ivsize='8'; # 64-bit integers
nvsize='8'; # 64-bit floats
On a system where nvsize is larger than ivsize
On these systems, the first gap will happen above the largest unsigned integer. If your system uses IEEE double-precision floats, your floats have 53-bit of precision. They can represent without loss all integers from -253 to 253 (inclusive). ++ will fail to increment beyond that.
$ perl -MConfig -MB=svref_2object,SVf_IVisUV,SVf_NOK -e'
$i = eval($Config{nv_overflows_integers_at}) - 3;
$sv = svref_2object(\$i);
for (1..4) {
++$i;
printf $sv->FLAGS & SVf_NOK ? "NV %.0f\n"
: $sv->FLAGS & SVf_IVisUV ? "UV %u\n"
: "IV %d\n", $i;
}
'
NV 9007199254740990
NV 9007199254740991
NV 9007199254740992 <-- 2**53 Requires 1 bit of precision as a float
NV 9007199254740992 <-- 2**53 + 1 Requires 54 bits of precision as a float
but only 53 are available.
On a system where nvsize is no larger than ivsize
On these systems, the first gap will happen before the largest unsigned integer. Switching to floating pointer numbers will allow you to go one further (a large power of two), but that's it. ++ will fail to increment beyond the largest unsigned integer + 1.
$ perl -MConfig -MB=svref_2object,SVf_IVisUV,SVf_NOK -e'
$i = hex(("FF"x($Config{ivsize}-1))."FD");
$sv = svref_2object(\$i);
for (1..4) {
++$i;
printf $sv->FLAGS & SVf_NOK ? "NV %.0f\n"
: $sv->FLAGS & SVf_IVisUV ? "UV %u\n"
: "IV %d\n", $i;
}
'
UV 18446744073709551614
UV 18446744073709551615 <-- 2**64 - 1 Largest UV
NV 18446744073709551616 <-- 2**64 Requires 1 bit of precision as a float
NV 18446744073709551616 <-- 2**64 + 1 Requires 65 bits of precision as a float
but only 53 are available.
This is on 32-bit perl,
perl -e "$x=2**53-5; printf qq{%.f\n}, ++$x for 1..10"
9007199254740988
9007199254740989
9007199254740990
9007199254740991
9007199254740992
9007199254740992
9007199254740992
9007199254740992
9007199254740992
9007199254740992
Well, on my 64-bit machine it's 18446744073709551615 (much easier as ~0), after which it increases once more time to 1.84467440737096e+19 and stops incrementing.

Convert 16bit colour to 32bit

I've got an 16bit bitmap image with each colour represented as a single short (2 bytes), I need to display this in a 32bit bitmap context. How can I convert a 2 byte colour to a 4 byte colour in C++?
The input format contains each colour in a single short (2 bytes).
The output format is 32bit RGB. This means each pixel has 3 bytes I believe?
I need to convert the short value into RGB colours.
Excuse my lack of knowledge of colours, this is my first adventure into the world of graphics programming.
Normally a 16-bit pixel is 5 bits of red, 6 bits of green, and 5 bits of blue data. The minimum-error solution (that is, for which the output color is guaranteed to be as close a match to the input colour) is:
red8bit = (red5bit << 3) | (red5bit >> 2);
green8bit = (green6bit << 2) | (green6bit >> 4);
blue8bit = (blue5bit << 3) | (blue5bit >> 2);
To see why this solution works, let's look at at a red pixel. Our 5-bit red is some fraction fivebit/31. We want to translate that into a new fraction eightbit/255. Some simple arithmetic:
fivebit eightbit
------- = --------
31 255
Yields:
eightbit = fivebit * 8.226
Or closely (note the squiggly ≈):
eightbit ≈ (fivebit * 8) + (fivebit * 0.25)
That operation is a multiply by 8 and a divide by 4. Owch - both operations that might take forever on your hardware. Lucky thing they're both powers of two and can be converted to shift operations:
eightbit = (fivebit << 3) | (fivebit >> 2);
The same steps work for green, which has six bits per pixel, but you get an accordingly different answer, of course! The quick way to remember the solution is that you're taking the top bits off of the "short" pixel and adding them on at the bottom to make the "long" pixel. This method works equally well for any data set you need to map up into a higher resolution space. A couple of quick examples:
five bit space eight bit space error
00000 00000000 0%
11111 11111111 0%
10101 10101010 0.02%
00111 00111001 -1.01%
Common formats include BGR0,
RGB0, 0RGB, 0BGR. In the code below I have assumed 0RGB. Changing this
is easy, just modify the shift amounts in the last line.
unsigned long rgb16_to_rgb32(unsigned short a)
{
/* 1. Extract the red, green and blue values */
/* from rrrr rggg gggb bbbb */
unsigned long r = (a & 0xF800) >11;
unsigned long g = (a & 0x07E0) >5;
unsigned long b = (a & 0x001F);
/* 2. Convert them to 0-255 range:
There is more than one way. You can just shift them left:
to 00000000 rrrrr000 gggggg00 bbbbb000
r <<= 3;
g <<= 2;
b <<= 3;
But that means your image will be slightly dark and
off-colour as white 0xFFFF will convert to F8,FC,F8
So instead you can scale by multiply and divide: */
r = r * 255 / 31;
g = g * 255 / 63;
b = b * 255 / 31;
/* This ensures 31/31 converts to 255/255 */
/* 3. Construct your 32-bit format (this is 0RGB): */
return (r << 16) | (g << 8) | b;
/* Or for BGR0:
return (r << 8) | (g << 16) | (b << 24);
*/
}
Multiply the three (four, when you have an alpha layer) values by 16 - that's it :)
You have a 16-bit color and want to make it a 32-bit color. This gives you four times four bits, which you want to convert to four times eight bits. You're adding four bits, but you should add them to the right side of the values. To do this, shift them by four bits (multiply by 16). Additionally you could compensate a bit for inaccuracy by adding 8 (you're adding 4 bits, which has the value of 0-15, and you can take the average of 8 to compensate)
Update This only applies to colors that use 4 bits for each channel and have an alpha channel.
There some questions about the model like is it HSV, RGB?
If you wanna ready, fire, aim I'd try this first.
#include <stdint.h>
uint32_t convert(uint16_t _pixel)
{
uint32_t pixel;
pixel = (uint32_t)_pixel;
return ((pixel & 0xF000) << 16)
| ((pixel & 0x0F00) << 12)
| ((pixel & 0x00F0) << 8)
| ((pixel & 0x000F) << 4);
}
This maps 0xRGBA -> 0xRRGGBBAA, or possibly 0xHSVA -> 0xHHSSVVAA, but it won't do 0xHSVA -> 0xRRGGBBAA.
I'm here long after the fight, but I actually had the same problem with ARGB color instead, and none of the answers are truly right: Keep in mind that this answer gives a response for a slightly different situation where we want to do this conversion:
AAAARRRRGGGGBBBB >>= AAAAAAAARRRRRRRRGGGGGGGGBBBBBBBB
If you want to keep the same ratio of your color, you simply have to do a cross-multiplication: You want to convert a value x between 0 and 15 to a value between 0 and 255: therefore you want: y = 255 * x / 15.
However, 255 = 15 * 17, which itself, is 16 + 1: you now have y = 16 * x + x
Which is actually the same as doing a for bits shift to the left and then adding the value again (or more visually, duplicating the value: 0b1101 becomes 0b11011101).
Now that you have this, you can compute your whole number by doing:
a = v & 0b1111000000000000
r = v & 0b111100000000
g = v & 0b11110000
b = v & 0b1111
return b | b << 4 | g << 4 | g << 8 | r << 8 | r << 12 | a << 12 | a << 16
Moreover, as the lower bits wont have much effect on the final color and if exactitude isnt necessary, you can gain some performances by simply multiplying each component by 16:
return b << 4 | g << 8 | r << 12 | a << 16
(All the left shifts values are strange because we did not bother doing a right shift before)