Offset in cache Mips - cpu-architecture

I know that offset is : block size=2^n (offset=n). But i have seen that when block size = 8 bytes we do : 8=2^n so offset=n=3 bits, which is correct, but when block size = 1 word, i have seen 1=2^n (offset=n=0). Dont we need to convert word to bytes if we know that cache has 32-bit memory address? (So we have 32bit=4bytes, 4=2^n offset is 2 in that case).

Your did right, It's intuitive that one should know that a word is 4 byte in 32 bit processor and 8 byte in 64 bit.
The byte offset can also can be find in this way, Assume you have address size 32 bits then
byte_offset = 32-tag_bits-set_bits.
In order to solve problem of this kind it's good to know some useful parameter and equation.
Parameter to know
C = cache capacity
b = block size
B = number of blocks
N = degree of associativity
S = number of set
tag_bits
set_bits (also called index)
byte_offset
v = valid bits
Equations to know
B = C/b
S = B/N
b = 2^(byte_offset)
S = 2^(set_bits)
Memory Address
|___tag________|____set___|___byte offset_|

Related

Why are the hex numbers for big endian different than little endian?

#include<stdio.h>
int main()
{
typedef unsigned char *byte_pointer;
void show_bytes(byte_pointer start, size_t len)
{
int i;
for (i = 0; i < len; i++)
{
printf(" %.2x", start[i]);
printf("\n");
}
}
void show_int(int x)
{
show_bytes((byte_pointer) &x, sizeof(int));
}
void show_float(int x)
{
show_bytes((byte_pointer) &x, sizeof(float));
}
void show_pointer(int x)
{
show_bytes((byte_pointer) &x, sizeof(void *));
}
int a = 0x12345678;
byte_pointer ap = (byte_pointer) &a;
show_bytes(ap, 3);
return 0;
}
(Solutions according to the CS:APP book)
Big endian: 12 34 56
Little endian: 78 56 34
I know systems have different conventions for storage allocation but if two systems use the same convention but are different endian why are the hex values different?
Endian-ness is an issue that arises when we use more than one storage location for a value/type, which we do because somethings won't fit in a single storage location.
As soon as we use multiple storage locations for a single value that gives rise to the question of:  What part of the value will we store in each storage location?
The first byte of a two-byte item will have a lower address than the second byte, and in particular, the address of the second byte will be at +1 from the address of the lower byte.
Storing a two-byte item in two bytes of storage, do we store the most significant byte first and the least significant byte second, or vice versa?
We choose to use directly consecutive bytes for the two bytes of the two-byte item, so no matter which (endian) way we choose to store such an item, we refer to the whole two-byte item by the lower address (the address of its first byte).
We can express these storage choices with a formula, here item[0] refer to the first byte while item[1] refers to the second byte.
item[0] = value >> 8 // also value / 256
item[1] = value & 0xFF // also value % 256
value = (item[0]<<8) | item[1] // also item[0]*256 | item[1]
--vs--
item[0] = value & 0xFF // also value % 256
item[1] = value >> 8 // also value / 256
value = item[0] | (item[1]<<8) // also item[0] | item[1]*256
The first set of formulas is for big endian, and the second for little endian.
By these formulas, it doesn't matter what order we access memory as to whether item[0] first, then item[1], or vice versa, or both at the same time (common in hardware), as long as the formulas for one endian are consistently used.
If the item in question is a four-byte value, then there are 4 possible orderings(!) — though only two of them are truly sensible.
For efficiency, the hardware offers us multibyte memory access in one instruction (and with one reference, namely to the lowest address of the multibyte item), and therefore, the hardware itself needs to define and consistently use one of the two possible/reasonable orderings.
If the hardware did not offer multibyte memory access, then the ordering would be entirely up to the software program itself to define (accessing memory one byte at a time), and the program could choose big or little endian, even differently for each variable, as long as it consistently accesses the multiple bytes of memory in the same manner to reassemble the values stored there.
In a similar manner, when we define a structure of multiple items (e.g. struct point { int x; int y; }, software chooses whether x comes first or y comes first in memory ordering.  However, since programmers (and compilers) will still choose to use hardware instructions to access individual fields such as x in one go, the hardware's endian configuration remains necessary.

How to read and write bits in a chunk of memory in Swift

I would like to know how to read a binary file into memory (writing it to memory like an "Array Buffer" from JavaScript), and write to different parts of memory 8-bit, 16-bit, 32-bit etc. values, even 5 bit or 10 bit values.
extension Binary {
static func readFileToMemory(_ file) -> ArrayBuffer {
let data = NSData(contentsOfFile: "/path/to/file/7CHands.dat")!
var dataRange = NSRange(location: 0, length: ?)
var ? = [Int32](count: ?, repeatedValue: ?)
data.getBytes(&?, range: dataRange)
}
static func writeToMemory(_ buffer, location, value) {
buffer[location] = value
}
static func readFromMemory(_ buffer, location) {
return buffer[location]
}
}
I have looked at a bunch of places but haven't found a standard reference.
https://github.com/nst/BinUtils/blob/master/Sources/BinUtils.swift
https://github.com/apple/swift/blob/master/stdlib/public/core/ArrayBuffer.swift
https://github.com/uraimo/Bitter/blob/master/Sources/Bitter/Bitter.swift
In Swift, how do I read an existing binary file into an array?
Swift - writing a byte stream to file
https://apple.github.io/swift-nio/docs/current/NIO/Structs/ByteBuffer.html
https://github.com/Cosmo/BinaryKit/blob/master/Sources/BinaryKit.swift
https://github.com/vapor-community/bits/blob/master/Sources/Bits/Data%2BBytesConvertible.swift
https://academy.realm.io/posts/nate-cook-tryswift-tokyo-unsafe-swift-and-pointer-types/
https://medium.com/#gorjanshukov/working-with-bytes-in-ios-swift-4-de316a389a0c
I would like for this to be as low-level as possible. So perhaps using UnsafeMutablePointer, UnsafePointer, or UnsafeMutableRawPointer.
Saw this as well:
let data = NSMutableData()
var goesIn: Int32 = 42
data.appendBytes(&goesIn, length: sizeof(Int32))
println(data) // <2a000000]
var comesOut: Int32 = 0
data.getBytes(&comesOut, range: NSMakeRange(0, sizeof(Int32)))
println(comesOut) // 42
I would basically like to allocate a chunk of memory and be able to read and write from it. Not sure how to do that. Perhaps using C is the best way, not sure.
Just saw this too:
let rawData = UnsafeMutablePointer<UInt8>.allocate(capacity: width * height * 4)
If you're looking for low level code you'll need to use UnsafeMutableRawPointer. This is a pointer to a untyped data. Memory is accessed in bytes, so 8 chunks of at least 8 bits. I'll cover multiples of 8 bits first.
Reading a File
To read a file this way, you need to manage file handles and pointers yourself. Try the the following code:
// Open the file in read mode
let file = fopen("/Users/joannisorlandos/Desktop/ownership", "r")
// Files need to be closed manually
defer { fclose(file) }
// Find the end
fseek(file, 0, SEEK_END)
// Count the bytes from the start to the end
let fileByteSize = ftell(file)
// Return to the start
fseek(file, 0, SEEK_SET)
// Buffer of 1 byte entities
let pointer = UnsafeMutableRawPointer.allocate(byteCount: fileByteSize, alignment: 1)
// Buffer needs to be cleaned up manually
defer { pointer.deallocate() }
// Size is 1 byte
let readBytes = fread(pointer, 1, fileByteSize, file)
let errorOccurred = readBytes != fileByteSize
First you need to open the file. This can be done using Swift strings since the compiler makes them into a CString itself.
Because cleanup is all for us on this low level, a defer is put in place to close the file at the end.
Next, the file is set to seek the end of the file. Then the distance between the start of the file and the end is calculated. This is used later, so the value is kept.
Then the program is set to return to the start of the file, so the application starts reading from the start.
To store the file, a pointer is allocated with the amount of bytes that the file has in the file system. Note: This can change inbetween the steps if you're extremely unlucky or the file is accessed quite often. But I think for you, this is unlikely.
The amount of bytes is set, and aligned to one byte. (You can learn more about memory alignment on Wikipedia.
Then another defer is added to make sure no memory leaks at the end of this code. The pointer needs to be deallocated manually.
The file's bytes are read and stored in the pointer. Do note that this entire process reads the file in a blocking manner. It can be more preferred to read files asynchronously, if you plan on doing that I'll recommend looking into a library like SwiftNIO instead.
errorOccurred can be used to throw an error or handle issues in another manner.
From here, your buffer is ready for manipulation. You can print the file if it's text using the following code:
print(String(cString: pointer.bindMemory(to: Int8.self, capacity: fileByteSize)))
From here, it's time to learn how to read manipulate the memory.
Manipulating Memory
The below demonstrates reading byte 20..<24 as an Int32.
let int32 = pointer.load(fromByteOffset: 20, as: Int32.self)
I'll leave the other integers up to you. Next, you can alos put data at a position in memory.
pointer.storeBytes(of: 40, toByteOffset: 30, as: Int64.self)
This will replace byte 30..<38 with the number 40. Note that big endian systems, although uncommon, will store information in a different order from normal little endian systems. More about that here.
Modifying Bits
As you notes, you're also interested in modifying five or ten bits at a time. To do so, you'll need to mix the previous information with the new information.
var data32bits = pointer.load(fromByteOffset: 20, as: Int32.self)
var newData = 0b11111000
In this case, you'll be interested in the first 5 bits and want to write them over bit 2 through 7. To do so, first you'll need to shift the bits to a position that matches the new position.
newData = newData >> 2
This shifts the bits 2 places to the right. The two left bits that are now empty are therefore 0. The 2 bits on the right that got shoved off are not existing anymore.
Next, you'll want to get the old data from the buffer and overwrite the new bits.
To do so, first move the new byte into a 32-bits buffer.
var newBits = numericCast(newData) as Int32
The 32 bits will be aligned all the way to the right. If you want to replace the second of the four bytes, run the following:
newBits = newBits << 16
This moves the fourth pair 16 bit places left, or 2 bytes. So it's now on position 1 starting from 0.
Then, the two bytes need to be added on top of each other. One common method is the following:
let oldBits = data32bits & 0b11111111_11000001_11111111_11111111
let result = oldBits | newBits
What happens here is that we remove the 5 bits with new data from the old dataset. We do so by doing a bitwise and on the old 32 bits and a bitmap.
The bitmap has all 1's except for the new locations which are being replaced. Because those are empty in the bitmap, the and operator will exclude those bits since one of the two (old data vs. bitmap) is empty.
AND operators will only be 1 if both sides of the operator are 1.
Finally, the oldBits and the newBits are merged with an OR operator. This will take each bit on both sides and set the result to 1 if the bits at both positions are 1.
This will merge successfully since both buffers contain 1 bits that the other number doesn't set.

CAPL - Converting 4 raw bytes into floating point

CAPL - Vector.
I receive message ID 0x110 which holds current information:
0x3E6978D5 -> 0.228
Currently I can read the data and save into Enviroment Variable to show in Panel using:
putValue(slow_current, this.long(4));
But I don't know how to convert the HEX 4 bytes into float variable, since I cannot use address or casting (float* x = (float *)&vBuffer;)
How to make this conversion in CAPL script? Thanks.
Typically your dbc-file shall contain conversion info from raw value (in your case 4B long) to physical value in form of factor and offset definition:
So your physical value of current shall be calculated as follows:
phys_val = (raw_value * factor) + offset
Note: if you define negative offset then you actually subtracting it in equation above.
But it seems you don't have dbc-file so you need to figure out factor and offset by yourself (if you have 2 example raw values and know their physical equivalent then it shall be as easy as finding linear equation parameters -> y = ax + b).
CAPL shall look like this:
variables
{
float current_phys;
/* adjust below values to your needs */
float factor = 0.001
dword offset = -1000
}
on message 0x110
{
current_phys = (this.long(4) * factor) + offset;
write(current_phys);
}
Alternate solution if you don't want to force transform the value:
You define a sysvar type float(double) and use that sysvar in the panel
(link to it), instead of the envVar
or you change the type of envVar to float(double).
The translation into float will be done automatically
.
Caveat: usually this trick requires that the input number is also 8 bytes as the defined CAPL float range 8 bytes. But you have this by message payload length constraint= 8bytes.
Does not look good, but works:
received msg: 0x3E6978D5
putValue(float4byte,interpretAdFloat(this.long(4)));
float4byte = 0.23
i just reused Vinícius Oliveira solution to avoid creating environment variable. it worked
float floatvalue;
floatvalue = interpretAsFloat(HexValue);
input (HexValue) = 0x3fe20e3a
output(floatvalue() = 1.76606

Reducing LUT utilization in a Vivado HLS design (RSA cryptosystem using montgomery multiplication)

A question/problem for anyone experienced with Xilinx Vivado HLS and FPGA design:
I need help reducing the utilization numbers of a design within the confines of HLS (i.e. can't just redo the design in an HDL). I am targeting the Zedboard (Zynq 7020).
I'm trying to implement 2048-bit RSA in HLS, using the Tenca-koc multiple-word radix 2 montgomery multiplication algorithm, shown below (More algorithm details here):
I wrote this algorithm in HLS and it works in simulation and in C/RTL cosim. My algorithm is here:
#define MWR2MM_m 2048 // Bit-length of operands
#define MWR2MM_w 8 // word size
#define MWR2MM_e 257 // number of words per operand
// Type definitions
typedef ap_uint<1> bit_t; // 1-bit scan
typedef ap_uint< MWR2MM_w > word_t; // 8-bit words
typedef ap_uint< MWR2MM_m > rsaSize_t; // m-bit operand size
/*
* Multiple-word radix 2 montgomery multiplication using carry-propagate adder
*/
void mwr2mm_cpa(rsaSize_t X, rsaSize_t Yin, rsaSize_t Min, rsaSize_t* out)
{
// extend operands to 2 extra words of 0
ap_uint<MWR2MM_m + 2*MWR2MM_w> Y = Yin;
ap_uint<MWR2MM_m + 2*MWR2MM_w> M = Min;
ap_uint<MWR2MM_m + 2*MWR2MM_w> S = 0;
ap_uint<2> C = 0; // two carry bits
bit_t qi = 0; // an intermediate result bit
// Store concatenations in a temporary variable to eliminate HLS compiler warnings about shift count
ap_uint<MWR2MM_w> temp_concat=0;
// scan X bit-by bit
for (int i=0; i<MWR2MM_m; i++)
{
qi = (X[i]*Y[0]) xor S[0];
// C gets top two bits of temp_concat, j'th word of S gets bottom 8 bits of temp_concat
temp_concat = X[i]*Y.range(MWR2MM_w-1,0) + qi*M.range(MWR2MM_w-1,0) + S.range(MWR2MM_w-1,0);
C = temp_concat.range(9,8);
S.range(MWR2MM_w-1,0) = temp_concat.range(7,0);
// scan Y and M word-by word, for each bit of X
for (int j=1; j<=MWR2MM_e; j++)
{
temp_concat = C + X[i]*Y.range(MWR2MM_w*j+(MWR2MM_w-1), MWR2MM_w*j) + qi*M.range(MWR2MM_w*j+(MWR2MM_w-1), MWR2MM_w*j) + S.range(MWR2MM_w*j+(MWR2MM_w-1), MWR2MM_w*j);
C = temp_concat.range(9,8);
S.range(MWR2MM_w*j+(MWR2MM_w-1), MWR2MM_w*j) = temp_concat.range(7,0);
S.range(MWR2MM_w*(j-1)+(MWR2MM_w-1), MWR2MM_w*(j-1)) = (S.bit(MWR2MM_w*j), S.range( MWR2MM_w*(j-1)+(MWR2MM_w-1), MWR2MM_w*(j-1)+1));
}
S.range(S.length()-1, S.length()-MWR2MM_w) = 0;
C=0;
}
// if final partial sum is greater than the modulus, bring it back to proper range
if (S >= M)
S -= M;
*out = S;
}
Unfortunately, the LUT utilization is huge.
This is problematic because I need to be able to fit multiple of these blocks in hardware as axi4-lite slaves.
Could someone please provide a few suggestions as to how I can reduce the LUT utilization, WITHIN THE CONFINES OF HLS?
I've already tried the following:
Experimenting with different word lengths
switching the top level inputs to arrays so they are BRAM (i.e. not using ap_uint<2048>, but instead ap_uint foo[MWR2MM_e])
Experimenting with all sorts of directives: compartmentalizing into multiple inline functions, dataflow architecture, resource limits on lshr, etc.
However, nothing really drives the LUT utilization down in a meaningful way. Is there a glaringly obvious way that I could reduce the utilization that is apparent to anyone?
In particular, I've seen papers on implementations of the mwr2mm algorithm that (only use one DSP block and one BRAM). Is this even worth attempting to implement using HLS? Or is there no way that I can actually control the resources that the algorithm is mapped to without describing it in HDL?
Thanks for the help.

How to do bitwise operation decently?

I'm doing analysis on binary data. Suppose I have two uint8 data values:
a = uint8(0xAB);
b = uint8(0xCD);
I want to take the lower two bits from a, and whole content from b, to make a 10 bit value. In C-style, it should be like:
(a[2:1] << 8) | b
I tried bitget:
bitget(a,2:-1:1)
But this just gave me separate [1, 1] logical type values, which is not a scalar, and cannot be used in the bitshift operation later.
My current solution is:
Make a|b (a or b):
temp1 = bitor(bitshift(uint16(a), 8), uint16(b));
Left shift six bits to get rid of the higher six bits from a:
temp2 = bitshift(temp1, 6);
Right shift six bits to get rid of lower zeros from the previous result:
temp3 = bitshift(temp2, -6);
Putting all these on one line:
result = bitshift(bitshift(bitor(bitshift(uint16(a), 8), uint16(b)), 6), -6);
This is doesn't seem efficient, right? I only want to get (a[2:1] << 8) | b, and it takes a long expression to get the value.
Please let me know if there's well-known solution for this problem.
Since you are using Octave, you can make use of bitpack and bitunpack:
octave> a = bitunpack (uint8 (0xAB))
a =
1 1 0 1 0 1 0 1
octave> B = bitunpack (uint8 (0xCD))
B =
1 0 1 1 0 0 1 1
Once you have them in this form, it's dead easy to do what you want:
octave> [B A(1:2)]
ans =
1 0 1 1 0 0 1 1 1 1
Then simply pad with zeros accordingly and pack it back into an integer:
octave> postpad ([B A(1:2)], 16, false)
ans =
1 0 1 1 0 0 1 1 1 1 0 0 0 0 0 0
octave> bitpack (ans, "uint16")
ans = 973
That or is equivalent to an addition when dealing with integers
result = bitshift(bi2de(bitget(a,1:2)),8) + b;
e.g
a = 01010111
b = 10010010
result = 00000011 100010010
= a[2]*2^9 + a[1]*2^8 + b
an alternative method could be
result = mod(a,2^x)*2^y + b;
where the x is the number of bits you want to extract from a and y is the number of bits of a and b, in your case:
result = mod(a,4)*256 + b;
an extra alternative solution close to the C solution:
result = bitor(bitshift(bitand(a,3), 8), b);
I think it is important to explain exactly what "(a[2:1] << 8) | b" is doing.
In assembly, referencing individual bits is a single operation. Assume all operations take the exact same time and "efficient" a[2:1] starts looking extremely inefficient.
The convenience statement actually does (a & 0x03).
If your compiler actually converts a uint8 to a uint16 based on how much it was shifted, this is not a 'free' operation, per se. Effectively, what your compiler will do is first clear the "memory" to the size of uint16 and then copy "a" into the location. This requires an extra step (clearing the "memory" (register)) that wouldn't normally be needed.
This means your statement actually is (uint16(a & 0x03) << 8) | uint16(b)
Now yes, because you're doing a power of two shift, you could just move a into AH, move b into AL, and AH by 0x03 and move it all out but that's a compiler optimization and not what your C code said to do.
The point is that directly translating that statement into matlab yields
bitor(bitshift(uint16(bitand(a,3)),8),uint16(b))
But, it should be noted that while it is not as TERSE as (a[2:1] << 8) | b, the number of "high level operations" is the same.
Note that all scripting languages are going to be very slow upon initiating each instruction, but will complete said instruction rapidly. The terse nature of Python isn't because "terse is better" but to create simple structures that the language can recognize so it can easily go into vectorized operations mode and start executing code very quickly.
The point here is that you have an "overhead" cost for calling bitand; but when operating on an array it will use SSE and that "overhead" is only paid once. The JIT (just in time) compiler, which optimizes script languages by reducing overhead calls and creating temporary machine code for currently executing sections of code MAY be able to recognize that the type checks for a chain of bitwise operations need only occur on the initial inputs, hence further reducing runtime.
Very high level languages are quite different (and frustrating) from high level languages such as C. You are giving up a large amount of control over code execution for ease of code production; whether matlab actually has implemented uint8 or if it is actually using a double and truncating it, you do not know. A bitwise operation on a native uint8 is extremely fast, but to convert from float to uint8, perform bitwise operation, and convert back is slow. (Historically, Matlab used doubles for everything and only rounded according to what 'type' you specified)
Even now, octave 4.0.3 has a compiled bitshift function that, for bitshift(ones('uint32'),-32) results in it wrapping back to 1. BRILLIANT! VHLL place you at the mercy of the language, it isn't about how terse or how verbose you write the code, it's how the blasted language decides to interpret it and execute machine level code. So instead of shifting, uint32(floor(ones / (2^32))) is actually FASTER and more accurate.