I wish to know if there could be any significant difference in terms of mem efficiency between marshaling a struct and marshaling a marshaled struct.
Example:
Assume we have a struct B with some fields.
message B{...}
The common representation:
message A {
B b = 1;
}
Another way:
message A {
bytes b = 1;
}
Where b is a marshaled B struct.
Generally, is it a good practice? any efficiency implications?
Thanks,
Elad
At the payload level, they are identical - however, in terms of how implementations treat them, there may be differences. The most obvious difference is that you can't use a bytes until you further deserialize it; this has pros and cons:
if you weren't ever going to touch it anyway, this could be fine and advantageous - avoiding some CPU processing that you didn't need for read or write; this will also mean that any downstream allocations (strings, etc) don't need to happen - so you only have a single allocation chunk: easy and efficient
if you do need to read it, then in addition to making life less convenient, you could have allocated an extra chunk of memory for the raw form (a chunk of bytes), and you'll need to allocate for the deserialized form; if you went straight to the deserialized form, most implementations would have skipped that intermediate allocation
So: yes, it will have different characteristics. Whether they are advantageous (or the opposite) depends on whether you also need to do the extra deserialization step on the bytes payload
I think it's a bad practice to declare a bytes field instead of a struct you would have otherwise specified in a proto file.
It's called a specification hole: you will have to write an additional documentation to describe how the receiver has to understand the bytes
Related
I'm relatively new to CUDA programming, so I want to clarify the behaviour of a struct when I pass it into a kernel. I've defined the following struct to somewhat imitate the behavior of a 3D array that knows its own size:
struct protoarray {
size_t dim1;
size_t dim2;
size_t dim3;
float* data;
};
I create two variables of type protoarray, dynamically allocate space to data via malloc and cudaMalloc on the host and device side, and update dim1, dim2 and dim3 to reflect the size of array I want this struct to represent. I read in this thread that the struct should be passed via copy. So this is what I do in my kernel
__global__ void kernel(curandState_t *state, protoarray arr_device){
const size_t dim1 = arr_device.dim1;
const size_t dim2 = arr_device.dim2;
for(size_t j(0); j < dim2; j++){
for(size_t i(0); i < dim1; i++){
// Do something
}
}
}
The struct is passed by copy, so all its contents are copied into shared memory of each block. This is where I'm getting bizarre behaviour, which I'm hoping you could help me with. Suppose I had set arr_device.dim1 = 2 on the host side. While debugging inside the kernel and setting a breakpoint at one of the for loops, checking the value of arr_device.dim1 yields something like 16776576, nowhere large enough to cause overflow, but this value copies correctly into dim1 as 2, which means that the for loops execute as I intended them to. As a side question, is using size_t which is essential unsigned long long int bad practice, seeing as the GPU's are made of 32bit cores?
Generally, how safe is it to pass struct and class into kernels as arguments, is bad practice that should be avoided at all cost? I imagine that passing pointers to classes to kernels is difficult in case they contain members which point to dynamically allocated memory, and that they should be very lightweight if I want to pass them by value.
This is a partial answer, since without a proper program to look into, it is difficult/impossible to guess why you would see an invalid value in your arr_device.dim1.
The struct is passed by copy, so all its contents are copied into shared memory of each block.
Incorrect. Kernel arguments are stored in constant memory, which is device-global and not block-specific. They are not stored shared memory (which is block-specific).
When a thread runs, it typically reads arguments from constant memory into registers (and again, not shared memory).
Generally, how safe is it to pass struct and class into kernels as arguments
My personal rule of thumb on this matter is: If the struct/class...
is trivially-copyable; and
all its members of the struct/class are defined both for the host and the device side, or at least - designed with GPU use in mind;
then it should be safe to pass to a kernel.
passing struct and class into kernels as arguments [ - ] is [it] bad practice that should be avoided at all cost?
No. But remember that most C++ libraries only provide host-side code; and were not written with a mind of being used on a GPU. So I'd be wary of using non-trivial classes without a lot of scrutiny.
I imagine that passing pointers to classes to kernels is difficult in case they contain members which point to dynamically allocated memory
Yes, this can be problematic. However - if you used cuda::memory::managed::allocate(), cuda::memory::managed::make_unique() or cudaMallocManaged() - then this should "just work", i.e. the relevant memory pages will be fetched to the GPU or the CPU as necessary when accessed. See:
Unified Memory in CUDA for beginners
Beyond GPU Memory Limits with Unified Memory on Pascal
and that they should be very lightweight if I want to pass [objects to kernels] by value.
Yes, because each and every thread has to read each argument from constant memory before it can use that argument. And while constant memory allows this to happen relatively quickly, it's still a bunch of overhead that you want to minimize.
Also remember that you can't pass anything to kernels by (C++) reference; it's all "by-value" - the object itself or a pointer to it.
I see that there are many ways to serialize/deserialize Haskell objects:
Data.Serialize -> encode, decode functions
Data.Binary http://code.haskell.org/binary/
MsgPack, JSON, BSON, etc
In my application, I want to setup a simple TCP client-server, where client may send serialized Haskell record objects. How does one decide between these serialization alternatives?
Additionally, when objects serialized into strings are sent over the network using Network.Socket, strings are returned. Is there a slightly higher level library, that works at the level of whole TCP messages? In other words, is there a way to avoid writing parsing code on the receive end that:
collects results of a sequence of recv() calls,
detect that a whole object has been received, and
then parse it into a haskell type?
In my application, the objects are not expected to be too large (maybe about ~1MB max).
As for the second part of your question, two things are required:
An incremental parser that doesn't need to have the whole document in memory to start parsing, and which can be fed with the partial chunks of data arriving from the wire. Also, when the parsing succeeds it must return any "leftover data" along with the parsed value.
A source of data with "pushback capabilities", that allows you to "unread" any leftovers so that they are available to the next parsing attempt.
The most popular library providing (1) is attoparsec. As for (2), all the three main streaming libraries (conduit, io-streams, and pipes) offer some kind of pushback functionality (the latter using the auxiliary pipes-parse package). All three libraries can integrate with attoparsec parsers as well (see here, here and here).
(Another option, of course, is to prepend each message with its lenght are read only the exact number of bytes.)
To answer the first part of your question (about data serialization), I would say that everything you listed sounds fine. Since you are dealing with pretty big (1MB) serializations, I think that the most important thing is laziness. There is another serialization library, called cereal that has strict serializations, and you wouldn't want that because you'd need to build it up in memory before sending in out. I'll give a shout out to aeson (http://hackage.haskell.org/package/aeson-0.8.0.2/docs/Data-Aeson.html) which you can use GHC Generics with to get something simple like this:
data Shape = Rect Int Int | Circle Double | Other String Int
deriving (Generic)
instance FromJSON Shape -- uses a default
instance ToJSON Shape -- uses a default
And then, bam!, you've got access to the encode and decode methods. I don't know about a higher level TCP library. Hopefully, someone else will have more insight on that.
I have a series of StrBuf objects and want to know the most efficient way to concatenate them together.
There's the add() method but the docs say, "Add x.toStr to the end of this buffer". If I'm doing this over and over and over again, I'd imagine that StrBuf.toStr() is not that performant.
(I know the real answer is to just use the one StrBuf, but humour me here!)
Cheers.
UPDATE:
Looking at the Java source, under the hood StrBuf uses a Java StringBuilder which uses a char array as it's internal buffer. So #Adrian, yeah, it's important to have a large initial buffer.
As far as StrBuf.toStr() is concerned, a new Java String is created using Arrays.copyOfRange() - which is reasonable but unnecessary given there's an append(StringBuffer sb) method.
Not really - there are no optimizations in StrBuf for that case. You seeing performance issues with what you have today?
Generally for our APIs we would pass around the OutStream instance from StrBuf.out and not the actual StrBuf instance. Not sure if that helps your case or not.
I suppose you could create a new StrBuf with a big enough capacity:
capacity
Int capacity
The number of characters this buffer can hold without allocating more memory.
make
new make(Int capacity := 16)
Create with initial capacity (defaults to 16)
I've been looking through the apple documentation for the NSdata class, and I didn't really find it too enlightening. I know how to use the class but I don't really understand the gravity of the advantages that it may or may not provide. I know its a simple question but perhaps it would be good to have such information as a reference.
Advantages over what? Certainly, it's useful to represent an arbitrary block of data as an object just as it's useful to represent a string, a number, or a value as an object. Memory management becomes simpler and is consistent with memory management for all other objects, and there are a number of useful methods defined.
Say you want to read a binary file into memory. We won't worry about the reasons why -- there are as many reasons as there are data file formats. You'll have to:
Check the size of the file
Allocate a block of memory of the proper size
Open the file
Read the contents into memory
Close the file
Remember to free the memory when you're done with it (a condition that can sometimes be tricky to detect)
(Optional) Worry about whether the block of memory has been modified
With NSData, you can just create a new instance from a path or URL and not have to think about the rest.
So if you want to look at sync block for an object, under sos you have to look at -4 bytes (on 32 bit machines) before the object address. Does anyone know what is the wisdom for going back 4 bytes? I mean they could have sync block at 0, then type handle at +4 and then object fields at +8.
This is an implementation detail, so I can't give you the exact reason for the placement of the syncblock. However, if you look at the shared source CLI, you'll see that the runtime has all sorts of optimizations for how objects are allocated and used, and actually the data associated with a single instance is located in several different places. The syncblock for instance is just an index value for a structure located elsewhere. Similarly the MethodTable and the EEClass are stored elsewhere. These are all implementation details. The important point IMO is understanding how to dig out the information needed during debugging. It is of much less importance to understand why the implementation details are as they are.
I'd say it matches expectations, especially for structs that have been explicitly laid out. As Brian says, it's just an implementation detail though. It's similar to how many implementations of malloc will allocate more space than requested, store the allocation size in the first four (or eight) bytes, and then return a pointer that is offset to point to the next byte beyond that.