DMA Scatter-Gather and contiguous memory? - linux-device-driver

I understand that the use of scatter-gather in DMA is to scatter - gather data from small memory segments because large contiguous blocks of memory are not easy to find. So If I need to to do S-G should I allocate small pieces of memory using kmalloc? What is the max size I can alloc at one go for one descriptor ?
Is there any other use of S-G feature?

Related

Error using zeros Out of memory

When I try running
Adj = zeros(x*y);
I am receiving the following error:
Error using zeros
Out of memory. Type HELP MEMORY for your options.
where x*y=37901. The occupancy of my PC storage is
I know the C drive doesn't have much space but 34.2 GB should be more than enough for creating a 37901*37901 matrix.
When I run the memory command this is what I got:
>> memory
Maximum possible array: 4825 MB (5.059e+09 bytes) *
Memory available for all arrays: 4825 MB (5.059e+09 bytes) *
Memory used by MATLAB: 12369 MB (1.297e+10 bytes)
Physical Memory (RAM): 12218 MB (1.281e+10 bytes)
* Limited by System Memory (physical + swap file) available.
How can I solve this issue? (I am using MATLAB 2017b)
Actually, coding side, variables are normally stored into memory (your computer RAM) rather than into hard disk space. That's what your error complains about... you don't have enough memory to store the variable you want to allocate.
The default numerical variable used by Matlab is double, which is used to represent double precision floating-point values and takes up 8 bytes of memory. Hence, you are trying to allocate:
37901 * 37901 * 8 = 11491886408 bytes
~= 10.7 gigabytes
When you only have something like 11.9 gigabytes of available memory and Matlab is telling you that you can't allocate an array greater than 4.7 gigabytes. As a workaround, I suggest you to take a look at Tall Arrays, which are a Matlab feature tailored around handling very big data flows:
Tall arrays are used to work with out-of-memory data that is backed by
a datastore. Datastores enable you to work with large data sets in
small chunks that individually fit in memory, instead of loading the
entire data set into memory at once. Tall arrays extend this
capability to enable you to work with out-of-memory data using common
functions.
What is a Tall Array?
Since the data is not loaded into memory all at once, tall arrays can be arbitrarily large in the first dimension
(that is, they can have any number of rows). Instead of writing
special code that takes into account the huge size of the data, such
as with techniques like MapReduce, tall arrays let you work with large
data sets in an intuitive manner that is similar to the way you would
work with in-memory MATLABĀ® arrays. Many core operators and functions
work the same with tall arrays as they do with in-memory arrays.
MATLAB works with small chunks of the data at a time, handling all of
the data chunking and processing in the background, so that common
expressions, such as A+B, work with big data sets.
Benefits of Tall Arrays
Unlike in-memory arrays, tall arrays typically remain unevaluated until you request that the calculations
be performed using the gather function. This deferred evaluation
allows you to work quickly with large data sets. When you eventually
request output using gather, MATLAB combines the queued calculations
where possible and takes the minimum number of passes through the
data. The number of passes through the data greatly affects execution
time, so it is recommended that you request output only when
necessary.

Difference between paging and segmentation

I am trying to understand both paradigms of memory management;however, I fail to see the big picture and the difference between both. Paging consists of taking fixed size pages from a secondary to a primary storage in order to do some task requested by a process. Segmentation consists of assigning to each unit in a process an address space, so they are allowed to grow. I don't quiet see how they are related and that's because there are still a lot of holes in my understanding. Can someone fill them up?
I think you have something confused. One problem you have is that the term "segment" had multiple meanings.
Segmentation is a method of memory management. Memory is managed in segments that are of variable or fixed length, depending upon the processor. Segments originated on 16-bit processors as a means to access more than 64K of memory.
On the PDP-11, programmers used segments to map different memory into the 64K address space. At any given time a process could only access 64K of memory but the memory that made up that 64K could change.
The 8086 and it successors used segments with base registers. Each segment could have 64K (that grew with the processors) but a process could have 4 segments (more in later processors).
Paging allows a process to have a larger address space than there is physical memory available.
The 8086's successors used the kludge of paging on top of segments. However, that bit of ugliness has finally gone away in 64-bit mode.
You got your answer right there, paging relates with fixed size pages in a storage while segmentation deals with units in a page. 'Segments' are objects in the class 'Page'

Why do you need blocks when you have sectors and why is the block size a multiple of sector size?

On reading through disc structure, I come across this statement that blocks size is a multiple of sector size. First thought is why do u even need blocks when u have sectors, and secondly why is the block size a multiple of sector like 1,2,4?
Why can't it be half of sector? What's the rationale here? This is not for homework.
Block is an abstraction of filesystems. All filesystem operations can be accessed only in multiple of blocks. In other terms , smallest logically addressable unit to filesystem is block , not a sector.
The smallest addressable unit on a block device is a sector.The sector size is physical property of a block device and is the fundamental unit of all block devices.
Most block devices have 512-byte sectors (although other sizes are common. For example, some CD-ROM discs have 2-kilobyte sectors) while block sizes are commonly of size 512 bytes , 1 KB or 4KB. This is the reason block size is a multiple of sector.
Early in the computing industry, the term "block" was loosely used to refer to a small chunk of data. Later the term referring to the data area was replaced by sector, and block became associated with the data packets that are passed in various sizes by different types of data streams.
read more here: http://en.wikipedia.org/wiki/Disk_sector

Dynamic allocation in kernel space

I have been trying to allocate space using malloc in kernel space for a driver I am working on (using malloc is a constraint here; I am not allowed to allocate space in any other manner), but if I try to allocate "too many" elements (~500 times a very small struct), only a fraction of the space I required is actually allocated.
Reducing the number of allocated elements did work for me with no problems. Does dynamic allocation in kernel space have limits which might be causing the behavior I am seeing?
malloc is a user space library function. You cannot use it in kernel space. There is a function called kmalloc() which is used to allocate memory in kernel space.
You can use vmalloc() also. I suggest you to read this thread What is the difference between vmalloc and kmalloc? for some clarification on vmalloc() and kmalloc().
And also I suggest you to search your queries in SO, then Ask Questions. Because, Already someone asked here

How to efficiently do scattered summing with SSE/x86

I've been tasked with writing a program that does streaming sums of vectors into scattered memory locations, at the absolute max speed possible. The input data is a destination ID and an XYZ float vectors, so something like:
[198, {0.4,0,1}], [775, {0.25,0.8,0}], [12, {0.5,0.5,0.02}]
and I need to sum them into memory like so:
memory[198] += {0.4,0,1}
memory[775] += {0.25,0.8,0}
memory[12] += {0.5,0.5,0.02}
To complicate matters, there will be multiple threads doing this at the same time, reading from different input streams but summing to the same memory. I don't anticipate there being a lot of contention for the same memory locations, but there will be some. The data sets will be pretty large - multiple streams of 10+ GB apiece that we'll be streaming simultaneously from multiple SSDs to get the highest possible read bandwidth. I'm assuming SSE for the math, although it certainly doesn't have to be that way.
The results won't be used for a while, so I don't need to pollute the cache... but I'm summing into memory, not just writing, so I can't use something like MOVNTPS, right? But since the threads won't be stepping on each other that much, how can I do this without a lot of locking overhead? Would you do this with memory fencing?
Thanks for any help. I can assume Nehalem and above, if that makes a difference.
You can use spin locks for synchronized access to array elements (one per ID) and SSE for summing. In C++, depending on the compiler, intrinsic functions may be available, e.g. Streaming SIMD Extensions and InterlockExchange in Visual C++.
Your program's performance will be limited by memory bandwidth. Don't expect significant speed improvement from multithreading unless you have a multi-CPU (not just multi-core) system.
Start one thread per CPU. Statically distribute destination data between these threads. And provide each thread with the same input data. This allows better use of NUMA architecture. And avoids extra memory traffic for thread synchronization.
In case of single-CPU system, use only one thread accessing destination data.
Probably, the only practical use for more cores in CPUs is to load input data with additional threads.
One obvious optimization is to align destination data by 16 bytes (to avoid touching two cache lines while accessing single data element).
You can use SIMD to perform the addition, or allow compiler to automatically vectorize your code, or just leave this operation completely unoptimized - it doesn't matter, it's nothing compared to the memory bandwidth problems.
As for polluting the cache with output data, MOVNTPS cannot help here, but you can use PREFETCHNTA to prefetch output data elements several steps ahead while minimizing cache pollution. Will it improve performance or degrade it, I don't know. It avoids cache trashing, but leaves most of the cache unused.