What is the complexity of external merge sort? - merge

In wikipedia time complexity of external sort is given as follows
(N/B).log(M/B)(N/B)
where N is the total size of the data, M is memory size and B is the number of chunks in the memory. I can understand log part as we sort each chunk in RAM, however I could not understand the base of the log as M/B.
Any help would be appreciated!

After the sorting phase the merge phase processes m runs in parallel therefore you get the base m = M/B.
Source: wikipedia.org/wiki/External_memory_algorithm

M is memory size and ...
The confusion is due to:
B is the number of chunks in the memory.
In the wiki article, B is the block size per chunk, so the number of chunks in the memory = M/B. The wiki time complexity is ignoring the fact that one of the chunks is used for merged output, and that the algorithm uses a k-way merge where k = (M/B)-1.

Related

yolov4..cfg : increasing subdivisions parameter consequences

I'm trying to train a custom dataset using Darknet framework and Yolov4. I built up my own dataset but I get a Out of memory message in google colab. It also said "try to change subdivisions to 64" or something like that.
I've searched around the meaning of main .cfg parameters such as batch, subdivisions, etc. and I can understand that increasing the subdivisions number means splitting into smaller "pictures" before processing, thus avoiding to get the fatal "CUDA out of memory". And indeed switching to 64 worked well. Now I couldn't find anywhere the answer to the ultimate question: is the final weight file and accuracy "crippled" by doing this? More specifically what are the consequences on the final result? If we put aside the training time (which would surely increase since there are more subdivisions to train), how will be the accuracy?
In other words: if we use exactly the same dataset and train using 8 subdivisions, then do the same using 64 subdivisions, will the best_weight file be the same? And will the object detections success % be the same or worse?
Thank you.
first read comments
suppose you have 100 batches.
batch size = 64
subdivision = 8
it will divide your batch = 64/8 => 8
Now it will load and work one by one on 8 divided parts into the RAM, because of LOW RAM capacity you can change the parameter according to ram capacity.
you can also reduce batch size , so it will take low space in ram.
It will do nothing to the datasets images.
It is just splitting the large batch size which can't be load in RAM, so divided into small pieces.

How to calculate the best numberOfPartitions for coalesce?

So, I understand that in general one should use coalesce() when:
the number of partitions decreases due to a filter or some other operation that may result in reducing the original dataset (RDD, DF). coalesce() is useful for running operations more efficiently after filtering down a large dataset.
I also understand that it is less expensive than repartition as it reduces shuffling by moving data only if necessary. My problem is how to define the parameter that coalesce takes (idealPartionionNo). I am working on a project which was passed to me from another engineer and he was using the below calculation to compute the value of that parameter.
// DEFINE OPTIMAL PARTITION NUMBER
implicit val NO_OF_EXECUTOR_INSTANCES = sc.getConf.getInt("spark.executor.instances", 5)
implicit val NO_OF_EXECUTOR_CORES = sc.getConf.getInt("spark.executor.cores", 2)
val idealPartionionNo = NO_OF_EXECUTOR_INSTANCES * NO_OF_EXECUTOR_CORES * REPARTITION_FACTOR
This is then used with a partitioner object:
val partitioner = new HashPartitioner(idealPartionionNo)
but also used with:
RDD.filter(x=>x._3<30).coalesce(idealPartionionNo)
Is this the right approach? What is the main idea behind the idealPartionionNo value computation? What is the REPARTITION_FACTOR? How do I generally work to define that?
Also, since YARN is responsible for identifying the available executors on the fly is there a way of getting that number (AVAILABLE_EXECUTOR_INSTANCES) on the fly and use that for computing idealPartionionNo (i.e. replace NO_OF_EXECUTOR_INSTANCES with AVAILABLE_EXECUTOR_INSTANCES)?
Ideally, some actual examples of the form:
Here 's a dataset (size);
Here's a number of transformations and possible reuses of an RDD/DF.
Here is where you should repartition/coalesce.
Assume you have n executors with m cores and a partition factor equal to k
then:
The ideal number of partitions would be ==> ???
Also, if you can refer me to a nice blog that explains these I would really appreciate it.
In practice optimal number of partitions depends more on the data you have, transformations you use and overall configuration than the available resources.
If the number of partitions is too low you'll experience long GC pauses, different types of memory issues, and lastly suboptimal resource utilization.
If the number of partitions is too high then maintenance cost can easily exceed processing cost. Moreover, if you use non-distributed reducing operations (like reduce in contrast to treeReduce), a large number of partitions results in a higher load on the driver.
You can find a number of rules which suggest oversubscribing partitions compared to the number of cores (factor 2 or 3 seems to be common) or keeping partitions at a certain size but this doesn't take into account your own code:
If you allocate a lot you can expect long GC pauses and it is probably better to go with smaller partitions.
If a certain piece of code is expensive then your shuffle cost can be amortized by a higher concurrency.
If you have a filter you can adjust the number of partitions based on a discriminative power of the predicate (you make different decisions if you expect to retain 5% of the data and 99% of the data).
In my opinion:
With one-off jobs keep higher number partitions to stay on the safe side (slower is better than failing).
With reusable jobs start with conservative configuration then execute - monitor - adjust configuration - repeat.
Don't try to use fixed number of partitions based on the number of executors or cores. First understand your data and code, then adjust configuration to reflect your understanding.
Usually, it is relatively easy to determine the amount of raw data per partition for which your cluster exhibits stable behavior (in my experience it is somewhere in the range of few hundred megabytes, depending on the format, data structure you use to load data, and configuration). This is the "magic number" you're looking for.
Some things you have to remember in general:
Number of partitions doesn't necessarily reflect
data distribution. Any operation that requires shuffle (*byKey, join, RDD.partitionBy, Dataset.repartition) can result in non-uniform data distribution. Always monitor your jobs for symptoms of a significant data skew.
Number of partitions in general is not constant. Any operation with multiple dependencies (union, coGroup, join) can affect the number of partitions.
Your question is a valid one, but Spark partitioning optimization depends entirely on the computation you're running. You need to have a good reason to repartition/coalesce; if you're just counting an RDD (even if it has a huge number of sparsely populated partitions), then any repartition/coalesce step is just going to slow you down.
Repartition vs coalesce
The difference between repartition(n) (which is the same as coalesce(n, shuffle = true) and coalesce(n, shuffle = false) has to do with execution model. The shuffle model takes each partition in the original RDD, randomly sends its data around to all executors, and results in an RDD with the new (smaller or greater) number of partitions. The no-shuffle model creates a new RDD which loads multiple partitions as one task.
Let's consider this computation:
sc.textFile("massive_file.txt")
.filter(sparseFilterFunction) // leaves only 0.1% of the lines
.coalesce(numPartitions, shuffle = shuffle)
If shuffle is true, then the text file / filter computations happen in a number of tasks given by the defaults in textFile, and the tiny filtered results are shuffled. If shuffle is false, then the number of total tasks is at most numPartitions.
If numPartitions is 1, then the difference is quite stark. The shuffle model will process and filter the data in parallel, then send the 0.1% of filtered results to one executor for downstream DAG operations. The no-shuffle model will process and filter the data all on one core from the beginning.
Steps to take
Consider your downstream operations. If you're just using this dataset once, then you probably don't need to repartition at all. If you are saving the filtered RDD for later use (to disk, for example), then consider the tradeoffs above. It takes experience to become familiar with these models and when one performs better, so try both out and see how they perform!
As others have answered, there is no formula which calculates what you ask for. That said, You can make an educated guess on the first part and then fine tune it over time.
The first step is to make sure you have enough partitions. If you have NO_OF_EXECUTOR_INSTANCES executors and NO_OF_EXECUTOR_CORES cores per executor then you can process NO_OF_EXECUTOR_INSTANCES*NO_OF_EXECUTOR_CORES partitions at the same time (each would go to a specific core of a specific instance).
That said this assumes everything is divided equally between the cores and everything takes exactly the same time to process. This is rarely the case. There is a good chance that some of them would be finished before others either because of locallity (e.g. the data needs to come from a different node) or simply because they are not balanced (e.g. if you have data partitioned by root domain then partitions including google would probably be quite big). This is where the REPARTITION_FACTOR comes into play. The idea is that we "overbook" each core and therefore if one finishes very quickly and one finishes slowly we have the option of dividing the tasks between them. A factor of 2-3 is generally a good idea.
Now lets take a look at the size of a single partition. Lets say your entire data is X MB in size and you have N partitions. Each partition would be on average X/N MBs. If N is large relative to X then you might have very small average partition size (e.g. a few KB). In this case it is usually a good idea to lower N because the overhead of managing each partition becomes too high. On the other hand if the size is very large (e.g. a few GB) then you need to hold a lot of data at the same time which would cause issues such as garbage collection, high memory usage etc.
The optimal size is a good question but generally people seem to prefer partitions of 100-1000MB but in truth tens of MB probably would also be good.
Another thing you should note is when you do the calculation how your partitions change. For example, lets say you start with 1000 partitions of 100MB each but then filter the data so each partition becomes 1K then you should probably coalesce. Similar issues can happen when you do a groupby or join. In such cases both the size of the partition and the number of partitions change and might reach an undesirable size.

When writing a large array directly to disk in MATLAB, is there any need to preallocate?

I need to write an array that is too large to fit into memory to a .mat binary file. This can be accomplished with the matfile function, which allows random access to a .mat file on disk.
Normally, the accepted advice is to preallocate arrays, because expanding them on every iteration of a loop is slow. However, when I was asking how to do this, it occurred to me that this may not be good advice when writing to disk rather than RAM.
Will the same performance hit from growing the array apply, and if so, will it be significant when compared to the time it takes to write to disk anyway?
(Assume that the whole file will be written in one session, so the risk of serious file fragmentation is low.)
Q: Will the same performance hit from growing the array apply, and if so will it be significant when compared to the time it takes to write to disk anyway?
A: Yes, performance will suffer if you significantly grow a file on disk without pre-allocating. The performance hit will be a consequence of fragmentation. As you mentioned, fragmentation is less of a risk if the file is written in one session, but will cause problems if the file grows significantly.
A related question was raised on the MathWorks website, and the accepted answer was to pre-allocate when possible.
If you don't pre-allocate, then the extent of your performance problems will depend on:
your filesystem (how data are stored on disk, the cluster-size),
your hardware (HDD seek time, or SSD access times),
the size of your mat file (whether it moves into non-contiguous space),
and the current state of your storage (existing fragmentation / free space).
Let's pretend that you're running a recent Windows OS, and so are using the NTFS file-system. Let's further assume that it has been set up with the default 4 kB cluster size. So, space on disk gets allocated in 4 kB chunks and the locations of these are indexed to the Master File Table. If the file grows and contiguous space is not available then there are only two choices:
Re-write the entire file to a new part of the disk, where there is sufficient free space.
Fragment the file, storing the additional data at a different physical location on disk.
The file system chooses to do the least-bad option, #2, and updates the MFT record to indicate where the new clusters will be on disk.
Now, the hard disk needs to physically move the read head in order to read or write the new clusters, and this is a (relatively) slow process. In terms of moving the head, and waiting for the right area of disk to spin underneath it ... you're likely to be looking at a seek time of about 10ms. So for every time you hit a fragment, there will be an additional 10ms delay whilst the HDD moves to access the new data. SSDs have much shorter seek times (no moving parts). For the sake of simplicity, we're ignoring multi-platter systems and RAID arrays!
If you keep growing the file at different times, then you may experience a lot of fragmentation. This really depends on when / how much the file is growing by, and how else you are using the hard disk. The performance hit that you experience will also depend on how often you are reading the file, and how frequently you encounter the fragments.
MATLAB stores data in Column-major order, and from the comments it seems that you're interested in performing column-wise operations (sums, averages) on the dataset. If the columns become non-contiguous on disk then you're going to hit lots of fragments on every operation!
As mentioned in the comments, both read and write actions will be performed via a buffer. As #user3666197 points out the OS can speculatively read-ahead of the current data on disk, on the basis that you're likely to want that data next. This behaviour is especially useful if the hard disk would be sitting idle at times - keeping it operating at maximum capacity and working with small parts of the data in buffer memory can greatly improve read and write performance. However, from your question it sounds as though you want to perform large operations on a huge (too big for memory) .mat file. Given your use-case, the hard disk is going to be working at capacity anyway, and the data file is too big to fit in the buffer - so these particular tricks won't solve your problem.
So ...Yes, you should pre-allocate. Yes, a performance hit from growing the array on disk will apply. Yes, it will probably be significant (it depends on specifics like amount of growth, fragmentation, etc). And if you're going to really get into the HPC spirit of things then stop what you're doing, throw away MATLAB , shard your data and try something like Apache Spark! But that's another story.
Does that answer your question?
P.S. Corrections / amendments welcome! I was brought up on POSIX inodes, so sincere apologies if there are any inaccuracies in here...
Preallocating a variable in RAM and preallocating on the disk don't solve the same problem.
In RAM
To expand a matrix in RAM, MATLAB creates a new matrix with the new size and copies the values of the old matrix into the new one and deletes the old one. This costs a lot of performance.
If you preallocated the matrix, the size of it does not change. So there is no more reason for MATLAB to do this matrix copying anymore.
On the hard-disk
The problem on the hard-disk is fragmentation as GnomeDePlume said. Fragmentation will still be a problem, even if the file is written in one session.
Here is why: The hard disk will generally be a little fragmentated. Imagine
# to be memory blocks on the hard disk that are full
M to be memory blocks on the hard disk that will be used to save data of your matrix
- to be free memory blocks on the hard disk
Now the hard disk could look like this before you write the matrix onto it:
###--##----#--#---#--------------------##-#---------#---#----#------
When you write parts of the matrix (e.g. MMM blocks) you could imagine the process to look like this >!(I give an example where the file system will just go from left to right and use the first free space that is big enough - real file systems are different):
First matrix part:
###--##MMM-#--#---#--------------------##-#---------#---#----#------
Second matrix part:
###--##MMM-#--#MMM#--------------------##-#---------#---#----#------
Third matrix part:
###--##MMM-#--#MMM#MMM-----------------##-#---------#---#----#------
And so on ...
Clearly the matrix file on the hard disk is fragmented although we wrote it without doing anything else in the meantime.
This can be better if the matrix file was preallocated. In other words, we tell the file system how big our file would be, or in this example, how many memory blocks we want to reserve for it.
Imagine the matrix needed 12 blocks: MMMMMMMMMMMM. We tell the file system that we need so much by preallocating and it will try to accomodate our needs as best as it can. In this example, we are lucky: There is free space with >= 12 memory blocks.
Preallocating (We need 12 memory blocks):
###--##----#--#---# (------------) --------##-#---------#---#----#------
The file system reserves the space between the parentheses for our matrix and will write into there.
First matrix part:
###--##----#--#---# (MMM---------) --------##-#---------#---#----#------
Second matrix part:
###--##----#--#---# (MMMMMM------) --------##-#---------#---#----#------
Third matrix part:
###--##----#--#---# (MMMMMMMMM---) --------##-#---------#---#----#------
Fourth and last part of the matrix:
###--##----#--#---# (MMMMMMMMMMMM) --------##-#---------#---#----#------
Voilá, no fragmentation!
Analogy
Generally you could imagine this process as buying cinema tickets for a large group. You would like to stick together as a group, but there are already some seats in the theatre reserved by other people. For the cashier to be able to accomodate to your request (large group wants to stick together), he/she needs knowledge about how big your group is (preallocating).
A quick answer to the whole discussion (in case you do not have the time to follow or the technical understanding):
Pre-allocation in Matlab is relevant for operations in RAM. Matlab does not give low-level access to I/O operations and thus we cannot talk about pre-allocating something on disk.
When writing a big amount of data to disk, it has been observed that the fewer the number of writes, the faster is the execution of the task and smaller is the fragmentation on disk.
Thus, if you cannot write in one go, split the writes in big chunks.
Prologue
This answer is based on both the original post and the clarifications ( both ) provided by the author during the recent week.
The question of adverse performance hit(s) introduced by a low-level, physical-media-dependent, "fragmentation", introduced by both a file-system & file-access layers is further confronted both in a TimeDOMAIN magnitudes and in ComputingDOMAIN repetitiveness of these with the real-use problems of such an approach.
Finally a state-of-art, principally fastest possible solution to the given task was proposed, so as to minimise damages from both wasted efforts and mis-interpretation errors from idealised or otherwise not valid assumptions, alike that a risk of "serious file fragmentation is low" due to an assumption, that the whole file will be written in one session ( which is simply principally not possible during many multi-core / multi-process operations of the contemporary O/S in real-time over a time-of-creation and a sequence of extensive modification(s) ( ref. the MATLAB size limits ) of a TB-sized BLOB file-object(s) inside contemporary COTS FileSystems ).
One may hate the facts, however the facts remain true out there until a faster & better method moves in
First, before considering performance, realise the gaps in the concept
The real performance adverse hit is not caused by HDD-IO or related to the file fragmentation
RAM is not an alternative for the semi-permanent storage of the .mat file
Additional operating system limits and interventions + additional driver and hardware-based abstractions were ignored from assumptions on un-avoidable overheads
The said computational scheme was omited from the review of what will have the biggest impact / influence on the resulting performance
Given:
The whole processing is intended to be run just once, no optimisation / iterations, no continuous processing
Data have 1E6 double Float-values x 1E5 columns = about 0.8 TB (+HDF5 overhead)
In spite of original post, there is no random IO associated with the processing
Data acquisition phase communicates with a .NET to receive DataELEMENTs into MATLAB
That means, since v7.4,
a 1.6 GB limit on MATLAB WorkSpace in a 32bit Win ( 2.7 GB with a 3GB switch )
a 1.1 GB limit on MATLAB biggest Matrix in wXP / 1.4 GB wV / 1.5 GB
a bit "released" 2.6 GB limit on MATLAB WorkSpace + 2.3 GB limit on a biggest Matrix in a 32bit Linux O/S.
Having a 64bit O/S will not help any kind of a 32bit MATLAB 7.4 implementation and will fail to work due to another limit, the maximum number of cells in array, which will not cover the 1E12 requested here.
The only chance is to have both
both a 64bit O/S ( wXP, Linux, Solaris )
and a 64bit MATLAB 7.5+
MathWorks' source for R2007a cited above, for newer MATLAB R2013a you need a User Account there
Data storage phase assumes block-writes of a row-ordered data blocks ( a collection of row-ordered data blocks ) into a MAT-file on an HDD-device
Data processing phase assumes to re-process the data in a MAT-file on an HDD-device, after all inputs have been acquired and marshalled to a file-based off-RAM-storage, but in a column-ordered manner
just column-wise mean()-s / max()-es are needed to calculate ( nothing more complex )
Facts:
MATLAB uses a "restricted" implementation of an HDF5 file-structure for binary files.
Review performance measurements on real-data & real-hardware ( HDD + SSD ) to get feeling of scales of the un-avoidable weaknesses thereof
The Hierarchical Data Format (HDF) was born on 1987 at the National Center for Supercomputing Applications (NCSA), some 20 years ago. Yes, that old. The goal was to develop a file format that combine flexibility and efficiency to deal with extremely large datasets. Somehow the HDF file was not used in the mainstream as just a few industries were indeed able to really make use of it's terrifying capacities or simply did not need them.
FLEXIBILITY means that the file-structure bears some overhead, one need not use if the content of the array is not changing ( you pay the cost without consuming any benefit of using it ) and an assumption, that HDF5 limits on overall size of the data it can contain sort of helps and saves the MATLAB side of the problem is not correct.
MAT-files are good in principle, as they avoid an otherwise persistent need to load a whole file into RAM to be able to work with it.
Nevertheless, MAT-files are not serving well the simple task as was defined and clarified here. An attempt to do that will result in just a poor performance and HDD-IO file-fragmentation ( adding a few tens of milliseconds during write-through-s and something less than that on read-ahead-s during the calculations ) will not help at all in judging the core-reason for the overall poor performance.
A professional solution approach
Rather than moving the whole gigantic set of 1E12 DataELEMENTs into a MATLAB in-memory proxy data array, that is just scheduled for a next coming sequenced stream of HDF5 / MAT-file HDD-device IO-s ( write-throughs and O/S vs. hardware-device-chain conflicting/sub-optimised read-aheads ) so as to have all the immenses work "just [married] ready" for a few & trivially simple calls of mean() / max() MATLAB functions( that will do their best to revamp each of the 1E12 DataELEMENTs in just another order ( and even TWICE -- yes -- another circus right after the first job-processing nightmare gets all the way down, through all the HDD-IO bottlenecks ) back into MATLAB in-RAM-objects, do redesign this very step into a pipe-line BigDATA processing from the very beginning.
while true % ref. comment Simon W Oct 1 at 11:29
[ isStillProcessingDotNET, ... % a FLAG from .NET reader function
aDotNET_RowOfVALUEs ... % a ROW from .NET reader function
] = GetDataFromDotNET( aDtPT ) % .NET reader
if ( isStillProcessingDotNET ) % Yes, more rows are still to come ...
aRowCOUNT = aRowCOUNT + 1; % keep .INC for aRowCOUNT ( mean() )
for i = 1:size( aDotNET_RowOfVALUEs )(2) % stepping across each column
aValue = aDotNET_RowOfVALUEs(i); %
anIncrementalSumInCOLUMN(i) = ...
anIncrementalSumInCOLUMN(i) + aValue; % keep .SUM for each column ( mean() )
if ( aMaxInCOLUMN(i) < aValue ) % retest for a "max.update()"
aMaxInCOLUMN(i) = aValue; % .STO a just found "new" max
end
endfor
continue % force re-loop
else
break
endif
end
%-------------------------------------------------------------------------------------------
% FINALLY:
% all results are pre-calculated right at the end of .NET reading phase:
%
% -------------------------------
% BILL OF ALL COMPUTATIONAL COSTS ( for given scales of 1E5 columns x 1E6 rows ):
% -------------------------------
% HDD.IO: **ZERO**
% IN-RAM STORAGE:
% Attr Name Size Bytes Class
% ==== ==== ==== ===== =====
% aMaxInCOLUMNs 1x100000 800000 double
% anIncrementalSumInCOLUMNs 1x100000 800000 double
% aRowCOUNT 1x1 8 double
%
% DATA PROCESSING:
%
% 1.000.000x .NET row-oriented reads ( same for both the OP and this, smarter BigDATA approach )
% 1x INT in aRowCOUNT, %% 1E6 .INC-s
% 100.000x FLOATs in aMaxInCOLUMN[] %% 1E5 * 1E6 .CMP-s
% 100.000x FLOATs in anIncrementalSumInCOLUMN[] %% 1E5 * 1E6 .ADD-s
% -----------------
% about 15 sec per COLUMN of 1E6 rows
% -----------------
% --> mean()s are anIncrementalSumInCOLUMN./aRowCOUNT
%-------------------------------------------------------------------------------------------
% PIPE-LINE-d processing takes in TimeDOMAIN "nothing" more than the .NET-reader process
%-------------------------------------------------------------------------------------------
Your pipe-lined BigDATA computation strategy will in a smart way principally avoid interim storage buffering in MATLAB as it will progressively calculate the results in not more than about 3 x 1E6 ADD/CMP-registers, all with a static layout, avoid proxy-storage into HDF5 / MAT-file, absolutely avoid all HDD-IO related bottlenecks and low BigDATA sustained-read-s' speeds ( not speaking at all about interim/BigDATA sustained-writes... ) and will also avoid ill-performing memory-mapped use just for counting mean-s and max-es.
Epilogue
The pipeline processing is nothing new under the Sun.
It re-uses what speed-oriented HPC solutions already use for decades
[ generations before BigDATA tag has been "invented" in Marketing Dept's. ]
Forget about zillions of HDD-IO blocking operations & go into a pipelined distributed process-to-process solution.
There is nothing faster than this
If it were, all FX business and HFT Hedge Fund Monsters would already be there...

Four processes of 1gb,1.2gb,2gb,2gb are there and RAM available is 2gb. We have a time shared system.

What of the following is the most appropriate scheduling algorithm
Options being-
a. all processes are loaded sequentially 1 by 1
b. load one process at a time and execute processes in RR fashion
c. load 1gb, 1,2gb first then processes 3 and 4 follow
d. All processes can be loaded together and CPU time shared among them
I came across this question somewhere and I was confused, as the answer could b (D) if we consider virtual memory and otherwise (B). Am I missing something here?
In my opinion, virtual memory should be taken into account here. Its clearly logical. Let me give you the answer by negation.
A.) Clearly not the case as CPU cycles will be wasted.
B.) If we are loading one process at a time, then it doesn't matter what algorithm we are applying afterwards. Its same as #A.
C.) Taking virtual memory into account, if we can load P1 and P2, then for some smaller page size, we can load P3 and P4 too at same time.
D.) As I stated in #C, for an arbitrary smaller page size value, we can load all of them simultaneously and schedule them using Round Robin Scheduling Algorithm.

Prefetch distance and degree of prefetch

what is the difference between prefetch distance and degree of prefetching?
Prefetching typically deals with entire cache lines. So a given prefetch request will bring in the cache line that would hold the specified address.
Due to the huge differences in memory speeds, it can take many cycles to bring data into the cache. Some latencies are in the dozens of cycles if not longer. Now, the only way to really benefit from a prefetch is to issue it far enough ahead of the actual use of the data so that there's enough time for the machine to pull the data into the cache. This implies that data access be predictable so one can anticipate what memory needs to be in the cache. The simplest case is marching through a linear array. Now, a common scenario (in 'scientific code') is a loop that reads the data then processes it. The cache miss penalty may be high and the processor may be very fast, and simply prefetching the next cache line may not be sufficient as we may be finished processing the array corresponding to the current cache line and waiting for the data in the neighbouring cache line before the data has arrived at the cache. So we may have to fetch further away than the next cache line.
How far ahead you prefetch is the distance e.g. 512 bytes. The degree of prefetching is the distance in terms of cache lines i.e. if your cache line is 256 bytes, the degree of prefetching is 2.
Prefetching degree is the number of cache lines to prefetch at each trigger.
Prefetching distance is the concept from array within loop. D = ceil(l / s), l is average memory latency in terms of number of cycles, s is cycle time of shortest execution path. D is be number of iterations ahead for a certain array element, so that memory latency is covered.