Can RoCC read a large chunk of data from dcache at once? How about write?

Can RoCC read a large chunk of data from dcache at once? How about write? - rocket-chip

I am new to rocket chip. I want to design a coprocessor to accelerate data processing. I have a question about how to do a large chunk of data exchange between core and accelerator for one custom instruction. I am wondering if dcache can be used for this data exchange. I went to check LazyRoCC.scala and found io.mem.resp.bits.data only xLen bits long. Does it mean the length of data exchange between dcache and RoCC is limited to xLen? Is there any other way to do this data exchange? Thank you in advance!

Related

Where GPU read/write data

I am trying to understand below lines from here
How quickly can data be sent to the GPU or read back from it?
How fast can the GPU kernel read and write data?
How quickly can data be sent to the GPU? Which peripheral device sent data to GPU? Where GPU kernel read and write data? to the data bus?
Is this implementation showing us how everything [GPU and other peripheral devices] contribute towards computation performance?

tensorflow store training data on GPU memory

I am pretty new to tensorflow. I used to use theano for deep learning development. I notice a difference between these two, that is where input data can be stored.
In Theano, it supports shared variable to store input data on GPU memory to reduce the data transfer between CPU and GPU.
In tensorflow, we need to feed data into placeholder, and the data can come from CPU memory or files.
My question is: is it possible to store input data on GPU memory for tensorflow? or does it already do it in some magic way?
Thanks.

If your data fits on the GPU, you can load it into a constant on GPU from e.g. a numpy array:
with tf.device('/gpu:0'):
tensorflow_dataset = tf.constant(numpy_dataset)
One way to extract minibatches would be to slice that array at each step instead of feeding it using tf.slice:
batch = tf.slice(tensorflow_dataset, [index, 0], [batch_size, -1])
There are many possible variations around that theme, including using queues to prefetch the data to GPU dynamically.

It is possible, as has been indicated, but make sure that it is actually useful before devoting too much effort to it. At least at present, not every operation has GPU support, and the list of operations without such support includes some common batching and shuffling operations. There may be no advantage to putting your data on GPU if the first stage of processing is to move it to CPU.
Before trying to refactor code to use on-GPU storage, try at least one of the following:
1) Start your session with device placement logging to log which ops are executed on which devices:
config = tf.ConfigProto(log_device_placement=True)
sess = tf.Session(config=config)
2) Try to manually place your graph on GPU by putting its definition in a with tf.device('/gpu:0'): block. This will throw exceptions if ops are not GPU-supported.

using PAPI for reading Performance monitoring counters in Intel Core i7

I want to read performance monitoring couters in Core i7
The out put for each event just contains 1 data and has no information about the core this data is for.
How can I read the events counts for each core separately by PAPI?
THANK YOU SO MUCH

PAPI counts are based on threads and not on cores. If you want core-based measurements, you may want to consider using intel pcm - which is capable of providing per-core counts.
The PCM is bit tricky to use and the counts may not match with PAPI - since PCM counts or accounts things slightly differently.
Does it answer your question?
tjr

large data file in matlab doesn't load/import

I have been trying to load data file (csv) into matlab 64 bit running on win7(64 bit) but get memory related errors. The file size is about 3 GB, containing date ( dd/mm/yyyy hh:mm:ss) in first column and bid and ask prices in another two columns. The memory command returns the following :
Maximum possible array: 19629 MB (2.058e+010 bytes) *
Memory available for all arrays: 19629 MB (2.058e+010 bytes) *
Memory used by MATLAB: 522 MB (5.475e+008 bytes)
Physical Memory (RAM): 16367 MB (1.716e+010 bytes)
* Limited by System Memory (physical + swap file) available.
Can somebody here please explain if the max possible array size is 19.6 GB then why would matlab throw a memory error while importing a data array that is just about 3GB. Apologies if this is a simple question to the experienced as I have little experience in process/app memory management.
I would greatly appreciate if someone would also suggest solution to being able to load this dataset into matlab workspace.
Thank you.

I am no expert in memory management but from experience I can tell you that you will run into all kinds of problems if you're importing/exporting 3GB text files.
I would either use an external tool to split your data before you read it or look into storing that data in another format that is more suited to large datasets. Personally, I have used hdf5 in the past---this is designed for large sets of data and is also supported by matlab.
In the meantime, these links may help:
Working with a big CSV file in MATLAB
Handling Large Data Sets Efficiently in MATLAB

I've posted before showing how to use memmapfile() to read huge text files in matlab. This technique may help you as well.

Communication between processor and high speed perihperal

Considering that a processor runs at 100 MHz and the data is coming to the processor from an external device/peripheral at the rate of 1000 Mbit/s (8 Bits/Clockcycle # 125 MHz), which is the best way to handle traffic that comes at a higher speed to the processor ?

First off, you can't do it in software. There would be no way to sample the digital lines at a sufficient rate, or to doing anything useful with it.
You need to use a hardware FIFO buffer or memory cell. When a data burst comes in, it can be buffered in the high speed FIFO and then read out as needed by the processor.
Drop in high speed FIFO chips are surprisingly expensive (though most are dual ported). To cut cost, you would be best off using an SRAM chip, and a hardware adder to increment the address lines on incoming data.

This is not an uncommon situation for software. semaj said the right word. This is a system engineering issue. Other folks have the right answer too. If you want to look at or process that data with the 100MHz processor, it is not going to happen, dont bother trying. You CAN look at snapshots of it or have the hardware filter out a specific percentage of it that you are looking for. At the end of the day though it is a systems issue, what does the hardware provide, where does it put this data, what is the softwares task for this data, does it see X buffers of data come in on the goesinta, and the notify the goesouta hardware that there are X buffers ready to go? Does the hardware examine and align the buffers so that you can look at a header, and then decide where to route the hardware? Once you do your system engineering you will know if you can use that processor or not, and if you can use it what its job is and how to do it.
Your direct question. What is the best way to handle it. The best way to handle it is to have hardware (fpga, asic, etc) move it into and out of some storage device (ram of some sort probably). Not necessarily the same ram the processor runs out of (DMA is a good thing to avoid). The hardware is something the software can talk to but you cannot examine all of that data so dont try. Without knowing what kind of data this is, what form, what the software looks at how much work you are willing to force the hardware to do, etc determines the rest of the answer. If you expect a certain (guaranteed) percentage to be bad or not belong to this processor, etc have the hardware filter that out and then what is left you can process.
Networking is a good example of this, PCs have gige ports but cannot process GigE line rate data. That is why we use switches now instead of hubs, hardware slices out a percentage of the data so the pc can handle it, the protocols take care of the data that cannot be processed by resending it later. And the switches processors dont look at all of the data, the hardware slices it up so the software can examine just the header. Or sometimes the software simply manages tables that drive the hardware and the hardware does all the work of processing the data.
Do your system engineering the answers will simply fall out.

You buffer it. Typically data from a device is written to a memory buffer (circular queue) using DMA (no cpu involved). The cpu reads from the memory buffer at a constant rate. Usually devices send data in bursts. This keeps the buffer from filling up. If there is too much data, buffer overflow.

DMA (direct memory access) is possibly the solution, however, it seems unlikely that the memory bus could run faster than the processor core, so the receiving peripheral would have to accept data into a larger register than 8 bit because 125MHz could not be sustained. For example a 16bit register would allow memory writes at 62.5MHz which may be achievable. Also the receiving device would have to be able to accept an external clock that is both faster and asynchronous to the core clock. Also of course the receiving peripheral must have support for DMA.
Unless you are more specific about your hardware and the communication protocol it is difficult to give anything other than a general answer.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse