We are trying to store some big buffers (8MB each) in Redis using the ServiceStack wrapper. We use the “RedisNativeClient.Set(string key, byte[] value)” API to set the buffers.
Both client and server reside on the same machine.
Persistence in disabled.
We are currently using the evaluation version of ServiceStack.
The problem is that we get very poor performance - around 60 MB/Sec.
Using some different c# wrappers ("Sider"), we get better performance (~400 MB/Sec).
The code I used for my measurements:
public void SimpleTest()
{
Stopwatch sw;
long ms1, ms2, interval;
int nBytesHandled = 0;
int nBlockSizeBytes = 8000000;
int nMaxIterations = 5;
byte[] pBuffer = new byte[(int)(nBlockSizeBytes)];
// Create Redis Wrapper
ServiceStack.Redis.RedisNativeClient m_serviceStackRedisClient = new ServiceStack.Redis.RedisNativeClient();
// Clear DB
m_serviceStackRedisClient.FlushAll();
sw = Stopwatch.StartNew();
ms1 = sw.ElapsedMilliseconds;
for (int i = 0; i < nMaxIterations; i++)
{
m_serviceStackRedisClient.Set("eitan" + i.ToString(), pBuffer);
nBytesHandled += nBlockSizeBytes;
}
ms2 = sw.ElapsedMilliseconds;
interval = ms2 - ms1;
// Calculate rate
double dMBPerSEc = nBytesHandled / 1024.0 / 1024.0 / ((double)interval / 1000.0);
Console.WriteLine("Rate {0:N4}", dMBPerSEc);
}
What could the problem be ?
Thanks,
Eitan.
ServiceStack.Redis uses a reusable Buffer Pool to reduce memory pressure by reusing a pool of byte buffers. The size of the default byte[] buffer is 1450 bytes to fit within the Ethernet MTU packet size. Whilst this default configuration is optimal for the normal use-case of smaller payloads (<100k) it looks like it ends up being slower for larger payloads (>1MB+).
Based on this the ServiceStack.Redis client has now been modified so that it no longer uses the buffer pool for payloads larger than 500k which is now configurable with RedisConfig.BufferPoolMaxSize, e.g:
RedisConfig.BufferPoolMaxSize = 500000;
The default 1450 byte size of the byte[] buffer is now also configurable with:
RedisConfig.BufferLength = 1450;
This change now improves the throughput performance of ServiceStack.Redis for larger payloads as seen in RedisBenchmarkTests suite which uses your benchmark with different payload sizes, e.g:
public void Run(string name, int nBlockSizeBytes, Action<int,byte[]> fn)
{
Stopwatch sw;
long ms1, ms2, interval;
int nBytesHandled = 0;
int nMaxIterations = 5;
byte[] pBuffer = new byte[nBlockSizeBytes];
// Create Redis Wrapper
var redis = new RedisNativeClient();
// Clear DB
redis.FlushAll();
sw = Stopwatch.StartNew();
ms1 = sw.ElapsedMilliseconds;
for (int i = 0; i < nMaxIterations; i++)
{
fn(i, pBuffer);
nBytesHandled += nBlockSizeBytes;
}
ms2 = sw.ElapsedMilliseconds;
interval = ms2 - ms1;
// Calculate rate
double dMBPerSEc = nBytesHandled / 1024.0 / 1024.0 / (interval / 1000.0);
Console.WriteLine(name + ": Rate {0:N4}, Total: {1}ms", dMBPerSEc, ms2);
}
Results running from my MacBook Pro and redis-server running in an Ubuntu VirtualBox VM:
1K Results:
ServiceStack.Redis 1K: Rate 4.7684, Total: 1ms
Sider 1K: Rate 0.4768, Total: 10ms
10K Results:
ServiceStack.Redis 10K: Rate 47.6837, Total: 1ms
Sider 10K: Rate 4.3349, Total: 11ms
100K Results:
ServiceStack.Redis 100K: Rate 26.4910, Total: 18ms
Sider 100K: Rate 20.7321, Total: 23ms
1MB Results:
ServiceStack.Redis 1MB: Rate 103.6603, Total: 46ms
Sider 1MB: Rate 70.1231, Total: 68ms
8MB Results:
ServiceStack.Redis 8MB: Rate 77.0646, Total: 495ms
Sider 8MB: Rate 84.3960, Total: 452ms
Where the performance for of ServiceStack.Redis is faster for smaller payloads and now closer for payloads larger than 8MB.
This change is available from v4.0.41+ that's now available on MyGet.
Related
Using Scala 2.12.8 on Java Hotspot 17.0.1, have a large number of instances of an object which contains code like this:
var tmpSet[SomeType] = mutable.Set[SomeType]()
lazy val finalSet[SomeType] = { val tmp = tmpSet.toSet; tmpSet = null; tmp }
During initialization, the logic builds the tmpSet in all of the objects. Then runs this:
for(i <- 0 until numberOfInstances ) instance(i).finalSet.size
to force the conversion to an immutable.Set which will be used for all further processing.
Before the conversion, using an -Xmx14G parameter, about 4.5G of memory has been consumed (for all the tmpSet's). Running the conversion always throws OOM. Have placed traces of memory use at points within the for(...) loop and can see memory usage steadily increasing until the OOM.
Any idea what is happening here? Even if the GC is disabled and does not recover any of the tmpSet instances that have been set to null, there should still be enough RAM -- unless an immutable.Set takes far more memory than the equivalent mutable.Set.
WITHDRAWING this question. Wrote a testbed (below as an answer) to mimic this situation and it does NOT show this behaviour -- so must be some other problem within my codebase.
Testbed code to mimic the situation, and it does not exhibit the problem.
Note Had trouble pasting the code with some special symbols. In particular replaced < with & lt ; to keep the HTML happy (about line 16). Also increased the indent.
/** Test case to explore OOM issue when converting a large number of mutable.Set's to immutable.Set's
* The 'lazy val finalSet' logic converts each mutable.Set into an immutable.Set, releasing the mutable.Set in the process.
*
* The original expectation was that this would consume roughly the same amount of memory as before the conversion!!!
**/
import scala.collection.mutable
import scala.collection.immutable
case class TestSetOOM(numStrings:Int) {
import TestSetOOM._
// tmpSet is loaded with sizeOfSet unique strings when this class is initialized
var tmpSet:mutable.Set[String] = mutable.Set[String]()
// finalSet is an immutable.Set initialized when finalSet.size is invoked. tmpSet is converted, and then released
lazy val finalSet:immutable.Set[String] = { val tmp = tmpSet.toSet; tmpSet = null; tmp}
// Executes at initialization
for(i <- 0 until numStrings) tmpSet += getString
}
object TestSetOOM {
/////// The basic parameters controlling the tests ////////
val numInstances = 10000 // How many instances of TestSetOOM to create
val sizeOfString = 1000 // How large to make each test string (+ 16 for the number)
val sizeOfSet = 1000 // How many test strings to add to the 'tmpSet' within each instance
val gcEveryN = 1000 // During the conversion phase, request a GC & pause every N instances processed
// NOTE: Would REALLY love to have a method to FORCE a complete, deep GC for use
// in exactly test routines such as this!!!
val pauseMillis = 100 // Number of milliseconds to pause during the gdEveryN, can be zero for no pause
// The hope is that if we pause the main thread then the JVM may actually run the GC
val reportAsMB = true // True == show memory reports as MB, false as GBs
//////// ... end of basic parameters ... ////////
val baseData = numInstances.toLong * sizeOfSet * (sizeOfString + 16)
def main(ignored:Array[String]):Unit = {
var instances:List[TestSetOOM] = Nil // Create numInstances and prepend to this list
for(i 0) Thread.sleep(pauseMillis)
ln("")
ln(f"Instances: $numInstances%,d, Size Of mutable.Set: $sizeOfSet%,d, Size of String: ${sizeOfString + 16}%,d == ${MBorGB(baseData)} base data size")
ln("")
ln(s" BASELINE -- $memUsedStr -- after initialization of all test data")
var dummy = 0L
var cnt = 0
instances.foreach { item =>
dummy += item.finalSet.size // Forces the conversion
cnt += 1
if(gcEveryN > 0 && (cnt % gcEveryN) == 0){
runtime.gc
if(pauseMillis > 0) Thread.sleep(pauseMillis)
ln(f"$cnt%,11d converted -- $memUsedStr ")
}
}
ln("")
ln(s" FINAL -- $memUsedStr")
}
val runtime = Runtime.getRuntime
// Get a memory report in either MBs or GBs
def memUsedStr = { val max = runtime.maxMemory
val ttl = runtime.totalMemory
val free = runtime.freeMemory
val used = ttl - free
s"Memory -- Max: ${MBorGB(max)}, Total: ${MBorGB(ttl)}, Used: ${MBorGB(used)}"
}
def MBorGB(amount:Long) = { val amt = amount / (if(reportAsMB) 1000000 else 1000000000L)
val as = if(reportAsMB) "MB" else "GB"
f"$amt%,9d $as"
}
// Generate strings with leading unique numbers as test data so the compiler cannot
// recognize as equal and memoize just one if they were all the same!
val emptyString = "X" * sizeOfString
var numString = 0L
def getString = { numString += 1
f"${numString}%,16d$emptyString"
}
def ln(str:String) = println(str)
}
Output of execution
Instances: 10,000, Size Of mutable.Set: 1,000, Size of String: 1,016 == 10,160 MB base data size
BASELINE -- Memory -- Max: 15,032 MB, Total: 15,032 MB, Used: 10,746 MB -- after initialization of all test data
1,000 converted -- Memory -- Max: 15,032 MB, Total: 15,032 MB, Used: 10,783 MB
2,000 converted -- Memory -- Max: 15,032 MB, Total: 15,032 MB, Used: 10,825 MB
3,000 converted -- Memory -- Max: 15,032 MB, Total: 15,032 MB, Used: 10,867 MB
4,000 converted -- Memory -- Max: 15,032 MB, Total: 15,032 MB, Used: 10,909 MB
5,000 converted -- Memory -- Max: 15,032 MB, Total: 15,032 MB, Used: 10,951 MB
6,000 converted -- Memory -- Max: 15,032 MB, Total: 15,032 MB, Used: 10,992 MB
7,000 converted -- Memory -- Max: 15,032 MB, Total: 15,032 MB, Used: 11,034 MB
8,000 converted -- Memory -- Max: 15,032 MB, Total: 15,032 MB, Used: 11,076 MB
9,000 converted -- Memory -- Max: 15,032 MB, Total: 15,032 MB, Used: 11,117 MB
10,000 converted -- Memory -- Max: 15,032 MB, Total: 15,032 MB, Used: 11,158 MB
FINAL -- Memory -- Max: 15,032 MB, Total: 15,032 MB, Used: 11,162 MB
I have implemented a simple linear probing hash map with an array of structs memory layout. The struct holds the key, the value, and a flag indicating whether the entry is valid. By default, this struct gets padded by the compiler, as key and value are 64-bit integers, but the entry only takes up 8 bools. Hence, I have also tried packing the struct at the cost of unaligned access. I was hoping to get better performance from the packed/unaligned version due to higher memory density (we do not waste bandwidth on transferring padding bytes).
When benchmarking this hash map on an Intel Xeon Gold 5220S CPU (single-threaded, gcc 11.2, -O3 and -march=native), I see no performance difference between the padded version and the unaligned version. However, on an AMD EPYC 7742 CPU (same setup), I find a performance difference between unaligned and padded. Here is a graph depicting the results for hash map load factors 25 % and 50 %, for different successful query rates on the x axis (0,25,50,75,100): As you can see, on Intel, the grey and blue (circle and square) lines almost overlap, the benefit of struct packing is marginal. On AMD, however, the line representing unaligned/packed structs is consistently higher, i.e., we have more throughput.
In order to investigate this, I tried to built a smaller microbenchmark. In this microbenchmark, we perform a similar benchmark, but without the hash map find logic (i.e., we just pick random indices in the array and advance a little there). Please find the benchmark here:
#include <atomic>
#include <chrono>
#include <cstdint>
#include <iostream>
#include <random>
#include <vector>
void ClobberMemory() { std::atomic_signal_fence(std::memory_order_acq_rel); }
template <typename T>
void doNotOptimize(T const& val) {
asm volatile("" : : "r,m"(val) : "memory");
}
struct PaddedStruct {
uint64_t key;
uint64_t value;
bool is_valid;
PaddedStruct() { reset(); }
void reset() {
key = uint64_t{};
value = uint64_t{};
is_valid = 0;
}
};
struct PackedStruct {
uint64_t key;
uint64_t value;
uint8_t is_valid;
PackedStruct() { reset(); }
void reset() {
key = uint64_t{};
value = uint64_t{};
is_valid = 0;
}
} __attribute__((__packed__));
int main() {
const uint64_t size = 134217728;
uint16_t repetitions = 0;
uint16_t advancement = 0;
std::cin >> repetitions;
std::cout << "Got " << repetitions << std::endl;
std::cin >> advancement;
std::cout << "Got " << advancement << std::endl;
std::cout << "Initializing." << std::endl;
std::vector<PaddedStruct> padded(size);
std::vector<PackedStruct> unaligned(size);
std::vector<uint64_t> queries(size);
// Initialize the structs with random values + prefault
std::random_device rd;
std::mt19937 gen{rd()};
std::uniform_int_distribution<uint64_t> dist{0, 0xDEADBEEF};
std::uniform_int_distribution<uint64_t> dist2{0, size - advancement - 1};
for (uint64_t i = 0; i < padded.size(); ++i) {
padded[i].key = dist(gen);
padded[i].value = dist(gen);
padded[i].is_valid = 1;
}
for (uint64_t i = 0; i < unaligned.size(); ++i) {
unaligned[i].key = padded[i].key;
unaligned[i].value = padded[i].value;
unaligned[i].is_valid = 1;
}
for (uint64_t i = 0; i < unaligned.size(); ++i) {
queries[i] = dist2(gen);
}
std::cout << "Running benchmark." << std::endl;
ClobberMemory();
auto start_padded = std::chrono::high_resolution_clock::now();
PaddedStruct* padded_ptr = nullptr;
uint64_t sum = 0;
for (uint16_t j = 0; j < repetitions; j++) {
for (const uint64_t& query : queries) {
for (uint16_t i = 0; i < advancement; i++) {
padded_ptr = &padded[query + i];
if (padded_ptr->is_valid) [[likely]] {
sum += padded_ptr->value;
}
}
doNotOptimize(sum);
}
}
ClobberMemory();
auto end_padded = std::chrono::high_resolution_clock::now();
uint64_t padded_runtime = static_cast<uint64_t>(std::chrono::duration_cast<std::chrono::milliseconds>(end_padded - start_padded).count());
std::cout << "Padded Runtime (ms): " << padded_runtime << " (sum = " << sum << ")" << std::endl; // print sum to avoid that it gets optimized out
ClobberMemory();
auto start_unaligned = std::chrono::high_resolution_clock::now();
uint64_t sum2 = 0;
PackedStruct* packed_ptr = nullptr;
for (uint16_t j = 0; j < repetitions; j++) {
for (const uint64_t& query : queries) {
for (uint16_t i = 0; i < advancement; i++) {
packed_ptr = &unaligned[query + i];
if (packed_ptr->is_valid) [[likely]] {
sum2 += packed_ptr->value;
}
}
doNotOptimize(sum2);
}
}
ClobberMemory();
auto end_unaligned = std::chrono::high_resolution_clock::now();
uint64_t unaligned_runtime = static_cast<uint64_t>(std::chrono::duration_cast<std::chrono::milliseconds>(end_unaligned - start_unaligned).count());
std::cout << "Unaligned Runtime (ms): " << unaligned_runtime << " (sum = " << sum2 << ")" << std::endl;
}
When running the benchmark, I pick repetitions = 3 and advancement = 5, i.e., after compiling and running it, you have to enter 3 (and press newline) and then enter 5 and press enter/newline. I updated the source code to (a) avoid loop unrolling by the compiler because repetition/advancement were hardcoded and (b) switch to pointers into that vector as it more closely resembles what the hash map is doing.
On the Intel CPU, I get:
Padded Runtime (ms): 13204
Unaligned Runtime (ms): 12185
On the AMD CPU, I get:
Padded Runtime (ms): 28432
Unaligned Runtime (ms): 22926
So while in this microbenchmark, Intel still benefits a little from the unaligned access, for the AMD CPU, both the absolute and relative improvement is higher. I cannot explain this. In general, from what I've learned from relevant SO threads, unaligned access for a single member is just as expensive as aligned access, as long as it stays within a single cache line (1). Also in (1), a reference to (2) is given, which claims that the cache fetch width can differ from the cache line size. However, except for Linus Torvalds mail, I could not find any other documentation of cache fetch widths in processors and especially not for my concrete two CPUs to figure out if that might somehow have to do with this.
Does anybody have an idea why the AMD CPU benefits much more from the struct packing? If it is about reduced memory bandwidth consumption, I should be able to see the effects on both CPUs. And if the bandwidth usage is similar, I do not understand what might be causing the differences here.
Thank you so much.
(1) Relevant SO thread: How can I accurately benchmark unaligned access speed on x86_64?
(2) https://www.realworldtech.com/forum/?threadid=168200&curpostid=168779
The L1 Data Cache fetch width on the Intel Xeon Gold 5220S (and all the other Skylake/CascadeLake Xeon processors) is up to 64 naturally-aligned Bytes per cycle per load.
The core can execute two loads per cycle for any combination of size and alignment that does not cross a cacheline boundary. I have not tested all the combinations on the SKX/CLX processors, but on Haswell/Broadwell, throughput was reduced to one load per cycle whenever a load crossed a cacheline boundary, and I would assume that SKX/CLX are similar. This can be viewed as necessary feature rather than a "penalty" -- a line-splitting load might need to use both ports to load a pair of adjacent lines, then combine the requested portions of the lines into a payload for the target register.
Loads that cross page boundaries have a larger performance penalty, but to measure it you have to be very careful to understand and control the locations of the page table entries for the two pages: DTLB, STLB, in the caches, or in main memory. My recollection is that the most common case is pretty fast -- partly because the "Next Page Prefetcher" is pretty good at pre-loading the PTE entry for the next page into the TLB before a sequence of loads gets to the end of the first page. The only case that is painfully slow is for stores that straddle a page boundary, and the Intel compiler works very hard to avoid this case.
I have not looked at the sample code in detail, but if I were performing this analysis, I would be careful to pin the processor frequency, measure the instruction and cycle counts, and compute the average number of instructions and cycles per update. (I usually set the core frequency to the nominal (TSC) frequency just to make the numbers easier to work with.) For the naturally-aligned cases, it should be pretty easy to look at the assembly code and estimate what the cycle counts should be. If the measurements are similar to observations for that case, then you can begin looking at the overhead of unaligned accesses in reference to a more reliable understanding of the baseline.
Hardware performance counters can be valuable for this case as well, particularly the DTLB_LOAD_MISSES events and the L1D.REPLACEMENT event. It only takes a few high-latency TLB miss or L1D miss events to skew the averages.
The number of cache-line accesses when using 24-byte data structures may be the same as when using 17-byte data structure.
Please see this blog post: https://lemire.me/blog/2022/06/06/data-structure-size-and-cache-line-accesses/
I have a question regarding Postgres autovacuum / vacuum settings.
I have a table with 4.5 billion rows and there was a period of time with a lot of updates resulting in ~ 1.5 billion dead tuples. At this point autovacuum was taking a long time (days) to complete.
When looking at the pg_stat_progress_vacuum view I noticed that:
max_dead_tuples = 178956970
resulting in multiple index rescans (index_vacuum_count)
According to docs - max_dead_tuples is a number of dead tuples that we can store before needing to perform an index vacuum cycle, based on maintenance_work_mem.
According to this one dead tuple requires 6 bytes of space.
So 6B x 178956970 = ~1GB
But my settings are
maintenance_work_mem = 20GB
autovacuum_work_mem = -1
So what am I missing? why didn't all my 1.5b dead tuples fit in max_dead_tuples, since 20GB should give enough space, and why there were multiple runs necessary?
There is a hard-coded limit of 1GB for the number of dead tuples in one VACUUM cycle, see the source:
/*
* Return the maximum number of dead tuples we can record.
*/
static long
compute_max_dead_tuples(BlockNumber relblocks, bool useindex)
{
long maxtuples;
int vac_work_mem = IsAutoVacuumWorkerProcess() &&
autovacuum_work_mem != -1 ?
autovacuum_work_mem : maintenance_work_mem;
if (useindex)
{
maxtuples = MAXDEADTUPLES(vac_work_mem * 1024L);
maxtuples = Min(maxtuples, INT_MAX);
maxtuples = Min(maxtuples, MAXDEADTUPLES(MaxAllocSize));
/* curious coding here to ensure the multiplication can't overflow */
if ((BlockNumber) (maxtuples / LAZY_ALLOC_TUPLES) > relblocks)
maxtuples = relblocks * LAZY_ALLOC_TUPLES;
/* stay sane if small maintenance_work_mem */
maxtuples = Max(maxtuples, MaxHeapTuplesPerPage);
}
else
maxtuples = MaxHeapTuplesPerPage;
return maxtuples;
}
MaxAllocSize is defined in src/include/utils/memutils.h as
#define MaxAllocSize ((Size) 0x3fffffff) /* 1 gigabyte - 1 */
You could lobby on the pgsql-hackers list to increase the limit.
I'm trying to capture my desktop using Desktop Duplication API, encode the D3DTexture2D using NVENC and send it over the local network. The performance of everything is very high until I reach the part where we need to lock the bitstream and extract the data. Below is the code used:
NV_ENC_LOCK_BITSTREAM lockBitstreamData = { NV_ENC_LOCK_BITSTREAM_VER };
lockBitstreamData.outputBitstream = vOutputBuffer[m_iGot % m_nEncoderBuffer];
lockBitstreamData.doNotWait = false;
auto starti = std::chrono::system_clock::now();
NVENC_API_CALL(m_nvenc.nvEncLockBitstream(m_hEncoder, &lockBitstreamData));
auto end = std::chrono::system_clock::now();
std::chrono::duration<double> elapsed_seconds = end - starti;
std::time_t end_time = std::chrono::system_clock::to_time_t(end);
std::cout << "finished computation at " << std::ctime(&end_time)
<< "elapsed time: " << elapsed_seconds.count() << "s\n";
uint8_t *pData = (uint8_t *)lockBitstreamData.bitstreamBufferPtr;
if (vPacket.size() < i + 1)
{
vPacket.push_back(std::vector<uint8_t>());
}
vPacket[i].clear();
vPacket[i].insert(vPacket[i].end(), &pData[0], &pData[lockBitstreamData.bitstreamSizeInBytes]);
i++;
NVENC_API_CALL(m_nvenc.nvEncUnlockBitstream(m_hEncoder, lockBitstreamData.outputBitstream));
The "NVENC_API_CALL(m_nvenc.nvEncLockBitstream(m_hEncoder, &lockBitstreamData));" takes anything from under 10ms when under desktop at low load to an average of 90ms when I run a game in full screen under heavy load. Our constraints require "real-time" 60fps so anything over 16ms is too high. Is there a way to get that down?
I have built a tricopter from scratch based on a .NET Micro Framework board from TinyCLR.com. I used the FEZ Mini which runs at 72 MHz. Read more about my project at: http://bit.ly/TriRot.
So after a pre-flight check where I initialise and test each component, like calibrating the IMU and spinning each motor, checking that I get receiver data, etc., it enters a permanent loop which then calls the flight controller method on each loop.
I'm trying to tune my PID controller now using the Ziegler-Nichols method, but I am always getting a progressively larger overshoot. I was eventually able to get a [mostly] stable oscillation using proportional control only (setting Ki and Kd = 0); timing the period K with a stopwatch averaged out to 3.198 seconds.
I came across the answer (by Rex Logan) on a similar question by chris12892.
I was initially using the "Duration" variable in milliseconds which made my copter highly aggressive, obviously because I was multiplying the running integrator error by thousands on each loop. I then divided it by another thousand to bring it to seconds, but I'm still battling...
What I don't understand from Rex's answer is:
Why does he ignore the time variable in the integral and differential parts of the equations? Is that right or is it a typo?
What he means by the remark
In a normal sampled system the delta term would be one...
One what? Should this be one second under normal circumstances? What
if this value fluctuates?
My flight controller method is below:
private static Single[] FlightController(Single[] imuData, Single[] ReceiverData)
{
Int64 TicksPerMillisecond = TimeSpan.TicksPerMillisecond;
Int64 CurrentTicks = DateTime.Now.Ticks;
Int64 TickCount = CurrentTicks - PreviousTicks;
PreviousTicks = CurrentTicks;
Single Duration = (TickCount / TicksPerMillisecond) / 1000F;
const Single Kp = 0.117F; //Proportional Gain (Instantaneou offset)
const Single Ki = 0.073170732F; //Integral Gain (Permanent offset)
const Single Kd = 0.001070122F; //Differential Gain (Change in offset)
Single RollE = 0;
Single RollPout = 0;
Single RollIout = 0;
Single RollDout = 0;
Single RollOut = 0;
Single PitchE = 0;
Single PitchPout = 0;
Single PitchIout = 0;
Single PitchDout = 0;
Single PitchOut = 0;
Single rxThrottle = ReceiverData[(int)Channel.Throttle];
Single rxRoll = ReceiverData[(int)Channel.Roll];
Single rxPitch = ReceiverData[(int)Channel.Pitch];
Single rxYaw = ReceiverData[(int)Channel.Yaw];
Single[] TargetMotorSpeed = new Single[] { rxThrottle, rxThrottle, rxThrottle };
Single ServoAngle = 0;
if (!FirstRun)
{
Single imuRoll = imuData[1] + 7;
Single imuPitch = imuData[0];
//Roll ----- Start
RollE = rxRoll - imuRoll;
//Proportional
RollPout = Kp * RollE;
//Integral
Single InstanceRollIntegrator = RollE * Duration;
RollIntegrator += InstanceRollIntegrator;
RollIout = RollIntegrator * Ki;
//Differential
RollDout = ((RollE - PreviousRollE) / Duration) * Kd;
//Sum
RollOut = RollPout + RollIout + RollDout;
//Roll ----- End
//Pitch ---- Start
PitchE = rxPitch - imuPitch;
//Proportional
PitchPout = Kp * PitchE;
//Integral
Single InstancePitchIntegrator = PitchE * Duration;
PitchIntegrator += InstancePitchIntegrator;
PitchIout = PitchIntegrator * Ki;
//Differential
PitchDout = ((PitchE - PreviousPitchE) / Duration) * Kd;
//Sum
PitchOut = PitchPout + PitchIout + PitchDout;
//Pitch ---- End
TargetMotorSpeed[(int)Motors.Motor.Left] += RollOut;
TargetMotorSpeed[(int)Motors.Motor.Right] -= RollOut;
TargetMotorSpeed[(int)Motors.Motor.Left] += PitchOut;// / 2;
TargetMotorSpeed[(int)Motors.Motor.Right] += PitchOut;// / 2;
TargetMotorSpeed[(int)Motors.Motor.Rear] -= PitchOut;
ServoAngle = rxYaw + 15;
PreviousRollE = imuRoll;
PreviousPitchE = imuPitch;
}
FirstRun = false;
return new Single[] {
(Single)TargetMotorSpeed[(int)TriRot.LeftMotor],
(Single)TargetMotorSpeed[(int)TriRot.RightMotor],
(Single)TargetMotorSpeed[(int)TriRot.RearMotor],
(Single)ServoAngle
};
}
Edit: I found that I had two bugs in my code above (fixed now). I was integrating and differentiating with the last IMU values as opposed to the last error values. That got rid of the runaway sitation completely. The only problem now is that it seems to be a bit slow. When I perturb the system, it responds very quickly and stop it from continuing, but it takes a long time to get back to the setpoint (0), about 10 seconds or more. Is this now just down to tuning the PID? I'll give the suggestions below a go, and let you know if any of them make a difference.
One question I have is:
being a .NET board, I don't want to bank on any kind of accurate timing, so instead of trying to work out at what frequency I am executing that method, surely if I calculate the actual time and factor that into the equations, it should be better, or am I misunderstanding something?