Multi-threaded Signature Generation C# - pkcs#11

I am using PKCS11 Compliant Crypto Device which secures my Private Key. The Crypto Device is capable of generating 500 RSA-2048 Bit Signings per second. I have written an application in C#.NET interfaced with PKCS11Interop Wrapper. Here is my code:
#region Initialization
Pkcs11 pkcs11 = new Pkcs11(pkcsLibraryPath, true);
Slot slot = pkcs11.GetSlotList(true)[slotIndex];
Session session = slot.OpenSession(false);
session.Login(CKU.CKU_USER, hsmPIN);
List<ObjectAttribute> searchObject = new List<ObjectAttribute>(2);
searchObject.Add(new ObjectAttribute(CKA.CKA_CLASS,(uint)CKO.CKO_PRIVATE_KEY));
searchObject.Add(new ObjectAttribute(CKA.CKA_LABEL, keyLabelName));
ObjectHandle privateKeyHandle = session.FindAllObjects(searchObject)[0];
byte[] dataToBeSigned = new byte[500];
byte[] signature = new byte[dataToBeSigned.Length];
#endregion Initialization
#region SEQUENTIAL Signing Loop
for(int i = 0; i<500;i++)
{
signature[i] = session.Sign(new Mechanism(CKM.CKM_SHA256_RSA_PKCS_PSS) , privateKeyHandle , dataToBeSigned[i]);
}
#endregion SEQUENTIAL Signing Loop
#region UNMANAGED Parallel Loop
Parallel.For(0, dataToBeSigned.Length, index =>
{
signature[index] = session.Sign(new Mechanism(CKM.CKM_SHA256_RSA_PKCS_PSS) , privateKeyHandle , dataToBeSigned[index]);
});
#endregion UNMANAGED Parallel Loop
#region MANAGED Parallel Loop
Parallel.For(0, dataToBeSigned.Length, index =>
{
lock(session)
{
signature[index] = session.Sign(new Mechanism(CKM.CKM_SHA256_RSA_PKCS_PSS) , privateKeyHandle , dataToBeSigned[index]);
}
});
#endregion MANAGED Parallel Loop
Here you go!!
With the Sequential Signing Loop region, I can achieve just 250-280 Signings, but never the speed of 500signings as specified by my Crypto OEM. At least I need 440~480 signings per second. How can I achieve this using a Sequential 'for' loop?
Why does my UNMANAGED Parallel loop throw an exception always? Even if I handle those exception, 40% of signings are getting failed (session.Sign() function returns null). Why is it so?
With "MANAGED Parallel Loop" code, I can achieve max speed of 280, as I got with Sequential Signing Loop. Why my MANAGED Parallel loop is slow? Is it because of 'lock'? If I remove the lock, it becomes UNMANAGED PARALLEL LOOP. How can I handle this?
If you feel, my multithreading coding (and entire PKCS11 programming and operations) is wrong, please suggest me some method to achieve maximum speed.
If you feel, there could be a problem with PCKS11Interop Wrapper which is not letting me to achieve the speed, please suggest some other wrappers. I used NCryptoki, Pkcs11.Net Wrappers, but I could not achieve max speed.
I am 100% confident that my PKCS11 Compliant device is capable of generating 500Signings. I confirmed this with my OEM. Only, when I operate the device through programmatically (either C# or Java), my speed goes down.
I request experts of this forum to clarify me on above 6 points.
Many Thanks.
Karthick

You need to create new Session for each signing operation.
Please read "Chapter 6 - General overview" of PKCS#11 v2.20 specification. All basic concepts of PKCS#11 API (including thread/operation isolation provided by sessions) are explained there.
After you finish this mandatory reading, you can take a look at Pkcs11RsaSignature class in Pkcs11Interop.PDF project for a working code sample.

Related

How to capture command line input from Vert.x

Env: Mac OS 12.1, JDK 17, Vert.x 4.2.4
Question: how to capture command line input from a verticle? Tried so far following in the public void start(Promise<Void> startPromise) throws Exception method:
getVertx().createSharedWorkerExecutor("sys-in").executeBlocking(promise -> {
try (final BufferedReader br = new BufferedReader(new InputStreamReader(System.in))) {
String line;
int count = 0;
do {
System.out.print("message to MC: ");
line = br.readLine();
count++;
//doSth(line); // e.g. send line over multicast
} while (count < 3);
} catch (Throwable t) {
// log.info("<start> ", t);
} finally {
// bye(); // send a final message and close vertx
promise.complete();
}
});
This will start, get 3 nulls from br, and exit. Also tried a separated ExecutorService, in vain. Couldn't find any help in Vert.x doc either. Any hints are appreciated:
aware of the warnings of Vert.x when doing blocking stuff
Vert.x might not meant to be used this way, but would be cool if it (reading from command line) can be done with the same toolkit
I understand what you are trying to accomplish, but the problem is that that goes against fundamentals of verticles concept. Waiting for user input is potentially infinitely blocking operation i.e. there is no guarantee user will ever input the values. In that case, you are left with the verticle that is hung forever, spending resources and stuck in one spot. Multiply this if you are using worker verticles and you might have serious problems with the app. This issue is also emphasized here: https://vertx.io/docs/vertx-core/java/#blocking_code (under Warning).
In the link provided you can also find a suggested solution with a separate thread solution. Non-vertx thread won't mind being blocked and when the user input is provided can inform the vertx part of the application via the event bus that the user input dependent code can now be executed.
This might not be the solution you had in mind since it's not pure vertx, but have in mind that vert.x is just another tool, and that tool is not a good fit for what you are trying to accomplish here. However, it can be paired well with plain Java and it won't mind.

Adding a delay in FiddlerScript

I would like to add a uniform delay to responses in all sessions that Fiddler intercepts. The use of "response-trickle-delay" is unacceptable, since that doesn't actually introduce a uniform delay, but rather delays each 1KB of transfer (which simulates low bandwidth, rather than high latency).
The only reference I could find was here, which used the following atrocity (DO NOT USE!):
static function wait(msecs)
{
var start = new Date().getTime();
var cur = start;
while(cur – start < msecs)
{
cur = new Date().getTime();
}
}
and wait(5000); is inserted into OnBeforeResponse.
As expected, it locked up my computer and started overheating my CPU and I had to quit Fiddler.
I'm looking for something:
Less stupid, and
As simple as possible.
It looks like FiddlerScript is written in JScript.NET, and from what I gather there is a setTimeout() function, but I'm having trouble calling it (and I don't know JavaScript or .NET at all). Here is my OnBeforeResponse:
static function OnBeforeResponse(oSession: Session) {
if (m_Hide304s && oSession.responseCode == 304) {
oSession["ui-hide"] = "true";
}
setTimeout(function(){},1000);
}
It just gives a syntax error at the setTimeout line. Can setTimeout() be used from a FiddlerScript to introduce uniform delay?
FiddlerScript uses JScript.NET, which can reference .NET assemblies, and System.Threading contains Sleep.
In Tools > Fiddler Options > Extensions, add the path to System.Threading.dll. On my machine, this was located at C:\Windows\Microsoft.NET\Framework\v4.0.30319.
In FiddlerScript, add import System.Threading;.
You can now add lines like Thread.Sleep(1000).

Using burst_read/write with register model

I've a register space of 16 registers.
These are accessible through serial bus (single as well as burst).
I've UVM reg model defined for these registers.
However none of the reg model method supports burst transaction on bus.
As a workaround
I can declare memory model for same space and whenever I need burst access I use memory model but it seems redundant to declare 2 separate classes for same thing and this approach won't mirror register values correctly.
create a function which loops for number of bytes iterations and access registers one by one however this method doesn't create burst transaction on bus.
So I would like to know if there is a way to use burst_read and burst_write methods with register model. It would be nice if burst_read and burst_write support mirroring (current implementation doesn't support this) but if not I can use .predict and .set so its not big concern.
Or can I implement a method for register model easily to support burst operation.
I found this to help get you started:
http://forums.accellera.org/topic/716-uvm-register-model-burst-access/
The guy mentions using the optional 'extension' argument that read/write take. You could store the length of the burst length inside a container object (think int vs. Integer in Java) and then pass that as an argument when calling write() on the first register.
A rough sketch (not tested):
// inside your register sequence
uvm_queue #(int) container = new("container");
container.push_front(4);
start_reg.write(status, data, .extension(container));
// inside your adapter
function uvm_sequence_item reg2bus(const ref uvm_reg_bus_op rw);
int burst_len = 1;
uvm_reg_item reg_item = get_item();
uvm_queue #(int) extension;
if ($cast(extension, reg_item.extension))
burst_len = extension.pop_front();
// do the stuff here based on the burst length
// ...
endfunction
I've used uvm_queue because there isn't any trivial container object in UVM.
After combining opinions provided by Tudor and links in the discussion, here is what works for adding burst operation to reg model.
This implementation doesn't show all the code but only required part for adding burst operation, I've tested it for write and read operation with serial protocols (SPI / I2C). Register model values are updated correctly as well as RTL registers are updated.
Create a class to hold data and burst length:
class burst_class extends uvm_object;
`uvm_object_utils (....);
int burst_length;
byte data [$];
function new (string name);
super.new(name);
endfunction
endclass
Inside register sequence (for read don't initialize data)
burst_class obj;
obj = new ("burstInfo");
obj.burst_length = 4; // replace with actual length
obj.data.push_back (data1);
obj.data.push_back (data2);
obj.data.push_back (data3);
obj.data.push_back (data4);
start_reg.read (status,...., .extension(obj));
start_reg.write (status, ...., .extension (obj));
After successful operation data values should be written or collected in obj object
In adapter class (reg2bus is updated for write and bus2reg is updated for read)
All the information about transaction is available in reg2bus except data in case of read.
adapter class
uvm_reg_item start_reg;
int burst_length;
burst_class adapter_obj;
reg2bus implementation
start_reg = this.get_item;
adapter_obj = new ("adapter_obj");
if($cast (adapter_obj, start_reg.extension)) begin
if (adapter_obj != null) begin
burst_length = adapter_obj.burst_length;
end
else
burst_length = 1; /// so that current implementation of adapter still works
end
Update the size of transaction over here according to burst_length and assign data correctly.
As for read bus2reg needs to be updated
bus2reg implementation (Already has all control information since reg2bus is always executed before bus2reg, use the values captured in reg2bus)
According to burst_length only assign data to object passed though extension in this case adapter_obj

Parallel.For System.OutOfMemoryException

We have a fairly simple program that's used for creating backups. I'm attempting to parallelize it but am getting an OutOfMemoryException within an AggregateException. Some of the source folders are quite large, and the program doesn't crash for about 40 minutes after it starts. I don't know where to start looking so the below code is a near exact dump of all code the code sans directory structure and Exception logging code. Any advice as to where to start looking?
using System;
using System.Diagnostics;
using System.IO;
using System.Threading.Tasks;
namespace SelfBackup
{
class Program
{
static readonly string[] saSrc = {
"\\src\\dir1\\",
//...
"\\src\\dirN\\", //this folder is over 6 GB
};
static readonly string[] saDest = {
"\\dest\\dir1\\",
//...
"\\dest\\dirN\\",
};
static void Main(string[] args)
{
Parallel.For(0, saDest.Length, i =>
{
try
{
if (Directory.Exists(sDest))
{
//Delete directory first so old stuff gets cleaned up
Directory.Delete(sDest, true);
}
//recursive function
clsCopyDirectory.copyDirectory(saSrc[i], sDest);
}
catch (Exception e)
{
//standard error logging
CL.EmailError();
}
});
}
}
///////////////////////////////////////
using System.IO;
using System.Threading.Tasks;
namespace SelfBackup
{
static class clsCopyDirectory
{
static public void copyDirectory(string Src, string Dst)
{
Directory.CreateDirectory(Dst);
/* Copy all the files in the folder
If and when .NET 4.0 is installed, change
Directory.GetFiles to Directory.Enumerate files for
slightly better performance.*/
Parallel.ForEach<string>(Directory.GetFiles(Src), file =>
{
/* An exception thrown here may be arbitrarily deep into
this recursive function there's also a good chance that
if one copy fails here, so too will other files in the
same directory, so we don't want to spam out hundreds of
error e-mails but we don't want to abort all together.
Instead, the best solution is probably to throw back up
to the original caller of copy directory an move on to
the next Src/Dst pair by not catching any possible
exception here.*/
File.Copy(file, //src
Path.Combine(Dst, Path.GetFileName(file)), //dest
true);//bool overwrite
});
//Call this function again for every directory in the folder.
Parallel.ForEach(Directory.GetDirectories(Src), dir =>
{
copyDirectory(dir, Path.Combine(Dst, Path.GetFileName(dir)));
});
}
}
The Threads debug window shows 417 Worker threads at the time of the exception.
EDIT: The copying is from one server to another. I'm now trying to run the code with the last Paralell.ForEach changed to a regular foreach.
Making a few guesses here as I haven't yet had feedback from the comment to your question.
I am guessing that the large amount of worker threads is happening here as actions (an action being the unit of work carried out on the parallel foreach) are taking longer than a specified amount of time, so the underlying ThreadPool is growing the number of threads. This will happen as the ThreadPool follows an algorithm of growing the pool so that new tasks are not blocked by existing long running tasks e.g. if all my current threads have been busy for half a second, I'll start adding more threads to the pool. However, you are going to get into trouble if all tasks are long-running and new tasks that you add are going to make existing tasks run even longer. This is why you are probably seeing a large number of worker threads - possibly because of disk thrashing or slow network IO (if networked drives are involved).
I am also guessing that files are being copied from one disk to another, or they are being copied from one location to another on the same disk. In this case, adding threads to the problem is not going to help out much. The source and destination disks only have one set of heads, so trying to make them do multiple things at once is likely to actually slow things down:
The disk heads will be lurching all over the place.
Your disk\OS caches may be frequently invalidated.
This may not be a great problem for parallelization.
Update
In answer to your comment, if you are getting a speed-up using multiple threads on smaller datasets, then you could experiment with lowering the maximum number of threads used in your parallel foreach, e.g.
ParallelOptions options = new ParallelOptions { MaxDegreeOfParallelism = 2 };
Parallel.ForEach(Directory.GetFiles(Src), options, file =>
{
//Do stuff
});
But please do bear in mind that disk thrashing may negate any benefits from parallelization in the general case. Play about with it and measure your results.

SQLConnection Pooling - Handling InvalidOperationExceptions

I am designing a Highly Concurrent CCR Application in which it is imperative that I DO NOT Block or Send to sleep a Thread.
I am hitting SQLConnection Pool issues - Specifically getting InvalidOperationExceptions when trying to call SqlConnection.Open
I can potentially retry a hand full of times, but this isn't really solving the problem.
The ideal solution for me would be a method of periodically re-checking the connection for availablity that doesn't require a thread being tied up
Any ideas?
[Update]
Here is a related problem/solution posted at another forum
The solution requires a manually managed connection pool. I'd rather have a solution which is more dynamic i.e. kicks in when needed
Harry, I've run into this as well, also whilst using the CCR. My experience was that having completely decoupled my dispatcher threads from blocking on any I/O, I could consume and process work items much faster than the SqlConnection pool could cope with. Once the maximum-pool-limit was hit, I ran into the sort of errors you are seeing.
The simplest solution is to pre-allocate a number of non-pooled asynchronous SqlConnection objects and post them to some central Port<SqlConnection> object. Then whenever you need to execute a command, do so within an iterator with something like this:
public IEnumerator<ITask> Execute(SqlCommand someCmd)
{
// Assume that 'connPort' has been posted with some open
// connection objects.
try
{
// Wait for a connection to become available and assign
// it to the command.
yield return connPort.Receive(item => someCmd.Connection = item);
// Wait for the async command to complete.
var iarPort = new Port<IAsyncResult>();
var iar = someCmd.BeginExecuteNonQuery(iarPort.Post, null);
yield return iarPort.Receive();
// Process the response.
var rc = someCmd.EndExecuteNonQuery(iar);
// ...
}
finally
{
// Put the connection back in the 'connPort' pool
// when we're done.
if (someCmd.Connection != null)
connPort.Post(someCmd.Connection);
}
}
The nice thing about using the Ccr is that it is trivial to add the following the features to this basic piece of code.
Timeout - just make the initial receive (for an available connection), a 'Choice' with a timeout port.
Adjust the pool size dynamically. To increase the size of the pool, just post a new open SqlConnection to 'connPort'. To decrease the size of the pool, yield a receive on the connPort, and then close the received connection and throw it away.
Yes, connections are kept open and out of the connection pool. In the above example, the port is the pool.