is data race safe in ARMV8?

is data race safe in ARMV8? - atomic

As we know, access aligned fundamental data types in INTEL X86 architecture is atomic. How about ARMV8？
I have tried to get the result from Arm Architecture Reference Manual Armv8, for A-profile architecture, I did find something related to atomicity. ARMV8 is other-multi-copy atomic. It promises that multi-threads access one same LOCATION is atomic. But it says LOCATION is a byte. I am wondering that if thread 1 writes an aligned uint64_t memory without lock and thread 2 reads or writes it without lock at the same time. Is it atomic?(uint64_t is 8 bytes, but LOCATION is only one byte)

This is explained in B2.2 of the ARMv8 Architecture Reference Manual. In general, ordinary loads and stores of up to 64 bits, if naturally aligned, are single-copy atomic. In particular, if one thread stores to an address and another loads that same address, the load is guaranteed to see either the old or the new value, with no tearing or other undefined behavior. This is roughly analogous to a relaxed load or store in C or C++; indeed, you can see that compilers emit ordinary load and store instructions for such atomic accesses. https://godbolt.org/z/cWjaed9rM
Let's prove this for an example. For simplicity, let's use an aligned 2-byte halfword H, calling its bytes H0 and H1. Suppose that in the distant past, H was initialized to 0x0000 by a store instruction Wi; the respective writes to bytes H0 and H1 will be denoted Wi.0 and Wi.1. Now let a new store instruction Wn = {Wn.0,Wn.1} store the value 0xFFFF, and let it race with a load instruction R = {R.0,R.1}. Each of the accesses Wi, Wn, R is single-copy atomic by B2.2.1, first two bullets. We wish to show that either R.0,R.1 both return 0x00, or else they both return 0xFF.
By B2.3.2 there is a reads-from relation pairing each read with some write. R.0 must read-from either Wi.0 or Wn.0, as those are the only two writes to H0, and thus it must return either 0x00 or 0xFF. Likewise, R.1 must also return either 0x00 or 0xFF. If they both return 0x00 we are done, so suppose that one of them, say R.1, returns 0xFF, and let us show that R.0 also returns 0xFF.
We are supposing that R.1 reads-from Wn.1. By B2.2.2 (2), none
of the overlapping writes generated by Wn are coherence-after the corresponding overlapping reads generated by R, in the sense of B2.3.2. In particular, Wn.0 is not coherence-after R.0.
Note that Wn.0 is coherence-after Wi.0 (coherence order is a total order on writes, so one must come after the other, and we are assuming Wi took place very long ago, with sufficient sequencing or synchronization in between). So if R.0 reads-from Wi.0, we then have that Wn.0 is coherence-after R.0 (definition of coherence-after, second sentence). We just argued that is not the case, so R.0 does not read-from Wi.0; it must read-from Wn.0 and therefore return 0xFF. ∎
Note that on x86, ordinary loads and stores implicitly come with acquire and release ordering respectively, and this is not true on ARM64. You have to use ldar / stlr for that.

Related

Thread safe operations on XDP

I was able to confirm from the documentation that bpf_map_update_elem is an atomic operation if done on HASH_MAPs. Source (https://man7.org/linux/man-pages/man2/bpf.2.html). [Cite: map_update_elem() replaces existing elements atomically]
My question is 2 folds.
What if the element does not exist, is the map_update_elem still atomic?
Is the XDP operation bpf_map_delete_elem thread safe from User space program?
The map is a HASH_MAP.

Atomic ops, race conditions and thread safety are sort of complex in eBPF, so I will make a broad answer since it is hard to judge from your question what your goals are.
Yes, both the bpf_map_update_elem command via the syscall and the helper function update the maps 'atmomically', which in this case means that if we go from value 'A' to value 'B' that the program always sees either 'A' or 'B' not some combination of the two(first bytes of 'B' and last bytes of 'A' for example). This is true for all map types. This holds true for all map modifying syscall commands(including bpf_map_delete_elem).
This however doesn't make race conditions impossible since the value of the map may have changed between a map_lookup_elem and the moment you update it.
What is also good to keep in mind is that the map_lookup_elem syscall command(userspace) works differently from the helper function(kernelspace). The syscall will always return a copy of the data which isn't mutable. But the helper function will return a pointer to the location in kernel memory where the map value is stored, and you can directly update the map value this way without using the map_update_elem helper. That is why you often see hash maps used like:
value = bpf_map_lookup_elem(&hash_map, &key);
if (value) {
__sync_fetch_and_add(&value->packets, 1);
__sync_fetch_and_add(&value->bytes, skb->len);
} else {
struct pair val = {1, skb->len};
bpf_map_update_elem(&hash_map, &key, &val, BPF_ANY);
}
Note that in this example, __sync_fetch_and_add is used to update parts of the map value. We need to do this since updating it like value->packets++; or value->packets += 1 would result in a race condition. The __sync_fetch_and_add emits a atomic CPU instruction which in this case fetches, adds and writes back all in one instruction.
Also, in this example, the two struct fields are atomically updated, but not together, it is still possible for the packets to have incremented but bytes not yet. If you want to avoid this you need to use a spinlock(using the bpf_spin_lock and bpf_spin_unlock helpers).
Another way to sidestep the issue entirely is to use the _PER_CPU variants of maps, where you trade-off congestion/speed and memory use.

IEC61131-3 directly represented variables: data width and datatype

Directly represented variables (DRV) in IEC61131-3 languages include in their "addresses" a data-width specifier: X for 1 bit, B for byte, W for word, D for dword, etc.
Furthermore, when a DRV is declared, a IEC data type is specified, as any variable (BYTE, WORD, INT, REAL...).
I'm not sure about how these things are related. Are they orthogonal or not? Can one define a REAL variable with a W (byte) address? What would be the expected result?
A book says:
Assigning a data type to a flag or I/O address enables the programming
system to check whether the variable is being accessed correctly. For
example, a variable declared by AT %QD3 : DINT; cannot be
inadvertently accessed with UINT or REAL.
which does not make things clearer for me. Take for example this fragment (recall that W means Word, i.e., 16 bits - and both DINT and REAL correspond to 32 bits)
X AT %MW3 : DINT;
Y AT %MD4.1 : DINT;
Z AT %MD4.1 : REAL;
The first line maps a 32-bits IEC var to a 16-bits location. Is this legal? would the write/read be equivalent to a "cast" or what?
The other lines declare two 32-bits IEC variables of different type that points to the same address (I guess this should be legal). What is the expected result when reading or writing?

Like everything in PLC world, its all vendor and model specific, unfortunately.
Siemens compiler would not let you declare Real address with bit component like MD4.1, it allowed only MD4 and data length had to be double word, MB4 was not allowed.
Reading would not be equivalent to cast. For example you declare MW2 as integer and copy some value there. PLC stores integer as, lets say in twos complement format. Later in program you read MD2 as real. The PLC does not try to convert integer to real, it just blindly reads the bytes and treats it as real regardless what was saved there or what was declared there. There was no automatic casting.
This is how things worked in Siemens S7 plc-s. But you have to be very careful since each vendor does things its own way.

How can we prove that a bitcoin block is always solvable?

I'm trying to implement a simple cryptocurrency similar to bitcoin, just to understand it deeply down to the code level.
I understand that a bitcoin block contains a hash of the previous block, many transactions and an reward transaction for the solver.
the miner basically runs SHA256 on this candidate block combined with an random number. As long as the first certain digits of a hash result are zeros, we say this block is solved, and we broadcast the result to the entire network to claim the reward.
but I have never seen anyone proving that a block is solvable at all. I guess this is guaranteed by SHA256? because the solution size is fixed, after trying enough inputs, you are guaranteed to hit every hash result? but how can you prove that the solution distribution of a block is even (uniform), so that you can indeed cover all hash results?
now, suppose a block is indeed always solvable, can I assume that using 64bit for the random integer is enough to solve it? how about 32bit? or I have to use an infinite bit integer?
for example, in the basiccoin project:
the code for proof of work is the following:
def POW(block, hashes):
halfHash = tools.det_hash(block)
block[u'nonce'] = random.randint(0, 10000000000000000000000000000000000000000)
count = 0
while tools.det_hash({u'nonce': block['nonce'],
u'halfHash': halfHash}) > block['target']:
count += 1
block[u'nonce'] += 1
if count > hashes:
return {'error': False}
if restart_signal.is_set():
restart_signal.clear()
return {'solution_found': True}
''' for testing sudden loss in hashpower from miners.
if block[u'length']>150:
else: time.sleep(0.01)
'''
return block
this code randoms a number between [0, 10000000000000000000000000000000000000000] as a start point, and then it just increases the value one by one:
block[u'nonce'] += 1
I'm not a python programmer, I don't know how python handles the type of the integer. there is no handling of integer overflow.
I'm trying to implement similar thing with c++, I don't know what kind of integer can guarantee a solution.

but how can you prove that the solution distribution of a block is even (uniform), so that you can indeed cover all hash results?
SHA256 is deterministic so if you rehash the txns it will always provide the same 256 hash.
The client nodes keep all the txn and the hashes in the merkle tree for the network clients to propagate and verify the longest possible block chain.
The merkle tree is the essential data structure for recording the hashes of previous blocks.
From there the chain of hash confirmations can be tracked from the origin (genesis) block.

Difference of SystemVerilog data types (reg, logic, bit)

There are different data types in SystemVerilog that can be used like the following:
reg [31:0] data;
logic [31:0] data;
bit [31:0] data;
How do the three of them differ?

reg and wire were the original types. Wires are constantly assigned and regs are evaluated at particular points, the advantage here is for the simulator to make optimisations.
wire w_data;
assign w_data = y;
// Same function as above using reg
reg r_data;
always #*
r_data = y ;
A common mistake when learning Verilog is to assume the a reg type implies a register in hardware. The earlier optimisation for the simulator can be done through the context of its usage.
This introduces logic which can be used in place of wire and reg.
logic w_data;
assign w_data = y;
// Same function as above using reg
logic r_data;
always #*
r_data = y ;
The type bit and byte have also been created that can only hold 2 states 0 or 1 no x or z. byte implies bit [7:0]. Using these types offers a small speed improvement but I would recommend not using them in RTL as your verification may miss uninitialized values or critical resets.
The usage of bit and byte would be more common in testbench components, but can lead to issues in case of having to drive x's to stimulate data corruption and recovery.
Update
At the time of writing I was under the impression that logic could not be used for tristate, I am unable to find the original paper that I based this on. Until further updates, comments or edits, I revoke my assertion that logic can not be used to create tri-state lines.
The tri type has been added, for explicitly defining a tri-state line. It is based on the properties of a wire, logic is based on the properties of a reg.
tri t_data;
assign t_data = (drive) ? y : 1'bz ;
If you no longer have to support backwards compatibility Verilog then I would recommend switching to using logic and tri. Using logic aids re-factoring and and tri reflects the design intent of a tristate line.

The choice of the name reg turned out to be a mistake, because the existence of registers is instead inferred based on how assignments are performed. Due to this, use of reg is essentially deprecated in favor of logic, which is actually the same type.
logic is a 1-bit, 4-state data type
bit is a 1-bit, 2-state data type which may simulate faster than logic
If a logic is also declared as a wire, it has the additional capability of supporting multiple drivers. Note that by default wire is equivalent to wire logic.
In general, the "nets" (such as wire and tri) are most suitable for designing communication buses.
Practically speaking, for RTL it usually doesn't matter whether you declare with reg, or logic, or wire. However, if you have to make an explicit declaration of a 4-state type (as opposed to when you don't), you should typically choose logic since that is what is intended by the language.
Related articles:
What’s the deal with those wire’s and reg’s in Verilog
An analysis of the "logic" data type by Cliff Cummings - 20021209

As I'm unable to add a comment I've to write what looks like a new answer but isn't. Sigh!
#e19293001, #Morgan, logic defines a 4-state variable unlike bit, and hence a logic variable can be used to store 1'bz so the following code is valid and compiles:
logic t_data;
assign t_data = (drive) ? y : 1'bz ;
But I agree that to reflect the design intent tri should be used instead of logic in these cases (although I must say I don't see people using tri instead of logic/wire too often).

reg and logic are exactly the same. These data types appear inside the always or initial blocks and store values i.e. always #(a) b <= a;, the reg b gets evaluated only when 'a' changes but otherwise it simply stores the value it has been assigned last.
wire are just simply connections and need to continuously driven. I agree that they can behave identical as #Morgan mentioned, but they can be imagined as a piece of hard wire, the value of which changes only the value at the other end or the source changes.

How do the three of them differ?
There is no difference between logic and reg.
The difference between bit and the other two is that bit is 2-state, whereas logic/reg are 4-state.
Refer to IEEE Std 1800-2017, section 6.11.2, 2-state (two-value) and 4-state (four-value) data types:
logic and reg denote the same type.
Also, section 6.3.1 Logic values:
The SystemVerilog value set consists of the following four basic values:
0 —represents a logic zero or a false condition
1 —represents a logic one or a true condition
x —represents an unknown logic value
z —represents a high-impedance state
Several SystemVerilog data types are 4-state types, which can store
all four logic values. All bits of 4-state vectors can be
independently set to one of the four basic values. Some SystemVerilog
data types are 2-state, and only store 0 or 1 values in each bit of a
vector.

Logic data type doesn't permit multiple driver. The last assignment wins in case of multiple assignment .Reg/Wire data type give X if multiple driver try to drive them with different value. Logic data type simply assign the last assignment value.

The "Reg" data type is used in procedural assignment, whereas the "Logic" data type can be used anywhere.

Fastest possible string key lookup for known set of keys

Consider a lookup function with the following signature, which needs to return an integer for a given string key:
int GetValue(string key) { ... }
Consider furthermore that the key-value mappings, numbering N, are known in advance when the source code for function is being written, e.g.:
// N=3
{ "foo", 1 },
{ "bar", 42 },
{ "bazz", 314159 }
So a valid (but not perfect!) implementation for the function for the input above would be:
int GetValue(string key)
{
switch (key)
{
case "foo": return 1;
case "bar": return 42;
case "bazz": return 314159;
}
// Doesn't matter what we do here, control will never come to this point
throw new Exception();
}
It is also known in advance exactly how many times (C>=1) the function will be called at run-time for every given key. For example:
C["foo"] = 1;
C["bar"] = 1;
C["bazz"] = 2;
The order of such calls is not known, however. E.g. the above could describe the following sequence of calls at run-time:
GetValue("foo");
GetValue("bazz");
GetValue("bar");
GetValue("bazz");
or any other sequence, provided the call counts match.
There is also a restriction M, specified in whatever units is most convenient, defining the upper memory bound of any lookup tables and other helper structures that can be used by the GetValue (the structures are initialized in advance; that initialization is not counted against the complexity of the function). For example, M=100 chars, or M=256 sizeof(object reference).
The question is, how to write the body of GetValue such that it is as fast as possible - in other words, the aggregate time of all GetValue calls (note that we know the total count, per everything above) is minimal, for given N, C and M?
The algorithm may require a reasonable minimal value for M, e.g. M >= char.MaxValue. It may also require that M be aligned to some reasonable boundary - for example, that it may only be a power of two. It may also require that M must be a function of N of a certain kind (for example, it may allow valid M=N, or M=2N, ...; or valid M=N, or M=N^2, ...; etc).
The algorithm can be expressed in any suitable language or other form. For runtime performance constrains for generated code, assume that the generated code for GetValue will be in C#, VB or Java (really, any language will do, so long as strings are treated as immutable arrays of characters - i.e. O(1) length and O(1) indexing, and no other data computed for them in advance). Also, to simplify this a bit, answers which assume that C=1 for all keys are considered valid, though those answers which cover the more general case are preferred.
Some musings on possible approaches
The obvious first answer to the above is using a perfect hash, but generic approaches to finding one seem to be imperfect. For example, one can easily generate a table for a minimal perfect hash using Pearson hashing for the sample data above, but then the input key would have to be hashed for every call to GetValue, and Pearson hash necessarily scans the entire input string. But all sample keys actually differ in their third character, so only that can be used as the input for the hash instead of the entire string. Furthermore, if M is required to be at least char.MaxValue, then the third character itself becomes a perfect hash.
For a different set of keys this may no longer be true, but it may still be possible to reduce the amount of characters considered before the precise answer can be given. Furthermore, in some cases where a minimal perfect hash would require inspecting the entire string, it may be possible to reduce the lookup to a subset, or otherwise make it faster (e.g. a less complex hashing function?) by making the hash non-minimal (i.e. M > N) - effectively sacrificing space for the sake of speed.
It may also be that traditional hashing is not such a good idea to begin with, and it's easier to structure the body of GetValue as a series of conditionals, arranged such that the first checks for the "most variable" character (the one that varies across most keys), with further nested checks as needed to determine the correct answer. Note that "variance" here can be influenced by the number of times each key is going to be looked up (C). Furthermore, it is not always readily obvious what the best structure of branches should be - it may be, for example, that the "most variable" character only lets you distinguish 10 keys out of 100, but for the remaining 90 that one extra check is unnecessary to distinguish between them, and on average (considering C) there are more checks per key than in a different solution which does not start with the "most variable" character. The goal then is to determine the perfect sequence of checks.

You could use the Boyer search, but I think that the Trie would be a much more effiecent method. You can modify the Trie to collapse the words as you make the hit count for a key zero, thus reducing the number of searches you would have to do the farther down the line you get. The biggest benefit you would get is that you are doing array lookups for the indexes, which is much faster than a comparison.

You've talked about a memory limitation when it comes to precomputation - is there also a time limitation?
I would consider a trie, but one where you didn't necessarily start with the first character. Instead, find the index which will cut down the search space most, and consider that first. So in your sample case ("foo", "bar", "bazz") you'd take the third character, which would immediately tell you which string it was. (If we know we'll always be given one of the input words, we can return as soon as we've found a unique potential match.)
Now assuming that there isn't a single index which will get you down to a unique string, you need to determine the character to look at after that. In theory you precompute the trie to work out for each branch what the optimal character to look at next is (e.g. "if the third character was 'a', we need to look at the second character next; if it was 'o' we need to look at the first character next) but that potentially takes a lot more time and space. On the other hand, it could save a lot of time - because having gone down one character, each of the branches may have an index to pick which will uniquely identify the final string, but be a different index each time. The amount of space required by this approach would depend on how similar the strings were, and might be hard to predict in advance. It would be nice to be able to dynamically do this for all the trie nodes you can, but then when you find you're running out of construction space, determine a single order for "everything under this node". (So you don't end up storing a "next character index" on each node underneath that node, just the single sequence.) Let me know if this isn't clear, and I can try to elaborate...
How you represent the trie will depend on the range of input characters. If they're all in the range 'a'-'z' then a simple array would be incredibly fast to navigate, and reasonably efficient for trie nodes where there are possibilities for most of the available options. Later on, when there are only two or three possible branches, that becomes wasteful in memory. I would suggest a polymorphic Trie node class, such that you can build the most appropriate type of node depending on how many sub-branches there are.
None of this performs any culling - it's not clear how much can be achieved by culling quickly. One situation where I can see it helping is when the number of branches from one trie node drops to 1 (because of the removal of a branch which is exhausted), that branch can be eliminated completely. Over time this could make a big difference, and shouldn't be too hard to compute. Basically as you build the trie you can predict how many times each branch will be taken, and as you navigate the trie you can subtract one from that count per branch when you navigate it.
That's all I've come up with so far, and it's not exactly a full implementation - but I hope it helps...

Is a binary search of the table really so awful? I would take the list of potential strings and "minimize" them, the sort them, and finally do a binary search upon the block of them.
By minimize I mean reducing them to the minimum they need to be, kind of a custom stemming.
For example if you had the strings: "alfred", "bob", "bill", "joe", I'd knock them down to "a", "bi", "bo", "j".
Then put those in to a contiguous block of memory, for example:
char *table = "a\0bi\0bo\0j\0"; // last 0 is really redundant..but
char *keys[4];
keys[0] = table;
keys[1] = table + 2;
keys[2] = table + 5;
keys[3] = table + 8;
Ideally the compiler would do all this for you if you simply go:
keys[0] = "a";
keys[1] = "bi";
keys[2] = "bo";
keys[3] = "j";
But I can't say if that's true or not.
Now you can bsearch that table, and the keys are as short as possible. If you hit the end of the key, you match. If not, then follow the standard bsearch algorithm.
The goal is to get all of the data close together and keep the code itty bitty so that it all fits in to the CPU cache. You can process the key from the program directly, no pre-processing or adding anything up.
For a reasonably large number of keys that are reasonably distributed, I think this would be quite fast. It really depends on the number of strings involved. For smaller numbers, the overhead of computing hash values etc is more than search something like this. For larger values, it's worth it. Just what those number are all depends on the algorithms etc.
This, however, is likely the smallest solution in terms of memory, if that's important.
This also has the benefit of simplicity.
Addenda:
You don't have any specifications on the inputs beyond 'strings'. There's also no discussion about how many strings you expect to use, their length, their commonality or their frequency of use. These can perhaps all be derived from the "source", but not planned upon by the algorithm designer. You're asking for an algorithm that creates something like this:
inline int GetValue(char *key) {
return 1234;
}
For a small program that happens to use only one key all the time, all the way up to something that creates a perfect hash algorithm for millions of strings. That's a pretty tall order.
Any design going after "squeezing every single bit of performance possible" needs to know more about the inputs than "any and all strings". That problem space is simply too large if you want it the fastest possible for any condition.
An algorithm that handles strings with extremely long identical prefixes might be quite different than one that works on completely random strings. The algorithm could say "if the key starts with "a", skip the next 100 chars, since they're all a's".
But if these strings are sourced by human beings, and they're using long strings of the same letters, and not going insane trying to maintain that data, then when they complain that the algorithm is performing badly, you reply that "you're doing silly things, don't do that". But we don't know the source of these strings either.
So, you need to pick a problem space to target the algorithm. We have all sorts of algorithms that ostensibly do the same thing because they address different constraints and work better in different situations.
Hashing is expensive, laying out hashmaps is expensive. If there's not enough data involved, there are better techniques than hashing. If you have large memory budget, you could make an enormous state machine, based upon N states per node (N being your character set size -- which you don't specify -- BAUDOT? 7-bit ASCII? UTF-32?). That will run very quickly, unless the amount of memory consumed by the states smashes the CPU cache or squeezes out other things.
You could possibly generate code for all of this, but you may run in to code size limits (you don't say what language either -- Java has a 64K method byte code limit for example).
But you don't specify any of these constraints. So, it's kind of hard to get the most performant solution for your needs.

What you want is a look-up table of look-up tables.
If memory cost is not an issue you can go all out.
const int POSSIBLE_CHARCODES = 256; //256 for ascii //65536 for unicode 16bit
struct LutMap {
int value;
LutMap[POSSIBLE_CHARCODES] next;
}
int GetValue(string key) {
LutMap root = Global.AlreadyCreatedLutMap;
for(int x=0; x<key.length; x++) {
int c = key.charCodeAt(x);
if(root.next[c] == null) {
return root.value;
}
root = root.next[c];
}
}

I reckon that it's all about finding the right hash function. As long as you know what the key-value relationship is in advance, you can do an analysis to try and find a hash function to meet your requrements. Taking the example you've provided, treat the input strings as binary integers:
foo = 0x666F6F (hex value)
bar = 0x626172
bazz = 0x62617A7A
The last column present in all of them is different in each. Analyse further:
foo = 0xF = 1111
bar = 0x2 = 0010
bazz = 0xA = 1010
Bit-shift to the right twice, discarding overflow, you get a distinct value for each of them:
foo = 0011
bar = 0000
bazz = 0010
Bit-shift to the right twice again, adding the overflow to a new buffer:
foo = 0010
bar = 0000
bazz = 0001
You can use those to query a static 3-entry lookup table. I reckon this highly personal hash function would take 9 very basic operations to get the nibble (2), bit-shift (2), bit-shift and add (4) and query (1), and a lot of these operations can be compressed further through clever assembly usage. This might well be faster than taking run-time infomation into account.

Have you looked at TCB . Perhaps the algorithm used there can be used to retrieve your values. It sounds a lot like the problem you are trying to solve. And from experience I can say tcb is one of the fastest key store lookups I have used. It is a constant lookup time, regardless of the number of keys stored.

Consider using Knuth–Morris–Pratt algorithm.
Pre-process given map to a large string like below
String string = "{foo:1}{bar:42}{bazz:314159}";
int length = string.length();
According KMP preprocessing time for the string will take O(length).
For searching with any word/key will take O(w) complexity, where w is length of the word/key.
You will be needed to make 2 modification to KMP algorithm:
key should be appear ordered in the joined string
instead of returning true/false it should parse the number and return it
Wish it can give a good hints.

Here's a feasible approach to determine the smallest subset of chars to target for your hash routine:
let:
k be the amount of distinct chars across all your keywords
c be the max keyword length
n be the number of keywords
in your example (padded shorter keywords w/spaces):
"foo "
"bar "
"bazz"
k = 7 (f,o,b,a,r,z, ), c = 4, n = 3
We can use this to compute a lower bound for our search. We need at least log_k(n) chars to uniquely identify a keyword, if log_k(n) >= c then you'll need to use the whole keyword and there's no reason to proceed.
Next, eliminate one column at a time and check if there are still n distinct values remaining. Use the distinct chars in each column as a heuristic to optimize our search:
2 2 3 2
f o o .
b a r .
b a z z
Eliminate columns with the lowest distinct chars first. If you have <= log_k(n) columns remaining you can stop. Optionally you could randomize a bit and eliminate the 2nd lowest distinct col or try to recover if the eliminated col results in less than n distinct words. This algorithm is roughly O(n!) depending on how much you try to recover. It's not guaranteed to find an optimal solution but it's a good tradeoff.
Once you have your subset of chars, proceed with the usual routines for generating a perfect hash. The result should be an optimal perfect hash.