Tokenize the mmaped string using strtok - mmap

I have mmaped a huge file containing 4k * 4k floats. Since it was an text file, I need to mmap it as char string and use. Now I need to parse floats and write into 2d array. If I tokenize it using strtok, it will not allow me to do since mmapped string is not modifiable. If I copy the string into std::string and then tokenize using getline function, it let me to do it but I feel I will lose the performance got from mmap. How do I optimally solve this problem ??

You can try some different solutions, but you will have to benchmark to find out which one is the best for you. It's not always clear that mmap()ing a file and processing the memory-mapped pages directly is the best solution. Especially if you make a single sequential pass through the file, a loop that read()s pieces at a time into a buffer can be faster, even if you use madvise() together with mmap(). Again, benchmark if you want to know what is fastest for you.
Some solutions you might try:
mmap() with MAP_WRITE and MAP_PRIVATE and then use your existing strtok() code. This will allow strtok() to write the NUL bytes it wants to write, without having those changes be reflected in the file. If you choose this solution, you should probably call madvise(MADV_DONTNEED) on the parts of the file you have already processed, else memory usage will grow linearly.
Implement your own variant of strtok() that returns the length of the matched token instead of a NUL-terminated string. It's not difficult, using memchr(). This way you don't need to modify the memory. You might then need to pass the resulting tokens to functions that take a string and a length instead of a NUL-terminated string. There aren't many such functions in the C library, but even so you might be able to get away with calling functions like strtod() if the tokens are guaranteed to end in some non-digit delimiter. Or you can copy them into a small stack-allocated buffer (they're floats, they can't be that long, right?)
Use a read()-and-process loop instead of mmap().

Related

What is the fastest way to search through strings in Objective-C?

I am implementing a sort of autocomplete for an iOS app. The data I am using for the autocomplete values is a comma-separated text file with about 100,000 strings. This is what I am doing now:
Read the text file, and create an NSArray with 100,000 NSString.
As the user types, do [array containsObject:text]
Surely there is a better/faster way to do this lookup. Any thoughts?
Absolutely, there is! It's not "in Objective-C" though: most likely, you would need to code it yourself.
The idea is to convert your list of string to a suffix tree, a data structure that lets you search by prefix very quickly. Searching for possible completions in a suffix tree are very fast, but the structure itself is not easy to build. A quick search on the internet revealed that there is no readily available implementation in Objective C, but you may be able to port an implementation in another language, use a C implementation, or even write your own if you are not particularly pressed for time.
Perhaps an easier approach would be to sort your strings alphabetically, and run a binary search on the prefix that has been entered so far. Though not as efficient as a suffix tree, the sorted array approach will be acceptable for 100K strings, because you get to the right spot in under seventeen checks.
The simplest is probably binary search. See -[NSArray indexOfObject:inSortedRange:options:usingComparator:].
In particular, I'd try something like this:
Pre-sort the array that you save to the file
When you load the array, possibly #selector(compare:) (if you are worried about it being accidentally unsorted or the Unicode sort order changing for some edge cases). This should be approximately O(n) assuming the array is mostly sorted already.
To find the first potential match, [array indexOfObject:searchString inSortedRange:(NSRange){0,[array count]} options:NSBinarySearchingInsertionIndex|NSBinarySearchingFirstEqual usingComparator:#selector(compare:)]
Walk down the array until the entries no longer contain searchString as a prefix. You probably want to do case/diacritic/width-insensitive comparisons to determine whether it is a prefix (NSAnchoredSearch|NSCaseInsensitiveSearch|NSDiacriticInsensitiveSearch|NSWidthInsensitiveSearch)
This may not "correctly" handle all locales (Turkish in particular), but neither will replacing compare: with localizedCompare:, nor will naïve string-folding. (It is only 9 lines long, but took about a day of work time to get right and has about 40 lines of code and 200 lines of test, so I probably shouldn't share it here.)

Parse bit strings in Perl

When working with unpack, I had hoped that b3 would return a bitstring, 3 bits in length.
The code that I had hoped to be writing (for parsing a websocket data packet) was:
my($FIN,$RSV1, $RSV2, $RSV3, $opcode, $MASK, $payload_length) = unpack('b1b1b1b1b4b1b7',substr($read_buffer,0,2));
I noticed that this doesn't do what I had hoped.
If I used b16 instead of the template above, I get the entire 2 bytes loaded into first variable as "1000000101100001".
That's great, and I have no problem with that.
I can use what I've got so far, by doing a bunch of substrings, but is there a better way of doing this? I was hoping there would be a way to process that bit string with a template similar to the one I attempted to make. Some sort of function where I can pass the specification for the packet on the right hand side, and a list of variables on the left?
Edit: I don't want to do this with a regex, since it will be in a very tight loop that will occur a lot.
Edit2: Ideally it would be nice to be able to specify what the bit string should be evaluated as (Boolean, integer, etc).
If I have understood correctly, your goal is to split the 2-bytes input to 7 new variables.
For this purpose you can use bitwise operations. This is an example of how to get your $opcode value:
my $b4 = $read_buffer & 0x0f00; # your mask to filter 9-12 bits
$opcode = $b4 >> 8; # rshift your bits
You can do the same manipulations (maybe in a single statement, if you want) for all your variables and it should execute at a resonable good speed.

Is there a such a way to know how much memory space a file would take?

Is there a such a way to know how much memory space a file would take before hand?
For example, lets say I have a file size of 1G bytes. How would this file size translate to memory size?
I take your example from the comment and elaborate on what might happen to a text file when loaded into memory: Some time ago, "text" usually meant ASCII (as the least common denominator at least). And lots of software, written in a language like C, would represent such ASCII strings as an char* type. This resulted in a more-or-less exact match in memory requirements: Every byte in the input file would occupy one byte when loaded into RAM.
But that has changed in the last years with the rise of Unicode. The same text file, loaded by a simple Java program (and using Java's String type, which is very likely) would take up two times the amount of RAM. This is so, because the Java String type represents each character internally using UTF-16 (16 bits per character minimum), whereas ASCII used only one byte per character.
What I'm trying to say here is: There is no easy answer to your question, it always depends on who reads the data and what he's about to do with it.
One thing is true quite often: by "loading", the data does not become smaller.
If you read the whole file into memory at once, you'll need at least the size of the file free memory. Much of the time people don't actually need to do so, they just don't know another way. For an explanation of the problem and alternatives see:
http://www.effectiveperlprogramming.com/2010/01/memory-map-files-instead-of-slurping-them/
You can check yourself by writing a little test script with Memory::Usage.
From its documentation's synopsis:
use Memory::Usage;
my $mu = Memory::Usage->new();
# Record amount of memory used by current process
$mu->record('starting work');
# Do the thing you want to measure
$object->something_memory_intensive();
# Record amount in use afterwards
$mu->record('after something_memory_intensive()');
# Spit out a report
$mu->dump();
Then you'll know how much your build of Perl, given whatever character encoding you intend to use, and whatever method of dealing with the file you intend to implement, will consume in memory.
If you can avoid loading the whole file at once, and instead just iterate over it line by line or record by record, the memory concern goes away. So it would help to know what you actually are trying to accomplish. You may have an XY problem.
perldoc -f stat
stat Returns a 13-element list giving the status info for a file,
either the file opened via FILEHANDLE or DIRHANDLE, or named by
EXPR. If EXPR is omitted, it stats $_. Returns the empty list
if "stat" fails. Typically used as follows:
($dev,$ino,$mode,$nlink,$uid,$gid,$rdev,$size,
$atime,$mtime,$ctime,$blksize,$blocks)
= stat($filename);
Note the $size return value. It is the size of the file in bytes. If you are going to slurp the entire file into memory you will need at least $size bytes. Then again, you might need a whole lot more (or even a whole lot less), depending on what you do to the contents of the file.

dd speed issue using ibs/obs

I have a loop where I use dd to copy a stream to a disk. I am using a larger blocksize using 'bs' in the entire process for speed reasons. However with one specific line I have to use 'ibs' and 'obs' because my 'seek' location is not a multiple of the 'bs' I use elsewhere.
My question is: Is there a way using dd or any other program/Perl module to write out a blocksize different from the one used to 'seek'?
dd if=/dev/ram1 of=/dev/sdb1 seek=2469396480 ibs=1048576 obs=1 count=1
As you can see above, while the raw data is read in a 1M block I have to write it out in 1 byte segments because I need to seek to a specific location based on a byte granularity. This makes the write 1/100th as fast.
Is there a workaround? Or is there a way to do this in Perl without using dd?
Thanks,
Nick
This problem is inherent in dd. If your desired seek location has no factor of suitable magnitude (big enough for good performance but small enough to use as a buffer size) then you're stuck. This happens among other times when your desired seek location is a large prime.
In this specific case, as Mark Mann pointed out, you do have good options: 2469396480 is 2355 blocks of size 1048576, or 1024 blocks of size 2411520, etc... But that's not a generic answer.
To do this generically, you'll want to use something other than dd. Fortunately, dd's task is really simple and all you need is the following (in pseudocode... I haven't done much Perl in a while)
if = open("/dev/ram1", "r")
of = open("/dev/sdb1", "r+")
seek(of, 2469396480)
loop until you have copied the amount of data you want {
chunk = read(if, min(chunksize, remaining_bytes_to_copy))
write(of, chunk)
}
It looks like the source of your copy is a ramdisk of some sort. If you really want screaming performance, you might try another method besides reading chunks into a buffer and writing the buffer out to the output file. For example you can mmap() the source file and write() directly from the mapped address. The OS may (or may not) optimize away one of the RAM-to-RAM copy operations. Note that such methods will be less portable and less likely to be available in high level languages like Perl.

Can the C preprocessor perform simple string manipulation?

This is C macro weirdness question.
Is it possible to write a macro that takes string constant X ("...") as argument and evaluates to sting Y of same length such that each character of Y is [constant] arithmetic expression of corresponding character of X.
This is not possible, right ?
No, the C preprocessor considers string literals to be a single token and therefore it cannot perform any such manipulation.
What you are asking for should be done in actual C code. If you are worried about runtime performance and wish to delegate this fixed task at compile time, modern optimising compilers should successfully deal with code like this - they can unroll any loops and pre-compute any fixed expressions, while taking code size and CPU cache use patterns into account, which the preprocessor has no idea about.
On the other hand, you may want your code to include such a modified string literal, but do not want or need the original - e.g. you want to have obfuscated text that your program will decode and you do not want to have the original strings in your executable. In that case, you can use some build-system scripting to do that by, for example, using another C program to produce the modified strings and defining them as macros in the C compiler command line for your actual program.
As already said by others, the preprocessor sees entire strings as tokens. There is only one exception the _Pragma operator, that takes a string as argument and tokenizes its contents to pass it to a #pragma directive.
So unless your targeting a _Pragma the only way to do things in the preprocessing phases is to have them written as token sequences, manipulate them and to stringify them at the end.