Swift FileHandle seek/readData performance - swift

Context:
I have a project where I store a lot of data in Binary files and Data files. I retrieve offsets in a Binary file, stored as UInt64, and each of these offsets give me the position of an utf-8 encoded String in another file.
I am attempting, given all the offsets, to reconstruct all the strings from the utf-8 file. The file that hold all the strings has a size of exactly 20437 bytes / approx 177000 strings.
Assuming I already retrieved all the offsets and now need to rebuild each string one at a time. I also have the length in bytes of every String.
Method 1:
I open a FileHandle set to the utf8 encoded file, and for each offset I seek to the offset and perform a readData(ofLength:), the whole operation is very long... More than 35 seconds.
Method 2:
I initialize a Data object with Data(contentsOf: URL).
Then, I perform a Data.subdata(in: Range) for each string I want to build. The range starts from offset and ends to offset + size.
This will load the entire file into the RAM, and allow me to retrieve the bytes I need for each String. This is much faster than the first option, but probably as bad performance-wise.
How can I get the best performance for this particular task ?

I recently went through a similar experience when caching/loading binary data to/from disk.
Im not sure what the ultimate process is for best performance, but you can improve performance of method 2 further still, by using a "slice" of the data object instead of data.subdata(). This is similar to using array slices.
This probably because instead of creating more data objects with COPIES of the original data, the data returned from the slice uses the source Data object as a reference. This made a significant difference for me as my source data was actually pretty large. You should profile both methods and see if it makes a noticeable for you.
https://developer.apple.com/documentation/foundation/data/1779919-subscript

Related

Flutter/Dart read portions of file as bytes rather than List<int>?

Right now, I'm reading files with the openRead() stream because I wasn't wanting to read in the entire file at once with readAsBytesSync() since some files might be multiple gigabytes and could possibly have some memory exception when reading huge files. But when I read in a small portion of a file and process the bytes with other libraries, I am having to convert the List that openRead() provides to a Uint8List.fromList() before I can provide the read bytes to other functionality that wont' accept a List.
These constant conversions are probably taking up processing time which I'd rather not have to spend. Is it possible to read portions of a file the way openRead()'s stream does but also get a Uint8List rather than having to convert the list? Or is there a way to do a readAsBytesSync() but supply start/end offsets which this function doesn't have?
https://api.flutter.dev/flutter/dart-io/File/openRead.html
https://api.flutter.dev/flutter/dart-io/File/readAsBytesSync.html (no start/end arguments)

Swift: how can I iterate over the contents of Data one Character at a time (without converting Data to String)

I have a Data struct whose contents (essentially an array of UInt8) represent the contents of an (arbitrarily large) file on storage. The data is character data and I want to be able to iterate over the data one Character at a time (as opposed to one UInt8 at a time). I do not want to convert the data to a String, as this could potentially consume a lot of memory (the Data object was creates using mappedIfSafe to try to minimise memory usage). Anyone have any thoughts on how I might achieve this?

Marc21 Binary Decoder with Akka-Stream

I'm trying to decode Marc21 binary data records which have the following specification concerning the field that provide the length of the record.
A Computer-generated, five-character number equal to the length of the
entire record, including itself and the record terminator. The number
is right justified and unused positions contain zeros.
I am trying to use
Akka Stream Framing.lengthField, however I just don't know how specify the size of that field. I imagine that a character is 8 bit, maybe 16 for a number, i am not sure, i wonder if that depend of the platform or language. In short, the question is is it possible to say what is the size of that field Knowing that i am in Scala/Java.
Also what does means:
The number is right justified and unused positions contain zeros"
Does that has implication on how one read the value if collected properly ?
If anyone know anything about this, please share.
EDIT1
Context:
I am trying to build a stream processing graph where the first stage would be processing the result of a sys command ran against a symphony (Vendor Cataloging system) server, which is a stream of unstructured byte chunck which as a whole represent all the Marc21 records Requested (full dump or partial dump).
By processing i mean, chunking that unstructured stream of byte into a stream of frames where the frames are the Records.
In other words, readying the bytes for one record at the time, and emitting it individually to the next stage.
The next stage will consist in emitting that record (Bytes) to apache Kafka.
Obviously the emission stage would be fully parallelize to speed up the process.
The Symphony server does not have the capability to stream a dump when requested, especially over the network. Hence, this Akka-stream based Graph processing to perform that work, for fast ingestion/production and overall streaming processing of our dumps in our overall fast data infrastructure.
EDIT2
Based on #badcook input, I wonder if ComputeFramesize could be used here. Not sure i am slightly confused by the function and what does it takes into parameters.
Little clarification would be much appreciated.
It looks like you're trying to parse MARC 21 records.
In that case I would recommend you just take a look at MARC4J and use that.
If you want to integrate it with Akka streams, or even if you want to parse MARC records your own way, I would recommend breaking up your byte steam with Framing.delimiter using the MARC 21 record terminator (ASCII control character 1D) into complete MARC records rather than try to stream and work with fragments of MARC records. It'll be a lot easier.
As for your specific questions: The MARC 21 specification uses characters rather than raw bytes when talking about its structure. It specifies two character encodings into raw bytes, UTF-8 and MARC 8, both of which are variable width encodings. Hence, no it is not true that every character is a byte. There is no single answer of how many bytes a character takes up.
"[R]ight justified and unused positions contain zeroes" is another way of saying that numbers are padded from the left with 0s. In this case this line comes from a larger quote staying that the numerical string must be 5 characters long. That means if you are trying to represent the number 1, you must represent it as 00001.

Will using Core Data speed up key-value parsing?

Currently I am downloading and parsing my data from an XML file that is on my server. At one point, I have to check my first XML key value against all of the values from a second XML file and store the matching values into an array of NSDictionarys then display them in a tableview. This can take up to about 5 seconds and I want to speed this up.
I am wondering if I download my XML files into a Core Data structure first if this will speed up the this process, so the load time doesn't take as long when checking these values against one another.
No, Core Data won't speed up parsing the XML data -- you'll need to parse the XML before adding the data to your Core Data store. It may or may not speed up the next step, where you apparently are looking for matches, but as you haven't really described what's going on there it's difficult to say one way or the other.
Sounds like you are effectively doing a comparison of the keys in the two XML documents. If you're using XML-based API calls to do that, you may be doing a linear search in XML doc 1 for every key in doc 2. If there are N keys in doc 1 and M keys in doc 2, that's on the order of N * M operations.
Walking through each document just once to get all the keys and add them to something like Core Data (or just an NSDictionary) that's optimized for retrieval by key seems like it could be an improvement, assuming that looking for matches is what's slowing things down. (If most of your time is being spent simply parsing the XML in the first place, you won't gain much by speeding up the matching.)

How would you minimize or compress Core Data sqlite file size?

I have a 215MB csv file which I have parsed and stored in core data wrapped in my own custom objects. The problem is my core data sqlite file is around 260MB. The csv file contains about 4.5million lines of data on my city's transit system (bus stop, times, routes etc).
I have tried modifying attributes so that arrays of strings representing stop times are stored instead as NSData files but for some reason the file size still remains at around 260MB.
I can't ship an app this size. I doubt anyone would want to download a 260MB app even if it means they have the whole city's transit schedule on it.
Are there any ways to compress or minimize the storage space used (even if it means not using core data, I am willing to hear suggestions)?
EDIT: I just want to provide an update right now because I have been staring at the file size in disbelief. With some clever manipulation involving strings, indexing and database normalization in general, I have managed to reduce the size down to 6.5MB or 2.6MB when compressed. About 105,000 objects stored in Core Data containing the full details of the city's transit system. I'm almost in tears right now D':
Unless your original CSV is encoded in a really foolish manner, it seems unlikely that the size is not going to get below 100M, no matter how much you compress it. That's still really large for an app. The solution is to move your data to a web service. You may want to download and cache significant parts, but if you're talking about millions of records, then fetching from a server seems best. Besides, I have to believe that from time to time the transit system changes, and it would be frustrating to have to upgrade a many-10s-of-MB app every time there was a single stop adjustment.
I've said that, but actually there are some things you may consider:
Move booleans into a bit fields. You can put 64 booleans into an NSUInteger. (And don't use a full 64-bit integer if you just need 8 bits. Store the smallest thing you can.)
Compress how you store times. There are only 1440 minutes in a day. You can store that in 2 bytes. Transit times are generally not to the second; they don't need a CGFloat.
Days of the week and dates can similarly be compressed.
Obviously you should normalize any strings. Look at the CSV for duplicated string values on many lines.
I generally would recommend raw sqlite rather than core data for this kind of problem. Core Data is more about object persistence than raw data storage. The fact that you're seeing a 20% bloat over CSV (which is not itself highly efficient) is not a good direction for this problem.
If you want to get even tighter, and don't need very good searching capabilities, you can create packed data blobs. I used to do this on phone switches where memory was extremely tight. You create a bit field struct and allocate 5 bits for one variable, and 7 bits for another, etc. With that, and some time shuffling things so they line up correctly on word boundaries, you can get pretty tight.
Since you care most about your initial download size, and may be willing to expand your data later for faster access, you can consider very domain-specific compression. For example, in the above discussion, I mentioned how to get down to 2 bytes for a time. You could probably get down to 1 bytes in many cases by storing times as delta minutes since the last time (since most of your times are going to be always increasing by fairly small steps if they're bus and train schedules). Abandoning the database, you could create a very tightly encoded data file that you could extract into a database on first launch.
You also can use domain-specific knowledge to encode your strings into smaller tokens. If I were encoding the NY subway system, I would notice that some strings show up a lot, like "Avenue", "Road", "Street", "East", etc. I'd probably encode those as unprintable ASCII like ^A, ^R, ^S, ^E, etc. I'd probably encode "138 Street" as two bytes (0x8A13). This of course is based on my knowledge that รจ (0x8a) never shows up in the NY subway stops. It's not a general solution (in Paris it might be a problem), but it can be used to highly compress data that you have special knowledge of. In a city like Washington DC, I believe their highest numbered street is 38th St, and then there's a 4-value direction. So you can encode that in two bytes, first a "numbered street" token, and then a bit field with 2 bits for the quadrant and 6 bits for the street number. This kind of thinking can potentially significantly shrink your data size.
You might be able to perform some database normalization.
Look for anything that might be redundant or the same values being stored in multiple rows. You will probably need to restructure your database so these duplicate values (if any) are stored in separate tables and then referenced from their original row by means of id's.
How big is the sqlite file compressed? If it's satisfactorily small, the simplest thing would be to ship it compressed, then uncompress it to NSCachesDirectory.