Decoding/parsing CSV and CSV-like files in Swift - swift

I'll have to write a very customised CSV-like parser/decoder. I have looked for open source ones on Github, but not found any that fits my needs. I can solve this, but my question is if it would be a total violation of the key/value decoding, to implement this as a TopLevelDecoder in Swift.
I have keys, but not exactly key/value pairs. In CSV files, there is rather a key for each column of data,
There are a number of problem with the files I need to parse:
Commas are not only for separation of fields, but there are also commas within some fields. Example:
//If I convert to an array
Struct Family {
let name: String?
let parents: [String?]
let siblings: [String?]
}
In this example, both parents' names are within the same field, and needs to be converted into an array, and also the siblings field.
"Name", "Parents","Siblings"
"Danny", "Margaret, John","Mike, Jim, Jane"
In the case of the parents, I could have split that into two fields in a struct like
Struct Family {
let name: String?
let mother: String?
let father: String?
}
but with the Siblings field that doesn't work, since there can be all from zero to many siblings. Therefore I will have to use an array.
There are cases when I will split into two fields though.
All the files I need to parse are not strictly CSV. All of the files have tabular data (comma-or tab-separated), but some of the files have a few rows of comments (sometimes containing metadata) that I need to consider. Those files have a .txt extension, instead of .csv.
## File generated 2020-05-02
"Name", "Parents","Siblings"
"Danny", "Margaret, John","Mike, Jim, Jane"
Therefore I need to peek at the first line(s) to determine if there are such comments, and after that has been parsed I can continue to treat the rest of the file as CSV.
I plan to make it look like any Decoder, from the applications point of view, but internally in my decoder i can handle things like they were a key/value pair, because there is just one set of keys, and that is the first line in the file, if there are no comments in the beginning. I still want to use CodingKeys though.
What are your thoughts? Should I implement in as a decoder (actually TopLevelDecoder in Swift), or would that be an abuse of the idea of key/value decoding? The alternative is to implement this as a parser, but I have to handle several types of files (JSON, GraphQL, CSV and CSV-like files), and I think my application code would be a lot simpler if I could use Decoders for all the types of files.
For JSON there's no problem, since there is already a HSON decoder in Swift. For GraphQL it's not a problem either, because I can write a decoder with an unkeyed container. The problem files are those CSV and CSV-like files.
Some of them have everything in double-quotes, but for the "keys" in the CSV header and for the values. Some only have double-quotes for the keys, but not for the values. Some have comma-separated fields, and some tab-separated. Some have commas within fields, that needs special handling. Some have comments in the beginning of the file, that needs to be skipped, before parsing the rest of the file as CSV.
Some files have two fields in the first column. I have no influence whatsoever of the format of these files, so I just have to deal with it.
If you wonder what files they are, I can tell you that they are files of raw DNA, files with DNA matches, files with common DNA segments with people I have matching DNA with. It's quite a few slightly different files, from several DNA testing companies. I wish they all had used JSON in a standard format, where all keys also were standard for all the companies. But they all have different CSV headers, and other differences.
I also have to decode Gedcom files, which sort of also has key/value coded pairs, but that format too doesn't conform to a pure key/value coding in the files.
ALso: I have searched for others with similar problems, but not exactly the same, so I didn't want to hijack their threads.
See this thread Advice for going from CSV > JSON > Swift objects
That was more of a question of how to convert from CSV to JSON and then to internal data structs in Swift. I know I can write a parser to solve this, but I think it would be more elegant to handle all these files with decoders, but I want your thoughts about it.
I was also think of making a new protocol
protocol ColumnCodingKey: CodingKey {
)
I haven't decided yet what to have in the protocol, if anything.
It might work by just having it empty like in the example, and then let my decoder conform to it, then it maybe wouldn't be a very big violation of the key/value decoding.
Thanks in advance!

CSV files could be parsed using regular expression. To get you started this might save some time. It's hard to know what you really need because it looks like there are many different scenarios, it might grow to even more situations?
Regex expression to parse one line in a CSV file might look something like this
(?:(?:"(?:[^"]|"")*"|(?<=,)[^,]*(?=,))|^[^,]+|^(?=,)|[^,]+$|(?<=,)$)
Here is a detailed description on how it works with a javascript sample
Build a CSV parser

Related

Identifying and formatting XML String to readable format in XMLParser

I am working in Swift and I have a set of Data that I can encode as a String like this:
<CONTAINER><Creator type="NSNull"/><Category type="NSNull"/><UMID type="NSArray"><CHILD>d1980b265cbd415c90f5d5f04efcb5df</CHILD><CHILD>7e0252c137c249fc92bd0f844effe27f</CHILD></UMID><Channels type="NSNumber">1</Channels></CONTAINER>
I am looking for a way to format this string as XML with indents so I can use XMLParser to properly read through it, which it currently does not. I imagine NSNull is when the object is empty, I just haven't seen this format so I don't know what to search for. Could it be closer to a Dictionary object? If so I'd be happy to format it as that as well.
I've also tried to create a XMLDocument from the data, but it doesn't fix the format.
EDIT:
I wanted to add a bit more information to help clarify what I am trying to do. This string above is derived from an encrypted piece of metadata from a file. In my code I identify the chunk of data that is encrypted, then decrypt it, and then convert that data to a string. It's worth noting that the string ends up having null characters in between each valid character, but I strip those out and end up with this string.
Copying this string into an XML Validator confirms it is valid XML. What is confusing to me is it's format, in which it has Object types such as NSNull and NSNumber. My post was originally hoping to identify this type of format. It seems like more than just XML.
In response to some of the comments, I have used XML Parser delegate with other XML strings and have a basic understanding of how it works. I should have originally mentioned that and instead said that XML Parser does not recognize any of these elements or strings within them.
UPDATE:
The issue ended up being the null characters in between each valid character. Stripping those out and then running it through XML Parser worked great. Thanks all.

Is it possible to dynamically name the part-XXXX files based on the value of the column used to partition the Dataset?

I have a val dataset = Dataset[FeedData], where FeedData is something like case class FeedData(feed: String, data: XYZ).
I want to avoid post-processing the files, so I decided to call dataset.repartition($"feed").json("s3a://...") so that each feed ends up in a different file. The problem is that the files are still named along the lines of part-XXXX so I can't easily pick out the relevant file for a given feed, without a) opening them all to check the values for feed inside, or b) post-processing the files to be more friendly.
I want the files to look like part-XXXX-{feed} instead of part-XXXX
Is it possible to dynamically name the partition files based on the value of the column feed used to partition the dataset?
Background:
I found this answer which mentions a saveAsNewAPIHadoopFile() method, where I can extend some relevant classes for my own file naming implementation.
Can anybody help me understand this method, how to access it from a Dataset, and tell me whether it's possible to project the required information (feed) into my implementation to dynamically name the partitions?
I was trying to do it the wrong way:
dataset.repartition($"colName").write.format("json").save(path)
The correct way to do this is:
dataset.write.partitionBy("colName").format("json").save(path)
The difference is that you should call .partitionBy after .write. The resulting directories look like: colName=value/part-XXXX.
See here for more info.

There is very little documentation on NSTextCHeckingKey, what is it and how is it used?

Ive come across some code that I'm trying to decipher.
case addressComponents([NSTextCheckingKey: String]?)
However there is very little documentation on NSTextCheckingKey. Can anyone be of assistance?
This structure is used to select the kind of information that you want to extract from a string using NSDataDetector
In your code I suppose that addressComponents is a dictionary whose values conform to a human-readable adress.
Maybe it contains the keys .city, .country, .name, .state, .street and .zip

Spreadsheet::ParseExcel module in perl

I always gets confused when I deal with Classes and Objects. As I am trying to understand the Spreadsheet::ParseExcel module, I am having some doubts for its classes and object :
My doubt is:
With $parser= Spreadsheet::ParseExcel->new();, we are creating an object for Spreadsheet::ParseExcel and after this we shall create the object for Spreadsheet::ParseExcel::Workbook.
Why can not we create the object directly for Spreadsheet::ParseExcel::Workbook and start parsing ?
Thanks
Why can not we create the object directly for Spreadsheet::ParseExcel::Workbook and start parsing
That is a reasonable question and in older versions of Spreadsheet::ParseExcel there was a Spreadsheet::ParseExcel::Workbook->Parse() method that did just that. (*)
Users tend to see an Excel file only as a workbook. However the file format also contains data such as metadata (author, creation date, etc.) and vba macros that are separate from the workbook data.
As such the logical division of the parser from the workbook probably occurred due to the physical division of the data in the file.
Or it may have been to allow reporting of file parsing errors rather than just returning an undefined workbook object.
Either way, other people may have chosen to model the interface differently but that is what the original author chose. It is not completely intuitive but it works.
(*) This method is now deprecated since it doesn't allow error checking on the file.
Think about Spreadsheet::ParseExcel and Spreadsheet::ParseExcel::Workbook like they are just of different types, like integer and string, which are both scalar, but you cannot, say, multiply them, although they can interact in some cases. E.g. length() applied to string gives you integer length of string. The same way, Spreadsheet::ParseExcel::parse() gives you Spreadsheet::ParseExcel::Workbook. They are bound by common namespace but they are completely different, Spreadsheet::ParseExcel is a parser and Spreadsheet::ParseExcel::Workbook is a workbook.

Jackrabbit / JCR organisation of text content data

i was thinking about, how to organize "normal" text content (i.e a String, HTML Code ...) in Jackrabbit.
Are there any recommended structures for plain text content (like for files)?
Should i store each text content as a binary (like i do with files)
Node(nt:folder)--> Node(nt:file) --> Node(jcr:content with a jcr:data property which holds the binary)
Or is it better to have something like
Node(nt:folder)--> Node(nt:unstructured with a jcr:message property which holds the string)
My third idea was to create a separate name space for text content
Node(nt:folder)--> Node(my:text with a jcr:message property which holds the string)
Node(nt:folder)--> Node(my:html with a jcr:message property which holds the string)
...
What do you thing is the best solution?
It would be great to discuss this.
Storing text and html content as nt:file structures makes it visible via WebDAV and other tools that understand those structures. That can be useful depending on your application.
If you don't need this, you can just store your textual content as properties. In this case, using standard property names: jcr:title, jcr:description etc. as defined in the Standard Application Node Types section of the JSR-283 spec helps make things consistent.
See also http://wiki.apache.org/jackrabbit/DavidsModel which has some related recommendations.
I would store regular text in a string property, unless it's a large (multi-kilobyte) text. This is similar to VARCHAR in a relational database.
For really large texts that are not 'files', I would use a binary property (a stream). Such properties are stored in the DataStore, which is slower to write and access than a string property, but will not load the whole item in memory, and will only store the same data once. This is similar to BLOB / CLOB in a relational database.
For files, I would use nt:folder / nt:file. This is similar to a file in a file system.