Readin files with enconding OEM 850 in Swift - swift

I have to read text files in Swift/Cocoa, which are encoded as OEM 850. Does anybody know how to do this?

You can first read the file in as raw data and then convert that data to a string value according to your encoding. A small wrinkle in your case:
There are two types which represent the known string encodings, NSStringEncoding (String.Encoding in Swift) and CFStringEncoding. Apple only directly defines a subset of the known encodings as NSStringEncoding/String.Encoding values. The remaining known encodings have CFStringEncoding values and the function CFStringConvertEncodingToNSStringEncoding() is provided to map these to NSStringEncoding. Unfortunately for you OEM 850 is only directly provided by CFStringEncoding...
That sounds worse than it is. In Objective-C you can get the encoding you require using:
NSStringEncoding dosLatin1 = CFStringConvertEncodingToNSStringEncoding(kCFStringEncodingDOSLatin1);
Note: “DOS Latin 1” is one of the names for the same coding “OEM 850” refers to (see Wikipedia for a list) and is the one Apple chose hence the kCFStringEncodingDOSLatin1.
In Swift this is messier:
let dosLatin1 = String.Encoding(rawValue: CFStringConvertEncodingToNSStringEncoding(CFStringEncoding(CFStringEncodings.dosLatin1.rawValue)))
Once you have the encoding the rest is straightforward, without error checking an outline is:
let sourceURL = ...
let rawData = try? Data(contentsOf: sourceURL)
let convertedString = String(data: rawData, encoding: dosLatin1)
In real code you must check that the file read and conversion are succesful. Reading raw data from a URL in Swift will throw if the read fails, converting the data to a string produces an optional (String?) as the conversion may fail.
HTH

Related

How to match a letter with accent (Hex code format) and its similar English letter in Swift?

I am creating a app which need to retrieve a player info from a XML file. I've basically done the parsing part but encounter this problem right now.
The example of format is:
<P id="4336" f="Luka" s="MODRIĆ" d="1985-09-09" h="174" w="65" i="4336.png"/>
(My logic is compare first name and last name from data with user input and pull the data out from the XML file.)
I need retrieve the data when user input "Luka Modric" but I know it is hard to compare "MODRIC" with "MODRIĆ" or "MODRIĆ" . Are there any better solutions that I can achieve my goal? Thanks.
Assuming you addressed the html escape sequence (Ć) to string conversion, you can transform the diacritics from the player name to latin characters, and compare based on those:
let playerName = "Luka Modrić"
let normalizedPlayerName = playerName.folding(options: .diacriticInsensitive, locale: nil)
let isLukaModric = normalizedPlayerName.caseInsensitiveCompare("luka modric")
print(normalizedPlayerName, isLukaModric.rawValue)
The above code prints "Luka Modric 0", 0 standing for equal strings.

Why is MD5 hashing so hard and in Swift 3?

Ok, so every now and then you come across problems that you've solved before using various frameworks and libraries and whatnot found on the internet and your problem is solved relatively quick and easy and you also learn why your problem was a problem in the first place.
However, sometimes you come across problems that make absolute 0 sense, and even worse when the solutions make negative sense.
My problem is that I want to take Data and make an MD5 hash out of it.
I find all kinds of solutions but none of them work.
What's really bugging me out actually is how unnecessarily complicated the solutions seem to be for a trivial task as getting an MD5 hash out of anything.
I am trying to use the Crypto and CommonCrypto frameworks by Soffes and they seem fairly easy, right? Right?
Yes!
But why am I still getting the error fatal error: unexpectedly found nil while unwrapping an Optional value?
From what I understand, the data served by myData.md5 in the extension of Crypto by Soffes seem to be "optional". But why?
The code I am trying to execute is:
print(" md5 result: " + String(data: myData.md5, encoding: .utf8)!)
where myData has data in it 100% because after the above line of code, I send that data to a server, and the data exists.
On top of that, printing the count of myData.md5.count by print(String(myData.md5.count)) works perfectly.
So my question is basically: How do I MD5 hash a Data and print it as a string?
Edit:
What I have tried
That works
MD5:ing the string test in a PHP script gives me 098f6bcd4621d373cade4e832627b4f6
and the Swift code "test".md5() also gives me 098f6bcd4621d373cade4e832627b4f6
That doesn't work
Converting the UInt8 byte array from Data.md5() to a string that represents the correct MD5 value.
The different tests I've done are the following:
var hash = ""
for byte in myData.data.md5() {
hash += String(format: "%02x", byte)
}
print("loop = " + hash) //test 1
print("myData.md5().toHexString() = " + myData.md5().toHexString()) //test 2
print("CryptoSwift.Digest.md5([UInt8](myData)) = " + CryptoSwift.Digest.md5([UInt8](myData)).toHexString()) //test 3
All three tests with the 500 byte test data give me the MD5 value 56f6955d148ad6b6abbc9088b4ae334d
while my PHP script gives me 6081d190b3ec6de47a74d34f6316ac6b
Test Sample (64 bytes):
Raw data:
FFD8FFE0 00104A46 49460001 01010048 00480000 FFE13572 45786966 00004D4D
002A0000 0008000B 01060003 00000001 00020000 010F0002 00000012 00000092
Test 1, 2 and 3 MD5: 7f0a012239d9fde5a46071640d2d8c83
PHP MD5: 06eb0c71d8839a4ac91ee42c129b8ba3
PHP Code: echo md5($_FILES["file"]["tmp_name"])
The simple answer to your question is:
String(data: someData, encoding: .utf8)
returns nil if someData is not properly UTF8 encoded data. If you try to unwrap nil like this:
String(data: someDate, encoding: .utf8)!
you get:
fatal error: unexpectedly found nil while unwrapping an Optional value
So at it's core, it's got nothing to do with hashing or crypto.
Both the input and the output of MD5 (or any hash algorithm for that matter) are binary data (and not text or strings). So the output of MD5 is not UTF8 encoded data. Thus why the above String initializer always failed.
If you want to display binary data in your console, you need to convert it to a readable representation. The most common ones are hexadecimal digits or Base 64 encoding.
Note: Some crypto libraries allow you to feed string into their hash functions. They will silently convert the string to a binary representation using some character encoding. If the encodings do not match, the hash values do not match across systems and programming languages. So you better try to understand why they really do in the background.
I use a library called 'CryptoSwift' for creating hashes, as well as encrypting data before I send it/store it. It's very easy to use.
It can be found here https://github.com/krzyzanowskim/CryptoSwift and you can even install it with CocoaPods by adding pod 'CryptoSwift' to your podfile.
Once installed, hashing a Data object is as simple as calling Data.md5()! It really is that easy. It also supports other hashing algorithms such as SHA.
You can then just print the MD5 object and CryptoSwift will convert it to a String for you.
The full docs on creating digests can be found here: https://github.com/krzyzanowskim/CryptoSwift#calculate-digest
Thanks to Jacob King I tried a much simpler MD5 framework called CryptoSwift.
The user Codo inspired me to look deeper in to my PHP script as he suggested that I am not in fact hashing the content of my data, but instead the filename, which is correct.
The original question however was not about which framework to use or suggestions to as why my app and my PHP script return different MD5 values.
The question was originally about why I get the error
fatal error: unexpectedly found nil while unwrapping an Optional value
at the line of code saying
print(" md5 result: " + String(data: myData.md5, encoding: .utf8)!)
So the answer to that is that I should not try to convert the 16 bytes data output of the MD5() function, but instead call a subfunction of MD5() called toHexString().
So the proper line of code should look like the following:
print("md5 result: " + myData.md5().toHexString())
BONUS
My PHP script now contains the following code:
move_uploaded_file($_FILES["file"]["tmp_name"], $target_dir); //save data to disk
$md5_of_data = md5_file ($target_dir); //get MD5 of saved data
BONUS-BONUS
The problem and solution is part of a small framework called AssetManager that I'm working on, which can be found here: https://github.com/aidv/AssetManager

NSData encoding

Currently, I'm trying to parse an NSData in my iOS app. Problem is, I can't seem to find a proper hebrew encoding for parsing. I must decode the data using the Windows-1255 encoding (hebrew encoding type for windows) or ISO 8859-8 encoding, or I'll get plain gibberish. The closest I've got to solving the issue was using
CFStringConvertEncodingToNSStringEncoding(CFStringEncodings.ISOLatinHebrew)
yet it throws 'CFStringEncodings' is not convertible to 'CFStringEncoding' (notice Encodings vs Encoding).
What can I do in order to encode the data correctly?
Thanks!
The problem is that CFStringEncodings is an enumeration based on CFIndex
(which in turn is a type alias for Int), whereas CFStringEncoding is a type
alias for UInt32. Therefore you have to convert the .ISOLatinHebrew
value explicitly to a CFStringEncoding:
let cfEnc = CFStringEncodings.ISOLatinHebrew
let enc = CFStringConvertEncodingToNSStringEncoding(CFStringEncoding(cfEnc.rawValue))
Turns out I needed to get my hands a bit dirty.
I saw that CFStringEncodings has a relation to the file CFStringEncodingsExt.h, so I searched the file for some help. Suddenly I came across a huge CF_ENUM that included exactly what I needed- all of the CFStringEncodings by their UInt32 value!
So it has turned out that kCFStringEncodingISOLatinHebrew = 0x0208, /* ISO 8859-8 */
I encourage everyone who is facing this encoding issue to go to that file and search for his needed encoding.

Encoding in Pig

Loading data that contains some particular characters (as for example, À, ° and others) using Pig Latin and storing data in a .txt file is possible to see that these symbols in a txt file are displayed as � and ï characters. That happens because of UTF-8 substitution character.
I would like to ask if is possible to avoid it somehow, maybe with some pig commands, to have in the result (in txt file) for example À instead of �?
In Pig we have built in dynamic invokers that that allow a Pig programmer to refer to Java functions without having to wrap them in custom Pig UDFs. So now u can load the data as UTF-8 encoded strings, then decode it, then perform all your operations on it and then store it back as UTF-8. I guess this should work for the first part:
DEFINE UrlDecode InvokeForString('java.net.URLDecoder.decode', 'String String');
encoded_strings = LOAD 'encoded_strings.txt' as (encoded:chararray);
decoded_strings = FOREACH encoded_strings GENERATE UrlDecode(encoded, 'UTF-8');
The java code responsible for doing this is:
import java.io.IOException;
import java.net.URLDecoder;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;
public class UrlDecode extends EvalFunc<String> {
#Override
public String exec(Tuple input) throws IOException {
String encoded = (String) input.get(0);
String encoding = (String) input.get(1);
return URLDecoder.decode(encoded, encoding);
}
}
Now modify this code to return UTF-8 encoded strings from normal strings and store it to your text file. Hope it works.
You are correct this is because of Text (http://hadoop.apache.org/docs/r2.6.0/api/org/apache/hadoop/io/Text.html) converts incoming data (Bytes) to UTF-8 automatically. To avoid this you should not work with Text.
That said you should use bytearray type instead of chararray (bytearray do not use Text and so no conversion is done). Since you don't specify any code, I'll provide an example for illustration:
this is what (likely) you did:
converted_to_utf = LOAD 'strangeEncodingdata' using TextLoader AS (line:chararray);
this is what you wanted to do:
no_conversion = LOAD 'strangeEncodingdata' using TextLoader AS (line:bytearray);

How to determine string encoding in cocoa?

How to determine string encoding in cocoa?
Recently I'm working on a radio player.Sometimes id3 tag text was garbled.
Here is my code:
CFDictionaryRef audioInfoDictionary;
UInt32 size = sizeof(audioInfoDictionary);
result = AudioFileGetProperty(fileID, kAudioFilePropertyInfoDictionary, &size, &audioInfoDictionary);
ID3 info are in audioInfoDictionary. Sometimes the id3 doesn't use utf8 encoding, and title, artist name were garbled.
Is there any way to determine what encoding a string use?
Special thx!
While it's an NSString object, there's no specific encoding since it's guaranteed to represent whatever is put into it using the encoding determined when it was created. See the Working With Encodings section of the docs.
From where are you getting the ID3 tags? The time you "receive" this information is the best time to determine its encoding. See Creating and Initializing Strings and the next few sections (for file and url creation) for a list of initializers. Some of them let you set the encoding and others pass back (by reference) the "best guess" encoding the system determined when creating the string. Look for methods with "usedEncoding:" for the system's reported guess.
All of this really depends on exactly what is handing you that string. Are you reading it from a file (an MP3) or a web service (Internet Radio)? If the latter, the server's response should include the encoding and if that's wrong, there's not much to do but guess.