AWS Polly - Highlighting special characters - swift

I am using the AWS Polly service for text to speech. But if the text contains some special characters, it is returning the wrong start and end numbers.
For example if the text is : "Böylelikle" it returns : {"time":6,"type":"word","start":0,"end":11,"value":"Böylelikle"}
But it should start from 0 and end to 10.
I've searched AWS Documentation and they say for the start and end values, the offset in bytes not characters.
My question is how can I convert this byte value to the character.
My code is:
builder.continueOnSuccessWith { (awsTask: AWSTask<NSURL>) -> Any? in
if builder.error == nil {
if let url = awsTask.result {
do {
let txtData = try Data(contentsOf: url as URL)
if let txtString = String(data: txtData, encoding: .utf8) {
let lines = txtString.components(separatedBy: .newlines)
for line in lines {
let jsonData = Data(line.utf8)
let pollyVoiceSentence = try JSONDecoder().decode(PollyVoiceSentence.self, from: jsonData)
voiceSentences.append(pollyVoiceSentence)
}
}
} catch {
print("Could not parse TXT file")
}
}
} else {
print("ParseJSON: \(builder.error!)")
}
completionHandler(voiceSentences)
return nil
}
And to highlight words:
let start = pollyVoiceSentence.start
var end = pollyVoiceSentence.end
let voiceRange = NSRange(location: start, length: end - start)
print("RANGE: \(voiceRange) - Word: \(pollyVoiceSentence.value)")
Thanks.

It looks like they are providing you String.utf8.count for the word. Swift supports Unicode and not all characters can be represented within UTF8.
You can read the official docs here -
String and Characters
There are a ton of useful details there. I would like to highlight following for your use case -
Here's how it looks for your input as well -
What you can do in your case is -
Decode the PollyVoiceSentence the way you are today.
Create an extension on PollyVoiceSentence to account for this char count issue.
Iterate/account for all words in a sentence, because each previous word's char-count now affects start for all the subsequent words.
And you can't trust the start & end provided by the json, because it clearly doesn't fit best with Swift's String API.

Related

Decoding strings including utf8-literals like '\xc3\xa6' in Swift?

Follow up question to my former thread about UTF-8 literals:
It was established that you can decode UTF-8 literals from string like this that exclusively includes UTF-8 literals:
let s = "\\xc3\\xa6"
let bytes = s
.components(separatedBy: "\\x")
// components(separatedBy:) would produce an empty string as the first element
// because the string starts with "\x". We drop this
.dropFirst()
.compactMap { UInt8($0, radix: 16) }
if let decoded = String(bytes: bytes, encoding: .utf8) {
print(decoded)
} else {
print("The UTF8 sequence was invalid!")
}
However this only works if the string only contains UTF-8 literals. As I am fetching a Wi-Fi list of names that has these UTF-8 literals within, how do I go about decoding the entire string?
Example:
let s = "This is a WiFi Name \\xc3\\xa6 including UTF-8 literals \\xc3\\xb8"
With the expected result:
print(s)
> This is a WiFi Name æ including UTF-8 literals ø
In Python there is a simple solution to this:
contents = source_file.read()
uni = contents.decode('unicode-escape')
enc = uni.encode('latin1')
dec = enc.decode('utf-8')
Is there a similar way to decode these strings in Swift 5?
To start with add the decoding code into a String extension as a computed property (or create a function)
extension String {
var decodeUTF8: String {
let bytes = self.components(separatedBy: "\\x")
.dropFirst()
.compactMap { UInt8($0, radix: 16) }
return String(bytes: bytes, encoding: .utf8) ?? self
}
}
Then use a regular expression and match using a while loop to replace all matching values
while let range = string.range(of: #"(\\x[a-f0-9]{2}){2}"#, options: [.regularExpression, .caseInsensitive]) {
string.replaceSubrange(range, with: String(string[range]).decodeUTF8)
}
As far as I know there's no native Swift solution to this. To make it look as compact as the Python version at the call site you can build an extension on String to hide the complexity
extension String {
func replacingUtf8Literals() -> Self {
let regex = #"(\\x[a-zAZ0-9]{2})+"#
var str = self
while let range = str.range(of: regex, options: .regularExpression) {
let literalbytes = str[range]
.components(separatedBy: "\\x")
.dropFirst()
.compactMap{UInt8($0, radix: 16)}
guard let actuals = String(bytes: literalbytes, encoding: .utf8) else {
fatalError("Regex error")
}
str.replaceSubrange(range, with: actuals)
}
return str
}
}
This lets you call
print(s.replacingUtf8Literals()).
//prints: This is a WiFi Name æ including UTF-8 literals ø
For convenience I'm trapping a failed conversion with fatalError. You may want to handle this in a better way in production code (although, unless the regex is wrong it should never occur!). There needs to be some form of break or error thrown here else you have an infinite loop.

Check or validation Persian(Farsi) string swift

I searched over web pages and stack overflow about validation of a Persian(Farsi) language string. Most of them have mentioned Arabic letters. Also, I want to know if my string is fully Persian(not contain).
for example, these strings are Persian:
"چهار راه"
"خیابان."
And These are not:
"خیابان 5"
"چرا copy کردی؟"
Also, just Persian or Arabic digits are allowed. There are exceptions about [.,-!] characters(because keyboards are not supported these characters in Persian)
UPDATE:
I explained a swift version of using regex and predicate in my answer.
Based on this extension found elsewhere:
extension String {
func matches(_ regex: String) -> Bool {
return self.range(of: regex, options: .regularExpression, range: nil, locale: nil) != nil
}
}
and construct your regex containing allowed characters like
let mystra = "چهار راه"
let mystrb = "خیابان."
let mystrc = "خیابان 5"
let mystrd = "چرا copy کردی؟" //and so on
for a in mystra {
if String(a).matches("[\u{600}-\u{6FF}\u{064b}\u{064d}\u{064c}\u{064e}\u{064f}\u{0650}\u{0651}\u{0020}]") { // add unicode for dot, comma, and other needed puctuation marks, for now I added space etc
} else { // not in range
print("oh no--\(a)---zzzz")
break // or return false
}
}
Make sure you construct the Unicode needed using the above model.
Result for other strings
for a in mystrb ... etc
oh no--.---zzzz
oh no--5---zzzz
oh no--c---zzzz
Enjoy
After a period I could find a better way:
extension String {
var isPersian: Bool {
let predicate = NSPredicate(format: "SELF MATCHES %#",
"([-.]*\\s*[-.]*\\p{Arabic}*[-.]*\\s*)*[-.]*")
return predicate.evaluate(with: self)
}
}
and you can use like this:
print("yourString".isPersian) //response: true or false
The main key is using regex and predicate. these links help you to manipulate whatever you want:
https://nshipster.com/nspredicate/
https://nspredicate.xyz/
http://userguide.icu-project.org/strings/regexp
Feel free and ask whatever question about this topic :D
[EDIT] The following regex can be used to accept Latin numerics, as they are mostly accepted in Persian texts
"([-.]*\\s*[-.]*\\p{Arabic}*[0-9]*[-.]*\\s*)*[-.]*"

Convert Swift String to wchar_t

For context: I'm trying to use the very handy LibXL. I've used it with success in Obj-C and C++ but am now trying to port over to Swift. In order to better support Unicode, I need to sent all strings to the LibXL api as wchar_t*.
So, for this purpose I've cobbled together this code:
extension String {
///Function to convert a String into a wchar_t buffer.
///Don't forget to free the buffer!
var wideChar: UnsafeMutablePointer<wchar_t>? {
get {
guard let _cString = self.cString(using: .utf16) else {
return nil
}
let buffer = UnsafeMutablePointer<wchar_t>.allocate(capacity: _cString.count)
memcpy(buffer, _cString, _cString.count)
return buffer
}
}
The calls to LibXL appear to be working (getting a print of the error messages returns 'Ok'). Except when I try to actually write to a cell in a test spreadsheet. I get can't write row 0 in trial version:
if let name = "John Doe".wideChar, let passKey = "mac-f.....lots of characters...3".wideChar {
xlBookSetKeyW(book, name, passKey)
print(">: " + String.init(cString: xlBookErrorMessageW(book)))
}
if let sheetName = "Output".wideChar, let path = savePath.wideChar, let test = "Hello".wideChar {
let sheet: SheetHandle = xlBookAddSheetW(book, sheetName, nil)
xlSheetWriteStrW(sheet, 0, 0, test, sectionTitleFormat)
print(">: " + String.init(cString: xlBookErrorMessageW(book)))
let success = xlBookSaveW(book, path)
dump(success)
print(">: " + String.init(cString: xlBookErrorMessageW(book)))
}
I'm presuming that my code for converting to wchar_t* is incorrect. Can someone point me in the right direction for that..?
ADDENDUM: Thanks to #MartinR for the answer. It appears that the block 'consumes' any pointers that are used in it. So, for example, when writing a string using
("Hello".withWideChars({ wCharacters in
xlSheetWriteStrW(newSheet, destRow, destColumn, wCharacters, aFormatHandle)
})
The aFormatHandle will become invalid after the writeStr line executes and isn't re-useable. It's necessary to create a new FormatHandle for each write command.
There are different problems here. First, String.cString(using:) does
not work well with multi-byte encodings:
print("ABC".cString(using: .utf16)!)
// [65, 0] ???
Second, wchar_t contains UTF-32 code points, not UTF-16.
Finally, in
let buffer = UnsafeMutablePointer<wchar_t>.allocate(capacity: _cString.count)
memcpy(buffer, _cString, _cString.count)
the allocation size does not include the trailing null character,
and the copy copies _cString.count bytes, not characters.
All that can be fixed, but I would suggest a different API
(similar to the String.withCString(_:) method):
extension String {
/// Calls the given closure with a pointer to the contents of the string,
/// represented as a null-terminated wchar_t array.
func withWideChars<Result>(_ body: (UnsafePointer<wchar_t>) -> Result) -> Result {
let u32 = self.unicodeScalars.map { wchar_t(bitPattern: $0.value) } + [0]
return u32.withUnsafeBufferPointer { body($0.baseAddress!) }
}
}
which can then be used like
let name = "John Doe"
let passKey = "secret"
name.withWideChars { wname in
passKey.withWideChars { wpass in
xlBookSetKeyW(book, wname, wpass)
}
}
and the clean-up is automatic.

Using Swift to write a character to a file

I'm trying to write a Swift program that writes a single character to a file. I've researched this but so far haven't figured out how to do this (note, I'm new to Swift). Note that the text file I'm reading and writing to can contain a series of characters, one per line. I want to read the last character and update the file so it only contains that last character.
Here's what I have so far:
let will_file = "/Users/willf/Drobox/foo.txt"
do {
let statusStr = try String(contentsOfFile: will_file, encoding: .utf8)
// find the last character in the string
var strIndex = statusStr.index(statusStr.endIndex, offsetBy: -1)
if statusStr[strIndex] == "\n" {
// I need to access the character just before the last \n
strIndex = statusStr.index(statusStr.endIndex, offsetBy: -2)
}
if statusStr[strIndex] == "y" {
print("yes")
} else if statusStr[strIndex] == "n" {
print("no")
} else {
// XXX deal with error here
print("The char isn't y or n")
}
// writing
// I get a "cannot invoke 'write with an arg list of type (to: String)
try statusStr[strIndex].write(to: will_file)
}
I would appreciate advice on how to write the character returned by statusStr[strIndex].
I will further point out that I have read this Read and write a String from text file but I am still confused as to how to write to a text file under my Dropbox folder. I was hoping that there was a write method that could take an absolute path as a string argument but I have not found any doc or code sample showing how to do this that will compile in Xcode 9.2. I have also tried the following code which will not compile:
let dir = FileManager.default.urls(for: .userDirectory, in: .userDomainMask).first
let fileURL = dir?.appendingPathComponent("willf/Dropbox/foo.txt")
// The compiler complains about extra argument 'atomically' in call
try statusStr[strIndex].write(to: fileURL, atomically: false, encoding: .utf8)
I have figured out how to write a character as a string to a file thanks to a couple answers on stack overflow. The key is to coerce a character type to a string type because the string object supports the write method I want to use. Note that I used both the answers in Read and write a String from text file and in Swift Converting Character to String to come up with the solution. Here is the Swift code:
import Cocoa
let will_file = "/Users/willf/Dropbox/foo.txt"
do {
// Read data from will_file into String object
let statusStr = try String(contentsOfFile: will_file, encoding: .utf8)
// find the last character in the string
var strIndex = statusStr.index(statusStr.endIndex, offsetBy: -1)
if statusStr[strIndex] == "\n" {
// I need to access the character just before the last \n
strIndex = statusStr.index(statusStr.endIndex, offsetBy: -2)
}
if statusStr[strIndex] != "n" && statusStr[strIndex] != "y" {
// XXX deal with error here
print("The char isn't y or n")
}
// Update file so it contains only the last status char
do {
// String(statusStr[strIndex]) coerces the statusStr[strIndex] character to a string for writing
try String(statusStr[strIndex]).write(toFile: will_file, atomically: false, encoding: .utf8)
} catch {
print("There was a write error")
}
} catch {
print("there is an error!")
}

How to search array using unknown characters - Swift 3 for Mac

I am looking for a way to search an Array of strings (containing filenames with extension) for dots (if the string contains characters-a dot-charaters, print the string definition). To do that I have to use something like wildcards (.).
So I tried this :
let testString = "*.*"
if Array[x].countains(testString)
{
print (Array[x])
}
or
if Array[x].range(of:testString) != nil
{
print (Array[x])
}
But it does not work. I guess I have to declare it differently but I don't know how and I have not found the right example.
Could someone shows some examples? Thank U.
Using this helper method on String:
extension String {
func contains(regex: NSRegularExpression) -> Bool {
let length = self.utf16.count // NSRanges are UTF-16 based!
let wholeString = NSRange(location: 0, length: length)
let matchCount = regex.numberOfMatches(in: self, range: wholeString)
return matchCount > 0
}
}
Then try this:
let fileNameWithExtension = try! NSRegularExpression(pattern: "\\w+[.]\\w+")
if Array[x].contains(regex: fileNameWithExtension) {
print(Array[x])
}
You may need to tweak my pattern above in order to match all cases you have in mind. This NSRegularExpression cheat sheet might help you there ;-)