Carriage return character not being matched in Swift - swift

I'm trying to parse a file that (apparently) ends its lines with carriage returns, but they aren't being matched as such in Swift, despite having the same UTF8 value. I can see possible fixes for the problem, but I'm curious as to what these characters actually are.
Here's some sample code, with the output below. (CR is set using Character("\r"), although I've tried it using "\r" as well.
try f.forEach() { c in
print(c, terminator:" ") // DBG
if (c == "\r") {
print("Carriage return found!")
}
print(String(c).utf8.first!, terminator:" ")//DBG
print(String(describing:pstate)) // DBG
...
case .field:
switch c {
case CR,LF :
self.endline()
pstate = .eol
When it reaches the end of line (which shows up as such in my text editors), I get this:
. 46 field
0 48 field
13 field
I 73 field
It doesn't seem to be matching using == or in the switch statement. Is there another approach I should be using for this character?
(I'll note that the parsing works fine with files that terminate in newlines.)

I determined what the problem was. By looking at c.unicodeScalars I discovered that the end of line character was in fact "\r\n", not just "\r". As seen in my code I was only taking the first when printing it out as UTF-8. I don't know if that's something from String.forEach or in the file itself.
I know that there are tests to determine if something is a newline. Swift 5 has them directly (c.isNewline), and there is also the CharacterSet approach as noted by Bill Nattaner.
I'm happier with something that will work in my switch statement (and thus I'll define each one explicitly), but that might change if I expect to deal with a wider variety of files.

I'm a little hazy as to what the f.forEach represents, but if your variable c is of type Character then you could replace your if statement with:
if "\(c)".rangeOfCharacter( from: CharacterSet.newlines ) != nil
{
print("Carriage return found!")
}
That way you won't have to invent a list of all-possible new line characters.

Related

Converting numbers into timestamps (inserting colons at specific places)

I'm using AutoHotkey for this as the code is the most understandable to me. So I have a document with numbers and text, for example like this
120344 text text text
234000 text text
and the desired output is
12:03:44 text text text
23:40:00 text text
I'm sure StrReplace can be used to insert the colons in, but I'm not sure how to specify the position of the colons or ask AHK to 'find' specific strings of 6 digit numbers. Before, I would have highlighted the text I want to apply StrReplace to and then press a hotkey, but I was wondering if there is a more efficient way to do this that doesn't need my interaction. Even just pointing to the relevant functions I would need to look into to do this would be helpful! Thanks so much, I'm still very new to programming.
hfontanez's answer was very helpful in figuring out that for this problem, I had to use a loop and substring function. I'm sure there are much less messy ways to write this code, but this is the final version of what worked for my purposes:
Loop, read, C:\[location of input file]
{
{ If A_LoopReadLine = ;
Continue ; this part is to ignore the blank lines in the file
}
{
one := A_LoopReadLine
x := SubStr(one, 1, 2)
y := SubStr(one, 3, 2)
z := SubStr(one, 5)
two := x . ":" . y . ":" . z
FileAppend, %two%`r`n, C:\[location of output file]
}
}
return
Assuming that the "timestamp" component is always 6 characters long and always at the beginning of the string, this solution should work just fine.
String test = "012345 test test test";
test = test.substring(0, 2) + ":" + test.substring(2, 4) + ":" + test.substring(4, test.length());
This outputs 01:23:45 test test test
Why? Because you are temporarily creating a String object that it's two characters long and then you insert the colon before taking the next pair. Lastly, you append the rest of the String and assign it to whichever String variable you want. Remember, the substring method doesn't modify the String object you are calling the method on. This method returns a "new" String object. Therefore, the variable test is unmodified until the assignment operation kicks in at the end.
Alternatively, you can use a StringBuilder and append each component like this:
StringBuilder sbuff = new StringBuilder();
sbuff.append(test.substring(0,2));
sbuff.append(":");
sbuff.append(test.substring(2,4));
sbuff.append(":");
sbuff.append(test.substring(4,test.length()));
test = sbuff.toString();
You could also use a "fancy" loop to do this, but I think for something this simple, looping is just overkill. Oh, I almost forgot, this should work with both of your test strings because after the last colon insert, the code takes the substring from index position 4 all the way to the end of the string indiscriminately.

How to get non-escaped apostrophe from .components(separatedBy: CharacterSet)

How I can get components(separatedBy: CharacterSet) to return the substrings so that they do not contain escaped apostrophes or single quotes?
When I print the resulting array, I want it to not include the backslash character.
I am using a playground to manipulate text and produce output in the terminal that I can copy and use outside of Xcode, so I want to strip the escape character from the string representation produced in the terminal output.
var str = "can't,,, won't, , good-bye, Santa Claus"
var delimiters = CharacterSet.letters.inverted.subtracting(.whitespaces)
delimiters = delimiters.subtracting(CharacterSet(charactersIn: "-"))
delimiters = delimiters.subtracting(CharacterSet(charactersIn: "'"))
var result = str.components(separatedBy: delimiters)
.map({ $0.trimmingCharacters(in: .whitespaces) })
.filter({ !$0.isEmpty })
print(result) // ["can\'t", "won\'t", "good-bye", "Santa Claus"]
What you are asking for is a metaphysical impossibility. You cannot want anything about how print prints. It's only a representation in the log.
Your strings do not actually contain any backslashes, so what's the problem? How the print command output notates them is irrelevant. You might as well "want" the print command to translate your strings into French. No, that's not what it does. It just prints, and the way it prints is the way it prints.
Another way to look at it: An array doesn't contain square brackets at both ends. And a string doesn't contain double-quotes at both ends. Those are things you might write in order express those things as literals, but they are not real as part of the actual object. Well, I don't see you objecting to those!
Basically, if you want to control the output of something, you write an output routine. If you're doing to rely on print, just accept the funny old way it writes stuff and move on.

iText PDFSweep RegexBasedCleanupStrategy not work in some case

I'm trying to use iText PDFSweep RegexBasedCleanupStrategy to redact some words from pdf, however I only want to redact the word but not appear in other word, eg.
I want to redact "al" as single word, but I don't want to redact the "al" in "mineral".
So I add the word boundary("\b") in the Regex as parameter to RegexBasedCleanupStrategy,
new RegexBasedCleanupStrategy("\\bal\\b")
however the pdfAutoSweep.cleanUp not work if the word is at the end of line.
In short
The cause of this issue is that the routine that flattens the extracted text chunks into a single String for applying the regular expression does not insert any indicator for a line break. Thus, in that String the last letter from one line is immediately followed by the first letter of the next which hides the word boundary. One can fix the behavior by adding an appropriate character to the String in case of a line break.
The problematic code
The routine that flattens the extracted text chunks into a single String is CharacterRenderInfo.mapString(List<CharacterRenderInfo>) in the package com.itextpdf.kernel.pdf.canvas.parser.listener. In case of a merely horizontal gap this routine inserts a space character but in case of a vertical offset, i.e. a line break, it adds nothing extra to the StringBuilder in which the String representation is generated:
if (chunk.sameLine(lastChunk)) {
// we only insert a blank space if the trailing character of the previous string wasn't a space, and the leading character of the current string isn't a space
if (chunk.getLocation().isAtWordBoundary(lastChunk.getLocation()) && !chunk.getText().startsWith(" ") && !chunk.getText().endsWith(" ")) {
sb.append(' ');
}
indexMap.put(sb.length(), i);
sb.append(chunk.getText());
} else {
indexMap.put(sb.length(), i);
sb.append(chunk.getText());
}
A possible fix
One can extend the code above to insert a newline character in case of a line break:
if (chunk.sameLine(lastChunk)) {
// we only insert a blank space if the trailing character of the previous string wasn't a space, and the leading character of the current string isn't a space
if (chunk.getLocation().isAtWordBoundary(lastChunk.getLocation()) && !chunk.getText().startsWith(" ") && !chunk.getText().endsWith(" ")) {
sb.append(' ');
}
indexMap.put(sb.length(), i);
sb.append(chunk.getText());
} else {
sb.append('\n');
indexMap.put(sb.length(), i);
sb.append(chunk.getText());
}
This CharacterRenderInfo.mapString method is only called from the RegexBasedLocationExtractionStrategy method getResultantLocations() (package com.itextpdf.kernel.pdf.canvas.parser.listener), and only for the task mentioned, i.e. applying the regular expression in question. Thus, enabling it to properly allow recognition of word boundaries should not break anything but indeed should be considered a fix.
One merely might consider adding a different character for a line break, e.g. a plain space ' ' if one does not want to treat vertical gaps any different than horizontal ones. For a general fix one might, therefore, consider making this character a settable property of the strategy.
Versions
I tested with iText 7.1.4-SNAPSHOT and PDFSweep 2.0.3-SNAPSHOT.

Strange results when deleting all special characters from a string in Progress / OpenEdge

I have the code snippet below (as suggested in this previous Stack Overflow answer ... Deleting all special characters from a string in progress 4GL) which is attempting to remove all extended characters from a string so that I may transmit it to a customer's system which will not accept any extended characters.
do v-int = 128 to 255:
assign v-string = replace(v-string,chr(v-int),"").
end.
It is working perfectly with one exception (which makes me fear there may be others I have not caught). When it gets to 255, it will replace all 'y's in the string.
If I do the following ...
display chr(255) = chr(121). /* 121 is asc code of y */
I get true as the result.
And therefore, if I do the following ...
display replace("This is really strange",chr(255),"").
I get the following result:
This is reall strange
I have verified that 'y' is the only character affected by running the following:
def var v-string as char init "abcdefghijklmnopqrstuvwxyz".
def var v-int as int.
do v-int = 128 to 255:
assign v-string = replace(v-string,chr(v-int),"").
end.
display v-string.
Which results in the following:
abcdefghijklmnopqrstuvwxz
I know I can fix this by removing 255 from the range but I would like to understand why this is happening.
Is this a character collation set issue or am I missing something simpler?
Thanks for any help!
This is a bug. Here's a Progress Knowledge Base article about it:
http://knowledgebase.progress.com/articles/Article/000046181
The workaround is to specify the codepage in the CHR() statement, like this:
CHR(255, "UTF-8", "1252")
Here it is in your example:
def var v-string as char init "abcdefghijklmnopqrstuvwxyz". def var v-int as int.
do v-int = 128 to 255:
assign v-string = replace(v-string, chr(v-int, "UTF-8", "1252"), "").
end.
display v-string.
You should now see the 'y' in the output.
This seems to be a bug!
The REPLACE() function returns an unexpected result when replacing character CHR(255) (ÿ) in a String.
The REPLACE() function modifies the value of the target character, but additionally it changes any occurrence of characters 'Y' and 'y' present in the String.
This behavior seems to affect only the character ÿ. Other characters are correctly changed by REPLACE().
Using default codepage ISO-8859-1
Link to knowledgebase

Matching Unicode punctuation using LPeg

I am trying to create an LPeg pattern that would match any Unicode punctuation inside UTF-8 encoded input. I came up with the following marriage of Selene Unicode and LPeg:
local unicode = require("unicode")
local lpeg = require("lpeg")
local punctuation = lpeg.Cmt(lpeg.Cs(any * any^-3), function(s,i,a)
local match = unicode.utf8.match(a, "^%p")
if match == nil
return false
else
return i+#match
end
end)
This appears to work, but it will miss punctuation characters that are a combination of several Unicode codepoints (if such characters exist), as I am reading only 4 bytes ahead, it probably kills the performance of the parser, and it is undefined what the library match function will do, when I feed it a string that contains a runt UTF-8 character (although it appears to work now).
I would like to know whether this is a correct approach or if there is a better way to achieve what I am trying to achieve.
The correct way to match UTF-8 characters is shown in an example in the LPeg homepage. The first byte of a UTF-8 character determines how many more bytes are a part of it:
local cont = lpeg.R("\128\191") -- continuation byte
local utf8 = lpeg.R("\0\127")
+ lpeg.R("\194\223") * cont
+ lpeg.R("\224\239") * cont * cont
+ lpeg.R("\240\244") * cont * cont * cont
Building on this utf8 pattern we can use lpeg.Cmt and the Selene Unicode match function kind of like you proposed:
local punctuation = lpeg.Cmt(lpeg.C(utf8), function (s, i, c)
if unicode.utf8.match(c, "%p") then
return i
end
end)
Note that we return i, this is in accordance with what Cmt expects:
The given function gets as arguments the entire subject, the current position (after the match of patt), plus any capture values produced by patt. The first value returned by function defines how the match happens. If the call returns a number, the match succeeds and the returned number becomes the new current position.
This means we should return the same number the function receives, that is the position immediately after the UTF-8 character.