Extract words in Lua split by Unicode spaces and control characters - unicode

I'm interested in a pure-Lua (i.e., no external Unicode library) solution to extracting the units of a string between certain Unicode control characters and spaces. The code points I would like to use as delimiters are:
0000-0020
007f-00a0
00ad
1680
2000-200a
2028-2029
202f
205f
3000
I know how to access the code points in a string, for example:
> for i,c in utf8.codes("é$ \tπ😃") do print(c) end
233
36
32
9
960
128515
but I am not sure how to "skip" the spaces and tabs and reconstitute the other codepoints into strings themselves. What I would like to do in the example above, is drop the 32 and 9, then perhaps use utf8.char(233, 36) and utf8.char(960, 128515) to somehow get ["é$", "π😃"].
It seems that putting everything into a table of numbers and painstakingly walking through the table with for-loops and if-statements would work, but is there a better way? I looked into string:gmatch but that seems to require making utf8 sequences out of each of the ranges I want, and it's not clear what that pattern would even look like.
Is there a idiomatic way to extract the strings between the spaces? Or must I manually hack tables of code points? gmatch does not look up to the task. Or is it?

would require painstakingly generating the utf8 encodings for all code points at each end of the range.
Yes. But of course not manually.
local function range(from, to)
assert(utf8.codepoint(from) // 64 == utf8.codepoint(to) // 64)
return from:sub(1,-2).."["..from:sub(-1).."-"..to:sub(-1).."]"
end
local function split_unicode(s)
for w in s
:gsub("[\0-\x1F\x7F]", " ")
:gsub("\u{00a0}", " ")
:gsub("\u{00ad}", " ")
:gsub("\u{1680}", " ")
:gsub(range("\u{2000}", "\u{200a}"), " ")
:gsub(range("\u{2028}", "\u{2029}"), " ")
:gsub("\u{202f}", " ")
:gsub("\u{205f}", " ")
:gsub("\u{3000}", " ")
:gmatch"%S+"
do
print(w)
end
end
Test:
split_unicode("#\0#\t#\x1F#\x7F#\u{00a0}#\u{00ad}#\u{1680}#\u{2000}#\u{2005}#\u{200a}#\u{2028}#\u{2029}#\u{202f}#\u{205f}#\u{3000}#")

Related

Converting numbers into timestamps (inserting colons at specific places)

I'm using AutoHotkey for this as the code is the most understandable to me. So I have a document with numbers and text, for example like this
120344 text text text
234000 text text
and the desired output is
12:03:44 text text text
23:40:00 text text
I'm sure StrReplace can be used to insert the colons in, but I'm not sure how to specify the position of the colons or ask AHK to 'find' specific strings of 6 digit numbers. Before, I would have highlighted the text I want to apply StrReplace to and then press a hotkey, but I was wondering if there is a more efficient way to do this that doesn't need my interaction. Even just pointing to the relevant functions I would need to look into to do this would be helpful! Thanks so much, I'm still very new to programming.
hfontanez's answer was very helpful in figuring out that for this problem, I had to use a loop and substring function. I'm sure there are much less messy ways to write this code, but this is the final version of what worked for my purposes:
Loop, read, C:\[location of input file]
{
{ If A_LoopReadLine = ;
Continue ; this part is to ignore the blank lines in the file
}
{
one := A_LoopReadLine
x := SubStr(one, 1, 2)
y := SubStr(one, 3, 2)
z := SubStr(one, 5)
two := x . ":" . y . ":" . z
FileAppend, %two%`r`n, C:\[location of output file]
}
}
return
Assuming that the "timestamp" component is always 6 characters long and always at the beginning of the string, this solution should work just fine.
String test = "012345 test test test";
test = test.substring(0, 2) + ":" + test.substring(2, 4) + ":" + test.substring(4, test.length());
This outputs 01:23:45 test test test
Why? Because you are temporarily creating a String object that it's two characters long and then you insert the colon before taking the next pair. Lastly, you append the rest of the String and assign it to whichever String variable you want. Remember, the substring method doesn't modify the String object you are calling the method on. This method returns a "new" String object. Therefore, the variable test is unmodified until the assignment operation kicks in at the end.
Alternatively, you can use a StringBuilder and append each component like this:
StringBuilder sbuff = new StringBuilder();
sbuff.append(test.substring(0,2));
sbuff.append(":");
sbuff.append(test.substring(2,4));
sbuff.append(":");
sbuff.append(test.substring(4,test.length()));
test = sbuff.toString();
You could also use a "fancy" loop to do this, but I think for something this simple, looping is just overkill. Oh, I almost forgot, this should work with both of your test strings because after the last colon insert, the code takes the substring from index position 4 all the way to the end of the string indiscriminately.

Swift string indexing combines "\r\n" as one char instead of two

I am dealing with strings containing \r\n with Swift 4.2. I ran into kind of strange behavior of Swift index, it appears \r\n will be treated as one character instead of two by Swift indexing methods. I wrote a piece of code to present this behavior:
var text = "ABC\r\n\r\nDEF"
func printChar(_ lower: Int, _ upper: Int) {
let start = text.index(text.startIndex, offsetBy: lower)
let end = text.index(text.startIndex, offsetBy: upper)
print("\"" + text[start..<end] + "\"")
}
printChar(0, 1) // "A"
printChar(1, 2) // "B"
printChar(2, 3) // "C"
printChar(3, 4) // new line
printChar(4, 5) // new line (okay, what's going on here?)
printChar(5, 6) // "D"
printChar(6, 7) // "E"
printChar(7, 8) // "F"
The print result will be
"A"
"B"
"C"
"
"
"
"
"D"
"E"
"F"
Any idea why it's like this?
TLDR: \r\n is a grapheme cluster and is treated as a single Character in Swift because Unicode.
Swift treats \r\n as one Character.
Objective-C NSString treats it as two characters (in terms of the result from length).
On the swift-users forum someone wrote:
– "\r\n" is a single Character. Is this the correct behaviour?
– Yes, a Character corresponds to a Unicode grapheme cluster, and "\r\n" is considered a single grapheme cluster.
And the subsequent response posted a link to Unicode documentation, check out this table which officially states CRLF is a grapheme cluster.
Take a look at the Apple documentation on Characters and Grapheme Clusters.
It's common to think of a string as a sequence of characters, but when working with NSString objects, or with Unicode strings in general, in most cases it is better to deal with substrings rather than with individual characters. The reason for this is that what the user perceives as a character in text may in many cases be represented by multiple characters in the string.
The Swift documentation on Strings and Characters is also worth reading.
This overview from objc.io is interesting as well.
NSString represents UTF-16-encoded text. Length, indices, and ranges are all based on UTF-16 code units.
Another example of this is an emoji like 👍🏻. This single character is actually %uD83D%uDC4D%uD83C%uDFFB, four different unicode scalars. But if you called count on a string with just that emoji you'd (correctly) get 1.
If you wanted to see the scalars you could iterate them as follows:
for scalar in text.unicodeScalars {
print("\(scalar.value) ", terminator: "")
}
Which for "\r\n" would give you 13 10
In the Swift documentation you'll find why NSString is different:
The count of the characters returned by the count property isn’t always the same as the length property of an NSString that contains the same characters. The length of an NSString is based on the number of 16-bit code units within the string’s UTF-16 representation and not the number of Unicode extended grapheme clusters within the string.
Thus this isn't really "strange" behaviour of Swift string indexing, but rather a result of how Unicode treats these characters and how String in Swift is designed. Swift string indexing goes by Character and \r\n is a single Character.

Matlab: Function that returns a string with the first n characters of the alphabet

I'd like to have a function generate(n) that generates the first n lowercase characters of the alphabet appended in a string (therefore: 1<=n<=26)
For example:
generate(3) --> 'abc'
generate(5) --> 'abcde'
generate(9) --> 'abcdefghi'
I'm new to Matlab and I'd be happy if someone could show me an approach of how to write the function. For sure this will involve doing arithmetic with the ASCII-codes of the characters - but I've no idea how to do this and which types that Matlab provides to do this.
I would rely on ASCII codes for this. You can convert an integer to a character using char.
So for example if we want an "e", we could look up the ASCII code for "e" (101) and write:
char(101)
'e'
This also works for arrays:
char([101, 102])
'ef'
The nice thing in your case is that in ASCII, the lowercase letters are all the numbers between 97 ("a") and 122 ("z"). Thus the following code works by taking ASCII "a" (97) and creating an array of length n starting at 97. These numbers are then converted using char to strings. As an added bonus, the version below ensures that the array can only go to 122 (ASCII for "z").
function output = generate(n)
output = char(97:min(96 + n, 122));
end
Note: For the upper limit we use 96 + n because if n were 1, then we want 97:97 rather than 97:98 as the second would return "ab". This could be written as 97:(97 + n - 1) but the way I've written it, I've simply pulled the "-1" into the constant.
You could also make this a simple anonymous function.
generate = #(n)char(97:min(96 + n, 122));
generate(3)
'abc'
To write the most portable and robust code, I would probably not want those hard-coded ASCII codes, so I would use something like the following:
output = 'a':char(min('a' + n - 1, 'z'));
...or, you can just generate the entire alphabet and take the part you want:
function str = generate(n)
alphabet = 'a':'z';
str = alphabet(1:n);
end
Note that this will fail with an index out of bounds error for n > 26, so you might want to check for that.
You can use the char built-in function which converts an interger value (or array) into a character array.
EDIT
Bug fixed (ref. Suever's comment)
function [str]=generate(n)
a=97;
% str=char(a:a+n)
str=char(a:a+n-1)
Hope this helps.
Qapla'

AutoHotKey Script - String Splitting

I have a string that looks like this:
17/07/2013 TEXTT TEXR 1 Text 1234567 456.78 987654
I need to separate this so I only end up with 2 values (in this example it's 1234567 and 456.78). The rest is unneeded.
I tried using string split with %A_Space% but as the whole middle area between values is filled with spaces, it doesn't really work.
Anyone got an idea?
src:="17/07/2013 TEXTT TEXR 1 Text "
. " 1234567 456.78 987654", pattern:="([\d\.]+)\s+([\d\.]+)"
RegexMatch(src, pattern, match)
MsgBox, 262144, % "result", % match1 "`n"match2
You should look at RegExMatch() and RegexReplace().
So, you will need to build a regex needle (I'm not an expert regexer, but this will work)
First, remove all of the string up to the end of "1 Text" since "1 Text" as you say, is constant. That will leave you with the three number values.
Something like this should find just the numbers you want:
needle:= "iO)1\s+Text"
partialstring := RegexMatch(completestring, needle, results)
lenOfFrontToRemove := results.pos() + results.len()
lastthreenumbers := substr(completestring, lenOfFrontToRemove, strlen(completestring) )
lastthreenumbers := trim(lastthreenumbers)
msgbox % lastthreenumbers
To explain the regex needle:
- the i means case insensitive
- the O stands for options - it lets us use results.pos and results.len
- the \s means to look for whitespace; the + means to look for more than one if present.
Now you have just the last three numbers.
1234567 456.78 987654
But you get the idea, right? You should able to parse it from here.
Some hints: in a regex needle, use \d to find any digit, and the + to make it look for more than one in a row. If you want to find the period, use \.

Format a variable in iReport in a string with multiple fields

I have a text field that has the following expression:
$F{casNo} + " Total " + $P{chosenUom} + ": " + $V{total_COUNT}
casNo is a string, chosenUom is a string. total_COUNT is a sum variable of doubles. The total_COUNT variable displays, but it's got 8 - 10 decimal places (1.34324255234), all I need is something along the lines of 1.34.
Here's what I tried already:
$F{casNo} + " Total " + $P{chosenUom} + ": " + new DecimalFormat("0.00").format($V{total_COUNT}).toString()
Any help would be appreciated
For now I'm just doing basic math, but I'm hoping for a real solution, not a workaround
((int)($V{total_COUNT}*100.0))/100.0
You can format the in lline numbers by using:
new DecimalFormat("###0.00").format(YOUR NUMBER)
You might split the text field into two, one containing everything but the $V{total_COUNT}, and the second containing only $V{total_COUNT}, but with the Pattern property set to something like "#0.00".
You'd have to get a bit creative with layout, though, to prevent unwanted word-wrapping and spacing; for example, first text field could be wide and right-aligned, while text field containing the count could be left-aligned and wide enough to accommodate the formatted number.