calculating upper case letters ,lowercase letters, and other characters - jes

Write a program that accepts a sentence as console input and calculate the number of upper case letters , lower case letters and other characters.
Suppose the following input is supplied to the program:
Hello World;!#

Since this question sounds like a programming assignment, I've written this is a more-wordy manner. This is standard Python 3, not Jes.
#! /usr/bin/env python3
import sys
upper_case_chars = 0
lower_case_chars = 0
total_chars = 0
found_eof = False
# Read character after character from stdin, processing it in turn
# Stop if an error is encountered, or End-Of-File happens.
while (not found_eof):
try:
letter = str(sys.stdin.read(1))
except:
# handle any I/O error somewhat cleanly
break
if (letter != ''):
total_chars += 1
if (letter >= 'A' and letter <= 'Z'):
upper_case_chars += 1
elif (letter >= 'a' and letter <= 'z'):
lower_case_chars += 1
else:
found_eof = True
# write the results to the console
print("Upper-case Letters: %3u" % (upper_case_chars))
print("Lower-case Letters: %3u" % (lower_case_chars))
print("Other Letters: %3u" % (total_chars - (upper_case_chars + lower_case_chars)))
Note that you should modify the code to handle end-of-line characters yourself. Currently they're counted as "other". I've also left out handling of binary input, probably the str() will fail.

Related

Are NFC normalization boundaries also extended grapheme cluster boundaries?

This question is related to text editing. Say you have a piece of text in normalization form NFC, and a cursor that points to an extended grapheme cluster boundary within this text. You want to insert another piece of text at the cursor location, and make sure that the resulting text is also in NFC. You also want to move the cursor on the first grapheme boundary that immediately follows the inserted text.
Now, since concatenating two strings that are both in NFC doesn't necessarily produce a string that is also in NFC, you might have to emend the text around the insertion point. For instance, if you have a string that contains 4 code points like so:
[0] LATIN SMALL LETTER B
[1] LATIN SMALL LETTER E
[2] COMBINING MACRON BELOW
--- Cursor location
[3] LATIN SMALL LETTER A
And you want to insert a 2-codepoints string {COMBINING ACUTE ACCENT, COMBINING DOT ABOVE} at the cursor location. Then the result will be:
[0] LATIN SMALL LETTER B
[1] LATIN SMALL LETTER E WITH ACUTE
[2] COMBINING MACRON BELOW
[3] COMBINING DOT ABOVE
--- Cursor location
[4] LATIN SMALL LETTER A
Now my question is: how do you figure out at which offset you should place the cursor after inserting the string, in such a way that the cursor ends up after the inserted string and also on a grapheme boundary? In this particular case, the text that follows the cursor location cannot possibly interact, during normalization, with what precedes. So the following sample Python code would work:
import unicodedata
def insert(text, cursor_pos, text_to_insert):
new_text = text[:cursor_pos] + text_to_insert
new_text = unicodedata.normalize("NFC", new_text)
new_cursor_pos = len(new_text)
new_text += text[cursor_pos:]
if new_cursor_pos == 0:
# grapheme_break_after is a function that
# returns the offset of the first grapheme
# boundary after the given index
new_cursor_pos = grapheme_break_after(new_text, 0)
return new_text, new_cursor_pos
But does this approach necessarily work? To be more explicit: is it necessarily the case that the text that follows a grapheme boundary doesn't interact with what precedes it during normalization, such that NFC(text[:grapheme_break]) + NFC(text[grapheme_break:]) == NFC(text) is always true?
Update
#nwellnhof's excellent analysis below motivated me to investigate things
further. So I followed the "When in doubt, use brute force" mantra and wrote a
small script that parses grapheme break properties and examines each code point
that can appear at the beginning of a grapheme, to test whether it can
possibly interact with preceding code points during normalization. Here's the
script:
from urllib.request import urlopen
import icu, unicodedata
URL = "http://www.unicode.org/Public/UCD/latest/ucd/auxiliary/GraphemeBreakProperty.txt"
break_props = {}
with urlopen(URL) as f:
for line in f:
line = line.decode()
p = line.find("#")
if p >= 0:
line = line[:p]
line = line.strip()
if not line:
continue
fields = [x.strip() for x in line.split(";")]
codes = [int(x, 16) for x in fields[0].split("..")]
if len(codes) == 2:
start, end = codes
else:
assert(len(codes) == 1)
start, end = codes[0], codes[0]
category = fields[1]
break_props.setdefault(category, []).extend(range(start, end + 1))
# The only code points that can't appear at the beginning of a grapheme boundary
# are those that appear in the following categories. See the regexps in
# UAX #29 Tables 1b and 1c.
to_ignore = set(c for name in ("Extend", "ZWJ", "SpacingMark") for c in break_props[name])
nfc = icu.Normalizer2.getNFCInstance()
for c in range(0x10FFFF + 1):
if c in to_ignore:
continue
if not nfc.hasBoundaryBefore(chr(c)):
print("U+%04X %s" % (c, unicodedata.name(chr(c))))
Looking at the output, it appears that there are about 40 code points that are
grapheme starters but still compose with preceding code points in NFC.
Basically, they are non-precomposed Hangul syllables of type V
(U+1161..U+1175) and T (U+11A8..U+11C2). Things makes sense when you examine
the regular expressions in UAX #29, Table
1c together with what
the standard says about Jamo composition (section 3.12, p. 147 of the version
13 of the standard).
The gist of it is that Hangul sequences of the form {L, V} can compose to a
Hangul syllable of type LV, and similarly sequences of the form {LV, T} can
compose to a syllable of type LVT.
To sum up, and assuming I'm not mistaken, the above Python code could
be corrected as follows:
import unicodedata
import icu # pip3 install icu
def insert(text, cursor_pos, text_to_insert):
new_text = text[:cursor_pos] + text_to_insert
new_text = unicodedata.normalize("NFC", new_text)
new_cursor_pos = len(new_text)
new_text += text[cursor_pos:]
new_text = unicodedata.normalize("NFC", new_text)
break_iter = icu.BreakIterator.createCharacterInstance(icu.Locale())
break_iter.setText(new_text)
if new_cursor_pos == 0:
# Move the cursor to the first grapheme boundary > 0.
new_cursor_pos = breakIter.nextBoundary()
elif new_cursor_pos > len(new_text):
new_cursor_pos = len(new_text)
elif not break_iter.isBoundary(new_cursor_pos):
# isBoundary() moves the cursor on the first boundary >= the given
# position.
new_cursor_pos = break_iter.current()
return new_text, new_cursor_pos
The (possibly) pointless test new_cursor_pos > len(new_text) is there to
catch the case len(NFC(x)) > len(NFC(x + y)). I'm not sure whether this can
actually happen with the current Unicode database (more tests would be needed to prove it), but it is theoretically quite possible. If, say, you have
a set a three code points A, B and C and two precomposed forms A+B and
A+B+C (but not A+C), then you could very well have NFC({A, C} + {B}) = {A+B+C}.
If this case doesn't occur in practice (which is very likely, especially with
"real" texts), then the above Python code will necessarily locate the first
grapheme boundary after the end of the inserted text. Otherwise, it will merely
locate some grapheme boundary after the inserted text, but not necessarily the
first one. I don't yet see how it could be possible to improve the second case (assuming it isn't merely theoretical), so I think I'll leave
my investigation at that for now.
As mentioned in my comment, the actual boundaries can differ slightly. But AFAICS, there should be no meaningful interaction. UAX #29 states:
6.1 Normalization
[...] the grapheme cluster boundary specification has the following features:
There is never a break within a sequence of nonspacing marks.
There is never a break between a base character and subsequent nonspacing marks.
This only mentions nonspacing marks. But with extended grapheme clusters (as opposed to legacy ones), I'm pretty sure these statements also apply to "non-starter" spacing marks[1]. This would cover all normalization non-starters (which must be either nonspacing (Mn) or spacing (Mc) marks). So there's never an extended grapheme cluster boundary before a non-starter[2] which should give you the guarantee you need.
Note that it's possible to have multiple runs of starters and non-starters ("normalization boundaries") within a single grapheme cluster, for example with U+034F COMBINING GRAPHEME JOINER.
[1] Some spacing marks are excluded, but these should all be starters.
[2] Except at the start of text.

MATLAB only runs first case in a switch-case block

I am trying to separate data from a csv file into "blocks" of data that I then put into 10 different categories. Each block has a set of spaces at the top of it. Each category contains 660 blocks. Currently, my code successfully puts in the first block, but only the first block. It does correctly count the number of blocks though. I do not understand why it only puts in the first block when the block count works correctly, and any help would be appreciated.
The csv file can be downloaded from here.
https://archive.ics.uci.edu/ml/machine-learning-databases/00195/
fid = fopen('Train_Arabic_Digit.txt','rt');
traindata = textscan(fid, '%f%f%f%f%f%f%f%f%f%f%f%f%f', 'MultipleDelimsAsOne',true, 'Delimiter','[;', 'HeaderLines',1);
fclose(fid);
% Each line in Train_Arabic_Digit.txt or Test_Arabic_Digit.txt represents
% 13 MFCCs coefficients in the increasing order separated by spaces. This
% corresponds to one analysis frame. Lines are organized into blocks, which
% are a set of 4-93 lines separated by blank lines and corresponds to a
% single speech utterance of an spoken Arabic digit with 4-93 frames.
% Each spoken digit is a set of consecutive blocks.
% TO DO: how get blocks...split? with /n?
% In Train_Arabic_Digit.txt there are 660 blocks for each spoken digit. The
% first 330 blocks represent male speakers and the second 330 blocks
% represent the female speakers. Blocks 1-660 represent the spoken digit
% "0" (10 utterances of /0/ from 66 speakers), blocks 661-1320 represent
% the spoken digit "1" (10 utterances of /1/ from the same 66 speakers
% 33 males and 33 females), and so on up to digit 9.
content = fileread( 'Train_Arabic_Digit.txt' ) ;
default = regexp(content,'\n','split');
digit0=[];
digit1=[];
digit2=[];
digit3=[];
digit4=[];
digit5=[];
digit6=[];
digit7=[];
digit8=[];
digit9=[];
blockcount=0;
a=0;
for i=1:1:length(default)
if strcmp(default{i},' ')
blockcount=blockcount+1;
else
switch blockcount % currently only works for blockcount=1 even though
%it does pick up the number of blocks...
case blockcount>0 && blockcount<=660 %why does it not recognize 2 as being<660
a=a+1;
digit0=[digit0 newline default{i}];
case blockcount>660 && blockcount<=1320
digit1=[digit1 newline default{i}];
case blockcount<=1980 && blockcount>1320
digit2=[digit2 newline default{i}];
case blockcount<=2640 && blockcount>1980
digit3=[digit3 newline default{i}];
case blockcount<=3300 && blockcount>2640
digit4=[digit4 newline default{i}];
case blockcount<=3960 && blockcount>3300
digit5=[digit5 newline default{i}];
case blockcount<=4620 && blockcount>3960
digit6=[digit6 newline default{i}];
case blockcount<=5280 && blockcount>4620
digit7=[digit7 newline default{i}];
case blockcount<=5940 && blockcount>5280
digit8=[digit8 newline default{i}];
case blockcount<=6600 && blockcount>5940
digit9=[digit9 newline default{i}];
end
end
end
That's because you have somehow confused if-else syntax with switch-case. Note that an expression like blockcount>0 && blockcount<=660 always returns a logical value, meaning it's either 0 or 1. Now, when blockcount is equal to 1, first case expression also results 1 and the rest result 0, so, 1==1 and first block runs. But when blockcount becomes 2, the first case expression still results 1 and 2~=1 so nothing happens!
You can either use if-else or change your case expressions to cell arrays containing ranges of values. According to docs:
The switch block tests each case until one of the case expressions is
true. A case is true when:
For numbers, case_expression == switch_expression.
For character vectors, strcmp(case_expression,switch_expression) == 1.
For objects that support the eq function, case_expression ==
switch_expression. The output of the overloaded eq function must be
either a logical value or convertible to a logical value.
For a cell array case_expression, at least one of the elements of the
cell array matches switch_expression, as defined above for numbers,
character vectors, and objects.
It should be something like:
switch blockcount
case num2cell(0:660)
digit0 ...
case num2cell(661:1320)
digit1 ...
...
end
BUT, this block of code will take forever to complete. First, always avoid a = [a b] in loops. Resizing matrices is time consuming. Always preallocate a and do a(i) = b.

When using fscanf, what is happening inbetween lines being scanned?

I have an input file that I scan each line until the end. I use the first character as an indicator as what I want to do: 1. paused for x cycles, 2. write a 16-bit word serially, 3. write one bit at a time, 4 end of file.
The issue is that I see an extra mode in between the first p to w transition.
I tried printing out the string value of my variable "mode" but what I see is printed on the wave in between the first p and w is an additional mode not specified in my case statement.
at time = 0: mode equals " " (blank, nothing, all fine)
at time = A: mode now equals "p" (paused 4 cycles long, sure fine, I can fix this later)
at time = B: mode now equals "[]" (ERROR! this is not the next line)
at time = C: mode now equals "w" (back to normal)
input file:
p 10
w DEAD
w BEEF
p 5
b HHLL
p 100
eol
I have my systemverilog code that is suppose to scan the input file:
$fscanf(fd, "%c %h", mode, mystr);
case(mode)
"p": begin
wait_v = mystr.atoi();
repeat ( wait_v) begin
//turns bits on or off, but only modifies wire outputs and not mode
end
end
"w": begin
data = mystr.atohex();
// writes data, each character is 4 bits. each word is 16 cycles
end
"b": begin
lastsix = mystr.atobin();
// writes data outputs either high or low, each character is 1 cycle long
end
"eol": begin
$fclose(fn);
$stop;
end
Expected:
time => 0: mode equals " " (blank, nothing, all fine)
time => A: mode now equals "p" (paused for 3 cycles)
time => C: mode now equals "w" (back to normal)
Actual:
time => 0: mode equals " " (blank, nothing, all fine)
time => A: mode now equals "p" (paused 4 cycles long, sure fine, I can fix this later)
time => B: mode now equals "[]" (ERROR! this is not the next line)
time => C: mode now equals "w" (back to normal)
When you use %c in scanf it will read the very next character. When you use %h it will read a hex value, stopping after the last valid hex digit, and not reading what is next. So after your first fscanf call, the input will be pointing at the newline a the end of the first line, and the next fscanf call will read that newline with %c, and you'll get mode == "\n"
What you probably want is to use " %c%h" as your format -- note the (space) before the %c. The space causes fscanf to read and discard whitespace. Since %h automatically skips whitespace before reading a number, you don't need the space before it.

Unicode character transformation in SPSS

I have a string variable. I need to convert all non-digit characters to spaces (" "). I have a problem with unicode characters. Unicode characters (the characters outside the basic charset) are converted to some invalid characters. See the code for example.
Is there any other way how to achieve the same result with procedure which would not choke on special unicode characters?
new file.
set unicode = yes.
show unicode.
data list free
/T (a10).
begin data
1234
5678
absd
12as
12(a
12(vi
12(vī
12āčž
end data.
string Z (a10).
comp Z = T.
loop #k = 1 to char.len(Z).
if ~range(char.sub(Z, #k, 1), "0", "9") sub(Z, #k, 1) = " ".
end loop.
comp Z = normalize(Z).
comp len = char.len(Z).
list var = all.
exe.
The result:
T Z len
1234 1234 4
5678 5678 4
absd 0
12as 12 2
12(a 12 2
12(vi 12 2
12(vī 12 � 6
>Warning # 649
>The first argument to the CHAR.SUBSTR function contains invalid characters.
>Command line: 1939 Current case: 8 Current splitfile group: 1
12āčž 12 �ž 7
Number of cases read: 8 Number of cases listed: 8
The substr function should not be used on the left hand side of an expression in Unicode mode, because the replacement character may not be the same number of bytes as the character(s) being replaced. Instead, use the replace function on the right hand side.
The corrupted characters you are seeing are due to this size mismatch.
How about instead of replacing non-numeric characters, you cycle though and pull out the numeric characters and rebuild Z? (Note my version here is pre CHAR. string functions.)
data list free
/T (a10).
begin data
1234
5678
absd
12as
12(a
12(vi
12(vī
12āčž
12as23
end data.
STRING Z (a10).
STRING #temp (A1).
COMPUTE #len = LENGTH(RTRIM(T)).
LOOP #i = 1 to #len.
COMPUTE #temp = SUBSTR(T,#i,1).
DO IF INDEX('0123456789',#temp) > 0.
COMPUTE Z = CONCAT(SUBSTR(Z,1,#i-1),#temp).
ELSE.
COMPUTE Z = CONCAT(SUBSTR(Z,1,#i-1)," ").
END IF.
END LOOP.
EXECUTE.

Code Golf - Word Scrambler

Please answer with the shortest possible source code for a program that converts an arbitrary plaintext to its corresponding ciphertext, following the sample input and output I have given below. Bonus points* for the least CPU time or the least amount of memory used.
Example 1:
Plaintext: The quick brown fox jumps over the lazy dog. Supercalifragilisticexpialidocious!
Ciphertext: eTh kiquc nobrw xfo smjup rvoe eth yalz .odg !uioiapeislgriarpSueclfaiitcxildcos
Example 2:
Plaintext: 123 1234 12345 123456 1234567 12345678 123456789
Ciphertext: 312 4213 53124 642135 7531246 86421357 975312468
Rules:
Punctuation is defined to be included with the word it is closest to.
The center of a word is defined to be ceiling((strlen(word)+1)/2).
Whitespace is ignored (or collapsed).
Odd words move to the right first. Even words move to the left first.
You can think of it as reading every other character backwards (starting from the end of the word), followed by the remaining characters forwards. Corporation => XoXpXrXtXoX => niaorCoprto.
Thank you to those who pointed out the inconsistency in my description. This has lead many of you down the wrong path, which I apologize for. Rule #4 should clear things up.
*Bonus points will only be awarded if Jeff Atwood decides to do so. Since I haven't checked with him, the chances are slim. Sorry.
Python, 50 characters
For input in i:
' '.join(x[::-2]+x[len(x)%2::2]for x in i.split())
Alternate version that handles its own IO:
print ' '.join(x[::-2]+x[len(x)%2::2]for x in raw_input().split())
A total of 66 characters if including whitespace. (Technically, the print could be omitted if running from a command line, since the evaluated value of the code is displayed as output by default.)
Alternate version using reduce:
' '.join(reduce(lambda x,y:y+x[::-1],x) for x in i.split())
59 characters.
Original version (both even and odd go right first) for an input in i:
' '.join(x[::2][::-1]+x[1::2]for x in i.split())
48 characters including whitespace.
Another alternate version which (while slightly longer) is slightly more efficient:
' '.join(x[len(x)%2-2::-2]+x[1::2]for x in i.split())
(53 characters)
J, 58 characters
>,&.>/({~(,~(>:#+:#i.#-#<.,+:#i.#>.)#-:)#<:##)&.><;.2,&' '
Haskell, 64 characters
unwords.map(map snd.sort.zip(zipWith(*)[0..]$cycle[-1,1])).words
Well, okay, 76 if you add in the requisite "import List".
Python - 69 chars
(including whitespace and linebreaks)
This handles all I/O.
for w in raw_input().split():
o=""
for c in w:o=c+o[::-1]
print o,
Perl, 78 characters
For input in $_. If that's not acceptable, add six characters for either $_=<>; or $_=$s; at the beginning. The newline is for readability only.
for(split){$i=length;print substr$_,$i--,1,''while$i-->0;
print"$_ ";}print $/
C, 140 characters
Nicely formatted:
main(c, v)
char **v;
{
for( ; *++v; )
{
char *e = *v + strlen(*v), *x;
for(x = e-1; x >= *v; x -= 2)
putchar(*x);
for(x = *v + (x < *v-1); x < e; x += 2)
putchar(*x);
putchar(' ');
}
}
Compressed:
main(c,v)char**v;{for(;*++v;){char*e=*v+strlen(*v),*x;for(x=e-1;x>=*v;x-=2)putchar(*x);for(x=*v+(x<*v-1);x<e;x+=2)putchar(*x);putchar(32);}}
Lua
130 char function, 147 char functioning program
Lua doesn't get enough love in code golf -- maybe because it's hard to write a short program when you have long keywords like function/end, if/then/end, etc.
First I write the function in a verbose manner with explanations, then I rewrite it as a compressed, standalone function, then I call that function on the single argument specified at the command line.
I had to format the code with <pre></pre> tags because Markdown does a horrible job of formatting Lua.
Technically you could get a smaller running program by inlining the function, but it's more modular this way :)
t = "The quick brown fox jumps over the lazy dog. Supercalifragilisticexpialidocious!"
T = t:gsub("%S+", -- for each word in t...
function(w) -- argument: current word in t
W = "" -- initialize new Word
for i = 1,#w do -- iterate over each character in word
c = w:sub(i,i) -- extract current character
-- determine whether letter goes on right or left end
W = (#w % 2 ~= i % 2) and W .. c or c .. W
end
return W -- swap word in t with inverted Word
end)
-- code-golf unit test
assert(T == "eTh kiquc nobrw xfo smjup rvoe eth yalz .odg !uioiapeislgriarpSueclfaiitcxildcos")
-- need to assign to a variable and return it,
-- because gsub returns a pair and we only want the first element
f=function(s)c=s:gsub("%S+",function(w)W=""for i=1,#w do c=w:sub(i,i)W=(#w%2~=i%2)and W ..c or c ..W end return W end)return c end
-- 1 2 3 4 5 6 7 8 9 10 11 12 13
--34567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890
-- 130 chars, compressed and written as a proper function
print(f(arg[1]))
--34567890123456
-- 16 (+1 whitespace needed) chars to make it a functioning Lua program,
-- operating on command line argument
Output:
$ lua insideout.lua 'The quick brown fox jumps over the lazy dog. Supercalifragilisticexpialidocious!'
eTh kiquc nobrw xfo smjup rvoe eth yalz .odg !uioiapeislgriarpSueclfaiitcxildcos
I'm still pretty new at Lua so I'd like to see a shorter solution if there is one.
For a minimal cipher on all args to stdin, we can do 111 chars:
for _,w in ipairs(arg)do W=""for i=1,#w do c=w:sub(i,i)W=(#w%2~=i%2)and W ..c or c ..W end io.write(W ..' ')end
But this approach does output a trailing space like some of the other solutions.
For an input in s:
f=lambda t,r="":t and f(t[1:],len(t)&1and t[0]+r or r+t[0])or r
" ".join(map(f,s.split()))
Python, 90 characters including whitespace.
TCL
125 characters
set s set f foreach l {}
$f w [gets stdin] {$s r {}
$f c [split $w {}] {$s r $c[string reverse $r]}
$s l "$l $r"}
puts $l
Bash - 133, assuming input is in $w variable
Pretty
for x in $w; do
z="";
for l in `echo $x|sed 's/\(.\)/ \1/g'`; do
if ((${#z}%2)); then
z=$z$l;
else
z=$l$z;
fi;
done;
echo -n "$z ";
done;
echo
Compressed
for x in $w;do z="";for l in `echo $x|sed 's/\(.\)/ \1/g'`;do if ((${#z}%2));then z=$z$l;else z=$l$z;fi;done;echo -n "$z ";done;echo
Ok, so it outputs a trailing space.