I am coming across more and more situations in Scratch where I have to convert a number to its ACSII character or visa versa. There is no built function in the blocks for this.
My solution is to create a list of size 26 and append letters A-Z into each sequence using a variable called alphabet = "abcdefghijklmnopqrstuvwxyz" and iterating over it with a Repeat block and appending LETTER COUNT of ALPHABET to the list.The result is a list data structure with letters A_Z between location 1 to 26.In effect creating my own ASCII table.
To do a converson say from number 26 to 'Z' I have to iterate over the list to get the correct CHAR value. It really slows down the program that is heavily dependent on the CHR() feature. Is there a better or more efficient solution?
My solution is to create a list of size 26 and append letters A-Z into each sequence using a variable called alphabet = "abcdefghijklmnopqrstuvwxyz"
Stop right there. You don't even need to convert it into a list. You can just look it up directly from the string.
To get a character from an index is very easy. Just use the (letter () of []) block.
To get the index of a character is more complex. Unfortunately, Scratch doesn't have a built-in way to do that. What i would do here is define a index of [] in [] custom pseudo-reporter block:
define index of (char) in (str)
set [i v] to [1]
repeat until <<(i) = (length of (str))> or <(letter (i) of (str)) = (char)>>
change [i v] by (1)
view online
You can then call the block as index of [a] in (alphabet) and it will set the i variable to 1.
This code doesn't have any case for if the character isn't found, but the link i provided does include that, if you need.
You could also use Snap! which is similar to Scratch, but has more blocks. Snap! has a unicode block, that will convert a character to its ASCII or unicode value.
Related
I've looked into the underscore for drop/cut, but this only seems to remove the first or last n entries, not characters. Any ideas?
Depends on what you're using drop cut on.
Can you provide an example of your values?
Below shows how cut can be used on a sting and then a list of strings.
It uses each right to drop a value from each item.
http://code.kx.com/q/ref/adverbs/#each-right
q)1_"12456789"
"2456789"
q)
q)1_("12456789";"12456789")
"12456789"
q)
q)1_/:("12456789";"12456789")
"2456789"
"2456789"
#Connor Gervin had almost what I wanted, but if you want to cast back to a string, you can use `$(-3)_'string sym from tab
What would be the best way to find a word such as Hi or a name mainly like dön with that special char in it through a pattern. They would be optional so it should obviously use a '?' but I dont know what control code to use to find them.
I basically want to make sure that I am getting words with possible unicode characters in them but nothing else. So dön would be fine but no other special chars or numbers and such like brackets.
According to the Lua guide on Unicode, "Lua's pattern matching facilities work byte by byte. In general, this will not work for Unicode pattern matching, although some things will work as you want". This means the best option is probably to iterate over each character and work out if it is a valid letter. To loop over each unicode character in a string:
for character in string.gmatch(myString, "([%z\1-\127\194-\244][\128-\191]*)") do
-- Do something with the character
end
Note this method will not work if myString isn't valid unicode. To check if the character is one that you want, it's probably best to simply have a list of all characters you don't want in your strings and then exclude them:
local notAllowed = ":()[]{}+_-=\|`~,.<>/?!##$%^&*"
local isValid = true
for character in string.gmatch(myString, "([%z\1-\127\194-\244][\128-\191]*)") do
if notAllowed:find(character) then
isValid = false
break
end
end
Hope this helped.
I have this kind of symbols in db table (Наиме) , and I don't know who inserted this data to table.Is there any way to convert them to cyrillic ?
Yes, you can do the conversion. Since you haven't mentioned any langauge, so the logic is given:
Assuming the string length is even, take two immediate characters.
Combine the underlying byte values of two characters to give a 16 bit value. This gives you the multi-byte value of Cryllic character. You can decode the value to give its representation using a proper decoding format like utf-8.
Repeat points 1 and 2 for next two characters until the end of string.
If you want, you can implement it in any language of your choice.
I am currently learning Go and am making a lot of progress. One way I do this is to port past projects and prototypes from a prior language to a new one.
Right now I am busying myself with a "language detector" I prototyped in Python a while ago. In this module, I generate an ngram frequency table, where I then calculate the difference between a given text and a known corpora.
This allows one to effectively determine which corpus is the best match by returning the cosine of two vector representations of the given ngram tables. Yay. Math.
I have a prototype written in Go that works perfectly with plain ascii characters, but I would very much like to have it working with unicode multibyte support. This is where I'm doing my head in.
Here is a quick example of what I'm dealing with: http://play.golang.org/p/2bnAjZX3r0
I've only posted the table generating logic since everything already works just fine.
As you can see by running the snippet, the first text works quite well and builds an accurate table. The second text, which is German, has a few double-byte characters in it. Due to the way I am building the ngram sequence, and due to the fact that these specific runes are made of two bytes, there appear 2 ngrams where the first byte is cut off.
Could someone perhaps post a more efficient solution or, at the very least, guide me through a fix? I'm almost positive I am over analysing this problem.
I plan on open sourcing this package and implementing it as a service using Martini, thus providing a simple API people can use for simple linguistic computation.
As ever, thanks!
If I understand correctly, you want chars in your Parse function to hold the last n characters in the string. Since you're interested in Unicode characters rather than their UTF-8 representation, you might find it easier to manage it as a []rune slice, and only convert back to a string when you have your ngram ready to add to the table. This way you don't need to special case non-ASCII characters in your logic.
Here is a simple modification to your sample program that does the above: http://play.golang.org/p/QMYoSlaGSv
By keeping a circular buffer of runes, you can minimise allocations. Also note that reading a new key from a map returns the zero value (which for int is 0), which means the unknown key check in your code is redundant.
func Parse(text string, n int) map[string]int {
chars := make([]rune, 2 * n)
table := make(map[string]int)
k := 0
for _, chars[k] = range strings.Join(strings.Fields(text), " ") + " " {
chars[n + k] = chars[k]
k = (k + 1) % n
table[string(chars[k:k+n])]++
}
return table
}
I'm searching about a sort of hash function to index similar text. So for example if we have two very long text called "A" and "B" where A and B differ not so much, then the hash function (called H) applied to A and B should return the same number.
So H(A) = H(B) where A and B are similar text.
I tried the "DoubleMetaphone" (I use italian language text), but I saw that it depends very strong from the string prefixes. For example:
A = "This is the very long text that I want to hash"
B = "This is the very"
==> doubleMetaPhone(A) = doubleMetaPhone(B)
And this is not so good for me, beacause strings with the same prefix could be compared as similar and I don't want this.
Could anyone suggest me any other way?
see http://en.wikipedia.org/wiki/Locality_sensitive_hashing
You problem is (close to) insoluble for many distance functions between strings.
Most distance functions (e.g. edit distance) allow you to transform a string into another string via a sequence of 1-distance transformations:
"AAAA" -> "AAAB" -> "AAABC"
according to your requirements, the first and second strings should have the same hash value. But so must the second and the third, and so on. So all the strings will have to have the same hash, if we allow a pair with distance=1 to have the same hash value.
Even if we impose a higher threshold on the distance (maybe in relation to string length), we'll end up with a messy result.
A better (IMO) approach is to find an equivalence relation on the set of strings, such that each string in each equivalence class has the same hash. A possibility is to define classes by their distance to a predefined string (e.g. edit distance from "AAAAA"), and the distance itself would be the hash value. Probably this approach would not be the best in your case, but maybe with some extra info on the problem we can come up with a better equivalence relation.