Casting a column from hexadecimal string to uint64? - python-polars

As part of the kaggle competition (https://www.kaggle.com/competitions/amex-default-prediction/overview), I'm trying to take advantage of a trick where they (other competitors sharing their solution) reduce the size of a column by interpreting a hexadecimal string as a base-16 uint64. I'm trying to work out if this is possible in polars /rust:
# The python approach - this is used via .apply in pandas.
string = "0000099d6bd597052cdcda90ffabf56573fe9d7c79be5fbac11a8ed792feb62a"
def func(x):
return int(string[-16:], 16)
func(string)
# 13914591055249847850
My attempt at a solution in polars yields nearly the right answer, but the final digits are off, which is a bit confusing:
import polars as pl
def func(x: str) -> int:
return int(x[-16:], 16)
strings = [
"0000099d6bd597052cdcda90ffabf56573fe9d7c79be5fbac11a8ed792feb62a",
"00000fd6641609c6ece5454664794f0340ad84dddce9a267a310b5ae68e9d8e5",
]
df = pl.DataFrame({"id": strings})
result_polars = df.with_column(pl.col("id").apply(func).cast(pl.UInt64)).to_series().to_list()
result_python = [func(x) for x in strings]
result_polars, result_python
# ([13914591055249848320, 11750091188498716672],
# [13914591055249847850, 11750091188498716901])
I've also tried casting directly from utf-8 to uint64, but I get the following error, which yields nulls if I pass strict=False.
df.with_column(pl.col("id").str.slice(-16).cast(pl.UInt64)).to_series().to_list()
###
ComputeError: strict conversion of cast from Utf8 to UInt64 failed. consider non-strict cast.
If you were trying to cast Utf8 to Date,Time,Datetime, consider using `strptime`

The values you return from func are:
13914591055249847850
11750091188498716901
These values are larger than can be represented with a pl.Int64. Which is what polars uses for python's int type. If a values overflows, polars instead uses Float64, but this comes with loss of precision.
A better solution
Taking just the latest 16 values of a string throws away a lot of information, meaning you can easily have collisions. It's better to use a hash function that tries to avoids collisions.
You could use the hash expression. This gives you a more qualitative hash, and will be much faster as you don't run python code.
df.with_columns([
pl.col("id").hash(seed=0)
])
shape: (2, 1)
┌─────────────────────┐
│ id │
│ --- │
│ u64 │
╞═════════════════════╡
│ 478697168017298650 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 7596707240263070258 │
└─────────────────────┘

Related

Fold operation in Scala Vs Java 8

How to perform the below Scala operation to find the most frequent character in a string in java 8?
val tst = "Scala is awesomestttttts"
val op = tst.foldLeft(Map[Char,Int]())((a,b) => {
a+(b -> ((a.getOrElse(b, 0))+1))
}).maxBy(f => f._2)
Here the output is
(Char, Int) = (t,6)
I was able to get a stream of characters in Java 8 like this:
Stream<Character> sch = tst.chars().mapToObj(i -> (char)i);
but not able to figure out whats the fold/foldLeft/foldRight alternative we have in Java 8
Can someone pls help?
Something like this seems to match with the Scala code you provided (if I understand it correctly):
String tst = "Java is awesomestttttts";
Optional<Map.Entry<Character, Long>> max =
tst.chars()
.mapToObj(i -> (char) i)
.collect(Collectors.groupingBy(Function.identity(),
Collectors.counting()))
.entrySet()
.stream()
.max(Comparator.comparing(Map.Entry::getValue));
System.out.println(max.orElse(null));
If you don't mind using a third-party library Eclipse Collections has a Bag type that can keep track of the character counts. I've provided two examples below that use Bags. Unfortunately there is no maxByOccurrences available today on Bag, but the same result can be achieved by using topOccurrences(1) which is available. You can also use forEachWithOccurrences to find the max but it will be a little more code.
The following example uses a CharAdapter, which is also included in Eclipse Collections.
MutableBag<Character> characters =
CharAdapter.adapt("Scala is awesomestttttts")
.collect(Character::toLowerCase)
.toBag();
MutableList<ObjectIntPair<Character>> charIntPairs = characters.topOccurrences(2);
Assert.assertEquals(
PrimitiveTuples.pair(Character.valueOf('t'), 6), charIntPairs.get(0));
Assert.assertEquals(
PrimitiveTuples.pair(Character.valueOf('s'), 5), charIntPairs.get(1));
The second example uses the chars() method available on String which returns an IntStream. It feels a bit awkward that something called chars() does not return a CharStream, but this is because CharStream is not available in JDK 8.
MutableBag<Character> characters =
"Scala is awesomestttttts"
.toLowerCase()
.chars()
.mapToObj(i -> (char) i)
.collect(Collectors.toCollection(Bags.mutable::empty));
MutableList<ObjectIntPair<Character>> charIntPairs = characters.topOccurrences(2);
Assert.assertEquals(
PrimitiveTuples.pair(Character.valueOf('t'), 6), charIntPairs.get(0));
Assert.assertEquals(
PrimitiveTuples.pair(Character.valueOf('s'), 5), charIntPairs.get(1));
In both examples, I converted the characters to lowercase first, so there are 5 occurrences of 's'. If you want uppercase and lowercase letters to be distinct then just drop the lowercase code in both examples.
Note: I am a committer for Eclipse Collections.
Here is the sample by the Stream in abacus-common:
String str = "Scala is awesomestttttts";
CharStream.from(str).boxed().groupBy(t -> t, Collectors.counting())
.max(Comparator.comparing(Map.Entry::getValue)).get();
But I think the simplest way by Multiset:
CharStream.from(str).toMultiset().maxOccurrences().get();

Converting number in scientific notation to int

Could someone explain why I can not use int() to convert an integer number represented in string-scientific notation into a python int?
For example this does not work:
print int('1e1')
But this does:
print int(float('1e1'))
print int(1e1) # Works
Why does int not recognise the string as an integer? Surely its as simple as checking the sign of the exponent?
Behind the scenes a scientific number notation is always represented as a float internally. The reason is the varying number range as an integer only maps to a fixed value range, let's say 2^32 values. The scientific representation is similar to the floating representation with significant and exponent. Further details you can lookup in https://en.wikipedia.org/wiki/Floating_point.
You cannot cast a scientific number representation as string to integer directly.
print int(1e1) # Works
Works because 1e1 as a number is already a float.
>>> type(1e1)
<type 'float'>
Back to your question: We want to get an integer from float or scientific string. Details: https://docs.python.org/2/reference/lexical_analysis.html#integers
>>> int("13.37")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: invalid literal for int() with base 10: '13.37'
For float or scientific representations you have to use the intermediate step over float.
Very Simple Solution
print(int(float(1e1)))
Steps:-
1- First you convert Scientific value to float.
2- Convert that float value to int .
3- Great you are able to get finally int data type.
Enjoy.
Because in Python (at least in 2.x since I do not use Python 3.x), int() behaves differently on strings and numeric values. If you input a string, then python will try to parse it to base 10 int
int ("077")
>> 77
But if you input a valid numeric value, then python will interpret it according to its base and type and convert it to base 10 int. then python will first interperet 077 as base 8 and convert it to base 10 then int() will jsut display it.
int (077) # Leading 0 defines a base 8 number.
>> 63
077
>> 63
So, int('1e1') will try to parse 1e1 as a base 10 string and will throw ValueError. But 1e1 is a numeric value (mathematical expression):
1e1
>> 10.0
So int will handle it as a numeric value and handle it as though, converting it to float(10.0) and then parse it to int. So Python will first interpret 1e1 since it was a numric value and evaluate 10.0 and int() will convert it to integer.
So calling int() with a string value, you must be sure that string is a valid base 10 integer value.
int(float(1e+001)) will work.
Whereas like what others had mention 1e1 is already a float.

How to strip everything except digits from a string in Scala (quick one liners)

This is driving me nuts... there must be a way to strip out all non-digit characters (or perform other simple filtering) in a String.
Example: I want to turn a phone number ("+72 (93) 2342-7772" or "+1 310-777-2341") into a simple numeric String (not an Int), such as "729323427772" or "13107772341".
I tried "[\\d]+".r.findAllIn(phoneNumber) which returns an Iteratee and then I would have to recombine them into a String somehow... seems horribly wasteful.
I also came up with: phoneNumber.filter("0123456789".contains(_)) but that becomes tedious for other situations. For instance, removing all punctuation... I'm really after something that works with a regular expression so it has wider application than just filtering out digits.
Anyone have a fancy Scala one-liner for this that is more direct?
You can use filter, treating the string as a character sequence and testing the character with isDigit:
"+72 (93) 2342-7772".filter(_.isDigit) // res0: String = 729323427772
You can use replaceAll and Regex.
"+72 (93) 2342-7772".replaceAll("[^0-9]", "") // res1: String = 729323427772
Another approach, define the collection of valid characters, in this case
val d = '0' to '9'
and so for val a = "+72 (93) 2342-7772", filter on collection inclusion for instance with either of these,
for (c <- a if d.contains(c)) yield c
a.filter(d.contains)
a.collect{ case c if d.contains(c) => c }

Different results from Murmur3 from Scala and Guava

I am trying to generate hashes using the Murmur3 algorithm. The hashes are consistent but they are different values being returned by Scala and Guava.
class package$Test extends FunSuite {
test("Generate hashes") {
println(s"Seed = ${MurmurHash3.stringSeed}")
val vs = Set("abc", "test", "bucket", 111.toString)
vs.foreach { x =>
println(s"[SCALA] Hash for $x = ${MurmurHash3.stringHash(x).abs % 1000}")
println(s"[GUAVA] Hash for $x = ${Hashing.murmur3_32().hashString(x).asInt().abs % 1000}")
println(s"[GUAVA with seed] Hash for $x = ${Hashing.murmur3_32(MurmurHash3.stringSeed).hashString(x).asInt().abs % 1000}")
println()
}
}
}
Seed = -137723950
[SCALA] Hash for abc = 174
[GUAVA] Hash for abc = 419
[GUAVA with seed] Hash for abc = 195
[SCALA] Hash for test = 588
[GUAVA] Hash for test = 292
[GUAVA with seed] Hash for test = 714
[SCALA] Hash for bucket = 413
[GUAVA] Hash for bucket = 22
[GUAVA with seed] Hash for bucket = 414
[SCALA] Hash for 111 = 250
[GUAVA] Hash for 111 = 317
[GUAVA with seed] Hash for 111 = 958
Why am I getting different hashes?
It looks to me like Scala's hashString converts pairs of UTF-16 chars to ints differently than Guava's hashUnencodedChars (hashString with no Charset was renamed to that).
Scala:
val data = (str.charAt(i) << 16) + str.charAt(i + 1)
Guava:
int k1 = input.charAt(i - 1) | (input.charAt(i) << 16);
In Guava, the char at an index i becomes the 16 least significant bits of the the int and the char at i + 1 becomes the most significant 16 bits. In the Scala implementation, that's reversed: the char at i is the most significant while the char at i + 1 is the least significant. (The fact that the Scala implementation uses + rather than | could also be significant I imagine.)
Note that the Guava implementation is equivalent to using ByteBuffer.putChar(c) twice to put two characters into a little endian ByteBuffer, then using ByteBuffer.getInt() to get an int value back out. The Guava implementation is also equivalent to encoding the characters to bytes using UTF-16LE and hashing those bytes. The Scala implementation is not equivalent to encoding the string in any of the standard charsets that JVMs are required to support. In general, I'm not sure what precedent (if any) Scala has for doing it the way it does.
Edit:
The Scala implementation does another thing different than the Guava implementation as well: it passes the number of chars being hashed to the finalizeHash method where Guava's implementation passes the number of bytes to the equivalent fmix method.
I believe hashString(x, StandardCharsets.UTF_16BE) should match Scala's behavior. Let us know.
(Also, please upgrade your Guava to something newer!)

Breaking Hexadecimal Value to multiple lines in CoffeeScript

How do I break a long Hexadecimal Value in Coffeescript so that it spans multiple lines?
authKey = 0xe6b86ae8bdf696009c90e0e650a92c63d52a4b3232cca36e0ff2f5911e93bd0067df904dc21ba87d29c32bf17dc88da3cc20ba65c6c63f21eaab5bdb29036b83
to something like
authKey = 0xe6b86ae8bdf696009c90e0e650a92c63d52a4b323\
2cca36e0ff2f5911e93bd0067df904dc21ba87d29c3\
2bf17dc88da3cc20ba65c6c63f21eaab5bdb29036b83
Using \ results in an Unexpected 'NUMBER' Error,
using line break in an Unexpected 'INDENT' Error
There's actually no point in doing this in CoffeeScript because numbers are stored as 64-bit IEEE 754 values and you have too many bits of precision for the value to be stored as a number.
If you write
authKey = 0xe6b86ae8bdf696009c90e0e650a92c63d52a4b3232cca36e0ff2f5911e93bd0067df904dc21ba87d29c32bf17dc88da3cc20ba65c6c63f21eaab5bdb29036b83
console.log(authKey)
then the value logged is
1.2083806867379407e+154
You want to store your authKey as a string or byte array, both of which are trivial to write across multiple lines.
Like others have said, this doesn't really make a whole lot of sense to be stored in a number, as opposed to a string; however, I decided to throw something together to allow it anyway:
stringToNumber = ( str ) -> parseInt( str.replace( /\n/g, '' ) )
authKey = stringToNumber """
0xe6b86ae8bdf696009c90e0e650a92c63d52a4b323
2cca36e0ff2f5911e93bd0067df904dc21ba87d29c3
2bf17dc88da3cc20ba65c6c63f21eaab5bdb29036b83
"""
Like Ray said, this will just result in:
1.2083806867379407e+154