Generate Random Hexidecimal in Scala? - scala

How can I generate a random hexadecimal in scala?
The purpose it to essentially to use this as a UDF to generate random 64 char hexadecimals per column in a DataFrame.
I understand one can utilize below for an Int and etc:
val r = scala.util.Random
println(r.nextInt)
Is there an equivalent or another method simple method for a hexadecimal? particularly with 64 chars? Ex) 6e89f0c4c8a86812ef594229e5f4d997cb38aadc8a694f1b3be24a543b7699de

Since a Byte is 2 hex digits, one can generate an array of 32 random bytes, render them as hex, and join them into a string:
def randomHex256(): String = {
val arr = Array[Byte](32)
scala.util.Random.nextBytes(arr)
// iterator avoids creating a strict intermediate collection
arr.iterator.map(b => String.format("%02x", Byte.box(b))).mkString("")
}

Below is a sample code(Scala) for base64 where concept for the hex generation will be similar, the difference is as explained below:
base64 has less overhead (base64 produces 4 characters for every 3 bytes of original data while hex produces 2 characters for every byte of original data).
import java.util.Base64
def encodeToBase64String(bytes: Array[Byte]): String = Base64.getEncoder.encodeToString(bytes)
val dm_with_clsr_two =(inputString:String) => encodeToBase64String(inputString.getBytes("UTF-8"))
spark.udf.register("DATA_MASK_TWO", dm_with_clsr_two)
spark.sql("select id,DATA_MASK_TWO(id), gender, birthdate, maiden_name, lname, fname, address, city, state, zip, cc_number, DATA_MASK_TWO(cc_number), cc_cvc, cc_expiredate from sample_ssn_data").show(5,false)
+-----------+--------------------------------+------+----------+-----------+------+--------+--------------------+-----------+-----+-----+-------------------+--------------------------------+------+-------------+
|id |UDF:DATA_MASK_ONE(id) |gender|birthdate |maiden_name|lname |fname |address |city |state|zip |cc_number |UDF:DATA_MASK_TWO(cc_number) |cc_cvc|cc_expiredate|
+-----------+--------------------------------+------+----------+-----------+------+--------+--------------------+-----------+-----+-----+-------------------+--------------------------------+------+-------------+
|2022-25-005|4DDA8A5D35947B12B948EFF6EF14579A|m |1958/04/21|Smooth |White |John |10932 California Rd |Calfornia creek |CA|94025|5270 2020 2022 5516|4F88DDF6489891710B9C5A5D8412129E|123 |2010/06/25 |

Related

convert ByteArray to String to ByteArray

I want to convert ByteArray to string and then convert the string to ByteArray,But while converting values changed. someone help to solve this problem.
person.proto:
syntax = "proto3";
message Person{
string name = 1;
int32 age = 2;
}
After sbt compile it gives case class Person (created by google protobuf while compiling)
My MainClass:
val newPerson = Person(
name = "John Cena",
age = 44 //output
)
println(newPerson.toByteArray) //[B#50da041d
val l = newPerson.toByteArray.toString
println(l) //[B#7709e969
val l1 = l.getBytes
println(l1) //[B#f44b405
why the values changed?? how to convert correctly??
[B#... is the format that a JVM byte array's .toString returns, and is just [B (which means "byte array") and a hex-string which is analogous to the memory address at which the array resides (I'm deliberately not calling it a pointer but it's similar; the precise mapping of that hex-string to a memory address is JVM-dependent and could be affected by things like which garbage collector is in use). The important thing is that two different arrays with the same bytes in them will have different .toStrings. Note that in some places (e.g. the REPL), Scala will instead print something like Array(-127, 0, 0, 1) instead of calling .toString: this may cause confusion.
It appears that toByteArray emits a new array each time it's called. So the first time you call newPerson.toByteArray, you get an array at a location corresponding to 50da041d. The second time you call it you get a byte array with the same contents at a location corresponding to 7709e969 and you save the string [B#7709e969 into the variable l. When you then call getBytes on that string (saving it in l1), you get a byte array which is an encoding of the string "[B#7709e969" at the location corresponding to f44b405.
So at the locations corresponding to 50da041d and 7709e969 you have two different byte arrays which happen to contain the same elements (those elements being the bytes in the proto representation of newPerson). At the location corresponding to f44b405 you have a byte array where the bytes encode (in some character set, probably UTF-16?) [B#7709e969.
Because a proto isn't really a string, there's no general way to get a useful string (depending on what definition of useful you're dealing with). You could try interpreting a byte array from toByteArray as a string with a given character encoding, but there's no guarantee that any given proto will be valid in an arbitrary character encoding.
An encoding which is purely 8-bit, like ISO-8859-1 is guaranteed to at least be decodable from a byte array, but there could be non-printable or control characters, so it's not likely to that useful:
val iso88591Representation = new String(newPerson.toByteArray, java.nio.charset.StandardCharsets.ISO_8859_1)
Alternatively, you might want a representation like how the Scala REPL will (sometimes) render it:
"Array(" + newPerson.toByteArray.mkString(", ") + ")"

How do i replace whitespace with underscore and encode values in scala array / list

I have a spark scala dataframe which has column "Name"
I have extracted the values of that column in to scala array[string]
org_name: Array[String] = Array(SARATOGA SENIOR HIGH SCHOOL)
I want to replace whitespaces with _ and encode that value in to utf-8 (any encoding is fine as long as it replaces special chars with something else)
so if there are any special chars those will be removed. later i want to use those in file path .
var org_name = orgsFlatDF.rdd.collect
.map( _.getString(2))
This is how i am extracting those vals ^^. I haven't found any method which I can use to do that. Replace or replaceall doesn't work on array
I tried this :
org_name.replace("\\s", "")
That didn't work .
Expected output : SARATOGA_SENIOR_HIGH_SCHOOL
if name is : new $ high school it should gets converted to new_$_high_school then encoded to new_%24_high_school
There are a couple of issues with what you are asking.
Java/Scala Arrays don't have a replace method. Even if they did have a replace method, would they replace the values they hold or the characters in a String they hold?
Let's assume this line org_name.replace("\\s", "") didn't compiled and org_name is indeed a an Array[String] holding one element.
scala> val org_name=Array("SARATOGA SENIOR HIGH SCHOOL")
val org_name: Array[String] = Array(SARATOGA SENIOR HIGH SCHOOL)
scala> org_name(0).replace(" ","_")
val res15: String = SARATOGA_SENIOR_HIGH_SCHOOL
replace("\\s","_") wouldn't work because it represents a \s string. "\" represents \. That's only way you'd be able to define strings containing other escape codes like \n or \t.
PS: to transform all the string in the array use org_name.map(_.replace(" ","_")), this gives you back another another array.

Most efficient way to format a string in scala with leading Zero's and a comma instead of a decimal point

Trying to create a simple function whereby a String value is passed in i.e. "1" and the formatter should return the value with leading zeros and 5 decimal points however instead of a dot '.' I'm trying to return it with a comma ','
This is what I have attempted however its not working because the decimalFormatter can only handle numbers and not a string. The end goal is to get from "1" to "000000001,00000" - character length is 14 in total. 5 0's after the comma and the remaining before the comma should be padded out 0's to fill the 9 digit requirement.
Another example would be going from "913" to "000000913,00000"
def numberFormatter (value: String): String =
{
import java.text.DecimalFormat
val decimalFormat = new DecimalFormat("%09d,00000")
val formattedValue = decimalFormat.format(value)
return formattedValue
}
Any help would be much appreciated. Thank you in advance.
It's easy enough to format a String padded with spaces, but with zeros not so much. Still, it's not so hard to roll your own.
def numberFormatter(value :String) :String =
("0" * (9 - value.length)) + value + ",00000"
numberFormatter("1") //res0: String = 000000001,00000
numberFormatter("913") //res1: String = 000000913,00000
Note that this won't truncate the input String. So if value is longer than 9 characters then the result will be longer than the desired 15 characters.
def f(value:String) = new java.text.DecimalFormat("000000000.00000").format(value.toFloat).replace(".", ",")
scala> f(913f)
res5: String = 000000913,00000
// Edit: Use .toFloat (or .toInt, .toLong, etc.) to convert your string to a number first.

How to convert string in UTF-8 to ASCII ignoring errors and removing non ASCII characters

I am new to Scala.
Please advise how to convert strings in UTF-8 to ASCII ignoring errors and removing non ASCII characters in output.
For example, how to remove non ASCII character \uc382 from result string: "hello���", so that "hello" is printed in output.
scala.io.Source.fromBytes("hello\uc382".getBytes ("UTF-8"), "US-ASCII").mkString
val str = "hello\uc382"
str.filter(_ <= 0x7f) // keep only valid ASCII characters
If you had text in UTF-8 as bytes that is now in a String then it was converted.
If you have text in a String and you want it in ASCII as bytes, you can convert it later.
It seems that you just want to filter for only the UTF-16 code units for the C0 Controls and Basic Latin codepoints. Fortunately, such codepoints take only one code unit so we can filter them directly without converting them to codepoints.
"hello\uC382"
.filter(Character.UnicodeBlock.of(_) == Character.UnicodeBlock.BASIC_LATIN)
.getBytes(StandardCharsets.US_ASCII)
.foreach {
println }
With the question generalized to an arbitrary, known character encoding, filtering doesn't do the job. Instead, the feature of the encoder to ignore characters that are not present in the target Charset can be used. An Encoder requires a bit more wrapping and unwrapping. (The API design is based on streaming and reusing the buffer within the same stream and even other streams.) So, with ISO_8859_1 as an example:
val encoder = StandardCharsets.ISO_8859_1
.newEncoder()
.onMalformedInput(CodingErrorAction.IGNORE)
.onUnmappableCharacter(CodingErrorAction.IGNORE)
val string = "ñhello\uc382"
println(string)
val chars = CharBuffer.allocate(string.length())
.put(string)
chars.rewind()
val buffer = encoder.encode(chars)
val bytes = Array.ofDim[Byte](buffer.remaining())
buffer.get(bytes)
println(bytes)
bytes
.foreach {
println }

I was wondering if someone could explain to me .decode and .encode in hashlib?

I understand that you have a hex string and perform SHA256 on it twice and then byte-swap the final hex string. The goal of this code is to find a Merkle Root by concatenating two transactions. I would like to understand what's going on in the background a bit more. What exactly are you decoding and encoding?
import hashlib
transaction_hex = "93a05cac6ae03dd55172534c53be0738a50257bb3be69fff2c7595d677ad53666e344634584d07b8d8bc017680f342bc6aad523da31bc2b19e1ec0921078e872"
transaction_bin = transaction_hex.decode('hex')
hash = hashlib.sha256(hashlib.sha256(transaction_bin).digest()).digest()
hash.encode('hex_codec')
'38805219c8ac7e9a96416d706dc1d8f638b12f46b94dfd1362b5d16cf62e68ff'
hash[::-1].encode('hex_codec')
'ff682ef66cd1b56213fd4db9462fb138f6d8c16d706d41969a7eacc819528038'
header_hex is a regular string of lower case ASCII characters and the decode() method with 'hex' argument changes it to a (binary) string (or bytes object in Python 3) with bytes 0x93 0xa0 etc. In C it would be an array of unsigned char of length 64 in this case.
This array/byte string of length 64 is then hashed with SHA256 and its result (another binary string of size 32) is again hashed. So hash is a string of length 32, or a bytes object of that length in Python 3. Then encode('hex_codec') is a synomym for encode('hex') (in Python 2); in Python 3, it replaces it (so maybe this code is meant to work in both versions). It outputs an ASCII (lower hex) string again that replaces each raw byte (which is just a small integer) with a two character string that is its hexadecimal representation. So the final bit reverses the double hash and outputs it as hexadecimal, to a form which I usually call "lowercase hex ASCII".