Encrypting values in a JSON array in data frame - pyspark

So we have a struct with an array in a nested struct in a JSON object, so we have column JSON then there is an array type called address in the nested struct. We are looking to encrypt some of these fields using fernet. At moment I have only got this far:
df = df.withColumn("json", col("json").withField(arrayPath, transform("json." + arrayPath, lambda x : x.withField(field, encrypt("json." + arrayPath + "." + field, lit(encryptKey))))))
Arraypath is the path to the array struct and the field we want to encrypt is in the field variable. I get the error 'TypeError: encoding without a string argument' as I can't drill down to the string value in the encrypt function call, any suggestions?

Related

convert ByteArray to String to ByteArray

I want to convert ByteArray to string and then convert the string to ByteArray,But while converting values changed. someone help to solve this problem.
person.proto:
syntax = "proto3";
message Person{
string name = 1;
int32 age = 2;
}
After sbt compile it gives case class Person (created by google protobuf while compiling)
My MainClass:
val newPerson = Person(
name = "John Cena",
age = 44 //output
)
println(newPerson.toByteArray) //[B#50da041d
val l = newPerson.toByteArray.toString
println(l) //[B#7709e969
val l1 = l.getBytes
println(l1) //[B#f44b405
why the values changed?? how to convert correctly??
[B#... is the format that a JVM byte array's .toString returns, and is just [B (which means "byte array") and a hex-string which is analogous to the memory address at which the array resides (I'm deliberately not calling it a pointer but it's similar; the precise mapping of that hex-string to a memory address is JVM-dependent and could be affected by things like which garbage collector is in use). The important thing is that two different arrays with the same bytes in them will have different .toStrings. Note that in some places (e.g. the REPL), Scala will instead print something like Array(-127, 0, 0, 1) instead of calling .toString: this may cause confusion.
It appears that toByteArray emits a new array each time it's called. So the first time you call newPerson.toByteArray, you get an array at a location corresponding to 50da041d. The second time you call it you get a byte array with the same contents at a location corresponding to 7709e969 and you save the string [B#7709e969 into the variable l. When you then call getBytes on that string (saving it in l1), you get a byte array which is an encoding of the string "[B#7709e969" at the location corresponding to f44b405.
So at the locations corresponding to 50da041d and 7709e969 you have two different byte arrays which happen to contain the same elements (those elements being the bytes in the proto representation of newPerson). At the location corresponding to f44b405 you have a byte array where the bytes encode (in some character set, probably UTF-16?) [B#7709e969.
Because a proto isn't really a string, there's no general way to get a useful string (depending on what definition of useful you're dealing with). You could try interpreting a byte array from toByteArray as a string with a given character encoding, but there's no guarantee that any given proto will be valid in an arbitrary character encoding.
An encoding which is purely 8-bit, like ISO-8859-1 is guaranteed to at least be decodable from a byte array, but there could be non-printable or control characters, so it's not likely to that useful:
val iso88591Representation = new String(newPerson.toByteArray, java.nio.charset.StandardCharsets.ISO_8859_1)
Alternatively, you might want a representation like how the Scala REPL will (sometimes) render it:
"Array(" + newPerson.toByteArray.mkString(", ") + ")"

How do I parse out a number from this returned XML string in python?

I have the following string:
{\"Id\":\"135\",\"Type\":0}
The number in the Id field will vary, but will always be an integer with no comma separator. I'm not sure how to get just that value from that string given that it's string data type and not real "XML". I was toying with the replace() function, but the special characters are making it more complex than it seems it needs to be.
is there a way to convert that to XML or something that I can reference the Id value directly?
Maybe use a regular expression, e.g.
import re
txt = "{\"Id\":\"135\",\"Type\":0}"
x = re.search('"Id":"([0-9]+)"', txt)
if x:
print(x.group(1))
gives
135
It is assumed here that the ids are numeric and consist of at least one digit.
Non-regex answer as you asked
\" is an escape sequence in python.
So if {\"Id\":\"135\",\"Type\":0} is a raw string and if you put it into a python variable like
a = '{\"Id\":\"135\",\"Type\":0}'
gives
>>> a
'{"Id":"135","Type":0}'
OR
If the above string is python string which has \" which is already escaped, then do a.replace("\\","") which will give you the string without \.
Now just load this string into a dict and access element Id like below.
import json
d = json.loads(a)
d['Id']
Output :
135

MATLAB: extract numerical data from alphanumerical table and save as double

I created a list of names of data files, e.g. abc123.xml, abc456.xml, via
list = dir('folder/*.xml').
Matlab starts this out as a 10x1 struct with 5 fields, where the first one is the name. I extracted the needed data with struct2table, so I now got a 10x1 table. I only need the numerical value as 10x1 double. How can I get rid of the alphanumerical stuff and change the data type?
I tried regexp (Undefined function 'regexp' for input arguments of type 'table') and strfind (Conversion to double from table is not possible). Couldn't come up with anything else, as I'm very new to Matlab.
You can extract the name fields and place them in a cell array, use regexp to capture the first string of digits it finds in each name, then use str2double to convert those to numeric values:
strs = regexp({list.name}, '(\d+)', 'once', 'tokens');
nums = str2double([strs{:}]);

How do I find a partial string in a Mongo database using a superset string?

If my database contains entries with the following string values for the "key" field:
"a,b,c"
"a,b,z"
"a,b,c,d,e,f,z"
"d,e,f,g"
"d,e,f,g,z"
"h,i"
And I have a string like this:
"a,b,c,d,e,f,g,h"
How do I find the entries where the value of the key field matches the start of my string? E.g. I want to find the entry where the value of the key field is "a,b,c".
How do I find the entries where the value of the key field matches any part of my string? E.g. I want to find the entries where the value of the key field is "a,b,c" and "d,e,f,g".
To give some context in case anyone thinks this is a pointless task, I want to do stack matching. I will have entries in a database that identify bugs by the first N frames of the stack and then I want to identify bug(s) by the stack obtained from a core dump.
The answer is to use the $where operator. An example in Python, where search_string is the string we want to find matches with, is:
search_string = 'a,b,c,d,e,f,g,h'
js_check = 'function () { var search_string=\'' + search_string + '\'; return search_string.indexOf(this.key) >= 0; }'
matches = my_collection.find({'$where': js_check})

get_schema multiple primary keys

I am trying the following:
from pandas.io.sql import get_schema
tbl_schema = get_schema(contracts, 'my_contracts', keys=['country', 'contract_id'], con=db_engine)
I am getting this
ArgumentError: Element ['country', 'contract_id'] is not a string name or column element
which seems likely coming from this:
def _to_schema_column_or_string(element):
if hasattr(element, '__clause_element__'):
element = element.__clause_element__()
if not isinstance(element, util.string_types + (ColumnElement, )):
msg = "Element %r is not a string name or column element"
raise exc.ArgumentError(msg % element)
return element
I am not sure I understand how the multiple primary keys should be formatted to be parsed properly. I don't really understand this: util.string_types + (ColumnElement, ) I was hoping I could just point to the frame columns without having to define the whole SQLAlchemy schema.