I would like to migrate hash generation to BigQuery which has SHA256, but does not have salt as parameter.
For example in R I can do something like this:
library(openssl)
sha256("test#gmail.com", key = "111")
# [1] "172f052058445afd9fe3afce05bfec573b5bb4c659bfd4cfc69a59d1597a0031"
Update
same with python based on an answer here:
import hmac
import hashlib
print(hmac.new(b"111", b"test#gmail.com", hashlib.sha256).hexdigest())
# 172f052058445afd9fe3afce05bfec573b5bb4c659bfd4cfc69a59d1597a0031
I hope by "migrate", you mean to migrate the logic not the exact byte-wise output from R Sha256() function.
R is using hmacsha256 and looking at Microsoft's HMACSHA256 class, it can be roughly expressed as:
The HMAC process mixes a secret key with the message data, hashes the result with the hash function, mixes that hash value with the secret key again, and then applies the hash function a second time. The output hash is 256 bits in length.
create temp function hmacsha256(content STRING, key STRING)
AS (SHA256(
CONCAT(
TO_HEX(SHA256(CONCAT(content, key))), key)
));
SELECT TO_HEX(hmacsha256("test#gmail.com", "111"));
Output:
+------------------------------------------------------------------+
| f0_ |
+------------------------------------------------------------------+
| 4010f74e5c69ddbe1e36975f7cb8be64bcfd1203dbc8e009b29d7a12a8bf5fef |
+------------------------------------------------------------------+
With the help of #Yun I have managed to solve this.
To apply HMAC you will need to include external library file in the example function bellow.
CREATE TEMP FUNCTION USER_HASH(message STRING, secret STRING)
RETURNS STRING
LANGUAGE js
OPTIONS (
-- copy this Forge library file to Storage:
-- https://cdn.jsdelivr.net/npm/node-forge#0.7.0/dist/forge.min.js
-- #see https://github.com/digitalbazaar/forge
library=["gs://.../forge.min.js"]
)
AS
"""
var hmac = forge.hmac.create();
hmac.start('sha256', secret);
hmac.update(message);
return hmac.digest().toHex();
""";
SELECT USER_HASH("test#gmail.com", "111");
-- Row f0_
-- 1 172f052058445afd9fe3afce05bfec573b5bb4c659bfd4cfc69a59d1597a0031
Related
I'm trying to add a column to a dataframe, which will contain hash of another column.
I've found this piece of documentation:
https://spark.apache.org/docs/2.3.0/api/sql/index.html#hash
And tried this:
import org.apache.spark.sql.functions._
val df = spark.read.parquet(...)
val withHashedColumn = df.withColumn("hashed", hash($"my_column"))
But what is the hash function used by that hash()? Is that murmur, sha, md5, something else?
The value I get in this column is integer, thus range of values here is probably [-2^(31) ... +2^(31-1)].
Can I get a long value here? Can I get a string hash instead?
How can I specify a concrete hashing algorithm for that?
Can I use a custom hash function?
It is Murmur based on the source code:
/**
* Calculates the hash code of given columns, and returns the result as an int column.
*
* #group misc_funcs
* #since 2.0.0
*/
#scala.annotation.varargs
def hash(cols: Column*): Column = withExpr {
new Murmur3Hash(cols.map(_.expr))
}
If you want a Long hash, in spark 3 there is the xxhash64 function: https://spark.apache.org/docs/3.0.0-preview/api/sql/index.html#xxhash64.
You may want only positive numbers. In this case you may use hash and sum Int.MaxValue as
df.withColumn("hashID", hash($"value").cast(LongType)+Int.MaxValue).show()
Can I retrieve record from aerospike database by previously saved hash digest?
Here's an example how you do it in the Aerospike client for Python. The Client.get needs a valid key tuple, which can be (namespace, set, None, digest) instead of the more standard (namespace, set, primary-key).
>>> client = aerospike.client(config).connect()
>>> client.put(('test','demo','oof'), {'id':0, 'a':1})
>>> (key, meta, bins) = client.get(('test','demo','oof'))
>>> key
('test', 'demo', None, bytearray(b'\ti\xcb\xb9\xb6V#V\xecI#\xealu\x05\x00H\x98\xe4='))
>>> (key2, meta2, bins2) = client.get(key)
>>> bins2
{'a': 1, 'id': 0}
>>> client.close()
You need three things to locate a record in Aerospike - namespace, set name (if used, can be null) and your key (that you used initially - say a string or integer). The "Key" object you pass to the get call comprises these three entities. The client library will compute the hash using set + your key, then in addition use the namespace to get the record. Aerospike only stores the hash (unless sendKey is set to true) but you need the namespace as well. So in your case, you can create the Key object that is passed to get() by specifying a namespace and hash and then pass that key object to get() but you cannot use get() with just the hash and not specifying a namespace.
using psycopg2 and postgresql 9.3 requires a row to be inserted into a table using the following syntax:
cur.execute(
'INSERT INTO customer (name,address) VALUES ('Herman M', '1313 mockingbird lane'))
If the data comes in a dictionary {'name':'Herman M','address':'1313 mockingbird lane'} is there a better, more pythonic way to extract the keys and values from the dictionary in order than this:
fields,values = '',[]
for k,v in dictionary.items():
fields = ','.join((fields,k))
values.append((v))
In order to do this:
cur.execute(
"INSERT INTO {} ({}) VALUES {}".format(
tablename,fields[1:],tuple(values)))
It works, but after watching Raymond Hettinger give his talk on transforming code into beautiful idiomatic python I am sensitive to the fact that it is ugly and I am copying data. Is there a better way?
Use the dictionary in the cursor.execute method
insert_query = """
insert into customer (name, address)
values (%(name)s, %(address)s)
"""
insert_dict = {
'name': 'Herman M',
'address': '1313 mockingbird lane'
}
cursor.execute(insert_query, insert_dict)
We have a table in Cassandra 1.2.0. That has an VarInt key. When we search keys we can see that they exist.
Table description:
CREATE TABLE u (
key varint PRIMARY KEY,
) WITH COMPACT STORAGE AND
bloom_filter_fp_chance=0.010000 AND
caching='KEYS_ONLY' AND
comment='' AND
dclocal_read_repair_chance=0.000000 AND
gc_grace_seconds=864000 AND
read_repair_chance=1.000000 AND
replicate_on_write='true' AND
compaction={'class': 'SizeTieredCompactionStrategy'} AND
compression={'sstable_compression': 'SnappyCompressor'};
Select key from u limit 10;
key
12040911
60619595
3220132
4602232
3997404
6312372
1128185
1507755
1778092
4701841
When I try and get the row for key 60619595 it works fine.
cqlsh:users> select key from u where key = 60619595;
key
60619595
cqlsh:users> select key from u where key = 3997404;
When I use pycassa to get the whole table I can access the row.
import pycassa
from struct import *
from pycassa.types import *
from urlparse import urlparse
import operator
userspool = pycassa.ConnectionPool('users');
userscf = pycassa.ColumnFamily(userspool, 'u');
users = {}
u = list(userscf.get_range())
for r in u:
users[r[0]] = r[1]
print users[3997404]
returns the correct result.
What am I doing wrong? I cannot see what the error is.
Any help would be appreciated,
Regards
Michael.
PS:
I should say that in pycassa when I try:
userscf.get(3997404)
File "test.py", line 10, in
userscf.get(3997404)
File "/usr/local/lib/python2.7/dist-packages/pycassa/columnfamily.py", line 655, in get
raise NotFoundException()
pycassa.cassandra.ttypes.NotFoundException: NotFoundException(_message=None)
It seems to be in Ints that are smaller than the average.
You are mixing CQL and Thrift-based queries, which do not always mix well. CQL abstracts the underlying storage rows, whereas Thrift deals directly with them.
This is a problem we are having in our project. I should have added that
select key from u where key = 3997404;
cqlsh:users>
returns 0 results, even although when select * from u in cqlsh, or get the whole table in pycassa we see the row with the key 3997404.
Sorry for the confusion.
Regards
D.
I created a url shortener algorithm with Ruby + MongoMapper
It's a simple url shortener algorithm with max 3 digits
http://pablocantero.com/###
Where each # can be [a-z] or [A-Z] or [0-9]
For this algorithm, I need to persist four attributes on MongoDB (through
MongoMapper)
class ShortenerData
include MongoMapper::Document
VALUES = ('a'..'z').to_a + ('A'..'Z').to_a + (0..9).to_a
key :col_a, Integer
key :col_b, Integer
key :col_c, Integer
key :index, Integer
end
I created another class to manage ShortenerData and to generate the unique
identifier
class Shortener
include Singleton
def get_unique
unique = nil
#shortener_data.reload
# some operations that can increment the attributes col_a, col_b, col_c and index
# ...
#shortener_data.save
unique
end
end
The Shortener usage
Shortener.instance.get_unique
My doubt is how can I make get_unique synchronized, my app will be
deployed on heroku, concurrent requests can call
Shortener.instance.get_unique
I changed the behaviour to get the base62 id. I created an auto increment gem to MongoMapper
With the auto incremented id I encode to base62
The gem is available on GitHub https://github.com/phstc/mongomapper_id2
# app/models/movie.rb
class Movie
include MongoMapper::Document
key :title, String
# Here is the mongomapper_id2
auto_increment!
end
Usage
movie = Movie.create(:title => 'Tropa de Elite')
movie.id # BSON::ObjectId('4d1d150d30f2246bc6000001')
movie.id2 # 3
movie.to_base62 # d
Short url
# app/helpers/application_helper.rb
def get_short_url model
"http://pablocantero.com/#{model.class.name.downcase}/#{model.to_base62}"
end
I solved the race condition with MongoDB find_and_modify http://www.mongodb.org/display/DOCS/findAndModify+Command
model = MongoMapper.database.collection(:incrementor).
find_and_modify(
:query => {'model_name' => 'movies'},
:update => {'$inc' => {:id2 => 1}}, :new => true)
model[:id2] # returns the auto incremented_id
If this new behaviour I solved the race condition problem!
If you liked this gem, please, help to improve it. You’re welcome to make your contributions and send them as a pull request or just send me a message http://pablocantero.com/blog/contato