BigQuery SHA256 function - hash

I need to hash some strings using SHA256. Using BigQuery to do this results in what I understand to be a BASE64 result, where as I need something that is different.
For example, if I want to hash "def#gmail.com" the result should be:
c392e50ebeca7bea4405e9c545023451ac56620031f81263f681269bde14218b
But doing this in BigQuery:
SELECT SHA256("def#gmail.com") as sha256;
the result is:
w5LlDr7Ke+pEBenFRQI0UaxWYgAx+BJj9oEmm94UIYs=
It's the first result that I need to get, any ideas if this is possible in BigQuery, I'm trying to avoid needing to use javascript for this.

If you're using Standard SQL in BigQuery then you could use:
SELECT TO_HEX(SHA256("def#gmail.com")) as sha256;
results:
| sha256 |
| c392e50ebeca7bea4405e9c545023451ac56620031f81263f681269bde14218b |

Related

MD5/SHA Field Dataset in Data Fusion

I need to concatanate a few string values in order to obtain the SHA256 encrypted string. I've seen Data Fusion has a plugin to do the job:
The documentation however is very poor and nothing I've tried seems to work. I created a table in BQ with the string fields I need to concatanate but the output is same as input. Can anyone provide with an example on how to use this plugin?
EDIT
Below I present the example,
This is how the workflow looks like:
For the testing purposes, I added one column with the following string:
2022-01-01T00:00:00+01:00
And here's the output:
You can use Wrangler to concatenate the string values.
I tried your scenario adding Wrangler to the Pipeline:
Joining 2 Columns:
I named the column new_col, using , as delimiter:
Output:
What you described can be achieved by 2 Wranglers:
The first Wrangler will be what #angela-b described. Use the merge directive to create a new column with the concatenation of two columns. Example directive that joins column a and b using , as the delimiter and stores the result in column a_b:
merge a b a_b ,
The second Wrangler will use the hash directive which will hash the column in place using a specified algorithm. Example of a directive that hashes column a_b using MD5:
hash :a_b 'MD5' true
Remember to set the last parameter encode to true so that you get a string output instead of a byte array.

Amazon Redshift get all keys from JSON

I looked at the documentation of Amazon redshift and I'm not able to see a function which will give me what I want.
https://docs.aws.amazon.com/redshift/latest/dg/json-functions.html
I have a column in my database which contains JSON like this
{'en_IN-foo':'bla bla', 'en_US-foo':'bla bla'}
I want to extract all keys from json which have foo. So I want to extract
en_IN-foo
en_US-foo
How can I get what I want? The closest to my requirement is JSON_EXTRACT_PATH_TEXT function but that can only extract the key when you know the key name. in my case I want all keys which have a pattern but I don't know the key names.
I also tried abandoning the JSON function way and going the REGEX way. I wrote this code
select distinct regexp_substr('{en_in-foo:FOO, en_US-foo:BAR}','[^.]{5}-foo')
but this finds only the first match. I need all the matches.
Redshift is not flexible with JSON, so I don't think getting keys from an arbitrary JSON document is possible. You need to know the keys upfront.
option 1
If possible change your JSON document to have a static schema:
{"locale":"en_IN", "foo": "bla bla"}
Or even
{"locale":"en_IN", "name": "foo", "value": "bla bla"}
Option 2
I can see that your prefix may be known to you as it looks like the locale. What you could do is to create a static table of locales, and then CROSS JOIN it with your JSON column.
locales_table:
Id | locale
----------------
1 | en_US
2 | en_IN
The query would look like this:
SELECT
JSON_EXTRACT_PATH_TEXT(json_column, locale || '-foo', TRUE) as foo_at_locale
FROM json_table
CROSS JOIN locales_table
WHERE foo_at_locale IS NOT NULL

Postgres: Are There Downsides to Using a JSON Column vs. an integer[] Column?

TLDR: If I want to save arrays of integers in a Postgres table, are there any pros or cons to using an array column (integer[]) vs. using a JSON column (eg. does one perform better than the other)?
Backstory:
I'm using a PostgreSQL database, and Node/Knex to manage it. Knex doesn't have any way of directly defining a PostgreSQL integer[] column type, so someone filed a Knex bug asking for it ... but one of the Knex devs closed the ticket, essentially saying that there was no need to support PostgreSQL array column types when anyone can instead use the JSON column type.
My question is, what downsides (if any) are there to using a JSON column type to hold a simple array of integers? Are there any benefits, such as improved performance, to using a true array column, or am I equally well off by just storing my arrays inside a JSON column?
EDIT: Just to be clear, all I'm looking for in an answer is either of the following:
A) an explanation of how JSON columns and integer[] columns in PostgreSQL work, including either how one is better than the other or how the two are (at least roughly) equal.
B) no explanation, but at least a reference to some benchmarks that show that one column type or the other performs better (or that the two are equal)
An int[] is a lot more efficient in terms of storage it requires. Consider the following query which returns the size of an array with 500 elements
select pg_column_size(array_agg(i)) as array_size,
pg_column_size(jsonb_agg(i)) as jsonb_size,
pg_column_size(json_agg(i)) as json_size
from generate_series(1,500) i;
returns:
array_size | jsonb_size | json_size
-----------+------------+----------
2024 | 6008 | 2396
(I am quite surprised that the JSON value is so much smaller than the JSONB, but that's a different topic)
If you always use the array as a single value it does not really matter in terms of query performance But if you do need to look into the array and search for specific value(s), that will be a lot more efficient with a native array.
There are a lot more functions and operators available for native arrays than there are for JSON arrays. You can easily search for a single value in a JSON array, but searching for multiple values requires workarounds.
The following query demonstrates that:
with array_test (id, int_array, json_array) as (
values
(1, array[1,2,3], '[1,2,3]'::jsonb)
)
select id,
int_array #> array[1] as array_single,
json_array #> '1' json_single,
int_array #> array[1,2] as array_all,
json_array ?& array['1','2'] as json_all,
int_array && array[1,2] as array_any,
json_array ?| array['1','2'] as json_any
from array_test;
You can easily query an array if it contains one specific value. This also works for JSON arrays. Those are the expressions array_single and json_single. With a native array you could also use 1 = any(int_array) instead.
But check if an array contains all values from a list, or any value from a list does not work with JSON arrays.
The above test query returns:
id | array_single | json_single | array_all | json_all | array_any | json_any
---+--------------+-------------+-----------+----------+-----------+---------
1 | true | true | true | false | true | false

How to save the predictions of YOLO (You Only Look Once) Object detection in a jsonb field in a database

I want to run Darknet(YOLO) on a number of images and store its predictions in PostgreSQL Database.
This is the structure of my table:
sample=> \d+ prediction2;
Table "public.prediction2"
Column | Type | Modifiers | Storage | Stats target | Description
-------------+-------+-----------+----------+--------------+-------------
path | text | not null | extended | |
pred_result | jsonb | | extended | |
Indexes:
"prediction2_pkey" PRIMARY KEY, btree (path)
Darknet(YOLO)'s source files are written in C.
I have already stored Caffe's predictions in the database as follows. I have listed one of the rows of my database here as an example.
path | pred_result
-------------------------------------------------+------------------------------------------------------------------------------------------------------------------
/home/reena-mary/Pictures/predict/gTe5gy6xc.jpg | {"bow tie": 0.00631, "lab coat": 0.59257, "neck brace": 0.00428, "Windsor tie": 0.01155, "stethoscope": 0.36260}
I want to add YOLO's predictions to the jsonb data of pred_result i.e for each image path and Caffe prediction result already stored in the database, I would like to append Darknet (YOLO's) predictions.
The reason I want to do this is to add search tags to each image. So, by running Caffe and Darknet on images, I want to be able to get enough labels that can help me make my image search better.
Kindly help me with how I should do this in Darknet.
This is an issue I also encountered. Actually YOLO does not provide a JSON output interface, so there is no way to get the same output as from Caffe.
However, there is a pull request that you can merge to get workable output here: https://github.com/pjreddie/darknet/pull/34/files. It outputs CSV data, which you can convert to JSON to store in the database.
You could of course also alter the source code of YOLO to make your own implementation that outputs JSON directly.
If you are able to use a TensorFlow implementation of YOLO try this: https://github.com/thtrieu/darkflow
You can directly interact with darkflow from another python application and then do with the output data as you please (or get JSON data saved to a file, whichever is easier).

Using groupBy in Spark and getting back to a DataFrame

I have a difficulty when working with data frames in spark with Scala. If I have a data frame that I want to extract a column of unique entries, when I use groupBy I don't get a data frame back.
For example, I have a DataFrame called logs that has the following form:
machine_id | event | other_stuff
34131231 | thing | stuff
83423984 | notathing | notstuff
34131231 | thing | morestuff
and I would like the unique machine ids where event is thing stored in a new DataFrame to allow me to do some filtering of some kind. Using
val machineId = logs
.where($"event" === "thing")
.select("machine_id")
.groupBy("machine_id")
I get a val of Grouped Data back which is a pain in the butt to use (or I don't know how to use this kind of object properly). Having got this list of unique machine id's, I then want to use this in filtering another DataFrame to extract all events for individual machine ids.
I can see I'll want to do this kind of thing fairly regularly and the basic workflow is:
Extract unique id's from a log table.
Use unique ids to extract all events for a particular id.
Use some kind of analysis on this data that has been extracted.
It's the first two steps I would appreciate some guidance with here.
I appreciate this example is kind of contrived but hopefully it explains what my issue is. It may be I don't know enough about GroupedData objects or (as I'm hoping) I'm missing something in data frames that makes this easy. I'm using spark 1.5 built on Scala 2.10.4.
Thanks
Just use distinct not groupBy:
val machineId = logs.where($"event"==="thing").select("machine_id").distinct
Which will be equivalent to SQL:
SELECT DISTINCT machine_id FROM logs WHERE event = 'thing'
GroupedData is not intended to be used directly. It provides a number of methods, where agg is the most general, which can be used to apply different aggregate functions and convert it back to DataFrame. In terms of SQL what you have after where and groupBy is equivalent to something like this
SELECT machine_id, ... FROM logs WHERE event = 'thing' GROUP BY machine_id
where ... has to be provided by agg or equivalent method.
A group by in spark followed by aggregation and then a select statement will return a data frame. For your example it should be something like:
val machineId = logs
.groupBy("machine_id", "event")
.agg(max("other_stuff") )
.select($"machine_id").where($"event" === "thing")