Efficiently matching a Hash/Map with multiple Hashes/Maps - hash

I have a base Hashmap something like below
{
"group_reassigned_count": 3,
"source": "App",
"description": "Sample",
"company_id": 753606,
"group_id": 706200,
"custom_text": "Item 1",
"custom_number": 45,
"custom_decimal": 34.5,
"custom_date": "2019-04-18T00:00:00Z",
"custom_boolean": false,
"priority": 2,
"first_assigned_at": "2019-04-12T09:56:33Z",
"subject": "Sample subject",
"fwd_emails": [],
"reply_cc": [
"abc#test.com",
"def#test.com"
],
"urgent": false,
"first_assign_group_id": 706200,
"outbound_email": false,
"display_id": 84,
"customer_reply_count": null,
}
I need to match this with multiple Hashmaps, that contain only the subset of keys of the original Hashmap. As of now, my requirement is to return a boolean based on the match.
NOTE: The base Hashmap tent to change but the list of subset Hashmaps will be constant.
Example:
# Hashmap 1 (Not matching)
{
"group_id": 706200,
"custom_text": "Dummy",
"custom_number": 45,
"custom_decimal": 31.8
}
# Hashmap 2 (Matching)
{
"subject": "Sample subject",
"fwd_emails": []
}
# Hashmap 3 (Not matching)
{
"source": "App",
"description": "Dummy",
"outbound_email": true,
"display_id": 4
}
# Hashmap 4 (Matching)
{
"custom_date": "2019-04-18T00:00:00Z",
"custom_boolean": false,
"priority": 2,
"first_assigned_at": "2019-04-12T09:56:33Z",
"subject": "Sample subject"
}
Most of the time, the subset Hashmaps I need to match with the base Hashmap will be in thousands.
I don't want to iterate through each Hashmap and compare it with the base Hashmap. It's time-consuming.
Is there any efficient way to do this?
Something like converting the list of subset Hashmaps into a data structure that will help in efficient matching (The subset Hashmaps are always constant, and The keys and values will not change)
I haven't tried any regexp matching.

Related

array_sort function sorting the data based on first numerical element in Array<Struct>

I need to sort the array<struct> based on a particular element from a struct. I am trying to use the array_sort function and could see that by default it is sorting the array but based on the first numerical element. Is this the expected behavior? PFB sample code and output.
val jsonData = """
{
"topping":
[
{ "id": "5001", "id1": "5001", "type": "None" },
{ "id": "5002", "id1": "5008", "type": "Glazed" },
{ "id": "5005", "id1": "5007", "type": "Sugar" },
{ "id": "5007", "id1": "5002", "type": "Powdered Sugar" },
{ "id": "5006", "id1": "5005", "type": "Chocolate with Sprinkles" },
{ "id": "5003", "id1": "5004", "type": "Chocolate" },
{ "id": "5004", "id1": "5003", "type": "Maple" }
]
}
"""
val json_df = spark.read.json(Seq(jsonData).toDS)
val sort_df = json_df.select(array_sort($"topping").as("sort_col"))
display(sort_df)
OUTPUT
As you could see the above output is sorted based on the id element which is the first numerical element in the struct.
Is there any way to specify the element based on which sorting can be done?
Is this the expected behavior?
Short answer, yes!
For arrays with struct type elements, It compares first fields to determine the order and if they are equal it compares the second fields and so on. You can see that easily by modifying your input data example to have the same value in id for 2 rows, you'll then notice the order is determined by the second field.
The array_sort function uses collection operation ArraySort. If you look into the code you'll find how it handles complex DataTypes like StructType.
Is there any way to specify the element based on which sorting can be done?
One way is using a tranform function to change the positions of the struct fields so that you can have the first field containing the values you want the ordering to be based on. Example: if you want to order by the field type:
val transform_expr = "TRANSFORM(topping, x -> struct(x.type as type, x.id as id, x.id1 as id1))"
val transform_df = json_df.select(expr(transform_expr).alias("topping_transform"))
transfrom_df.show(fasle)
//+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
//|topping_transform |
//+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
//|[[None, 5001, 5001], [Glazed, 5002, 5008], [Sugar, 5005, 5007], [Powdered Sugar, 5007, 5002], [Chocolate with Sprinkles, 5006, 5005], [Chocolate, 5003, 5004], [Maple, 5004, 5003]]|
//+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
val sort_df = transform_df.select(array_sort($"topping_transform").as("sort_col"))
sort_df.show(false)
//+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
//|sort_col |
//+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
//|[[Chocolate, 5003, 5004], [Chocolate with Sprinkles, 5006, 5005], [Glazed, 5002, 5008], [Maple, 5004, 5003], [None, 5001, 5001], [Powdered Sugar, 5007, 5002], [Sugar, 5005, 5007]]|
//+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Differentiate dropdown multi select (without options defined) with regular text columns

Is there any way to differentiate columns that are of type drop down multi select from regular text columns :
This is supposed to be a multi select drop down list without any option :
"id": 5414087443146628,
"version": 2,
"index": 2,
"title": "Column3",
"type": "TEXT_NUMBER",
"validation": false,
"width": 150
Same question goes for multi contact list without contact options defined.
If you think of multi-contact or multi-dropdown as new versions of the various GET requests, then its easier to return the correct values. For multi-dropdown, you use a combination of query parameters of "level=3" and "include=objectValue", then you'll see the column type change to MULTI_PICKLIST instead of TEXT. (The TEXT value is to maintain backwards compatibility.)
So, essentially, your request would look something like GET /sheets/{sheetId}?level=3&include=objectValue.
To test the scenario you've described, I created the following sheet structure in Smartsheet, where the column names indicate the type of each column:
Then I used Postman to issue a Get Sheet request for that sheet:
GET https://api.smartsheet.com/2.0/sheets/5831916227192708
The columns portion of the API response looks like this:
{
"id": 5831916227192708,
...
"columns": [
{
"id": 1256050323154820,
"version": 0,
"index": 0,
"title": "Description",
"type": "TEXT_NUMBER",
"primary": true,
"validation": false,
"width": 124
},
{
"id": 5759649950525316,
"version": 0,
"index": 1,
"title": "Type=Text/Number",
"type": "TEXT_NUMBER",
"validation": false,
"width": 128
},
{
"id": 1323283741206404,
"version": 0,
"index": 2,
"title": "Type=Dropdown (single select)",
"type": "PICKLIST",
"validation": false,
"width": 111
},
{
"id": 7741495861110660,
"version": 2,
"index": 3,
"title": "Type=Dropdown (multiple select)",
"type": "TEXT_NUMBER",
"validation": false,
"width": 113
},
{
"id": 3048711514285956,
"version": 0,
"index": 4,
"title": "Type=Contact List (single select)",
"type": "CONTACT_LIST",
"validation": false,
"width": 122
},
{
"id": 3992195570132868,
"version": 1,
"index": 5,
"title": "Type=Contact List (multiple select)",
"type": "TEXT_NUMBER",
"validation": false,
"width": 125
}
],
...
}
In this response, we see the following:
If column type is specified as Text/Number, the type attribute value is TEXT_NUMBER
If column type is specified as Dropdown (single select), the type attribute value is PICKLIST
If column type is specified as Dropdown (multiple select), the type attribute value is TEXT_NUMBER
If column type is specified as Contact List (single select), the type attribute value is CONTACT_LIST
If column type is specified as Contact List (multiple select), the type attribute value is TEXT_NUMBER
Therefore, it doesn't seem possible to programmatically differentiate a Dropdown (multiple select) column from a Text/Number column or a Contact List (multiple select) column from a Text/Number column, based on column metadata alone. IMO, seems like a bug for the Dropdown (multiple select) column type and Contact List (multiple select) column type to return type: TEXT_NUMBER. Perhaps someone with Smartsheet can comment here to provide more insight into this behavior.
Did a few tests and level 3 isn't available : https://api.smartsheet.com/2.0/sheets/{sheetId}?level=3 :
{
"errorCode": 1018,
"message": "The value '3' was not valid for the parameter 'level'.",
"refId": "1godowa5cigf1"
}
Although i tried with level 2 and got the info :
https://api.smartsheet.com/2.0/sheets/{sheetId}?level=2&include=objectValue
Results for a multi drop down list :
{
"id": 5414087443146628,
"version": 2,
"index": 2,
"title": "Column3",
"type": "MULTI_PICKLIST",
"options": [
"a",
"b"
],
"validation": false,
"width": 150
}

Parsing Really Messy Nested JSON Strings

I have a series of deeply nested json strings in a pyspark dataframe column. I need to explode and filter based on the contents of these strings and would like to add them as columns. I've tried defining the StructTypes but each time it continues to return an empty DF.
Tried using json_tuples to parse but there are no common keys to rejoin the dataframes and the row numbers dont match up? I think it might have to do with some null fields
The sub field can be nullable
Sample JSON
{
"TIME": "datatime",
"SID": "yjhrtr",
"ID": {
"Source": "Person",
"AuthIFO": {
"Prov": "Abc",
"IOI": "123",
"DETAILS": {
"Id": "12345",
"SId": "ABCDE"
}
}
},
"Content": {
"User1": "AB878A",
"UserInfo": "False",
"D": "ghgf64G",
"T": "yjuyjtyfrZ6",
"Tname": "WE ARE THE WORLD",
"ST": null,
"TID": "BPV 1431: 1",
"src": "test",
"OT": "test2",
"OA": "test3",
"OP": "test34
},
"Test": false
}

how to create Jsonpath file to load data in redshift

one of my sample record for Json:
{
"viewerId": "Ext-04835139",
"sid5": "269410578:2995631181:2211755370:3307088398:33879957",
"firstHbTimems": 1.506283958371E12,
"ipAddress": "74.58.57.31",
"streamUrl": "https://dc3-ll-livedazn-dznlivejp.hs.llnwd.net/live/channel/1007/all/stream.m3u8?event_id=61824040049&h=c912885e2a69ffa7ea84f45dc18c004d",
"asset": "[nlq9biy7trxl1cjceg70rogvd] Saints # Panthers",
"os": "IOS",
"osVersion": "10.3.3",
"deviceModel": "iPhone",
"geoInfo": {
"city": 63666,
"state": 3851,
"isp": 120,
"longitudeTimes1K": -73562,
"country": 37,
"dma": 0,
"asn": 5769,
"latitudeTimes1K": 45502,
"publicIP": 1245329695
},
"totalPlayingTime": 4.097,
"totalBufferingTime": 0.0,
"VST": 1.411,
"avgBitrate": 202.0,
"playStateSwitch": [
"{'seqNum': 0, 'eventNum': 0, 'sessionTimeMs': 7, 'startPlayState': 'eUnknown', 'endPlayState': 'eBuffering'}",
"{'seqNum': 1, 'eventNum': 5, 'sessionTimeMs': 1411, 'startPlayState': 'eBuffering', 'endPlayState': 'ePlaying'}"
],
"bitrateSwitch": [
],
"errorEvent": [
],
"tags": {
"LSsportName": "Football",
"c3.device.model": "iPhone+6+Plus",
"LSvideoType": "LIVE",
"c3.device.ua": "DAZN%2F5560+CFNetwork%2F811.5.4+Darwin%2F16.7.0",
"LSfixtureId": "5trxst8tv7slixckvawmtf949",
"genre": "Sport",
"LScompetitionName": "NFL+Game+Pass",
"show": "NFL+Game+Pass",
"c3.cmp.0._type": "DEVATLAS",
"c3.protocol.type": "cws",
"LSsportId": "9ita1e50vxttzd1xll3iyaulu",
"stageId": "8hm0ew6b8m7907ty8vy8tu4tl",
"LSvenueId": "na",
"syndicator": "None",
"applicationVersion": "2.0.8",
"deviceConnectionType": "wifi",
"c3.client.marketingName": "iPhone+6+Plus",
"playerVersion": "1.2.6.0",
"c3.cmp.0._id": "da",
"drmType": "AES128",
"c3.sh": "dc3-ll-livedazn-dznlivejp.hs.llnwd.net",
"c3.pt.ver": "10.3.3",
"applicationType": "ios",
"c3.viewer.id": "Ext-04835139",
"LSinterfaceLanguage": "en",
"c3.pt.os": "IOS",
"playerVendor": "Open+Source",
"c3.client.brand": "Apple",
"c3.cws.sf": "7",
"c3.cmp.0._ver": "1",
"c3.client.hwType": "Mobile+Phone",
"c3.pt.os.ver": "10.3.3",
"isAd": "false",
"c3.device.cver.bld": "2.124.0.33357",
"stageName": "Regular+Season",
"c3.client.osName": "iOS",
"contentType": "Live",
"c3.device.cver": "2.124.0",
"LScompetitionId": "wy3kluvb4efae1of0d8146c1",
"expireDate": "na",
"c3.client.model": "iPhone+6+Plus",
"c3.client.manufacturer": "Apple",
"LSproductionValue": "na",
"pubDate": "2017-09-23",
"c3.cluster.name": "production",
"accountType": "FreeTrial",
"c3.adaptor.type": "eCws1_7",
"c3.device.brand": "iPhone",
"c3.pt.br": "Non-Browser+Apps",
"contentId": "nlq9biy7trxl1cjceg70rogvd",
"streamingProtocol": "FairPlay",
"LSvenueName": "na",
"c3.device.type": "Mobile",
"c3.protocol.level": "2.4",
"c3.player.name": "AVPlayer",
"contentName": "Saints+%40+Panthers",
"c3.device.manufacturer": "Apple",
"c3.framework": "AVFoundation",
"c3.pt": "iOS",
"c3.device.ver": "6+Plus",
"c3.video.isLive": "T",
"c3.cmp.0._cfg_ver": "1504808821",
"c3.cws.clv": "2.124.0.33357",
"LScountryCode": "America%2FEl_Salvador"
},
"playername": "AVPlayer",
"isLive": "T",
"playerVersion": "1.2.6.0"
}
How to create jsonpath file to load it in redshift ?
Thanks
You have a nested array within your json - so a jsonpath will not expand that out for you.
You have a couple of choices on how to proceed:
You can load your data at the higher level (e.g. playStateSwitch
rather than seqNum within that) - and then try to use redshift to
process that data. This can be tricky as you cannot explode json
data from an array in redshift.
You can preprocess the data using e.g. aws glue / python / pyspark
or some other etl tool that can handle these nested arrays.
It all depends on the end goal, which is not clear form the above description.
I will approach the solution in the following order
Define which fields and array values that are required to be loaded into the Redshift. If the need is to copy all the records then the next check is how to handle the multiple array records.
If array or key/value are missing as part of JSON source then JSONPath will not work as is - So, better to update the JSON to add the missing array prior to COPY the data set over to RS.
The JSON update can be done using Linux commands or external tools like JP or refer additional reference
If all the values in the nested arrays are required then an alternative work around will be using external table an example
Otherwise, the JSONPATH file can be developed in this format
{
"jsonpaths": [
"$.viewerId", ///root level fields
...
"$geoInfo.city", /// object hierarchy
...
"$playStateSwitch[0].seqNum" ///define the required array number
...
]
}
Hope, this helps.

MongoDB data model for fast reads using array data

I have a dataset which returns an array named "data_arr" that contains anywhere from 5 to 200 subitems which have a labelspace & key-value pair as follows.
{
"other_fields": "other_values",
...
"data_arr":[
{
"labelspace": "A",
"color": "red"
},
{
"labelspace": "A",
"size": 500
},
{
"labelspace": "B",
"shape": "round"
},
]
}
The question is, within MongoDB, how to store this data optimized for fast reads. Specifically, there would be queries:
Comparing key-values (ie. Average size of objects which are both red
and round).
Returning all documents which meet a criteria (ie. Red objects
larger than 300).
Label space is important because some key names are reused.
I've contemplated indexing with the existing structure by indexing labelspace.
I've considered grouping all labelspace key/values into a single sub-document as follows:
{
"other_fields": "other_values",
...
"data_a":
{
"color": "red",
"size": 500
},
"data_b":
{
"shape": "round"
}
}
Or modeling it as follows with a multi-value index
{
"other_fields": "other_values",
...
"data_arr":[
{
"labelspace": "A",
"key": "color",
"value": "red"
},
{
"labelspace": "A",
"key": "size",
"value": 500
},
{
"labelspace": "B",
"key": "shape",
"value": "round"
},
]
}
This is a new data set that needs to be collected. So it's difficult for me to build up enough of a sample only to discover I've ventured down the wrong path.
I think the last one is best suited for indexing, so possibly the best approach?