I have a dataset from video streaming logs. Each video is identified by a FileGUID. The log entries record the FileGUID, the fragment of the video watched and the bandwidth it was watched at.
I would like to create a mapreduce outputting, for each video, a count for fragments both total and for each bandwidth. Ideally it would look like;
{"FileGUID":"50acb3a5796634df0e073285",
{
"1":{"total":76, "0832":34, "1028":42},
"2":{"total":42, "0832":28, "1028":14},
...
}
}
Is this possible with one mapreduce or is it a multi-step process, or should I use a different method?
Here is a sample of the data.
{
"_id": ObjectId("50acb3a5796634df0e073285"),
"IP": "46.7.1.88",
"DateTime": ISODate("2012-10-24T22:59:57.0Z"),
"FileGUID": "8cdde821fb934a6da7c125a012a26612",
"Bandwidth": NumberInt(1028),
"Segment": NumberInt(1),
"Fragment": NumberInt(237),
"Status": NumberInt(200),
"Size": NumberInt(576790),
"UserAgent": "Mozilla\/5.0 (Windows NT 6.1; WOW64; rv:16.0) Gecko\/20100101 Firefox\/16.0"
}
{
"_id": ObjectId("50acb3a5796634df0e073284"),
"IP": "46.7.1.88",
"DateTime": ISODate("2012-10-24T22:59:52.0Z"),
"FileGUID": "8cdde821fb934a6da7c125a012a26612",
"Bandwidth": NumberInt(1028),
"Segment": NumberInt(1),
"Fragment": NumberInt(236),
"Status": NumberInt(200),
"Size": NumberInt(577100),
"UserAgent": "Mozilla\/5.0 (Windows NT 6.1; WOW64; rv:16.0) Gecko\/20100101 Firefox\/16.0"
}
{
"_id": ObjectId("50acb3a5796634df0e073283"),
"IP": "46.7.1.88",
"DateTime": ISODate("2012-10-24T22:59:47.0Z"),
"FileGUID": "8cdde821fb934a6da7c125a012a26612",
"Bandwidth": NumberInt(0832),
"Segment": NumberInt(1),
"Fragment": NumberInt(234),
"Status": NumberInt(200),
"Size": NumberInt(576664),
"UserAgent": "Mozilla\/5.0 (Windows NT 6.1; WOW64; rv:16.0) Gecko\/20100101 Firefox\/16.0"
}
{
"_id": ObjectId("50acb3a5796634df0e073282"),
"IP": "46.7.1.88",
"DateTime": ISODate("2012-10-24T22:59:42.0Z"),
"FileGUID": "8cdde821fb934a6da7c125a012a26612",
"Bandwidth": NumberInt(0832),
"Segment": NumberInt(1),
"Fragment": NumberInt(233),
"Status": NumberInt(200),
"Size": NumberInt(575692),
"UserAgent": "Mozilla\/5.0 (Windows NT 6.1; WOW64; rv:16.0) Gecko\/20100101 Firefox\/16.0"
}
You can do this with a single MapReduce job.
The map function emits the video IDs as a key and as value an object which consists of a single field. The field name is the bandwidth, the value is the running time of the current entry.
The reduce function sums up the objects fed to it. It iterates the values array, does a foreach loop over each array entry, and adds the value of each field to a field of the same name in the returned value.
The finalize function does a foreach loop over the resulting object and calculates the sum of all entries in it. It then puts the sum into the object as the field "total" (never do changes to an object you are currently looping over).
Related
I am sending an event to Facebook Conversion API. It sends fine when using the 'test_event_code' => 'TEST11781'. I receive it al ok in event manager. However, when I remove that test_event_code attribute, ie to make the event live, I am not receiving the events in event manager. Here is the json event code being sent. Any ideas why it sends to the test event but not the live event? Thanks.
{
"data": [
{
"event_name": "ViewContent",
"event_source_url": "https://www.domain.co.uk/",
"event_time": 1638201564,
"action_source": "website",
"user_data": {
"client_ip_address": "78.xxx.xx.79",
"client_user_agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36",
"em": "bee8327d3be983b71f5c3665225dd24b305362943e81738ec639e2b418c64d68",
"ph": "3e3d1c86840255608723f4a7e667f7482de0a877376347232336da4dec0fc6a1",
"ge": "62c66a7a5dd70c3146618063c344e531e6d4b59e379808443ce962b3abd63c5a",
"ln": "a9cca5a75f06071ab13f05f8b017500014117c4022414cdfb7439ca332ccecbd",
"fn": "dbadfc88144b0c153a2d1bdf154681c857a237eb79d58df24e918bca6e17db05",
"ct": "b11c0468e61792336d6c0fca278f850aee8ab9e9d7d0df03e070df82b02188c8",
"st": "1de408006eb5a879c99e8ccda85bf7bcc72f1a22c9693fa7ef9fafec80e21b82",
"zp": "102ee65281e730945c8de2678fdb0659ddcf732c2a638c7f80f5322168726dad",
"country": "b4043b0b8297e379bc559ab33b6ae9c7a9b4ef6519d3baee53270f0c0dd3d960"
},
"custom_data": { "currency": "GBP", "value": "30", "content_ids": ["3268"], "num_items": "1", "content_type": "product", "content_name": "BASEBALL CAP" }
}
]
}
I am using MS Graph Sign-in REST API to retrieve the Guest user sign-ins my tenant. But I have retrieved certain sign-ins which are showing the internal users as Guest in User Type attribute.
Also absorbed HomeTenantId and ResourceTenantId also differs.
Certain times, while logging in to Azure AD Portal, directory of the previously logged-in tenant are logged in. In that cases TenantId may differ and userType attribute is shown as Guest. But for SharePoint I am not sure of the user Type guest
This is confusing a bit. Any idea on why Internal users are shown as Guest Users
Request : https://graph.microsoft.com/beta/auditLogs/signIns
Sample Response:
{ "id": "$$$$$$",
"createdDateTime": "2021-08-29T10:22:06Z",
"userDisplayName": "user",
"userPrincipalName": "user#cortana.onmicrosoft.com",
"userId": "$$$$$",
"appId": "08e18876-6177-487e-b8b5-cf950c1e598c",
"appDisplayName": "SharePoint Online Web Client Extensibility",
"ipAddress": "$$$$$$",
"ipAddressFromResourceProvider": null,
"clientAppUsed": "",
"userAgent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36",
"correlationId": "*********",
"conditionalAccessStatus": "notApplied",
"originalRequestId": "",
"isInteractive": true,
"tokenIssuerName": "",
"tokenIssuerType": "AzureAD",
"processingTimeInMilliseconds": 173,
"riskDetail": "none",
"riskLevelAggregated": "none",
"riskLevelDuringSignIn": "none",
"riskState": "none",
"riskEventTypes": [],
"riskEventTypes_v2": [],
"resourceDisplayName": "Office 365 SharePoint Online",
"resourceId": "$$$$$$$",
"resourceTenantId": "$$$$$$$$$",
"homeTenantId": "#########",
"authenticationMethodsUsed": [],
"authenticationRequirement": "singleFactorAuthentication",
"alternateSignInName": "", "signInIdentifier": "",
"signInIdentifierType": null,
"servicePrincipalName": null,
"signInEventTypes": ["interactiveUser"],
"servicePrincipalId": "",
"userType": "guest",
"flaggedForReview": false,
"isTenantRestricted": false,
"autonomousSystemNumber": 45609,
"crossTenantAccessType": "b2bCollaboration",
"servicePrincipalCredentialKeyId": null,
"servicePrincipalCredentialThumbprint": "",
"mfaDetail": null,
"status": {
"errorCode": 0,
"failureReason": "Other.",
"additionalDetails": null },
"deviceDetail": {
"deviceId": "",
"displayName": "",
"operatingSystem": "Windows 10",
"browser": "Chrome 92.0.4515",
"isCompliant": false,
"isManaged": false,
"trustType": ""
}, "location": {
"city": "Kallimandayam",
"state": "Tamil Nadu",
"countryOrRegion": "IN",
"geoCoordinates": {
"altitude": null,
"latitude": "",
"longitude": ""
}}, "appliedConditionalAccessPolicies": [],
"authenticationProcessingDetails": [{
"key": "Login Hint Present",
"value": "True" },
{
"key": "User certificate authentication level",
"value": "singleFactorAuthentication" } ],
"networkLocationDetails": [],
"authenticationDetails": [],
"authenticationRequirementPolicies": [],
"sessionLifetimePolicies": [],
"privateLinkDetails": {
"policyId": "",
"policyName": "",
"resourceId": "",
"policyTenantId": "" } }
This is By Design. It is an expected behavior, considering that when a user is accessing a tenant where the user is a guest (inviting tenant), you will have this authentication logged in your side as well. When a user is invited to another directory as a guest, the user will authenticate with the credentials from his home tenant, as explained in the below link.
Reference - Authentication is performed by the guest user's identity provider - https://learn.microsoft.com/en-us/azure/active-directory/external-identities/user-properties
I have a document like below:
{
"_id": "1.0",
files: [
{"name": "file_1", "size": 1024, "create_ts": 1570862776426},
{"name": "file_2", "size": 2048, "create_ts": 1570862778426}
]
}
And I want to upsert “files” with "file_x":
1 if "file_x" already in "files", then update, for example "file_x" is:
{"name": "file_2", "size": 4096, "create_ts": 1570862779426}
after upsert document is:
{
"_id": "1.0",
files: [
{"name": "file_1", "size": 1024, "create_ts": 1570862776426},
{"name": "file_2", "size": 4096, "create_ts": 1570862779426}}
]
}
2 if "file_x" not in "files", insert it, for example "file_x" is:
{"name": "file_3", "size": 4096, "create_ts": 1570862779426}
after upsert document is :
{
"_id": "1.0",
files: [
{"name": "file_1", "size": 1024, "create_ts": 1570862776426},
{"name": "file_2", "size": 2048, "create_ts": 1570862778426},
{"name": "file_3", "size": 4096, "create_ts": 1570862779426}
]
}
So can I use one function to archive it ?
You will need to do this manually. There's no upsert mechanism for embedded structures inside a document.
First fetch the document, check if file_x is in the files list, if not, insert it. Then save the document back.
You should make sure that at any given time, only one program / goroutine is trying to do this, otherwise you will run into race conditions and file_x might get inserted multiple times.
There is not a single update operation in mongodb update language that will do what you want to do. You can get close by using $addToSet, which adds to a set of items if the item is not already there, but it will not update the item based on the match of a subset of fields. Your best option is to perform a read-update in memory-write.
I have a collection with documents like this
"_id": ObjectId('55f02a779e6efb8'),
"msgId": "5fdf509c-5229-4e7c-87ff",
"statuses": [
{
"state": "QUEUED",
"timestamp": ISODate('2013-10-08T13:13:38.000Z')
},
{
"state": "PENDING",
"timestamp": ISODate('2013-10-08T13:13:49.000Z')
},
{
"state": "DELIVERED",
"timestamp": ISODate('2013-10-08T13:13:57.000Z')
}
]
I want to use project for assigning the last 2 (embedded docs) values of the nested array (the array size is not static) so as to use them in a group operation in later step.
I want sth like $slice(aggregation) but it is still not supported in the version of MongoDB I use (3.0.6).
Is there any way to access the 2 last elements of the array by their index and if not is there any other solution?
Where can I find a full list of the Tritium variables (such as $host, $path, $content_type, etc.)?
You can find all the accessible environment variables by looking at your tmp/messages/.../final-env.json file.
According to the doc:
All the environment variables available are listed in the
final-env.json file. This file can be found in the {Moovweb Project
Path}/tmp/messages/{Folder ID} directory. To use one of these
environment variables in Tritium, you need to add a dollar sign before
it: $variable.
They provide a sample final-env.json file there:
{
"0": "https://www.dropbox.com",
"1": "https://www.dropbox.com",
"2": "SAMEORIGIN",
"3": "HTTP/1.0",
"Content-Type-Charset": "UTF-8",
"__catch_all__": ".moovapp.com",
"accept_encoding": "gzip,deflate",
"asset_host": "http://localhost:3003/",
"body": "true",
"body_length": "1195",
"cache_control": "no-cache",
"canonical_found": "false",
"charset_determined": "UTF-8",
"connection": "close",
"content_type": "text/html; charset=utf-8",
"cookie": "gvc=Mjg1NjE0NTk0MjAxMDUyNjY4MTc1NjYyMDE3OTAxNjU0NDk4NTc2",
"date": "Fri, 07 Sep 2012 01:57:35 GMT",
"device_stylesheet": "main",
"found_conn": "true",
"header_hh": "Host: ",
"host": "mlocal.dropbox.com",
"host_hh": "https://mlocal.dropbox.com",
"key": "x_frame_options",
"location": "https://www.dropbox.com/",
"method": "GET",
"path": "/",
"pragma": "no-cache",
"rewriter_url": "false",
"secure": "false",
"server": "nginx/1.0.14",
"set_cookie": "flash=; Domain=dropbox.com; expires=Fri, 07-Sep-2012 01:57:35 GMT; Path=/; httponly",
"slash_path": "/",
"source_host": "www.dropbox.com",
"status": "302",
"use_global_replace_vars": "true",
"user_agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.54.16 (KHTML, like Gecko) Version/5.0.5 Safari/533.21.1",
"value": "SAMEORIGIN",
"x_frame_options": "SAMEORIGIN"
}