Add day to a date in Vega-Lite - date

I'm trying to add a day to my dates which look like "2020-11-20" for November, 20, 2020. However I am encountering difficulty doing that - do I need to use the offset function? The reason I am doing this is that Vega-Lite is automatically offsetting my dates back 1 day through its GMT conversion and I cannot get it to stop. Please help!
Here is an example. If you look at the timeline graph it ends at 2020-11-19, but the final date in my data is 2020-11-20, and I need to make it so 2020-11-20 is the last date on my timeline graph.

This issue comes from an unfortunate "feature" of how javascript parses dates. Here is a minimal example of the problem you're seeing (open in editor):
{
"data": {
"values": [
{"date": "2020-11-17", "value": 5},
{"date": "2020-11-18", "value": 6},
{"date": "2020-11-19", "value": 7},
{"date": "2020-11-20", "value": 8}
]
},
"mark": "bar",
"encoding": {
"x": {"field": "value", "type": "quantitative"},
"y": {
"field": "date",
"timeUnit": "yearmonthdate",
"type": "ordinal"
},
"tooltip": [
{
"field": "date",
"timeUnit": "yearmonthdate",
"type": "temporal"
}
]
}
}
Each of the days in the chart are off by one compared to the input. So why is this happening?
Well, it turns out that the Vega-Lite's renderer makes use of Javascript's built-in date parsing, and Javascript's date parsing treats inputs differently depending on how they're formatted. In particular, Javascript will parse non-standard timestamps in UTC time, but will parse full ISO-8601 timestamps in local time, a fact you can confirm in your browser's javascript console (I executed this on a computer set to PST):
> new Date('2020-11-20')
Thu Nov 19 2020 16:00:00 GMT-0800 (Pacific Standard Time)
> new Date('2020-11-20T00:00:00')
Fri Nov 20 2020 00:00:00 GMT-0800 (Pacific Standard Time)
The Vega-Lite docs recommend using UTC timeUnits and scales to work around this, but I tend to find that approach a bit clunky. Instead, I try to always specify dates in Vega-Lite via full ISO 8601 timestamps.
In your case, the best approach is probably to use a calculate transform to regularize your dates, and proceed from there. Modifying the simplified example above, it might look something like this (open in editor):
{
"data": {
"values": [
{"date": "2020-11-17", "value": 5},
{"date": "2020-11-18", "value": 6},
{"date": "2020-11-19", "value": 7},
{"date": "2020-11-20", "value": 8}
]
},
"transform": [
{"calculate": "toDate(datum.date + 'T00:00:00')", "as": "date"}
],
"mark": "bar",
"encoding": {
"x": {"field": "value", "type": "quantitative"},
"y": {
"field": "date",
"timeUnit": "yearmonthdate",
"type": "ordinal"
},
"tooltip": [
{
"field": "date",
"timeUnit": "yearmonthdate",
"type": "temporal"
}
]
}
}

Related

MongoDB - MongoImport of JSON (jsonl) - Rename, change types and add fields

i'm new to the topic MongoDB and have 4 different problems importing a big (16GB) file (jsonl) into my MongoDB (simple PSA-Cluster).
Below attached you will find a sample entry from the mentiond JSON-Dump.
With this file which i get from an external provider I actually have 4 problems.
"hotel_id" is the key and should normally be (re-)named as "_id"
"hotel_id" should not be treated as string rather than as Number
"location" is not properly formatted (if i understood correctly the MongoDB Manual) as GeoJSON as it should be like
"location": {
"type": "Point",
"coordinates": [-93.26838,37.15845]
}
instead of
"location": {
"coordinates": {
"latitude": 37.15845,
"longitude": -93.26838
}
}
"dates" can this be used to efficiently update just the records which needs to be updated?
So my challenge is now to transform the data according to my needs before importing the data or at time of import, but in both cases of course as quickly as possible.
Therefore i searched a lot for hints and best practices, but i was not able to find a solution yet, maybe due to the fact that i'm a beginner with MongoDB.
I played around with "jq" to adjust the data and for example add the type which seems to be necessary for the location (point 3), but wasn't really successful.
cat dump.jsonl | ./bin/jq --arg typeOfField Point '.location + {type: $typeOfField}'
Beside that i was injecting a sample dump of round-about 500MB which took 1,5 mins when importing it the first time (empty database). If i run it in "upsert" mode it will take round-about 12 hours. So i was also wondering what is the best practice to import such a big JSON-dump?
Any help is appreciated!! :-)
Kind regards,
Lumpy
{
"hotel_id": "12345",
"name": "Test Hotel",
"address": {
"line_1": "123 Test St",
"line_2": "Apt A",
"city": "Test City",
},
"ratings": {
"property": {
"rating": "3.5",
"type": "Star"
},
"guest": {
"count": 48382,
"average": "3.1"
}
},
"location": {
"coordinates": {
"latitude": 22.54845,
"longitude": -90.11838
}
},
"phone": "555-0153",
"fax": "555-7249",
"category": {
"id": 1,
"name": "Hotel"
},
"rank": 42,
"dates": {
"added": "1998-07-19T05:00:00.000Z",
"updated": "2018-03-22T07:23:14.000Z"
},
"statistics": {
"11": {
"id": 11,
"name": "Total number of rooms - 220",
"value": "220"
},
"12": {
"id": 12,
"name": "Number of floors - 7",
"value": "7"
}
},
"chain": {
"id": -2,
"name": "Test Hotels"
},
"brand": {
"id": 2,
"name": "Test Brand"
}
}

Azure Data Factory copy day before data from slicestart date

can somebody let me know how to get previous days data i.e 2017-07-28 etc from my onpremises file system if my pipleline start and end dates are
"start": "2017-07-29T00:00:00Z",
"end": "2017-08-03T00:00:00Z"
My pipeline's input is"FileSystemSource" and output is "AzureDataLakeStore". I have tried below JSON in my copy pipeline as input
"inputs": [
{
"name": "OnPremisesFileInput2"
"startTime": "Date.AddDays(SliceStart, -1)",
"endTime": "Date.AddDays(SliceEnd, -1)"
}
]
I have also tried defining "offset" in the input and output datasets and in the pipeline as follows
"availability": {
"frequency": "Day",
"interval": 1,
"offset": "-1.00:00:00",
"style": "StartOfInterval"
},
"scheduler": {
"frequency": "Day",
"interval": 1,
"offset": "-1.00:00:00",
"style": "StartOfInterval"
},
none of the above seems to be working. Request someone to help me.
I think a good strategy to do this is to think about yesterday's output as today's input. Azure Data Factory let's you run activities one after another in sequence using different data sources.
There's good documentation here
With an example like this:
Like this you can either have a temporary storage in between the two activities or use your main input data source but with a filter to get only yesterday's slice.
Your offset should be positive.
"availability": {
"frequency": "Day",
"interval": 1,
"offset": "01:00:00",
"style": "EndOfInterval"
}
In this case it will run for example on September 7th at 1:00 AM UTC and will run the slice from Sep 6th 0:00 UTC to Sept 7th UTC. Which is yesterday slice.
Your input dataset should be configured to use the SliceStart for the naming of the file
"partitionedBy": [
{
"name": "Slice",
"value": {
"type": "DateTime",
"date": SliceStart",
"format": "yyyymmdd"
}
}],
"typeProperties": {
"fileName": "{slice}.csv",
}
It would look for 20170906.csv file when executed on Sept 7th.

Is there a possibility to have another timestamp as dimension in Druid?

Is it possible to have Druid datasource with 2 (or multiple) timestmaps in it?
I know that Druid is time-based DB and I have no problem with the concept but I'd like to add another dimension with which I can work as with timestamp
e.g. User retention: Metric surely is specified to a certain date, but I also need to create cohorts based on users date of registration and rollup those dates maybe to a weeks, months or filter to only a certain time periods....
If the functionality is not supported, are there any plug-ins? Any dirty solutions?
Although I'd rather wait for official implementation for timestamp dimensions full support in druid to be made, I've found a 'dirty' hack which I've been looking for.
DataSource Schema
First things first, I wanted to know, how much users logged in for each day, with being able to aggregate by date/month/year cohorts
here's the data schema I used:
"dataSchema": {
"dataSource": "ds1",
"parser": {
"parseSpec": {
"format": "json",
"timestampSpec": {
"column": "timestamp",
"format": "iso"
},
"dimensionsSpec": {
"dimensions": [
"user_id",
"platform",
"register_time"
],
"dimensionExclusions": [],
"spatialDimensions": []
}
}
},
"metricsSpec": [
{ "type" : "hyperUnique", "name" : "users", "fieldName" : "user_id" }
],
"granularitySpec": {
"type": "uniform",
"segmentGranularity": "HOUR",
"queryGranularity": "DAY",
"intervals": ["2015-01-01/2017-01-01"]
}
},
so the sample data should look something like (each record is login event):
{"user_id": 4151948, "platform": "portal", "register_time": "2016-05-29T00:45:36.000Z", "timestamp": "2016-06-29T22:18:11.000Z"}
{"user_id": 2871923, "platform": "portal", "register_time": "2014-05-24T10:28:57.000Z", "timestamp": "2016-06-29T22:18:25.000Z"}
as you can see, my "main" timestamp to which I calculate these metrics is timestamp field, where register_time is only the dimension in stringy - ISO 8601 UTC format .
Aggregating
And now, for the fun part: I've been able to aggregate by timestamp (date) and register_time (date again) thanks to Time Format Extraction Function
Query looking like that:
{
"intervals": "2016-01-20/2016-07-01",
"dimensions": [
{
"type": "extraction",
"dimension": "register_time",
"outputName": "reg_date",
"extractionFn": {
"type": "timeFormat",
"format": "YYYY-MM-dd",
"timeZone": "Europe/Bratislava" ,
"locale": "sk-SK"
}
}
],
"granularity": {"timeZone": "Europe/Bratislava", "period": "P1D", "type": "period"},
"aggregations": [{"fieldName": "users", "name": "users", "type": "hyperUnique"}],
"dataSource": "ds1",
"queryType": "groupBy"
}
Filtering
Solution for filtering is based on JavaScript Extraction Function with which I can transform date to UNIX time and use it inside (for example) bound filter:
{
"intervals": "2016-01-20/2016-07-01",
"dimensions": [
"platform",
{
"type": "extraction",
"dimension": "register_time",
"outputName": "reg_date",
"extractionFn": {
"type": "javascript",
"function": "function(x) {return Date.parse(x)/1000}"
}
}
],
"granularity": {"timeZone": "Europe/Bratislava", "period": "P1D", "type": "period"},
"aggregations": [{"fieldName": "users", "name": "users", "type": "hyperUnique"}],
"dataSource": "ds1",
"queryType": "groupBy"
"filter": {
"type": "bound",
"dimension": "register_time",
"outputName": "reg_date",
"alphaNumeric": "true"
"extractionFn": {
"type": "javascript",
"function": "function(x) {return Date.parse(x)/1000}"
}
}
}
I've tried to filter it 'directly' with javascript filter but I haven't been able to convince druid to return the correct records although I've doublecheck it with various JavaScript REPLs, but hey, I'm no JavaScript expert.
Unfortunately Druid has only one time-stamp column that can be used to do rollup plus currently druid treat all the other columns as a strings (except metrics of course) so you can add another string columns with time-stamp values, but the only thing you can do with it is filtering.
I guess you might be able to hack it that way.
Hopefully in the future druid will allow different type of columns and maybe time-stamp will be one of those.
Another solution is add a longMin sort of metric for the timestamp and store the epoch time in that field or you convert the date time to a number and store it (eg 31st March 2021 08:00 to 310320210800)
As for Druid 0.22 it is stated in the documentation that secondary timestamps should be handled/parsed as dimensions of type long. Secondary timestamps can be parsed to longs at ingestion time with a tranformSpec and be transformed back if needed in querying time link.

Parts of Wunderground API Inaccessible In Swift

I am using the Wunderground API in my project, and the part of the API I want to use looks like this:
"history": {
"dailysummary": [
{ "date": {
"pretty": "12:00 PM PDT on August 12, 2015",
"year": "2015",
"mon": "08",
"mday": "12",
"hour": "12",
"min": "00",
"tzname": "America/Los_Angeles"
},
"fog":"0","rain":"0","snow":"0","snowfallm":"0.00", "snowfalli":"0.00","monthtodatesnowfallm":"", "monthtodatesnowfalli":"","since1julsnowfallm":"", "since1julsnowfalli":"","snowdepthm":"", "snowdepthi":"","hail":"0","thunder":"0","tornado":"0","meantempm":"26", "meantempi":"79","meandewptm":"16", "meandewpti":"60","meanpressurem":"1014", "meanpressurei":"29.94","meanwindspdm":"9", "meanwindspdi":"5","meanwdire":"","meanwdird":"331","meanvism":"16", "meanvisi":"10","humidity":"","maxtempm":"33", "maxtempi":"91","mintempm":"19", "mintempi":"66","maxhumidity":"78","minhumidity":"34","maxdewptm":"17", "maxdewpti":"62","mindewptm":"15", "mindewpti":"59","maxpressurem":"1016", "maxpressurei":"30.01","minpressurem":"1012", "minpressurei":"29.88","maxwspdm":"24", "maxwspdi":"15","minwspdm":"0", "minwspdi":"0","maxvism":"16", "maxvisi":"10","minvism":"16", "minvisi":"10","gdegreedays":"28","heatingdegreedays":"0","coolingdegreedays":"14","precipm":"0.00", "precipi":"0.00","precipsource":"","heatingdegreedaysnormal":"0","monthtodateheatingdegreedays":"0","monthtodateheatingdegreedaysnormal":"0","since1sepheatingdegreedays":"","since1sepheatingdegreedaysnormal":"","since1julheatingdegreedays":"0","since1julheatingdegreedaysnormal":"17","coolingdegreedaysnormal":"5","monthtodatecoolingdegreedays":"106","monthtodatecoolingdegreedaysnormal":"69","since1sepcoolingdegreedays":"","since1sepcoolingdegreedaysnormal":"","since1jancoolingdegreedays":"600","since1jancoolingdegreedaysnormal":"280" }
]
}
For some reason, dailysummary, which has both "{}" brackets and "[]" brackets, cannot be accessed the way I would normally, like this:
var jsonData = json["history"]["dailysummary"]["fog"]
which, if this worked normally, would return the fog value in my function. The function works fine; I've tested it with other parts of the API. Is there something specific that needs to be done for dailysummary?
You're misinterpreting how to call json. It might help to beautify the json to look like the below:
{
"history": {
"dailysummary": [{
"date": {
"pretty": "12:00 PM PDT on August 12, 2015",
"year": "2015",
"mon": "08",
"mday": "12",
"hour": "12",
"min": "00",
"tzname": "America/Los_Angeles"
},
"fog": "0",
...
}]
}
}
With the above it's pretty self explanatory that dailysummary is an array containing a single object. To access 'fog' you call it via json.history.dailysummary[0].fog

MongoDB Database Structure and Best Practices Help

I'm in the process of developing Route Tracking/Optimization software for my refuse collection company and would like some feedback on my current data structure/situation.
Here is a simplified version of my MongoDB structure:
Database: data
Collections:
“customers” - data collection containing all customer data.
[
{
"cust_id": "1001",
"name": "Customer 1",
"address": "123 Fake St",
"city": "Boston"
},
{
"cust_id": "1002",
"name": "Customer 2",
"address": "123 Real St",
"city": "Boston"
},
{
"cust_id": "1003",
"name": "Customer 3",
"address": "12 Elm St",
"city": "Boston"
},
{
"cust_id": "1004",
"name": "Customer 4",
"address": "16 Union St",
"city": "Boston"
},
{
"cust_id": "1005",
"name": "Customer 5",
"address": "13 Massachusetts Ave",
"city": "Boston"
}, { ... }, { ... }, ...
]
“trucks” - data collection containing all truck data.
[
{
"truckid": "21",
"type": "Refuse",
"year": "2011",
"make": "Mack",
"model": "TerraPro Cabover",
"body": "Mcneilus Rear Loader XC",
"capacity": "25 cubic yards"
},
{
"truckid": "22",
"type": "Refuse",
"year": "2009",
"make": "Mack",
"model": "TerraPro Cabover",
"body": "Mcneilus Rear Loader XC",
"capacity": "25 cubic yards"
},
{
"truckid": "12",
"type": "Dump",
"year": "2006",
"make": "Chevrolet",
"model": "C3500 HD",
"body": "Rugby Hydraulic Dump",
"capacity": "15 cubic yards"
}
]
“drivers” - data collection containing all driver data.
[
{
"driverid": "1234",
"name": "John Doe"
},
{
"driverid": "4321",
"name": "Jack Smith"
},
{
"driverid": "3421",
"name": "Don Johnson"
}
]
“route-lists” - data collection containing all predetermined route lists.
[
{
"route_name": "monday_1",
"day": "monday",
"truck": "21",
"stops": [
{
"cust_id": "1001"
},
{
"cust_id": "1010"
},
{
"cust_id": "1002"
}
]
},
{
"route_name": "friday_1",
"day": "friday",
"truck": "12",
"stops": [
{
"cust_id": "1003"
},
{
"cust_id": "1004"
},
{
"cust_id": "1012"
}
]
}
]
"routes" - data collections containing data for all active and completed routes.
[
{
"routeid": "1",
"route_name": "monday1",
"start_time": "04:31 AM",
"status": "active",
"stops": [
{
"customerid": "1001",
"status": "complete",
"start_time": "04:45 AM",
"finish_time": "04:48 AM",
"elapsed_time": "3"
},
{
"customerid": "1010",
"status": "complete",
"start_time": "04:50 AM",
"finish_time": "04:52 AM",
"elapsed_time": "2"
},
{
"customerid": "1002",
"status": "incomplete",
"start_time": "",
"finish_time": "",
"elapsed_time": ""
},
{
"customerid": "1005",
"status": "incomplete",
"start_time": "",
"finish_time": "",
"elapsed_time": ""
}
]
}
]
Here is the process thus far:
Each day drivers begin by Starting a New Route. Before starting a new route drivers must first input data:
driverid
date
truck
Once all data is entered correctly the Start a New Route will begin:
Create new object in collection “routes”
Query collection “route-lists” for “day” + “truck” match and return "stops"
Insert “route-lists” data into “routes” collection
As driver proceeds with his daily stops/tasks the “routes” collection will update accordingly.
On completion of all tasks the driver will then have the ability to Complete the Route Process by simply changing “status” field to “active” from “complete” in the "routes" collection.
That about sums it up. Any feedback, opinions, comments, links, optimization tactics are greatly appreciated.
Thanks in advance for your time.
You database schema looks like for me as 'classic' relational database schema. Mongodb good fit for data denormaliztion. I guess when you display routes you loading all related customers, driver, truck.
If you want make your system really fast you may embedd everything in route collection.
So i suggest following modifications of your schema:
customers - as-is
trucks - as-is
drivers - as-is
route-list:
Embedd data about customers inside stops instead of reference. Also embedd truck. In this case schema will be:
{
"route_name": "monday_1",
"day": "monday",
"truck": {
_id = 1,
// here will be all truck data
},
"stops": [{
"customer": {
_id = 1,
//here will be all customer data
}
}, {
"customer": {
_id = 2,
//here will be all customer data
}
}]
}
routes:
When driver starting new route copy route from route-list and in addition embedd driver information:
{
//copy all route-list data (just make new id for the current route and leave reference to routes-list. In this case you will able to sync route with route-list.)
"_id": "1",
route_list_id: 1,
"start_time": "04:31 AM",
"status": "active",
driver: {
//embedd all driver data here
},
"stops": [{
"customer": {
//all customer data
},
"status": "complete",
"start_time": "04:45 AM",
"finish_time": "04:48 AM",
"elapsed_time": "3"
}]
}
I guess you asking yourself what do if driver, customer or other denormalized data changed in main collection. Yeah, you need update all denormalized data within other collections. You will probably need update billions of documents (depends on your system size) and it's okay. You can do it async if it will take much time.
What benfits in above data structure?
Each document contains all data that you may need to display in your application. So, for instance, you no need load related customers, driver, truck when you need display routes.
You can make any difficult queries to your database. For example in your schema you can build query that will return all routes thats contains stops in stop of customer with name = "Bill" (you need load customer by name first, get id, and look by customer id in your current schema).
Probably you asking yourself that your data can be unsynchronized in some cases, but to solve this you just need build a few unit test to ensure that you update your denormolized data correctly.
Hope above will help you to see the world from not relational side, from document database point of view.