MongoDB Storing varying interval timeseries data in one collection - mongodb

I have weather data coming from various sources (api, iot measurements), which are in different granularities (daily, hourly, minutes).
You can imagine my data to look like:
2022-01-01 | source:api | data..
2022-01-01 01:00 | source:api | data..
2022-01-01 02:00 | source:api | data..
2022-01-01 00:30 | source:iot | data..
2022-01-01 01:00 | source:iot | data..
2022-01-01 01:30 | source:iot | data..
2022-01-02 | source:api | data..
Depending on the service, I sometimes need my data in a daily resolution, sometimes hourly.
My initial ideas were to store them in either:
time buckets grouped by day, e.g.:
2022-01-01
[d]
[h1, h2,..]
[m1, m2, m3 ...]
2022-01-02
[d]
[h1, h2,..]
[m1, m2, m3 ...]
Save a resolution (daily, hourly, minute data) variable for every document.
I wonder what the best data design strategy would be that would also work long term.
Some additional things to consider:
The data is also used by user facing services (e.g. api) and requests can be many a day. However, these requests are targeted at specific resolution/sources. Calls where we combination of data will be used a few a day.
Sometimes there is a precedence of which source we will choose based on presence. E.g. use minute iot data, otherwise daily api data.

It is important to let the data in the database "explain itself" clearly without special hidden knowledge. I suggest you keep it clear and simple by storing each incoming item with a true datetime type (not an ISO-8601 string) and a "resolution enum" to indicate what the timestamp actually means. 2020-06-22T00:00:00Z without the enum field is ambiguous.

Related

how to get date difference between rows for each 100th instance in Postgresql

I have a table where my product subscription data is recorded like date, amount, product, plan. I want to show the difference in days for every 100 subscriptions.
Subscription Range | Days
1-100 | 10 days
101-200 | 7 days
201-300 | 8 days
Please help me with the query to achieve this.

How to load DataFrame from semi-structured textfile?

I have a semi-structured text file which I want to convert it to a Data Frame in Spark. I do have a schema on my mind which is shown below. However, I am finding it challenging to parse my text file and assign the schema.
Following is my sample text file:
"good service"
Tom Martin (USA) 17th October 2015
4
Long review..
Type Of Traveller Couple Leisure
Cabin Flown Economy
Route Miami to Chicago
Date Flown September 2015
Seat Comfort 12345
Cabin Staff Service 12345
Ground Service 12345
Value For Money 12345
Recommended no
"not bad"
M Muller (Canada) 22nd September 2015
6
Yet another long review..
Aircraft TXT-101
Type Of Customer Couple Leisure
Cabin Flown FirstClass
Route IND to CHI
Date Flown September 2015
Seat Comfort 12345
Cabin Staff Service 12345
Food & Beverages 12345
Inflight Entertainment 12345
Ground Service 12345
Value For Money 12345
Recommended yes
.
.
The resulting schema with result that I expect to have as follows:
+----------------+------------+--------------+---------------------+---------------+---------------------------+----------+------------------+-------------+--------------+-------------------+----------------+--------------+---------------------+-----------------+------------------------+----------------+---------------------+-----------------+
| Review_Header | User_Name | User_Country | User_Review_Date | Overall Score | Review | Aircraft | Type of Traveler | Cabin Flown | Route_Source | Route_Destination | Date Flown | Seat Comfort | Cabin Staff Service | Food & Beverage | Inflight Entertainment | Ground Service | Wifi & Connectivity | Value for Money |
+----------------+------------+--------------+---------------------+---------------+---------------------------+----------+------------------+-------------+--------------+-------------------+----------------+--------------+---------------------+-----------------+------------------------+----------------+---------------------+-----------------+
| "good service" | Tom Martin | USA | 17th October 2015 | 4 | Long review.. | | Couple Leisure | Economy | Miami | Chicago | September 2015 | 12345 | 12345 | | | 12345 | | 12345 |
| "not bad" | M Muller | Canada | 22nd September 2015 | 6 | Yet another long review.. | TXT-101 | Couple Leisure | FirstClass | IND | CHI | September 2015 | 12345 | 12345 | 12345 | 12345 | 12345 | | 12345 |
+----------------+------------+--------------+---------------------+---------------+---------------------------+----------+------------------+-------------+--------------+-------------------+----------------+--------------+---------------------+-----------------+------------------------+----------------+---------------------+-----------------+
As you may notice, for each block of data in text file, the first four lines are mapped to user defined columns such as Review_Header, User_Name, User_Country, User_Review_Date, whereas rest other individual lines have defined columns.
What could be the best possible way to use schema inference technique in such scenario rather than writing verbose code?
UPDATE: I would like to make this problem a little more tricky. What if the "Long review.." and "Yet another long review" could itself span over multiple newlines. How may I parse the review over multiple line for each block?
If you guarantee that the semi-structured text file has records separated by two newlines, and that those two newlines will never appear in the "Long review..." section, you may be able to use textFiles with a modified delimiter ("\n\n") and then process the lines without writing a custom file format.
sc.hadoopConfiguration.set("textinputformat.record.delimiter", "\n\n")
df = sc.textFile("sample-file.txt")
Then you can do further splitting on "\n" and "\t" to create your fields and columns.
Seeing your update, it's kind of a difficult problem. You have to ask yourself what identifying info is in the attributes that's not in the review. Or what is guaranteed to be in a specific format. E.g.
Can you guarantee there's not two newlines in the long review? This is important if we're splitting on "\n\n" to generate the blocks.
Can you guarantee there's no tabs in the long review?
Is Aircraft, Cabin Flown, Cabin Staff Service, Date Flown, Food & Beverages, Ground Service, ... the full list of attributes? Do you have a full list of possible attributes?
As well as some meta questions:
Where is this data coming from?
Can we request it in a better format?
Can we find this data, or the aspects we're looking for from a better source?
With those known, you'll have a better idea on how to proceed. E.g. if there are no tabs in the review text, (or they're escaped as "\t" or something):
Extract lines[0] - first line "good service"
Extract lines[1] - split to user name, country, review date
Filter lines[2:] containing tabs, get lowest index i - split into attributes
Join lines[2:i] with "\n" - this is the review
What could be the best possible way to use schema inference technique in such scenario rather than writing verbose code?
You don't have much choice and you have to write a verbose code or a custom FileFormat (that would hide the complexity of loading such files to a DataFrame).
Use DataFrameReader.textFile to load the file and transform it accordingly.
textFile(path: String): Dataset[String] Loads text files and returns a Dataset of String. See the documentation on the other overloaded textFile() method for more details.

How can I aggregate metrics per day in a Grafana table?

I am charting data with a Grafana table, and I want to aggregate all data points from a single day into one row in the table. As you can see below my current setup is displaying the values on a per / minute basis.
Question: How can I make a Grafana table that displays values aggregated by day?
| Day | ReCaptcha | T & C |
|-------------------|------------|-------|
| February 21, 2017 | 9,001 | 8,999 |
| February 20, 2017 | 42 | 17 |
| February 19, 2017 | ... | ... |
You can use the summarize function on the metrics panel.
Change the query by pressing the + then selecting transform
summarize(24h, sum, false) this will aggregate the past 24hr data points into a single point by summing them.
http://graphite.readthedocs.io/en/latest/functions.html#graphite.render.functions.summarize
results

Joining time series events with daily 'shift' data?

What is the best practice for joining 'shift' data and other time series data in Tableau? I am working with multiple geo data (from LA to India, UK, NY, Malaysia, Australia, China etc), and a lot of employees work past midnight.
For example, an employee has shift at 9 PM to 6 AM on 2016-07-31. The 'report date' is 2016-07-31 but no time zone information is provided.
This employee does work and there are events (time stamps in UTC) between 2016-07-31 21:00 to 2016-08-01 06:00. When I look at the events though, 7/31 will only have the events between 21:00 and 23:59. If I filter for just July, my calculations will be skewed (the event data will be cut off at midnight even though the shift extended to 6 AM).
I need to make calculations based upon the total time an employee was actually engaged with work (productive) and the total time they were paid. The request is for this to be daily/weekly/monthly.
If anyone can help me out here or give me some talking points to explain this to my superiors, it would be appreciated. This seems like it must be a common scenario. Do I need to request for a new raw data format or is there something I can do on my end?
the shift data only looks like this:
id date regular_hours overtime_hours total_hours
abc 2016-06-17 8 0.52 8.52
abc 2016-06-18 7.64 0.83 8.47
abc 2016-06-19 7.87 0.23 8.1
the event data is more detailed (30 minute interval data on events handled and the time it took to complete those events in seconds):
id date interval events event_duration
abc 2016-06-17 01:30:00 4 688
abc 2016-06-17 02:00:00 6 924
abc 2016-06-17 02:30:00 10 1320
So, you sum up the event_duration for an entire day and you get a number of seconds which was actually spent doing work. You can then compare this to amount of time that the employee was paid to see how efficient the staffing is.
My concern is that the event data has the date and the time (UTC). The payroll data only has a date without any time zone information. This causes inaccuracies when blending data in Tableau because some shifts cross midnight. Is there a way around this or do I need to propose new data requirements?
(FYI - people have been calculating it just based on the date for years most likely without considering time zones before. My assumption is that they just did not realize that this could cause inaccurate results)

Merge periodes in java

I am working with Spring batch. the batch will read several records each time from the database which look like this
personId |fromDate| toDate | someCode
*100 | 05-05-2011 | 31-12-2011 | A
*100 | 01-01-2012 | 31-12-2012 | A
100 | 01-01-2013 | 03-03-2013 | B
101 | 05-05-2011 | 31-12-2011 | A
*periodes to be merged.
What i want to do is to merge the periodes which has the same code and same personId, but not diffrent code or personId.
The first question is can i chunk this step? the problem is that commite intervals are static and i might not get all the priodes for a person in one chunk. is it possible to have dynamic chunks based on how many records for a person are on the table?
the next question is what is the best way to merge the periods? periods should be merged if the toDate is 31-12 and the next period starts from 01-01 of next year.
I solv the problem with using 2 pointer in each object. one which point to the previous period and one to the next period.
For chunking i needed to read all rows with same person id and aggregate them.