filtering DocumentDb collection from Data Factory using unix-timestamp - unix-timestamp

I am trying to select some documents from documentDB collection, in an incremental way, so every slice will select based on the "timeCreated" field of the collection.
The problem is that this field (timeCreated) is in seconds since the epoch (1970-01-01) and I could not find the proper format here.
As project's assumptions, we are working with Azure Portal and without any programming interface, so the only solution I could think of is creating UDF in the DocumentDB that will transform the seconds field, to a dateTime field, but any approach that will only involve documentDB sql is much better.
This is the date data in the documentDB:
"serverTimestamp": {
"$date": 1446130451707
},
This is the way to use slice's startDate and endDate in the pipeline (from Azure documentation) :
"source": {
"type": "SqlSource",
"sqlReaderQuery": "$$Text.Format('select * from MyTable where timestampcolumn >= \\'{0:yyyy-MM-dd HH:mm\\' AND timestampcolumn < \\'{1:yyyy-MM-dd HH:mm\\'', WindowStart, WindowEnd)"
},
Is there another way, besides UDF, to format WindowStart/WindowEnd to seconds?
Thanks!

Is there another way, besides UDF, to format WindowStart/WindowEnd to seconds?
As you mentioned we can do that easliy with UDF function. Then we can could use the function in the sql. As DocumentDB now supports range indexes on both strings and numbers. In my option we can format the field that we want to filter instead of format WindowStart/WindowEnd to seconds. The following is the detail test steps:
1.we need to set up the indexes correctly for it to work.
2.Create UDF from with Azure portal
function epochToDate (ts) {
return new Date(ts*1000);
}
Note: ts vaule is seconds, so need to covert to milliseconds
3.valid it from the Azure Data factory copydata Wizard

Related

Elastic Search Date Range Filter Not Working

Context
I have an index with a field called "date" which contains dates. I need an elasticsearch query that returns records where date is greater than a specific date value.
Issue
Running the following query with range filter returns does not work. Records with earlier dates are returned in the result set.
{
"size": 1000,
"query": {
"filtered": {
"filter": {
"range": {
"date": {
"gt": "2014-02-23T00:00:00"
}
}
}
}
}
}
Questions
What is the correct query to pull data where date is greater than a
specific value?
If my query is syntactically correct, is there
something else I can go check (e.g. datatype of field is actually
date)?
How should I go about root causing this?
etc.
Solution
In lieu of implementing mapping, I came up with a partial solution. I used Chrome to analyze some of the Kibana traffic. I noticed Kibana is passing date filters as int values. So, I converted the dates to ints using Unix timestamp conversion and things are working now.
(Reference http://www.epochconverter.com/)
What about mapping?
I looked at the mappings earlier. On my index they don't exist. I seem to recall reading that mappings will be inferred for known types that have strong consistency.
My date data is consistent:
- no nulls
- dates are getting flipped from SQL, to C#, to Elastic
I guess I could implement a mapping, but I'm going with the Epoch conversion for now until I have a true need to map this for some other compelling reason.
Your query is syntactically correct.
Use get mapping API to see the document mapping:
curl -XGET 'http://localhost:9200/twitter/_mapping/tweet'
It's hard to say where goes wrong. Probably the mapping of date field is not date type actually.

Should I use the timestamp in "_id"?

I need monitor the time of the records been created, for further query and modify.
first thing flashed in my mind is give the document a "createDateTime" field, with the default value of "new Date()", but Mongodb said the document _id has a timestamp embedded with, and the id was generated when the document was created, so it sounds dummy to add a new field for that.
for too many times, I've seen people set a "createDateTime" for their data, and I don't know if they know about the details of mongodb's _id.
I want know should I use the _id as a "createDateTime" field? what is the best practice?
and the pros and cons.
thanks for any tips.
I'd actually say it depends on how you want to use the date.
For example, it's not actionable using the aggregation framework Date operators.
This will fail for example:
db.test.aggregate( { $group : { _id: { $year: "$_id" } } })
The following error occurs:
"errmsg" : "exception: can't convert from BSON type OID to Date"
(The date cannot be extracted from the ObjectId.)
So, operations that are normally simple date operations become much more complex if you wanted to do any sort of date math in an aggregation. It would be far easier to have a createDateTime stamp. Counting the number of documents created in a particular year and month would be simple using aggregation with a distinct createdDateTime field.
You can sort on an ObjectId, to some degree. The remaining 8 bytes of the ObjectId aren't sortable in a meaningful way. Most MongoDB drivers default to creating the ObjectId within the driver and not on the database. So, if you've got multiple clients (like web servers for example) creating new documents (and new ObjectIds), the time stamps will only be as accurate as the various servers.
Also, depending the precision you'd need, an ISODate value is stored using 8 bytes, rather than the 4 used in an ObjectId.
Yes, you should. There is no reason not to do, besides the human readability while directly looking into the database. See also here and here.
If you want to use the aggregation framework to group by the date within _id, this is not possible yet as WiredPrairie correctly said. There is an open jira ticket for that, you might watch. But of course you can do this with Map-Reduce and ObjectID.getTimestamp(). An example for that can be found here.

get day of the week from mongodb datetime query

I am constructing a database where I might want to query a day of the week. Is it possible to use mongodb to query days in the week in a datetime (or utc timestamp) field?
Something like; get every object that has a datetime that was on a monday.
If it is not possible then the alternative seems to create dummy variables in the collection that show what day of the week it was. Preferably I would like to only query the datetime object for this as this would keep the database smaller.
There are three solutions that I can think of:
Your solution: create an extra "day_of_week" field, either an int or string, and then query against this field rather than the datetime field.
Query for everything in your collection, and then filter the results by day of the week on the client side.
Use $where, passing a javascript function which calls date.getDay(). For example, {$where: function () { return this.date.getDay() == 5; }} for getting every date on a Friday.
Solution #2 would call datetime.date.weekday() in pymongo on the client side. The downside of this method is that every document in the collection will end up being sent over the wire, which could add unnecessary network load. It's better than #1, however, in that it's more space efficient and you don't have duplicated information to keep in sync. Solution #3 has neither of these problems, but $where is slow because it requires the server to create a JavaScript execution context and cannot make use of indexes.
Pymongo can return Mongo BSON timestamp fields as python datetimes: http://api.mongodb.org/python/current/api/bson/timestamp.html
From there you can call datetime.date.weekday()
http://docs.python.org/2/library/datetime.html#datetime.date

mongo query based on field calculation

I am looking for a way to query mongo for documents matching the results between two fields when compared to a variable.
For example, overlapping date ranges. I have a document with the following schema:
{startDate : someDate, endDate : otherDate, restrictions : {daysBefore : 5, daysAfter : 5}}
My user will supply their own date range like
var userInfo = { from : Date, to : Date}
I need the documents that satisfy this condition:
startDate - restrictions.daysBefore <= userInfo.to && endDate + restrictions.daysAfter >= userInfo.from;
I tried using a $where clause, but I loose the context of the to and from since they are defined outside of the scope of the where function.
I would like to do this without pulling down all of the results, or creating another field upon insert.
Is there a simple way this query can be done?
The aggregation framework [AF] will do what you want. The AF backend is written in C++ and therefor much faster then using JavaScript as an added bonus. In addition to faster then JavaScript there are number of reasons we discourage the use of $where some of which can be found in the $where docs.
The AF docs(i.e. the good stuff to use):
http://docs.mongodb.org/manual/reference/aggregation/
I am uncertain the format of the data you are storing, and this will also have an affect on performance. For instance if the date is the standard date of milliseconds since Jan 1st 1970 (unix epoch) and daysBefore is stored in (miliseconds per day) * (number of days), you can use simple math as the example below does. This is very fast. If not there are date conversions available in the AF, but that is of course more expensive to do the conversions in addition to getting the differences.
In Python (your profile mentions Django) datetime.timedelta can be used be used for daysBefore. For instance for 5 days:
import datetime
daysBefore=datetime.timedelta(5)
There are two main ways to go about what you want to use in the AF. Do the calculation directly and match on it, or create a new column and match against that. Your specific use case and testing against will be necessary for complicated or large scale deployments. An aggregate command from the shell to match against the calculation in Python:
fromDate=<program provided>
db.collection.aggregate([{"$match":{"startDate":{ "$lt": {"$add": ["$restrictions.daysBefore", fromDate]}}}}])
If you want to run multiple calculations in the same $match use $and:[{}, {}, …, {}]. I omitted that for clarity.
Further aggregation documentation for the AF can be found at:
http://api.mongodb.org/python/current/examples/aggregation.html#aggregation-framework
Note that “aggregation” also includes map reduce in Mongo, but this case the AF should be able to do it all (and much more quickly).
If you need any further information about the AF or if there is anything the docs don’t make clear, please don’t hesitate to ask.
Best,
Charlie

Rally bulk query api with date placeholder

I am using the Rally bulk query API to pull data from multiple tables. My issue happens when I try to use a placeholder for the Iteration's StartDate and pass it along to a following query same bulk request. i.e.
"iteration": "/Iteration?fetch=ObjectID,StartDate&query=(Name = \"Sprint 1\")",
"started": "${iteration.StartDate}",
"other_queries": "...?query=(CreatedDate > $(iteration.StartDate))"
The bulk service seems to convert this field to a formatted string. Is there a way to prevent this from happening? I am attempting to use the placeholder to limit other queries by date without making several requests.
It looks like the iteration object comes back with the date correctly, but when it is used as a placeholder it is automatically converted to a string.
"started": ["Wed Jan 16 22:00:00 MST 2013"],
"iteration": {
"Results": [
....
"StartDate": "2013-01-17T05:00:00.000Z",
]}
Unfortunately no, as this functionality is currently implemented, this is expected behavior. The placeholder is converted to a formatted String server-side, so it will be necessary to formulate a similar followup request if the same data is needed in another query.