Azure Data Factory - querying Cosmos DB using dynamic timestamps - azure-data-factory

I want to create and maintain a snapshot of a collection in Cosmos DB.
Periodically, I want to retrieve only the delta (new or modified documents) from Cosmos and write them to the snapshot, which will be stored in a Azure Data Explorer cluster.
I wish to get the delta using the _ts member of the documents. In other words, I will fetch only records for which the _ts is between some range.
The range will be the range of a time window, which I get using a tumbling window trigger in the data factory.
The issue is that if I print the dynamic timestamps which I create in the query, and hard code them into the query, it works. But if I let the query generate them, I don't get any results.
For example:
I'm using those value to simulate the window range of the trigger.
I use this query to create timestamps in unix time.
and I see that the timestamps created are correct.
And if I run my query using those hardcoded timestamps, I get results
But, if I run a query using the code that just create those timestamps, I get no results from the query
This is the code to create the timestamps:
select
DateTimeToTimestamp('#{formatDateTime('2020-05-20T12:00:00.0000000Z','yyyy-MM
ddTHH:mm:ss.fffffffZ')}')/1000,
DateTimeToTimestamp('#{formatDateTime('2020-08-20T12:00:00.0000000Z','yyyy-MM
ddTHH:mm:ss.fffffffZ')}')/1000
Does anyone have a clue as to what might be the issue?
Any other way to achieve this is also welcome.
Thanks
EDIT: I managed to work around this by taking the other, simpler option:
where TimestampToDateTime(c._ts*1000)> "#{formatDateTime(pipeline().parameters.windowStart,'yyyy-MM-ddTHH:mm:ss.fffffffZ')}"

We are glad that you resolved this problem:
You managed to work around this by taking the other, simpler option:
where TimestampToDateTime(c._ts*1000)> "#{formatDateTime(pipeline().parameters.windowStart,'yyyy-MM-ddTHH:mm:ss.fffffffZ')}"
I think the error in first option is most caused by the different data type between c.ts and DateTimeToTimestamp('#{formatDateTime('2020-05-20T12:00:00.0000000Z','yyyy-MM ddTHH:mm:ss.fffffffZ')}')/1000.

Related

How to combine data from postgreSQL and dynamic json in grafana

I have a grafana dashboard where I want to use an orcestra cities map dashboard to show status of some stations. The status is available as json from a http server (using nagios for this part) but the status has no idea of the location of the stations. This I have in a postGIS database.
I know I can set up a script that reads the status json and inserts the data into a table in the postgis database. This can run each five minutes or something. This feels a bit kludgy, so I wonder if there are some other ways of doing this.
Could it be possible to use a foreign data wrapper to fetch the json into postgis? The only json fdw I have found is to read a set of files, I would need to read from a http server.
If not, is it possible to combine data from json and postgres in one data set in grafana? I can read in data from both sources and present them e.g. as time series in one panel, but here I need to be able to join the two so that I use some of the attributes from json to categorize the points from postgis (or the other way around if that should be easier)
In theory you can do that in the Grafana. You need to have 2 queries with results from both sources (how to write query, configure datasources for that is not in the scope of this question) + you need a key, which can be used for a join in both results (e.g. city_id).
Then you may use join transformation to "join" both query results into single dataset.

MongoDB into AWS Redshift

We've got a pretty big MongoDB instance with sharded collections. It's reached a point where it's becoming too expensive to rely on MongoDB query capabilities (including aggregation framework) for insight to the data.
I've looked around for options to make the data available and easier to consume, and have settled on two promising options:
AWS Redshift
Hadoop + Hive
We want to be able to use a SQL like syntax to analyze our data, and we want close to real time access to the data (a few minutes latency is fine, we just don't want to wait for the whole MongoDB to sync overnight).
As far as I can gather, for option 2, one can use this https://github.com/mongodb/mongo-hadoop to move data over from MongoDB to a Hadoop cluster.
I've looked high and low, but I'm struggling to find a similar solution for getting MongoDB into AWS Redshift. From looking at Amazon articles, it seems like the correct way to go about it is to use AWS Kinesis to get the data into Redshift. That said, I can't find any example of someone that did something similar, and I can't find any libraries or connectors to move data from MongoDB into a Kinesis stream. At least nothing that looks promising.
Has anyone done something like this?
I ended up coding up our own migrator using NodeJS.
I got a bit irritated with answers explaining what redshift and MongoDB is, so I decided I'll take the time to share what I had to do in the end.
Timestamped data
Basically we ensure that all our MongoDB collections that we want to be migrated to tables in redshift are timestamped, and indexed according to that timestamp.
Plugins returning cursors
We then code up a plugin for each migration that we want to do from a mongo collection to a redshift table. Each plugin returns a cursor, which takes the last migrated date into account (passed to it from the migrator engine), and only returns the data that has changed since the last successful migration for that plugin.
How the cursors are used
The migrator engine then uses this cursor, and loops through each record.
It calls back to the plugin for each record, to transform the document into an array, which the migrator then uses to create a delimited line which it streams to a file on disk. We use tabs to delimit this file, as our data contained a lot of commas and pipes.
Delimited exports from S3 into a table on redshift
The migrator then uploads the delimited file onto S3, and runs the redshift copy command to load the file from S3 into a temp table, using the plugin configuration to get the name and a convention to denote it as a temporary table.
So for example, if I had a plugin configured with a table name of employees, it would create a temp table with the name of temp_employees.
Now we've got data in this temp table. And the records in this temp table get their ids from the originating MongoDB collection. This allows us to then run a delete against the target table, in our example, the employees table, where the id is present in the temp table. If any of the tables don't exist, it gets created on the fly, based on a schema provided by the plugin. And so we get to insert all the records from the temp table into the target table. This caters for both new records and updated records. We only do soft deletes on our data, so it'll be updated with an is_deleted flag in redshift.
Once this whole process is done, the migrator engine stores a timestamp for the plugin in a redshift table, in order to keep track of when the migration last run successfully for it. This value is then passed to the plugin the next time the engine decides it should migrate data, allowing the plugin to use the timestamp in the cursor it needs to provide to the engine.
So in summary, each plugin/migration provides the following to the engine:
A cursor, which optionally uses the last migrated date passed to it
from the engine, in order to ensure that only deltas are moved
across.
A transform function, which the engine uses to turn each document in the cursor into a delimited string, which gets appended to an export file
A schema file, this is a SQL file containing the schema for the table at redshift
Redshift is a data ware housing product and Mongo DB is a NoSQL DB. Clearly, they are not a replacement of each other and can co-exist and serve different purpose. Now how to save and update records at both places.
You can move all Mongo DB data to Redshift as a one time activity.
Redshift is not a good fit for real time write. For Near Real Time Sync to Redshift, you should Modify program that writes into Mongo DB.
Let that program also writes into S3 locations. S3 location to redshift movement can be done on regular interval.
Mongo DB being a document storage engine, Apache Solr, Elastic Search can be considered as possible replacements. But they do not support SQL type querying capabilities.They basically use a different filtering mechanism. For eg, for Solr, you might need to use the Dismax Filter.
On Cloud, Amazon's Cloud Search/Azure Search would be compelling options to try as well.
You can use AWS DMS to migrate data to redshift now easily , you can also realtime ongoing changes with it.

Tableau - How to query large data sources efficiently?

I am new to Tableau, and having performance issues and need some help. I have a query that joins several large tables. I am using a live data connection to a MySQL db.
The issue I am having is that it is not applying the filter criteria before asking MySQL for the data. So it is essentially doing a SELECT * from my query and not applying the filter criteria to the where clause. It pulls all the data from MySQL db back to Tableau, then throws away the un-needed data based on my filter criteria. My two main filter criteria are on account_id and a date range.
I can cleanly get a list of the accounts from just doing a select from my account table to populate the filter list, then need to know how to apply that selection when it goes to pull the data from the main data query from MySQL.
To apply a filter at the data source first, try using context filters.
Performance can also be improved by using extracts.
I would personally use an extract, go into your MySQL DB Back-end, run the query, and a CREATE TABLE extract1 AS statement, or whatever you want to call your data table.
When you import this table into Tableau it will already have a SELECT * of your aggregate data in the workbook. From here your query efficiency will be increased ten fold.
Unfortunately, it's going to take awhile for Tableau processing time + mySQL backend DB query time = Ntime to process your data.
Try the extracts...
I've been struggling with the very same thing. I have found that the tableau extracts aren't any faster than pulling directly from a SQL table. What I have done is within SQL created tables that already have the filtered data in them, so the Select * will have only the needed data. The downside to this is it takes up more space on the server, but this isn't a problem on my side.
For the Large Data sets Tableau recommend using an Extract.
An extract will create a snapshot of the data that you are connected with and processing on this data will be faster than a live connection.
All the charts and visualization will load faster and saves your time, each time when you go to the Dashboard.
For the filters that you are using to filter the data-set will work faster in an extract connection. But to get the latest data you have to refresh the extract or schedule a refresh in the server ( if you are uploading the report to server).
There are multiple type of filters available in Tableau, the use of which depends on your application, context filters and global filters can be use to filter the whole set of data.

Mongodb input in pentaho

I have a time field in mysql table. based on this time field I need to import data from mongodb collection.
So in Pentaho transformations first I have a Table Input step which gets the required date.
Next I have a mongodb input step.Here how do i filter records based on the output from previous step?
I saw that in mongodb input query it accepts parameters only if its an environment variable or defined in another transformation, but does not recognize variable from previous step.
How do I load from previous step, please help me I am a fresher in Pentaho and trying for this solution since a week.
Thank you,
Deepthi
You've already answered your own question:
I saw that in mongodb input query it accepts parameters only if its an environment variable or defined in another transformation, but does not recognize variable from previous step. How do I load from previous step, please help me I am a fresher in Pentaho and trying for this solution since a week.
If there is no way for a step to accept an input stream, you'll have to do exactly what you describe. In one transformation, access the MySQL table to get the time and store it in a variable. Then in another transformation access that variable in your MongoDB step.
Note that you will have to do this in two transformations to ensure that the variable is set by the time the MongoDB step runs.
Take a look at optiq. This is bleeding edge, but allows sql access to mongodb, so in theory you could use it in a table input step rather than a mongo input step:
http://julianhyde.blogspot.co.uk/2013/06/efficient-sql-queries-on-mongodb.html
It can be achieved vai passing query as parameter.
In Transformation setting Add a parameter (eg : MONGO_QRY )
Example
In MongoDB Query expression (json)
${MONGO_QRY}
It works fine for us try that . If not lets know .

how to extract data from mongo collection for data warehouse use

My company starts to use mongo and we are starting to think about what is the best way to extract data from mongodb and send it to our data warehouse.
My question focus around the extract part of the process. As i see it, the best way is to expose API on the service that is built on top of mongo, that the ETL process (that is invoked by a job from the data warehouse) will execute with some specific query that will probably will query for set of times (i.e. - startdate and enddate for every record).
is that sound right or i am missing something or maybe there is better way than that?
initially i was thinking about doing mongoexport every X duration but according to the documentation it seems not so good performance wise.
Thanks in advance!
give a try to pentaho kettle.
https://anonymousbi.wordpress.com/2012/07/25/creating-pentaho-reports-from-mongodb/
I am using Alteryx Designer to extract from MongoDB with the dedicated connector and prep my data to load into Tableau, with optional data prep in between.
Works pretty well!
ALteryx can write to most DBs though...