Mongodb input in pentaho - mongodb

I have a time field in mysql table. based on this time field I need to import data from mongodb collection.
So in Pentaho transformations first I have a Table Input step which gets the required date.
Next I have a mongodb input step.Here how do i filter records based on the output from previous step?
I saw that in mongodb input query it accepts parameters only if its an environment variable or defined in another transformation, but does not recognize variable from previous step.
How do I load from previous step, please help me I am a fresher in Pentaho and trying for this solution since a week.
Thank you,
Deepthi

You've already answered your own question:
I saw that in mongodb input query it accepts parameters only if its an environment variable or defined in another transformation, but does not recognize variable from previous step. How do I load from previous step, please help me I am a fresher in Pentaho and trying for this solution since a week.
If there is no way for a step to accept an input stream, you'll have to do exactly what you describe. In one transformation, access the MySQL table to get the time and store it in a variable. Then in another transformation access that variable in your MongoDB step.
Note that you will have to do this in two transformations to ensure that the variable is set by the time the MongoDB step runs.

Take a look at optiq. This is bleeding edge, but allows sql access to mongodb, so in theory you could use it in a table input step rather than a mongo input step:
http://julianhyde.blogspot.co.uk/2013/06/efficient-sql-queries-on-mongodb.html

It can be achieved vai passing query as parameter.
In Transformation setting Add a parameter (eg : MONGO_QRY )
Example
In MongoDB Query expression (json)
${MONGO_QRY}
It works fine for us try that . If not lets know .

Related

Extract the SQL query that was run in the Oracle connector from a DataStage job in order to add it to a file/table

I have a parallel DataStage job that uses a particular SQL query with some parameters. Once the job is running, I can see in the Director log the exact SQL query that was triggered on the database.
My question is: is there any way I can get this SQL query with all parameters replaced in the Designer job, so I can add the code in a column from a table (a metadata column that will contain the exact query that was used for that particular run). In my job I can have a transformer that will put the query from the Oracle connector as a derivation for a column in the target table.
Thank you!
You can retrieve the query from the log using log-reading functions (in a server routine) or options (in dsjob command). Design the writing part however you want.

Azure Data Factory - querying Cosmos DB using dynamic timestamps

I want to create and maintain a snapshot of a collection in Cosmos DB.
Periodically, I want to retrieve only the delta (new or modified documents) from Cosmos and write them to the snapshot, which will be stored in a Azure Data Explorer cluster.
I wish to get the delta using the _ts member of the documents. In other words, I will fetch only records for which the _ts is between some range.
The range will be the range of a time window, which I get using a tumbling window trigger in the data factory.
The issue is that if I print the dynamic timestamps which I create in the query, and hard code them into the query, it works. But if I let the query generate them, I don't get any results.
For example:
I'm using those value to simulate the window range of the trigger.
I use this query to create timestamps in unix time.
and I see that the timestamps created are correct.
And if I run my query using those hardcoded timestamps, I get results
But, if I run a query using the code that just create those timestamps, I get no results from the query
This is the code to create the timestamps:
select
DateTimeToTimestamp('#{formatDateTime('2020-05-20T12:00:00.0000000Z','yyyy-MM
ddTHH:mm:ss.fffffffZ')}')/1000,
DateTimeToTimestamp('#{formatDateTime('2020-08-20T12:00:00.0000000Z','yyyy-MM
ddTHH:mm:ss.fffffffZ')}')/1000
Does anyone have a clue as to what might be the issue?
Any other way to achieve this is also welcome.
Thanks
EDIT: I managed to work around this by taking the other, simpler option:
where TimestampToDateTime(c._ts*1000)> "#{formatDateTime(pipeline().parameters.windowStart,'yyyy-MM-ddTHH:mm:ss.fffffffZ')}"
We are glad that you resolved this problem:
You managed to work around this by taking the other, simpler option:
where TimestampToDateTime(c._ts*1000)> "#{formatDateTime(pipeline().parameters.windowStart,'yyyy-MM-ddTHH:mm:ss.fffffffZ')}"
I think the error in first option is most caused by the different data type between c.ts and DateTimeToTimestamp('#{formatDateTime('2020-05-20T12:00:00.0000000Z','yyyy-MM ddTHH:mm:ss.fffffffZ')}')/1000.

Unix Shell Script to Remove MongoDB Document-Based on Many inputs

Here I am in a position to write a Unix Shell script(.sh) to remove MongoDB collection documents.
I know how to remove based on a condition like below and it works for me.
eval 'db.Collection.remove({TimeStamp:{$lte: "'$var'"}})
But I need to change the remove statement based on a new parameter ,lets say PID which will receive bunch of inputs(Many PID's).
I don't need to remove the collection based on the Timestamp field ,Instead my condition has to be changed as per my previous statement.
I went through many forums ,But I am not able to get the solution.
Please help me to resolve this in my UNIX shell script.

Assigning an SQL result to a Job Parameter in DataStage

I just started using Datastage (version 11.5) and I am trying to assign the value of a simple SQL query (select max(date_col) from Table) into a Job Parameter so that I can use it as a part of a file produced from the job.
Can anyone point out a simple approach to this, since I am rather lost on how to include SQL queries in parameter values.
Thanks in advance.
There are some options to do this. The one I recommend is:
write the result of you query into a sequential file
Use a Execute Command stage (in a Sequence) to read the file
use it in one of the following Job Activity stages (as job parameter)
An alternative could be the use of Parameter Sets with value files. These value files are real files in the OS and their structure is simple so these could be written by the DataStage job. In this case it cannot be used for Conditions in the Sequence.

how to extract data from mongo collection for data warehouse use

My company starts to use mongo and we are starting to think about what is the best way to extract data from mongodb and send it to our data warehouse.
My question focus around the extract part of the process. As i see it, the best way is to expose API on the service that is built on top of mongo, that the ETL process (that is invoked by a job from the data warehouse) will execute with some specific query that will probably will query for set of times (i.e. - startdate and enddate for every record).
is that sound right or i am missing something or maybe there is better way than that?
initially i was thinking about doing mongoexport every X duration but according to the documentation it seems not so good performance wise.
Thanks in advance!
give a try to pentaho kettle.
https://anonymousbi.wordpress.com/2012/07/25/creating-pentaho-reports-from-mongodb/
I am using Alteryx Designer to extract from MongoDB with the dedicated connector and prep my data to load into Tableau, with optional data prep in between.
Works pretty well!
ALteryx can write to most DBs though...