These are following areas where Scheduling Task Using Marklogic can be used
1.Loading content. For example, periodically checking for new content from an external data source, such as a web site, web service, etc.
2.Synchronizing content. For example, when MarkLogic is used as a metadata repository, you might want to periodically check for changed data.
3.Delivering batches of content: For example, initiate an RSS feed, hourly or daily.
4.Delivering aggregated alerts, either hourly or daily.
5.Delivering reports, either daily, weekly, or monthly.
6.Polling for the completion of an asynchronous process, such as the creation of a PDF file
My requirement is to schedule a task for bulk loading data from local file system to Marklogic DB using any data loading option available in Marklogic such as
1.MLCP
2.Xquery
3.Rest API
4.Java API
5.WebDAV.
So is there any option to execute this programatically. I prefer MLCP since I need to perform bulk load of data from local file system
Similar to your question at Execute MLCP Content Load Command as a schedule task in Marklogic , I would start with a tool like Apache Camel. There are other options - Mule, Spring Integration, and plenty of commercial/graphical ETL tools - but I've found Camel to be very easy to get started with, and you can utilize the mlcp Camel component at https://github.com/rjrudin/ml-camel-mlcp .
Related
I use Metabase to generate dashboards and reports. I need to generate files using the scheduler and instead of sending them by e-mail, I need to make them available in SFTP. Do you have any suggestion on how to automate this processes ?
I use PostgreSQL as a database source.
I can also try other open source tools if needed.
I didn't find much information on how to do it yet.
I have a large Mongo database (5M documents). I edit the database from an offline application, so I store the database on my local computer. However, I want to be able to maintain an online copy of the database, so that my website can access it.
How can I update the online copy regularly, without having to upload multiple GBs of data every time?
Is there some way to "track changes" and upload only the diff, like in Git?
Following up on my comment:
Can't you store the commands you used on your offline db, and then
apply them on the online db, through a script running on SSH for
instance ? Or even better upload a file with all the commands you ran
on your offline base, to your server and then execute them with a cron
job, or a bash script ? (The only requirement would be for your bases
to have the same start point, and same state, when you execute the
script)
I would recommend to store all the queries you execute on your offline base, to do this you have many options, I can think about the following : You can set the profiling level to log all your queries.
(Here is a more detailed thread on the matter: MongoDB logging all queries)
Then you would have to extract then somehow (grep ?), or store them directly in another file on the fly, when they are executed.
For the uploading of the script, it depends on what you would like to use, but i suppose you would need to do it during low usage hours, and you could automate the task with a CRON job, and an SSH tunnel.
I guess it all depends on your constraints (security, downtime, etc..)
Currently we're monitoring our SQL servers running in Windows platform via MS SQL server reporting services using shared data sources. To confirm what I mean, we don't store data at centralized server to monitor over 500 target servers. We keep monitoring data on local SQL database servers and use shared data source in SSRS to create dashboards.
Now in our firm we're encouraged to use Grafana as dashboard since they have purchased or running some Grafana server licensing. What I know of Grafana instance is that it can be given to us to monitor SQL servers as described above.
My question is how would Grafana dynamically connect to those 500 plus servers? I see it creates data source once but how will I change or create multiple data sources when I have around 1000 servers to monitor?
Please suggest guide.
You may have to code a bit and use data source provisioning and/or Grafana datasource API for it to pickup the new data source.
If you could set up a system (user-data/ init script/IaC) where this API is called everytime a new server comes up, then you will be able to maintain the data sources without maintainance.
I'm building a customer management system using Rails that requires CSV files containing customer information to be imported into/diffed with a Postgres database. I'm hosting the application on Heroku. I moved the database to the background with Sidekiq but need advice on where to upload the file to in the first place for importing. Is hosting the file on S3 really the best solution or is there a simpler solution without using a third party storage service? The application will be used daily but up 10 employees and the larges CSV file being upload is around 100,000 rows.
Thanks.
Yes, I do think S3 is the best solution
We faced same problem at Storemapper (we use Resque instead of Sidekiq, but that's not a problem). The limiting factor here is the Heroku request timeout. You only have 30s to finish your upload to Heroku, which put hard limit on how big your csv can be. This is where S3 come. Basically what we do is:
User upload csv directly to S3 via javascript, bypassing our app server on Heroku.
Once the upload complete, the javascript makes a request to app server that will launch background worker, telling the worker where the file is at S3
The worker download the csv from s3, then process it as necessary
I found carrierwave_direct gem to be very helpful for step 1 and 2. For step 3, I use smarter_csv gem. Checkout our complete story here:
https://tylertringas.com/very-large-csv-import-in-rails-on-heroku/
I would like to take an export of the list of jobs that are created as tasks under job conductor in TAC along with its configurations. Is this possible and if yes, please support. I am using the Enterprise edition of Talend V5.6
There is an API to query the TAC called MetaServletCaller. Using this API, you can send a command to get a list of the tasks deployed in Job Conductor.
The API can be called from a url in a browser, or by calling MetaServletCaller.sh (or .bat) script from your Talend installation.
The command to get this list is listTasks.
Here is a tutorial on how to do that : http://edwardost.github.io/talend/di/2015/05/28/Using-the-TAC-API/
All TAC configuration is linked to DB tables. Just connect to this database (you can get the name through TAC, in Configuration>Database menu), and have a look at "executiontask" table, it contains all jobs deployed in Job Conductor, with Context/Job Version, etc.