Retrieve Large XML file from HTTP and copy to Azure SQL - azure-data-factory

I'm currently using an API call to download an XML file which is 600MB and copy to Azure SQL using the copying activity. Due to the file being large no output have ever succeed as the debug keeps on running and never completing.
I have tried splitting the source file using dataflow, lookup and get meta, however, HTTP request support aren't available for these activities. Is there a way to copy larger xml file from a http request to Azure SQL?

Related

Executing Batch service in Azure Data factory using python script

Hi i've been trying to execute a custom activity in ADF which receives csv file from the container (A) after further transformation on the data set, transformed DF stored into another csv file in a same container (A).
I've written the transformation logic in python and have it stored in the same container (A).
Error raises here, when i execute the pipeline it returns an error *can't find the specified file *
Nothing wrong in the connections, Is anything wrong in batch Account or pools!!
can anyone tell me where to place the python script..!!!
Install azure batch explorer and make sure to choose proper configuration for virtual machine (dsvm-windows) which will ensure python is already in place in the virtual machine where your code is being run.
This video explains the steps
https://youtu.be/_3_eiHX3RKE

Move file after transaction commit

I just started using Spring Batch and I don't know how I can implement my business need.
The behavior is quite simple : I have a directory where files are saved. My batch should detect those files, import them into my database and move the file to a backup directory (or an error directory if the data can't be saved).
So I create chunks of 1 file. The reader retrieve them and the processor imports the data.
I read Spring Batch create a global transaction for the whole chunk, and only the ChunkListener is called out of the transaction. It seems to be OK, but the input parameter is a ChunkContext. How can I retrieve the file managed in the chunk ? I don't see where it's stored in the ChunkContext.
I need to be sure the DB accepts the insertions before choosing where the file must be moved. That's why I need to do that after the commit.
Here is how you can proceed:
Create a service (based on a file system watching API or something like Spring Integration directory polling) that launches a batch job for the new file
The batch job can use a chunk-oriented step to read data and write it to the database. In that job, you can use a job/step execution listener or a separate step to move files to the backup/error directory according to the success or failure of the previous step.

Some questions about google Data fusion

I am discovering the tool and I have some questions:
-what do you exactly mean by the type File in (Source, Sink),
-is it also possible to send the result of the pipeline directly to a FTP server
I check the documentation, but I did not find this information
thank you
Short answer: File refers to the filesystem where the pipelines run. In Data Fusion context if you are using File sink the contents will be written to HDFS on Dataproc cluster.
Data Fusion has SFTP put actions that can be used to write to SFTP. Here is a simple pipeline of how to write to SFTP from GCS.
Step1: GCS Source to File Sink - This writes the content of GCS to HDFS on Dataproc when the pipeline is run
Step 2: SFTP Put action, that takes the output of File sink and upload to SFTP.
You need to configure the output path of File the same as source path in SFTP

CSV Importing With Rails, Postgres, and Sidekiq

I'm building a customer management system using Rails that requires CSV files containing customer information to be imported into/diffed with a Postgres database. I'm hosting the application on Heroku. I moved the database to the background with Sidekiq but need advice on where to upload the file to in the first place for importing. Is hosting the file on S3 really the best solution or is there a simpler solution without using a third party storage service? The application will be used daily but up 10 employees and the larges CSV file being upload is around 100,000 rows.
Thanks.
Yes, I do think S3 is the best solution
We faced same problem at Storemapper (we use Resque instead of Sidekiq, but that's not a problem). The limiting factor here is the Heroku request timeout. You only have 30s to finish your upload to Heroku, which put hard limit on how big your csv can be. This is where S3 come. Basically what we do is:
User upload csv directly to S3 via javascript, bypassing our app server on Heroku.
Once the upload complete, the javascript makes a request to app server that will launch background worker, telling the worker where the file is at S3
The worker download the csv from s3, then process it as necessary
I found carrierwave_direct gem to be very helpful for step 1 and 2. For step 3, I use smarter_csv gem. Checkout our complete story here:
https://tylertringas.com/very-large-csv-import-in-rails-on-heroku/

How to Schedule Task using Marklogic

These are following areas where Scheduling Task Using Marklogic can be used
1.Loading content. For example, periodically checking for new content from an external data source, such as a web site, web service, etc.
2.Synchronizing content. For example, when MarkLogic is used as a metadata repository, you might want to periodically check for changed data.
3.Delivering batches of content: For example, initiate an RSS feed, hourly or daily.
4.Delivering aggregated alerts, either hourly or daily.
5.Delivering reports, either daily, weekly, or monthly.
6.Polling for the completion of an asynchronous process, such as the creation of a PDF file
My requirement is to schedule a task for bulk loading data from local file system to Marklogic DB using any data loading option available in Marklogic such as
1.MLCP
2.Xquery
3.Rest API
4.Java API
5.WebDAV.
So is there any option to execute this programatically. I prefer MLCP since I need to perform bulk load of data from local file system
Similar to your question at Execute MLCP Content Load Command as a schedule task in Marklogic , I would start with a tool like Apache Camel. There are other options - Mule, Spring Integration, and plenty of commercial/graphical ETL tools - but I've found Camel to be very easy to get started with, and you can utilize the mlcp Camel component at https://github.com/rjrudin/ml-camel-mlcp .