Slow performance when copying from HTTP source to blob sink - azure-data-factory

I use a copy activity to call an HTTP API and store the json response as a file in Azure blob storage. The copy activity is executed in a ForEach loop and each activity run takes 16 seconds, but when I look at the run details it says the copy duration is only 3 seconds. Then why does the activity take 16 seconds to complete? The source dataset is an Http File with an HttpServer linked service and the sink dataset is a blob storage json file. Both the source and sink datasets are configured with Binary Copy and it's a GET request to an HTTPS URL with anonymous authentication.
I would like to speed up this acticity since it is run multiple times inside the ForEach loop. Is there some way to improve the performance?

There is always a few seconds of overhead when starting an activity. Also consider that the http server might be also responsible for some of those seconds you are seeing there.
If you are using a for each loop and want to speed up the process, you can uncheck the Sequential check in the settings tab of the foreach activity.
Hope this helped!

Related

KSQL query adds too much delay to my request

I have a system that saves (X,Y) coordinates to a SQL table. Then, I have an endpoint that when called returns the (X,Y) coordinates.
However my system takes up to 30 minutes to process and store a (X,Y) coordinate to the SQL table. In this sense, I am using KSQL to get that data faster.
I have added the call to KSQL in the endpoint of the backend I mentioned. The problem is that this call adds 6 extra seconds to my request.
My endpoint includes a query that looks like this
SELECT feature_a,feature_b FROM ksql_table;
The ksql_table has already been pre-processed by two previous streams. In my understanding, this query should be pretty straight forward and easy to compute. But it is taking 6 seconds to process.
When a KSQL query runs, it instantiates a Kafka Streams application that will build the table state requested. This is going to have a "spin-up" time, which doesn't matter when it's the stream processing application itself (since once it's running it stays running). However, if you're repeatedly calling it via the REST API as part of your application's flow then you are going to see this delay.
I think a more optimal way to work with the stream of data in Kafka would be to use Kafka Streams to build and persist the state required in a KTable, and then serve this through Interactive Query and a custom API that your nodejs application can interface with such as described here. Further examples are here and here.
There is also a nodejs Kafka Streams library, which I have not used but might be worth checking out.

Graceful custom activity timeout in data factory

I'm looking to impose a timeout on custom activities in data factory via the policy.timeout property in the activity json.
However I haven't seen any documentation to suggest how the timeout operates upon Azure batch? I assume that the batch task is forcibly terminated somehow.
But is the task -> custom activity informed so it can tidy up?
The reason I ask is that I could be mid-copying to data lake store and I neither want to let it run indefinitely nor stop it without some sort of clean up (I can't see a way of doing transactions as such using the data lake store SDK).
I'm considering putting the timeout within the custom activity, but it would be a shame to have timeouts defined at 2 different levels (I'd probably still want the overall timeout).
I feel your pain.
ADF does simply terminate the activity if its own time out is reached regardless of what state the invoked service is in.
I have the same issue with U-SQL processing calls. It takes a lot of proactive monitoring via PowerShell to ensure data lake or batch jobs have enough compute to complete jobs with natually increasing data volumes before the ADF timeout kill occurs.
I'm not aware of any graceful way for ADF to handle this because it would differ for each activity type.
Time to create another feedback article for Microsoft!

Async POST requests on REST API by multi users and wait for them to complete all in Jmeter

I'm submitting multiple POST submits on a REST API using same input Json. That means multi users (ex: 10000) are submitting the same POST with same Json to measure the performance of POST request, but I need to capture the result of completion on each submission using a GET method and still measure the performance of GET as well. This is a asynchronous process as follows.
POST submit
generates an ID1
wait for processing
in next step another ID2 will be generated
wait for processing
in next step another ID3 will be generated
wait for processing
final step is completion.
So I need to create a jmeter test plan that can process this Asynchronous POST submits by multi users and wait for them to be processed and finally capture the completion on each submission. I need to generate a graph and table format report that can show me latency and throughput. Sorry for my lengthy question. Thanks, Santana.
Based on your clarification in the comment, looks to me like you have a fairly straight forward script, which could be expressed like this:
Thread Group
HTTP Sampler 1 (POST)
Post-processor: save ID1 as a variable ${ID1}
Timer: wait for next step to be available
HTTP Sampler 2 (GET, uses ${ID1})
Post-processor: save ID2 as a variable ${ID2}
Timer: wait for next step to be available
HTTP Sampler 3 (GET, uses ${ID1} and ${ID2})
Post-Processor: extract completion status
(Optional) Assertion: check completion status
I cannot speak about which Timer specifically to use, or which Post-processor, they depend on specific requests you have.
You don't need to worry about multiple users from JMeter perspective (the variables are always independent for the users), but of course you need to make sure that multiple initial POSTs do not conflict with each other from application perspective (i.e. each post should process independent data)
Latency is a part of the standard interface used to save results in the file. But as JMeter's own doc states, latency measurement is a bit limited in JMeter:
JMeter measures the latency from just before sending the request to just after the first response has been received. Thus the time includes all the processing needed to assemble the request as well as assembling the first part of the response, which in general will be longer than one byte. Protocol analysers (such as Wireshark) measure the time when bytes are actually sent/received over the interface. The JMeter time should be closer to that which is experienced by a browser or other application client.
Throughput is available in some UI listeners, but can also be calculated in the same way as JMeter calculates it:
Throughput = (number of requests) / (total time)
using raw data in the file.
If you are planning to run 100-200 users (or for debug purposes), use UI listeners; with the higher load, use non-UI mode of JMeter, and save results in CSV which you can later analyze. I say get your test to pass in UI mode first with 100 users, and then setup a more robust multi-machine 10K user test.

Retry failed writing operations without delaying other steps in Spring Batch application

I am maintaining a legacy application written using Spring Batch and need to tweak it to never lose data.
I have to read from various webservice (one for each step) and then write to a remote database. Things goes bad when connection with the DB drops because all itens read from webservice are discarded (can't read the same item twice), and the data is lost because can not be written.
I need to setup Spring Batch to keep already read data on one step to retry the writing operation next time the step runs. The same step can not read more data until the write operation is successfully concluded.
When not being able to write, the step should keep the read data and pass execution to the next step, after a while, when it's time to the failed step to run again, it should not read another item, retrying the failed writing operation instead.
The batch application should runs in an infinite loop and each step should gather data from one different source. Failed writing operations should be momentarily skipped (keeping the read data) to not delay others steps but should resume from the write operation next time they are called.
I am researching in various web sources aside from official docs, but Spring Batch hasn't the most intuitive docs I have come across.
Can this be achieved? If yes, how?
You can write the data you need to persist in case the job fails to the Batch Step's ExecutionContext. You can restart the job again with this data:
Step executions are represented by objects of the StepExecution class.
Each execution contains a reference to its corresponding step and
JobExecution, and transaction related data such as commit and rollback
count and start and end times. Additionally, each step execution will
contain an ExecutionContext, which contains any data a developer needs
persisted across batch runs, such as statistics or state information
needed to restart
More from: http://static.springsource.org/spring-batch/reference/html/domain.html#domainStepExecution
I do not know if this will be ok with you, but here are my thoughts on your configuration.
Since you have two remote sources that are open to failure, let us partition the overall system with two jobs (not two steps)
JOB A
Step 1: Tasklet
Check a shared folder for files. If files exist, do not proceed to the next step. Will be more understandable when writing about JOB B
Step 2: Webservice to files
Read from your web service and write results to flatfiles in the shared folder. Since you would be using flatfiles for the output, you will solve your "all items read from webservice are discarded and the data is lost because can not be written."
Use Quartz or equivalent for the scheduling of this job.
JOB B
Poll the shared folder for generated files and create a joblauncher with the file (file.getWhere as a jobparameter). Spring integration project may help in this polling.
Step 1:
Read from the file, write them to remote db and move/delete file if writing to db is successful.
No scheduling will be needed since job launching originates from polled in files.
Sample Execution
Time 0: No file in the shared folder
Time 1: Read from web service and write to shared folder
Time 2: Job B file polling occurs, tries to write to db.
If successfull, the system continues to execute.
If not, when Job A tries to execute on its scheduled time, it will skip reading from web service since files still exist in the shared folder. It will skip until Job B consumes the files.
I did not want to go into implementation specifics but Spring Batch can handle all of these situations. Hope that this helps.

Multiple Actors that writes to the same file + rotate

I have written a very simple webserver in Scala (based on Actors). The purpose
of it so to log events from our frontend server (such as if a user clicks a
button or a page is loaded). The file will need to be rotated every 64-100mb or so and
it will be send to s3 for later analysis with Hadoop. the amount of traffic will
be about 50-100 calls/s
Some questions that pops into my mind:
How do I make sure that all actors can write to one file in a thread safe way?
What is the best way to rotate the file after X amount of mb. Should I do this
in my code or from the filesystem (if I do it from the filesystem, how do I then verify
that the file isn't in the middle of a write or that the buffer is flushed)
One simple method would be to have a single file writer actor that serialized all writes to the disk. You could then have multiple request handler actors that fed it updates as they processed logging events from the frontend server. You'd get concurrency in request handling while still serializing writes to your log file. Having more than a single actor would open the possibility of concurrent writes, which would at best corrupt your log file. Basically, if you want something to be thread-safe in the actor model, it should be executed on a single actor. Unfortunately, your task is inherently serial at the point you write to disk. You could do something more involved like merge log files coming from multiple actors at rotation time but that seems like overkill. Unless you're generating that 64-100MB in a second or two, I'd be surprised if the extra threads doing I/O bought you anything.
Assuming a single writing actor, it's pretty trivial to calculate the amount that has been written since the last rotation and I don't think tracking in the actor's internal state versus polling the filesystem would make a difference one way or the other.
U can use Only One Actor to write every requests from different threads, since all of the requests go through this actor, there will be no concurrency problems.
As per file write rolling, if your write requests can be logged in line by line, then you can resort to log4j or logback's FileRollingAppender things. Otherwise, you can write your own which will be easy as long as remembering to lock the file before performing any delete or update operations.
The rolling usually means you rename the older files and current file to other names and then create a new file with current file name, at last, u can always write to the file with current file name.