I have working in logging part in talend. I have followed this https://www.talendforge.org/tutorials/tutorial.php?idTuto=33 and successfully able to log error as well as stats of job (i.e begin end time of job) but I want to capture this information as well as in my logs
the information/message of subjobs like of csv file 3 rows in 0.01s 375 rows/s how to record or capture this information
See Stats & Logs in job properties. There you can store this information into files or into a database.
Keep in mind you might need to activate Monitor this connection in the row settings.
In the premium model there is also a Advanced Monitoring Console available which can be used to visualize those logs out of a database.
Related
I need to send data from a databricks delta table into azure event hubs.
The data will be selected with a sql select
spark.sql("SELECT [columns] FROM table WHERE [where clause]")
This select will return many many rows and after it, I will apply some transformation (mainly to be in accordance to the event hub event data message).
At the end I will send it to event hub.
As far as I can tell, at the moment of writing, I need to use "writeStream" but is this enough? How can I control how many messages are sent per batch? Do I even need to care about it or does the lib handle it?
Another question I have is, from the moment I use "writeStream" the command hangs in a running/streaming state for eternity. Is this correct or am I not being patient enough? If I'm correct, then how can I stop it (in a non-manual way) after sending all data?
Notes:
This will be running in a job that is to be triggered manually
The lib i use for the event hub connection is com.microsoft.azure:azure-eventhubs-spark_2.11:2.3.14.1
Once you get your final records which you want to save in eventhub than after your write command you need to call .start() method which will enable your stream to write data back to eventhub.
Also if your jobs gets failed than in that case you need to stop your sparkContext using sc.stop() or spark.sparkContext.stop()
I have been trying to get druid to fire a kill task periodically to clean up unused segments.
These are the configuration variables responsible for it
druid.coordinator.kill.on=true
druid.coordinator.kill.period=PT45M
druid.coordinator.kill.durationToRetain=PT45M
druid.coordinator.kill.maxSegments=10
From the above configuration my mental model is, once ingested data is marked unused, kill task will fire and delete the segments that are older that 45 mins while retaining 45 mins worth of data. period and durationToRetain are the config vars that are confusing me, not quite sure how to leverage them. Any help would be appreciated.
The caveat for druid.coordinator.kill.on=true is that segments are deleted from whitelisted datasources. The whitelist is empty by default.
To populate the whitelist with all datasources, set killAllDataSources to true. Once I did that, the kill task fired as expected and deleted the segments from s3 (COS). This was tested for Druid version 0.18.1.
Now, while the above configuration properties can be set when you build your image, the killAllDataSources needs to be set through an API. This can be set via the druid UI too.
When you click the option, a modal appears that has Kill All Data Sources. Click on True and you should see a kill task (Ingestion ---> Tasks below) firing in the interval specified. It would be really nice to have this as a part of runtime.properties or some sort of common configuration file that we can set the value in when build the druid image.
Use crontab it works quite well for us.
If you want to have a control outside the druid over the segments removal, then you must use an scheduled task which runs based on your desire interval and register kill-tasks in druid. It can increase your control over your segments, since when they go away, you cannot recover them. You can use this script to accompany you:
https://github.com/mostafatalebi/druid-kill-task
I use a copy activity to call an HTTP API and store the json response as a file in Azure blob storage. The copy activity is executed in a ForEach loop and each activity run takes 16 seconds, but when I look at the run details it says the copy duration is only 3 seconds. Then why does the activity take 16 seconds to complete? The source dataset is an Http File with an HttpServer linked service and the sink dataset is a blob storage json file. Both the source and sink datasets are configured with Binary Copy and it's a GET request to an HTTPS URL with anonymous authentication.
I would like to speed up this acticity since it is run multiple times inside the ForEach loop. Is there some way to improve the performance?
There is always a few seconds of overhead when starting an activity. Also consider that the http server might be also responsible for some of those seconds you are seeing there.
If you are using a for each loop and want to speed up the process, you can uncheck the Sequential check in the settings tab of the foreach activity.
Hope this helped!
I'm looking to impose a timeout on custom activities in data factory via the policy.timeout property in the activity json.
However I haven't seen any documentation to suggest how the timeout operates upon Azure batch? I assume that the batch task is forcibly terminated somehow.
But is the task -> custom activity informed so it can tidy up?
The reason I ask is that I could be mid-copying to data lake store and I neither want to let it run indefinitely nor stop it without some sort of clean up (I can't see a way of doing transactions as such using the data lake store SDK).
I'm considering putting the timeout within the custom activity, but it would be a shame to have timeouts defined at 2 different levels (I'd probably still want the overall timeout).
I feel your pain.
ADF does simply terminate the activity if its own time out is reached regardless of what state the invoked service is in.
I have the same issue with U-SQL processing calls. It takes a lot of proactive monitoring via PowerShell to ensure data lake or batch jobs have enough compute to complete jobs with natually increasing data volumes before the ADF timeout kill occurs.
I'm not aware of any graceful way for ADF to handle this because it would differ for each activity type.
Time to create another feedback article for Microsoft!
I am maintaining a legacy application written using Spring Batch and need to tweak it to never lose data.
I have to read from various webservice (one for each step) and then write to a remote database. Things goes bad when connection with the DB drops because all itens read from webservice are discarded (can't read the same item twice), and the data is lost because can not be written.
I need to setup Spring Batch to keep already read data on one step to retry the writing operation next time the step runs. The same step can not read more data until the write operation is successfully concluded.
When not being able to write, the step should keep the read data and pass execution to the next step, after a while, when it's time to the failed step to run again, it should not read another item, retrying the failed writing operation instead.
The batch application should runs in an infinite loop and each step should gather data from one different source. Failed writing operations should be momentarily skipped (keeping the read data) to not delay others steps but should resume from the write operation next time they are called.
I am researching in various web sources aside from official docs, but Spring Batch hasn't the most intuitive docs I have come across.
Can this be achieved? If yes, how?
You can write the data you need to persist in case the job fails to the Batch Step's ExecutionContext. You can restart the job again with this data:
Step executions are represented by objects of the StepExecution class.
Each execution contains a reference to its corresponding step and
JobExecution, and transaction related data such as commit and rollback
count and start and end times. Additionally, each step execution will
contain an ExecutionContext, which contains any data a developer needs
persisted across batch runs, such as statistics or state information
needed to restart
More from: http://static.springsource.org/spring-batch/reference/html/domain.html#domainStepExecution
I do not know if this will be ok with you, but here are my thoughts on your configuration.
Since you have two remote sources that are open to failure, let us partition the overall system with two jobs (not two steps)
JOB A
Step 1: Tasklet
Check a shared folder for files. If files exist, do not proceed to the next step. Will be more understandable when writing about JOB B
Step 2: Webservice to files
Read from your web service and write results to flatfiles in the shared folder. Since you would be using flatfiles for the output, you will solve your "all items read from webservice are discarded and the data is lost because can not be written."
Use Quartz or equivalent for the scheduling of this job.
JOB B
Poll the shared folder for generated files and create a joblauncher with the file (file.getWhere as a jobparameter). Spring integration project may help in this polling.
Step 1:
Read from the file, write them to remote db and move/delete file if writing to db is successful.
No scheduling will be needed since job launching originates from polled in files.
Sample Execution
Time 0: No file in the shared folder
Time 1: Read from web service and write to shared folder
Time 2: Job B file polling occurs, tries to write to db.
If successfull, the system continues to execute.
If not, when Job A tries to execute on its scheduled time, it will skip reading from web service since files still exist in the shared folder. It will skip until Job B consumes the files.
I did not want to go into implementation specifics but Spring Batch can handle all of these situations. Hope that this helps.