How to prevent timeout in Dataflow in Azure Data Factory when ingesting data into Kusto? - azure-data-factory

My source is a directory in Azure data lake with roughly 70,000 JSON files.
Each file has certain properties and one of them is an array of complex Node elements.
In total, the JSON files have around 13 million of such Node elements.
Using Azure Data Factory and a DataFlow I want to flatten the arrays in each JSON file and insert them as rows into a Kusto DB.
On a "General purpose" integration unit I'm getting an out of memory error. On a memory optimized instance the job runs for about 40 minutes and then fails with the message below.
What I tried:
Increase the timeout in the Kusto sink. It's set to 36,000 seconds (10 hours).
Increase the compute size to 8+8 memory optimized.
The Kusto cluster is configured to a min instance count of 2 with a max of 4 but I cannot see any scale out events.
What are my options to optimize the ingestion?
If other information is required, please let me know in the comments.
Operation on target ExportNodesToKusto failed:
{"StatusCode":"DFExecutorUserError","Message":"Job failed due to
reason: at Sink 'KustoNodesSink': Timed out trying to ingest
requestId:
'a23025d4-f1e7-48cd-a5f9-a4d8dbaec64e'","Details":"shaded.msdataflow.com.microsoft.kusto.spark.exceptions.TimeoutAwaitingPendingOperationException:
Timed out trying to ingest requestId:
'a23025d4-f1e7-48cd-a5f9-a4d8dbaec64e'\n\tat
shaded.msdataflow.com.microsoft.kusto.spark.datasink.KustoWriter$$anonfun$ingestRowsIntoKusto$1.apply(KustoWriter.scala:201)\n\tat
shaded.msdataflow.com.microsoft.kusto.spark.datasink.KustoWriter$$anonfun$ingestRowsIntoKusto$1.apply(KustoWriter.scala:198)\n\tat
scala.collection.Iterator$class.foreach(Iterator.scala:891)\n\tat
scala.collection.AbstractIterator.foreach(Iterator.scala:1334)\n\tat
scala.collection.IterableLike$class.foreach(IterableLike.scala:72)\n\tat
scala.collection.AbstractIterable.foreach(Iterable.scala:54)\n\tat
shaded.msdataflow.com.microsoft.kusto.spark.datasink.KustoWriter$.ingestRowsIntoKusto(KustoWriter.scala:198)\n\tat
shaded.msdataflow.com.microsoft.kusto.spark.datasink.KustoWriter$.ingestToTemporaryTableByWorkers(KustoWriter.scala:247)\n\tat
shaded.msdataflow.com.microsoft.kusto.spark.datasink.KustoWriter$.ingestRowsIntoTem"}

Related

Single Batch job performing heavy database reads

I have a Spring Batch solution which reads several tables in Oracle database, does some flattening and cleaning of data, and sends it to a Restful Api which is our BI platform. The Spring Batch breaks down this data in chunks by date and not by size. It may happen that on a particular day, one chunk could consist of million rows. We are running the complete end-to-end flow in the following way:
Control-M sends a trigger to Load Balancer at a scheduled time
Through Load Balancer request lands on to an instance of Spring Batch app
Spring Batch reads data for that day in chunks from Oracle database
Chunks are then sent to target API
My problems are:
The chunks can get heavier. If it contains of million rows then the instance's heap size increases and at one point chunks will get processed at trickling pace
One instance bears the load of entire batch processing
How can I distribute this processing across a group of instances? Is parallel processing achievable and if yes then how can I make sure that the same rows are not read by multiple instances (to avoid duplication)? Any other suggestions?
Thanks.
You can use a (locally or remotely) partitioned step where each worker step is assigned a distinct dataset. You can find more details and a code example in the documentation here:
https://docs.spring.io/spring-batch/docs/current/reference/html/spring-batch-integration.html#remote-partitioning
https://github.com/spring-projects/spring-batch/tree/main/spring-batch-samples#partitioning-sample

Is it possible to configure batch size for SingleStore Kafka pipeline?

I'm using SingleStore to load events from Kafka. I created a Kafka pipeline with the following script:
create pipeline `events_stream`
as load data kafka 'kafka-all-broker:29092/events_stream'
batch_interval 10000
max_partitions_per_batch 6
into procedure `proc_events_stream`
fields terminated by '\t' enclosed by '' escaped by '\\'
lines terminated by '\n' starting by '';
And SingleStore failing with OOM error like the following:
Memory used by MemSQL (4537.88 Mb) has reached the 'maximum_memory' setting (4915 Mb) on this node. Possible causes include (1) available query execution memory has been used up for table memory (in use table memory: 71.50 Mb) and (2) the query is large and complex and requires more query execution memory than is available
I'm a quite confused why 4Gb is not enough to read Kafka by batches....
Is it possible to configure batch_size for the pipeline to avoid memory issues and make the pipeline more predictable?
unfortunately, current version of Singlestore's pipelines only has a global batch size and can not be set individually in stored procedures.
However, In general, each pipeline batch has some overhead, so running 1 batch for 10000 messages should be better in terms of total resources than 100 or even 10 batches for the same number of messages. If the stored procedure is relatively light, the time delay you are experiencing is likely dominated by the network download of 500mb.
It is worth checking if 10000 large messages are arriving on the same partition or several. In Singlestore's pipelines, each database partition downloads messages from a different kafka partition, and that helps parallelize the workload. However, if the messages arrive on just one partition, then you are not getting the benefits of parallel execution.

Spark How to write to parquet file from data using synchronous API

I have a use case which I am trying to solve using Spark. The use case is that I have to call an API which expects a batchSize and token and then it gives back the token for next page. It gives me a list of JSON objects. Now I have to call this API till all the results are returned and write them all to s3 in parquet format. The size of returned object can range from 0 to 100 million.
My approach is that I am first getting let's say a batch of 1 million object, I convert them into a dataset and then writing to parquet using
dataSet.repartition(1).write.mode(SaveMode.Append)
.option("mapreduce.fileoutputcommitter.algorithm.version", "2")
.parquet(s"s3a://somepath/")
and then repeat the process till my API says that there's no more data, i.e. token is null
So the process is that those API calls will have to be run on the driver and sequentially. And once I get a million I will write to s3.
I have been seeing these memory issues on driver.
Application application_1580165903122_19411 failed 1 times due to AM Container for appattempt_1580165903122_19411_000001 exited with exitCode: -104
Diagnostics: Container [pid=28727,containerID=container_1580165903122_19411_01_000001] is running beyond physical memory limits. Current usage: 6.6 GB of 6.6 GB physical memory used; 16.5 GB of 13.9 GB virtual memory used. Killing container.
Dump of the process-tree for container_1580165903122_19411_01_000001 :
I have seen some weird behavior in a sense that, sometimes 30 million works fine and sometimes it fail due to this. Even 1 million fails sometimes.
I am wondering if I am doing some very silly mistake or is there any better approach for this?
This design is not scalable and putting a lot of pressure on the driver, so it is expected for it to crash. Additionally a lot of data is accumulated in memory before writing to s3.
I will recommend you to use Spark streaming to read data from API.In this way many executors will do the work and the solution will be much scalable. Here is an example - RestAPI service call from Spark Streaming
In those executors you can accumulate the API response in a balanced way, say accumulate 20,000 records but not wait for 5M records. After say 20,000 write them to S3 in "append" mode. The "append" mode will help multiple process work in tandem and not step on each other.

Batch processing from N producers to 1 consumer

I have built a distributed (celery based) parser that deal with about 30K files per day. Each file (edi file) is parsed as JSON and save in a file. The goal is to populate a Big Query dataset.
The generated JSON is Biq Query schema compliant and can be load as is to our dataset (table). But we are limited by 1000 load jobs per day. The incoming messages must be load to BQ as fast as possible.
So the target is: for every message, parse it by a celery task, each result will be buffered in a 300 items size (distributed) buffer . When the buffer reach the limit then all data (json data) are aggregated to be pushed into Big Query.
I found Celery Batch but the need is for a production environment but is the closest solution out of the box that I found.
Note: Rabbitmq is the message broker and the application is shipped with docker.
Thanks,
Use streaming inserts, your limit there is 100.000 items / seconds.
https://cloud.google.com/bigquery/streaming-data-into-bigquery

Spark throwing Out of Memory error

I have a single test node with 8 GB ram on which I am loading barely 10 MB of data(from csv files) into Cassandra(on the same node itself). Im trying to process this data using spark(running on the same node).
Please note that for SPARK_MEM, Im allocating 1 GB of RAM and SPARK_WORKER_MEMORY I'm allocating the same. The allocation of any extra amount of memory results in spark throwing a "Check if all workers are registered and have sufficient memory error", which is more often than not indicative of Spark trying to look for extra memory(as per SPARK_MEM and SPARK_WORKER_MEMORY properties) and coming up short.
When I try to load and process all data in the Cassandra table using spark context object, I'm getting an error during processing. So, I'm trying to use a looping mechanism to read chunks of data at a time from one table, process them and put them in another table.
My source code has the following structure
var data=sc.cassandraTable("keyspacename","tablename").where("value=?",1)
data.map(x=>tranformFunction(x)).saveToCassandra("keyspacename","tablename")
for(i<-2 to 50000){
data=sc.cassandraTable("keyspacename","tablename").where("value=?",i)
data.map(x=>tranformFunction(x)).saveToCassandra("keyspacename","tablename")
}
Now, this works for a while, for around 200 loops, and then this throws an error: java.lang.OutOfMemoryError: unable to create a new native thread.
I've got two questions:
Is this the right way to deal with data?
How can processing just 10 MB of data do this to a cluster?
You are running a query inside the for loop. If the 'value' column is not a key/indexed column, Spark will load the table into memory and then filter on the value. This will certainly cause an OOM.