Apache Nifi missing files when using ListGCSBucket and FetchGCSObject processors - google-cloud-storage

We have a nifi processor group to transfer files between 2 Google Cloud Storage buckets say BucketA and BucketB (in different Google Cloud projects). For this we are using below Nifi processors in order:
ListGCSBucket
FetchGCSObject
PutGCSObject
At the end of the day we have found that there are some files which are not present in BucketB. Our files in BucketA are uploaded each minute and Nifi ListGCS is scheduled to run at every 30 seconds.
Could it be due to some race condition like when ListGCS is working, at the same time a file is uploaded in BucketA and missed by ListGCS and not picked in the next run as well ?

Related

What happens when network connection to GCP is lost?

Imagine I have a GCS bucket mounted on my local Linux file system. Imagine I have an app that is writing new files into a Linux directory that is mounted to GCS. My goal is to have those locally written files eventually show up in GCS.
I understand that the writes on Linux happen "locally" until the file is closed ... what happens if I lose network connectivity and hence can't write to GCS? Will the local file eventually end up in GCS? Do retries and re-attempts happen?
Based on the repository documentation for gcsfuse, file upload retries are already built into the utility, and they happen when there are problems accessing the storage bucket that is mounted. You are able to modify the maximum backoff for retries by using the --max-retry-sleep flag. This flag controls the maximum time that can be reached between retries, after which retrying stops. The flag accepts an X amount of minutes as input.
This doc page is also relevant if you would like to know more about specific characteristics of gcsfuse.

Google Datafusion : Loading multiple small tables daily

I want to load about 100 small tables (min 5 records, max 10000 records) from SQL Server into Google BigQuery on a daily basis. We have created 100 Datafusion pipelines, one pipeline per source table. When we start one pipeline it takes about 7 minutes to execute. Offcourse its starts DataProc, connects to SQL server and sinks the data into Google BigQuery. When we have to run this sequentially it will take 700 minutes? When we try to run in pipelines in parallel we are limited by the network range which would be 256/3. 1 pipeline starts 3 VM's one master 2 slaves. We tried but the performance is going down when we start more than 10 pipelines in parallel.
Questions. Is this the right approach?
When multiple pipelines are running at the same time, there are multiple Dataproc clusters running behind the scenes with more VMs and require more disk. There are some plugins to help you out with multiple source tables. Correct plugin to use should be CDAP/Google plugin called Multiple Table Plugins as it allows for multiple source tables.
In the Data Fusion studio, you can find it in Hub -> Plugins.
To see full lists of available plugins, please visit official documentation.
Multiple Data Fusion pipelines can use the same pre-provisioned Dataproc cluster. You need to create the Remote Hadoop Provisioner compute profile for the Data Fusion instance.
This feature is only available in Enterprise edition.
How setup compute profile for the Data Fusion instance.

Apache Spark standalone settings

I have an Apache spark standalone set up.
I wish to start 3 workers to run in parallel:
I use the commands below.
./start-master.sh
SPARK_WORKER_INSTANCES=3 SPARK_WORKER_CORES=2 ./start-slaves.sh
I tried to run a few jobs and below are the apache UI results:
Ignore the last three applications that failed: Below are my questions:
Why do I have just one worker displayed in the UI despite asking spark to start 3 each with 2 cores?
I want to partition my input RDD for better performance. So for the first two jobs with no partions, I had a time of 2.7 mins. Here my Scala source code had the following.
val tweets = sc.textFile("/Users/soft/Downloads/tweets").map(parseTweet).persist()
In my third job (4.3 min) I had the below:
val tweets = sc.textFile("/Users/soft/Downloads/tweets",8).map(parseTweet).persist()
I expected a shorter time with more partitions(8). Why was this the opposite of what was expected?
Apparently you have only one active worker, which you need to investigate why other workers are not reported by checking the spark logs.
More partitions doesn't always mean that the application runs faster, you need to check how you are creating partitions from source data, the amount of data parition'd and how much data is being shuffled, etc.
In case you are running on a local machine it is quite normal to just start a single worker with several CPU's as shown in the output. It will still split you task of the available CPU's in the machine.
Partitioning your file will happen automatically depending on the amount of available resources, it works quite well most of the time. Spark (and partitioning the files) comes with some overhead, so often, especially on a single machine Spark adds so much overhead it will slowdown you process. The added values comes with large amounts of data on a cluster of machines.
Assuming that you are starting a stand-alone cluster, I would suggest using the configuration files to setup a the cluster and use start-all.sh to start a cluster.
first in your spark/conf/slaves (copied from spark/conf/slaves.template add the IP's (or server names) of you worker nodes.
configure the spark/conf/spark-defaults.conf (copied from spark/conf/spark-defaults.conf.template Set at least the master node to the server that runs your master.
Use the spark-env.sh (copied from spark-env.sh.template) to configure the cores per worker, memory etc:
export SPARK_WORKER_CORES="2"
export SPARK_WORKER_MEMORY="6g"
export SPARK_DRIVER_MEMORY="4g"
export SPARK_REPL_MEM="4g"
Since it is standalone (and not hosted on a Hadoop environment) you need to share (or copy) the configuration (or rather the complete spark directory) to all nodes in your cluster. Also the data you are processing needs to be available on all nodes e.g. directly from a bucket or a shared drive.
As suggested by the #skjagini checkout the various log files in spark/logs/ to see what's going on. Each node will write their own log files.
See https://spark.apache.org/docs/latest/spark-standalone.html for all options.
(we have a setup like this running for several years and it works great!)

How to reduce the time for file copy to S3 using Talend

I have created a small job to copy a csv file of 3 million records (350MB) in zip format to Amazon S3 via Talend Data Integration, using tS3put component. The job took around 2hrs 20 min for completion. But when i copy the same file via AWS Cli or Informatica it got completed within an hour.
Do anyone have an idea how to reduce the copy time to S3 using Talend Data Integration Tool?

Intermittant file not found using Google Cloud Storage from Dataproc - flushing writes?

I have a series of dataproc jobs that run to import some data received each morning. The process creates a cluster, runs four jobs in sequence, then shuts down the cluster. The input file is read from Google Cloud Storage, and the intermediate results are also saved in Avro form in GCS with the final output going to Cloud SQL.
Fairly often the jobs will fail trying to read the Avro written by the previous job. It appears that GCS hasn't "caught up" and the results from the previous job haven't been fully written. I was getting failures trying to read files that appeared to be from the previous day's run and partway through those files would disappear and be replaced by the new ones. I have changed my script that runs the files to clear the work area before starting the jobs, but still have problems where sometimes it starts reading and all the parts haven't been written fully.
I could change the code to simply store the intermediate files on the cluster, tho I like having them available outside for diagnosing other problems. I could also just write to both locations with the cluster for working and GCS for diagnostics.
But assuming this is some kind of sync issue, is there a way to force GCS to flush writes / be fully synced between jobs? Or is there some check I can do to make sure everything has been written before starting the next job in my chain?
EDIT: To answer the comment below, the sequence of jobs all run on the same cluster. The cluster is started, each job run in turn on that cluster, and then the cluster is shut down.
For now, I have worked around this by having the jobs write to HDFS on the cluster in addition to GCS, and the subsequent jobs reading from the cluster. The GCS output is now strictly for diagnostics in case of a problem. But even tho my immediate problem is (I believe) fixed I still would like to know what's happening and why GCS seems out of sync for a bit.