Google Datafusion : Loading multiple small tables daily - google-cloud-data-fusion

I want to load about 100 small tables (min 5 records, max 10000 records) from SQL Server into Google BigQuery on a daily basis. We have created 100 Datafusion pipelines, one pipeline per source table. When we start one pipeline it takes about 7 minutes to execute. Offcourse its starts DataProc, connects to SQL server and sinks the data into Google BigQuery. When we have to run this sequentially it will take 700 minutes? When we try to run in pipelines in parallel we are limited by the network range which would be 256/3. 1 pipeline starts 3 VM's one master 2 slaves. We tried but the performance is going down when we start more than 10 pipelines in parallel.
Questions. Is this the right approach?

When multiple pipelines are running at the same time, there are multiple Dataproc clusters running behind the scenes with more VMs and require more disk. There are some plugins to help you out with multiple source tables. Correct plugin to use should be CDAP/Google plugin called Multiple Table Plugins as it allows for multiple source tables.
In the Data Fusion studio, you can find it in Hub -> Plugins.
To see full lists of available plugins, please visit official documentation.

Multiple Data Fusion pipelines can use the same pre-provisioned Dataproc cluster. You need to create the Remote Hadoop Provisioner compute profile for the Data Fusion instance.
This feature is only available in Enterprise edition.
How setup compute profile for the Data Fusion instance.

Related

Big Latency using Azure Database for PostgreSQL flexible server

I have a PostgreSQL database deployed on Azure using Azure Database for PostgreSQL flexible server.
I have the same database created locally and queries that takes 70ms locally, takes 800ms on the database deployed on Azure.
The latency happens if I query the database from my local machine or from an app deployed in Azure Service.
Any clue what may be the issue or how this can be improved?
The compute tier I am using in Azure is the following (I know it's the worst one but I still think it shouldn't take almost a second to query a table that has 50 records in it and two columns):
Burstable (1-2 vCores) - Best for workloads that don’t need the full
CPU continuously
Standard_B1ms (1 vCore, 2 GiB memory, 640 max iops)
This is what Azure is showing me in the overview (so that's why I am guessing the problem should be network related and not CPU/memory related):
Burstable sku's are for testing and small workloads changing to General purpose will improve.

Can anyone please tell me how efficient Azure Databricks Jobs are?

like If I schedule a databricks notebook to run via Jobs and Azure Data Factory, which one would be more efficient and why?
There are few cases when Databricks Workflows (former Jobs) are more efficient to use than ADF:
ADF still uses Jobs API 2.0 for submission of ephemeral jobs that doesn't support setting of default access control lists
If you have a job consisting of several tasks, Databricks has an option of cluster reuse that allow to use the same cluster(s) to run multiple subtasks, and don't wait to creation of new clusters as in case when subtasks are scheduled from ADF
You can more efficiently share a context between subtasks when using Databricks Workflows

Apache Spark standalone settings

I have an Apache spark standalone set up.
I wish to start 3 workers to run in parallel:
I use the commands below.
./start-master.sh
SPARK_WORKER_INSTANCES=3 SPARK_WORKER_CORES=2 ./start-slaves.sh
I tried to run a few jobs and below are the apache UI results:
Ignore the last three applications that failed: Below are my questions:
Why do I have just one worker displayed in the UI despite asking spark to start 3 each with 2 cores?
I want to partition my input RDD for better performance. So for the first two jobs with no partions, I had a time of 2.7 mins. Here my Scala source code had the following.
val tweets = sc.textFile("/Users/soft/Downloads/tweets").map(parseTweet).persist()
In my third job (4.3 min) I had the below:
val tweets = sc.textFile("/Users/soft/Downloads/tweets",8).map(parseTweet).persist()
I expected a shorter time with more partitions(8). Why was this the opposite of what was expected?
Apparently you have only one active worker, which you need to investigate why other workers are not reported by checking the spark logs.
More partitions doesn't always mean that the application runs faster, you need to check how you are creating partitions from source data, the amount of data parition'd and how much data is being shuffled, etc.
In case you are running on a local machine it is quite normal to just start a single worker with several CPU's as shown in the output. It will still split you task of the available CPU's in the machine.
Partitioning your file will happen automatically depending on the amount of available resources, it works quite well most of the time. Spark (and partitioning the files) comes with some overhead, so often, especially on a single machine Spark adds so much overhead it will slowdown you process. The added values comes with large amounts of data on a cluster of machines.
Assuming that you are starting a stand-alone cluster, I would suggest using the configuration files to setup a the cluster and use start-all.sh to start a cluster.
first in your spark/conf/slaves (copied from spark/conf/slaves.template add the IP's (or server names) of you worker nodes.
configure the spark/conf/spark-defaults.conf (copied from spark/conf/spark-defaults.conf.template Set at least the master node to the server that runs your master.
Use the spark-env.sh (copied from spark-env.sh.template) to configure the cores per worker, memory etc:
export SPARK_WORKER_CORES="2"
export SPARK_WORKER_MEMORY="6g"
export SPARK_DRIVER_MEMORY="4g"
export SPARK_REPL_MEM="4g"
Since it is standalone (and not hosted on a Hadoop environment) you need to share (or copy) the configuration (or rather the complete spark directory) to all nodes in your cluster. Also the data you are processing needs to be available on all nodes e.g. directly from a bucket or a shared drive.
As suggested by the #skjagini checkout the various log files in spark/logs/ to see what's going on. Each node will write their own log files.
See https://spark.apache.org/docs/latest/spark-standalone.html for all options.
(we have a setup like this running for several years and it works great!)

Azure Data Factory v2 and data processing in custom activity

I am migrating (extract-load) a large dataset to a LOB service, and would like to use Azure Data Factory v2 (ADF v2). This would be the cloud version of the same kind of orchestration typically implemented in SSIS. My source database and dataset, as well as the target platform are on Azure. That lead me to ADFv2 with Batch Service (ABS) and creating a custom activity.
https://learn.microsoft.com/en-us/azure/data-factory/transform-data-using-dotnet-custom-activity
However, I am unable to read from the documentation or samples provided by Microsoft how ADF v2 can create the job and tasks needed by the batch service.
As an example, lets say I have dataset with 10 million records, and batch service with 10 cores in a pool. How do I submit 1/10, or even row-for-row, to my command line app running on each of the cores in the pool? How do I distribute the work? Following the default guide at the docs for ADF v2, I just get a datasets.json file, and it is the same for all my pool nodes, no "slice" or subset information.
If ADF v2 was not involved I would create a job in ABS and for each row or for each X rows, create a task. The nodes would then execute task for task. How do I achieve something similar with ADF v2?

Tool to create mongodb sharded cluster

I need a tool to manage a cluster of mongodbs. With an increasing number of machines, it is hard to maintain each machine without a tool.
More details:
The database grows around 50 MB per day, so they are approximately 1.5 GB per month. The mongodb is great for this because just increase a machine in your cluster resolve the size problem. The problem is that this change requires entering the host configuration and make the changes manually. I'd like to optimize the time of the team with a tool that allows remote execution, for example, run and save scripts or similar.
You can use Juju to create mongodb cluster :
https://github.com/charms/mongodb
http://www.youtube.com/playlist?list=PLyZVZdGTRf8g-5E7ppGGpxrraStyi493V
http://www.jorgecastro.org/2014/03/17/introducing-juju-bundles/
You need a provisioning tool, like Vagrant. Or SSH wrapper like Fabric.
MongoDB Cloud Manager has a Public API that you can use for cluster deployment automation.
Here's the link to the (very well-written) official docs.