Stitch Part Files to one with custom name - google-cloud-data-fusion

Data Fusion Pipeline gives us one or more part files at output if sync in GCS Bucket. My question is how we can combine those part files to one and also gave them a meaningful name ?

The Data Fusion transformations run in Dataproc clusters executing either Spark or MapReduce jobs. Your final output is split in many files because the jobs partition your data based on the HDFS partitions (this is the default behavior for Spark/Hadoop).
When writing a Spark script you are able to manipulate this default behavior and produce different outputs. However, Data Fusion was built to abstract the code layer and provide you the experience of using a fully managed data integrator. Using split files should not be a problem but if you really need to merge them I suggest that you use the following approach:
On the top of your Pipeline Studio click on Hub -> Plugins, search for Dynamic Spark Plugin, click on Deploy and then in Finish (you can ignore the JAR file)
Back to your pipeline, select Spark in the sink section.
Replace your GCS plugin with the Spark plugin
In your Spark plugin, set Compile at Deployment Time as false and replace the code with some Spark code that does what you want. The code below for example is hardcoded but works:
def sink(df: DataFrame) : Unit = {
new_df = df.coalesce(1)
new_df.write.format("csv").save("gs://your/path/")
}
This function receives the data from your pipeline as a Dataframe. The coalesce function reduces the number of partitions to 1 and the last line writes it to GCS.
Deploy your pipeline and it will be ready to run

Related

Concurrent file processing in data flow activity Azure Data Factory

When using control flow, it's possible to use a GetMetadata activity to retrieve a list of files in a blob storage account and then pass that list to a for each activity where the Sequential flag is false to process all files concurrently (in parallel) up to the max batch size according to the activities defined in the for each loop.
However, when reading about data flows in the following article from Microsoft (https://learn.microsoft.com/en-us/azure/data-factory/concepts-data-flow-column-pattern), they indicate the following:
A mapping data flow will execute better
when the Source transformation iterates over multiple files instead of
looping via the For Each activity. We recommend using wildcards or
file lists in your source transformation. The Data Flow process will
execute faster by allowing the looping to occur inside the Spark
cluster. For more information, see Wildcarding in Source
Transformation.
For example, if you have a list of data files from July 2019 that you
wish to process in a folder in Blob Storage, below is a wildcard you
can use in your Source transformation.
DateFiles/_201907.txt
By using wildcarding, your pipeline will only contain one Data Flow
activity. This will perform better than a Lookup against the Blob
Store that then iterates across all matched files using a ForEach with
an Execute Data Flow activity inside.
Based on this finding, I have configured a data flow task where the source is a blob directory of files and it processes all files in that directory with no control loops. I do not, however, see any options to process files concurrently within the data flow. I do, however, see an Optimize tab where you can set the partitioning option.
Is this option only for processing a single large file into multiple threads or does this control how many files it processes concurrently within the directory where the source is pointing?
Is the documentation assuming the for each control loop is set to "Sequential" (I can see why that would be true if it was, but having a hard time believing it if it's running one file at a time in the data flow)?
Inside data flow, each source transformation will read all of the files indicated in the folder or wildcard path and store those contents into data frames in memory for processing.
Setting the partitioning manually from the Optimize tab will instruct ADF the partitioning scheme you wish to use inside Spark.
To process each file individually 1x1, use the control flow capabilities in the pipeline.
Iterate over each file you wish to process and send the name of the file into the data flow via iterator parameter inside a For Each setting the execution to Sequential.

Change spark _temporary directory path to avoid deletion of parquets

When two or more Spark jobs have the same output directory, mutual deletion of files will be inevitable.
I'm writting a dataframe in append mode with spark 2.4.4 and I want to add a timestamp to the tmp dir of spark to avoid these deletion.
example:
my JobSpark write in hdfs:/outputFile/0/tmp/file1.parquet
the same spark job called with other data and write in hdfs:/outputFil/0/tm/file2.parquet
I want jobSpark1 write in hdfs:/outputFile/0/tmp+(timeStamp)/file1.parquet
and the other job write in hdfs:/outputFile/0/tmp+(timeStamp)/file2.parquet and next move parquets to hdfs:/outputFile/
df
.write
.option("mapreduce.fileoutputcommitter.algorithm.version", "2")
.partitionBy("XXXXXXXX")
.mode(SaveMode.Append)
.format(fileFormat)
.save(path)
When Spark appends data to an existing dataset, Spark uses FileOutputCommitter to manage staging output files and final output files. The behavior of FileOutputCommitter has direct impact on the performance of jobs that write data.
A FileOutputCommitter has two methods, commitTask and commitJob. Apache Spark 2.0 and higher versions use Apache Hadoop 2, which uses the value of mapreduce.fileoutputcommitter.algorithm.version to control how commitTask and commitJob work. In Hadoop 2, the default value of mapreduce.fileoutputcommitter.algorithm.version is 1. For this version, commitTask moves data generated by a task from the task temporary directory to job temporary directory and when all tasks complete, commitJob moves data to from job temporary directory to the final destination.
Because the driver is doing the work of commitJob, for cloud storage, this operation can take a long time. You may often think that your cell is “hanging”. However, when the value of mapreduce.fileoutputcommitter.algorithm.version is 2, commitTask moves data generated by a task directly to the final destination and commitJob is basically a no-op.

Loading many tables in Cloud Data Fusion fails with DAG error

I have an MS SQL Server data source with around 1000 tables, which I need to put into BigQuery. I was hoping to use Data Fusion to load them all into staging tables in BigQuery, and then perform transformations on them afterwards. However, as soon as I create a pipeline with two "islands" it give a DAG error. Is that a feature or a just something I'm doing wrong? I can't find anything in the documentation. My pipeline looks like this:
And the error I get when I try to deploy is: "Invalid DAG. There is an island made up of stages BigTest,BigQuery BigTest (no other stages connect to them)."
Each pipeline is a single DAG (Directed acyclic graph) and all the source and sink should be connected for the configuration to be valid. You can use multi-table source plugin that can bring in multiple tables at once to a landing table in BQ.
You can use Multi table plugins and BQ Multi table sink for your use-case.

AWS Glue, data filtering before loading into a frame, naming s3 objects

I have 3 questions, for the following context:
I'm trying to migrate my historical from RDS postgresql to S3. I have about a billion rows of dat in my database,
Q1) Is there a way for me to tell an aws glue job what rows to load? For example i want it to load data from a certain date onwards? There is no bookmarking feature for a PostgreSQL data source,
Q2) Once my data is processed, the glue job automatically creates a name for the s3 output objects, I know i can speciofy the path in DynamicFrame write, but can I specify the object name? if so, how? I cannot find an option for this.
Q3) I tried my glue job on a sample table with 100 rows of data, and it automatically separated the output into 20 files with 5 rows in each of those files, how can I specify the batch size in a job?
Thanks in advance
This is a question I have also posted in AWS Glue forum as well, here is a link to that: https://forums.aws.amazon.com/thread.jspa?threadID=280743
Glue supports pushdown predicates feature, however currently it works with partitioned data on s3 only. There is a feature request to support it for JDBC connections though.
It's not possible to specify name of output files. However, looks like there is an option with renaming files (note that renaming on s3 means copying file from one location into another so it's costly and not atomic operation)
You can't really control the size of output files. There is an option to control min number of files using coalesce though. Also starting from Spark 2.2 there is a possibility to set max number of records per file by setting config spark.sql.files.maxRecordsPerFile

Spark saveAsTextFile to Azure Blob creates a blob instead of a text file

I am trying to save an RDD to a text file. My instance of Spark is running on Linux and connected to Azure Blob
val rdd = sc.textFile("wasb:///HdiSamples/HdiSamples/SensorSampleData/hvac/HVAC.csv")
//find the rows which have only one digit in the 7th column in the CSV
val rdd1 = rdd.filter(s => s.split(",")(6).length() == 1)
rdd1.saveAsTextFile("wasb:///HVACOut")
When I look at the output, it is not as a single text file but as a series of application/octet-stream files in a folder called HVACOut.
How can I output it as a single text file instead?
Well I am not sure you can get just one file without a directory. If you do
rdd1 .coalesce(1).saveAsTextFile("wasb:///HVACOut")
you will get one file inside a directory called "HVACOut" the file should like something like part-00001. This is because your rdd is a disturbed on in your cluster with what they call partitions. When you do a call to save (all save functions) it is going to make a file per partition. So by call coalesce(1) your telling you want 1 partition.
Hope this helps.
After finished provisioning a Apache Spark cluster on Azure HDInsight, you can go to the built-in Jupyter notebook for your cluster at: https://YOURCLUSTERNAME.azurehdinsight.net/jupyter.
There you will find sample notebook with example on how to do this.
Specifically, for scala, you can go to the notebook named "02 - Read and write data from Azure Storage Blobs (WASB) (Scala)".
Copying some of the code and comments here:
Note:
Because CSV is not natively supported by Spark, so there is no built-in way to write an RDD to a CSV file. However, you can work around this if you want to save your data as CSV.
Code:
csvFile.map((line) => line.mkString(",")).saveAsTextFile("wasb:///example/data/HVAC2sc.csv")
Hope this helps!