Generate data from azure data lake store - scala

I am new to Scala and spark i would like to load the files from azure data lake store.
I want to load all the files which start from test.
I tried as follows:
folder/test[0-10000]*.csv"
Can some suggest any solution?

Related

Move Entire Azure Data Lake Folders using Data Factory?

I'm currently using Azure Data Factory to load flat file data from our Gen 2 data lake into Synapse database tables. Unfortunately, we receive (many) thousands of files into timestamped folders for each feed. I'm currently using Synapse external tables to copy this data into standard heap tables.
Since each folder contains so many files, I'd like to move (or Copy/Delete) the entire folder (after processing) somewhere else in the lake. Is there some practical way to do that with Azure Data Factory?
Yes, you can use copy activity with a wild card. I tried to reproduce the same in my environment and I got the below results:
First, add source dataset and select wildcard with folder name. In my scenario, I have a folder name pool.
Then select sink dataset with file path
The pipeline run is successful. It transferred the file from one location to another location with the required name. Look at the following image for reference.

Merging data in Datalake

I'm working on a project where we need to bring data from SQL Server database into a Datalake.
I succeded that through a pipeline which ingest data from the source and load it into a DL in parquet format.
My question is how to merge new data from data source to the existing file into that data lake(Upserting).
You can use Azure data flows wherein you can map the source file with other sources and override the existing file. There is no upsert activity directly in ADF for files unlike for databases.
reference :
https://learn.microsoft.com/en-us/answers/questions/542994/azure-data-factory-merge-2-csv-files-with-differen.html

How to remove extra files when sinking CSV files to Azure Data Lake Gen2 with Azure Data Factory data flow?

I have done data flow tutorial. Sink currently created 4 files to Azure Data Lake Gen2.
I suppose this is related to HDFS file system.
Is it possible to save without success, committed, started files?
What is best practice? Should they be removed after saving to data lake gen2?
Are then needed in further data processing?
https://learn.microsoft.com/en-us/azure/data-factory/tutorial-data-flow
There are a couple of options available.
You can mention the output filename in Sink transformation settings.
Select Output to single file from the dropdown of file name option and give the output file name.
You could also parameterize the output file name as required. Refer to this SO thread.
You can add delete activity after the data flow activity in the pipeline and delete the files from the folder.

Coping files from Azure blob storage to azure data lake store

I am Coping files from Azure blob storage to azure data lake store, I need to pick files from year(folder)\month(folder)\day(txt files are on day bases).I am able to do one file with hadrcoded path but i am not able to pick file per day and process to copy in azure data lake store. Can anyone please help me.
I am using ADF V2 and using UI designer to create my connections,datasets and pipeline my steps are which i is working fine
copy file from blob storage to data lake store
picking that file from data lake store and processing through usql for transform data.
that transform data i am saving in Azure SQL DB
Please give me answer i am not able to get any help b/c all help is in JSON i am looking how i will define and pass parameters in UI designer.
Thanks
For the partitioned file path part, you could take a look at this post.
You could use copy data tool to handle it.

Copy empty folders from an Azure Data Lake Store using Data Factory

I am using Data Factory v1 to copy folders from a source Data Lake Store to a destination Data Lake Store for backup purpose.
Unfortunately it does not copy empty folders, this is by design I think if I read this article
Note when recursive is set to true and sink is file-based store, empty folder/sub-folder will not be copied/created at sink
But for my backup this is not an option, is it possible to also copy folders using Data Factory?