How to convert list of dictionaries into bytes stream and load it to the database - postgresql

My standard way to run bulk upload of CSV files to Postgres database is to use copy_expert()
method:
cursor.copy_expert("copy %s from STDIN CSV HEADER NULL 'NULL' QUOTE '\"';" % (table_name), file=f)
Very often, before loading file to the database, I run some pre-processing of CSV file.
The results are always kept in list of dictionaries.
Following my standard way of loading I have to offload list of dictionaries to temporary CSV file and
only that file give to copy_expert() method as source of data.
What I would like to do is to replace source as a file with source as a some stream of bytes converted
from list of dictionaries. And then pass it to copy_expert() as a source.
I would like to exclude step of writing temporary CSV file and load data straight from memory.

Related

aws_s3.query_export_to_s3 PostgreSQL RDS extension exporting all multi-part CSV files to S3 with a header

I'm using the aws_s3.query_export_to_s3 function to export data from an Amazon Aurora Postgresql database to S3 in CSV format with a header row.
This works.
However, when the export is large and outputs to multiple part files, the first part file has the CSV header row, and subsequent part files do not.
SELECT * FROM aws_s3.query_export_to_s3(
'SELECT ...',
aws_commons.create_s3_uri(...),
options:='format csv, HEADER true'
);
How can I make this export add the header row to all CSV file parts?
I'm using Apache Spark to load this CSV data and it expects a header row in each individual part file.
How can I make this export add the header row to all part filess?
It's not possible, unfortunately.
The aws_s3.query_export_to_s3 function uses the PostgreSQL COPY command under the hood & then chunks the files appropriately depending on size.
Unless the extension picks up on the HEADER true option, caches the header & then provides an option to apply that to every CSV file generated, you're out of luck.
The expectation is that the files are then combined at destination when downloaded or the file processor has some mechanism of reading files in parts or the file processor only needs the header once.

How to create unique list of codes from multiple files in multiple subfolders in ADF?

We have a folder structure in data lake like this:
folder1/subfolderA/parquet files
folder1/subfolderB/parquet files
folder1/subfolderC/parquet files
folder1/subfolderD/parquet files
etc.
All the parquet files have the same schema, and all the parquet files have, amongst other fields, a code field, let's call it code_XX.
Now I want from all parquet files in all folders the distinct value of code_XX.
So if code_XX, value 'A345' resides multiple times in the parquet files in subfolderA and subfolderC, I only want it once.
Output must be a Parquet file with all unique codes.
Is this doable in Azure Data Factory, and how?
If not, can it be done in Databricks?
You can try as below.
Set source folder path to recursively look for all parquet files and choose a column to store the file names.
As it seems you only need file names in output parquet file, use select to have only that column forward.
Use expression in derived column to get the file names from path string.
distinct(array(right(fileNames,locate('/',reverse(fileNames))-1)))
If you have access to SQL, it can be done with two copy activities, no need for data flows.
Copy Activity 1 (Parquet to SQL): Ingest all files into a staging table.
Copy Activity 2 (SQL to Parquet): Select DISTINCT code_XX from the staging table.
NOTE:
Use Mapping to only extract the column you need.
Use a wildcard file path with the recursive option enabled to copy all files from subfolders. https://learn.microsoft.com/en-us/azure/data-factory/connector-azure-blob-storage?tabs=data-factory#blob-storage-as-a-source-type

ADF / Dataflow - Convert Multiple CSV to Parquet

In ADLS Gen2, TextFiles folder has 3 CSV files. Column names are different in each file.
We need to convert all 3 CSV files to 3 parquet files and put it in ParquetFiles folder
I tried to use Copy Activity and it fails because the column names have empty space in it and parquet files doesn't allow it
To remove spaces, I used Data flow: Source -> Select (replace space by underscore in col name) and sink. This worked for a single file. When I tried to do it for all 3 files, it tries to merge 3 files and generates single file with incorrect data.
How to solve this, mainly removing spaces from column names in all files. What would be the other options here?
Pipeline: ForEach activity (loop over CSV files in folder and send in current iteration item to data flow as param) -> Data Flow activity with source that points to that folder (parameterize the file name in the source path)
I created 2 datasets, one in csv with wildcard format, the other in parquet. I used the Data Copy Activity using the parquet data set as sink and csv data set as source. I set the copy behavior to Merge files.

Filename as Column using Data Factory V2

I have a lot of JSON files in Blob Storage and what I would like to do is to load the JSON files via Data factoryV2 into SQL Data Warehouse. I would like the filename in a column for each JSON file. I know how to
do this in SSIS but I am not sure how to replicate this in Data Factory.
e.g File Name: CornerShop.csv as CornerShop in the filename column in SQL Data Warehouse
Firstly,please see the limitation in the copy activity column mapping:
Source data store query result does not have a column name that is
specified in the input dataset "structure" section.
Sink data store (if with pre-defined schema) does not have a column
name that is specified in the output dataset "structure" section.
Either fewer columns or more columns in the "structure" of sink
dataset than specified in the mapping.
Duplicate mapping.
So,i don't think you could do the data transfer plus file name at one time.My idea is:
1.First use a GetMetadata activity. It should get the filepaths of each file you want to copy. Use the "Child Items" in the Field list.
2.On success of GetMetaData activity, do ForEach activity. For the ForEach activity's Items, pass the list of filepaths.
3.Inside the ForEach activity's Activities, place the Copy activity. Reference the iterated item by #item() or #item().name on the blob storage source file name.
4.Meanwhile,configure the filename as a parameter into stored procedure. In the stored procedure, merge the filename into fileName column.

SAS macro to read multiple rawdata files and create multiple SAS dataset for each raw data file

SAS macro to read multiple rawdata files and create multiple SAS dataset for each raw data file
Hi there
My name is Chandra. I am not very good at SAS macro especially the looping part and resolving &&. etc. Here is my problem statement.
Problem statement:
I have large number of raw data files (.dat files) stored in a folder in a SAS server. I need a macro that can read each of these raw data file and create SAS data set for each raw data file and store them in a separate target folder in the SAS server. All these raw data files have same file layout structure. I need to automate this operation so that every week, the macro reads raw data files from the source folder and creates the corresponding SAS dataset and stores them in the target folder in the SAS server. For example, if there are 200 raw data files in a source folder, I want to read them and create 200 SAS datasets one for each raw data file and save them in a target folder. I am not very good at constructing looping statement and also resolving && or &&& etc. How do I do it?
I would highly appreciate your kind assistance in this regard.
Respectfully
Chandra
You don't need to necessarily use a macro or a loop in case you have files with same fields. You can try pipe option and the filename keyword. Here is the link
You do not need a macro for this type of processing.
The INFILE statment will accept a file specification that includes operating system wildcards.
This example creates 200 text files in the work folder and then reads them back in in a single step.
I highly recommend not creating 200 separate data sets. Instead keep the filename, or a unique portion thereof, as a categorical variable that can be used later in a CLASS or BY statement, or as part of the criteria of a sub-setting WHERE.
%let workpath = %sysfunc(pathname(WORK));
* create something to input;
data _null_;
do i = 0 to 1999;
if mod(i,10) = 0 then filename = cats("&workpath./",'sample',i/10,'.txt');
file sample filevar=filename;
x = i; y = x**2;
put i x y;
end;
run;
* input data from 200 different files that have the same layout;
data samples;
length filename $250;
infile "&workpath.\*.txt" filename=filename; %* <-- Here be the wildcards;
input i x y;
source = filename;
run;