Snowflake : Copy comman not generating Constant SIze for multiple files while unloading - snowflake-schema

copy into #elasticsearch/product/s3file from (select
object_construct(*)from mytable) file_format = (type = json,
COMPRESSION=NONE), overwrite=TRUE, single = False,
max_file_size=5368709120;
the table has 2GB of data.
I want to split them in 100mb files to be stored in S3, but s3 splits them uneven files sizes.
Expected is to have multiple files having 100MB
I need to do performance improvement to index in elastic search, I'm using smart_open to do multiprocessing. so it will be convenient to handle files.
Thanks

You would only get identical file sizes if every value in each column was exactly the same size.
For example, if your table had firstname and lastname columns and one record had values of "John" "Smith" and another record had values of "Michael" "Gardner" then, if each record was written to a different file, the resulting JSON files would be different sizes as John is a different size to Michael and Smith is a different size to Gardner.
You can also only control the maximum size of a file, not the actual file size. If you had written 10 records to a file and that resulted in a file size of 950Mb, if the next record would be 100Mb in size then it would be written to a new file and the original fole would remain at 950Mb

Its not S3 split the files its snowflake wharehosue size split the file as if you use SINGLE=False in copy command. As WH size grows number files will be increase
Example
sppouse your running your query with XS size wh and it produce 8 files on s3 and if you use M size WH then it will create 16 file on s3. and its split happens in parallel mode so size may be vary for each file. Its not like it create file with max limit you have given in copy command and then start another file.

Related

How to split data into multiple outputs files based on value of a given column

Using Talend Open Studio for Data integration
How can I split one Excel file into multiple outputs based on values of given column ?
Example
Example of data in input.xlsx :
ID; Category
1; AAA
2; AAA
3; BBB
4; CCC
Example of output files :
AAA.xlsx contains ID 1 and 2
BBB.xslx contains ID 3
CCC.xslx contains ID 4
What I tried ?
tfilelist-->tinputexcel-->tuniqrows-->tflowtoiterate-->tfileinputexcel-->tfilterow-->tlogrow
In order to perform these actions :
Browse a folder of Excel files
Iterate to Open Excel file
Get uniques values in Excel files (on column used for the split)
Iterate to generate splitted files with the unique values and tfilterow to filter Excel file and that's where I get an error about Garbage Collector
Exception in component tFileInputExcel_4 (automatisation_premed)
java.io.IOException: GC overhead limit exceeded
Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
Talend's job diagram
Do someone have an idea to optimize this talend workflow and solve GC error ? Thanks for the support
Finally, I think we must not iterate on an Excel Input as openning twice the same file is a problem both on Windows and in designed job so a workaround should be :
Talend diagram for the job
There are multiple ways to tackle this type of thing in Talend. One approach is to store the Excel file somewhere after loading (Database, CSV, Hash, etc).
An alternative approach is to aggregate -> iterate -> normalize the data like so:
In tAggregateRow you want to group by the field containing the 'base' of your file name (Category in this case):
The aggregate function should be 'list' (with an appropriate delimiter not already contained in your Id column:
Feed the aggregated output into a tFlowToIterate to loop over each Category:
tFixedFlow can be used to output each of the aggregates to an independent flow:
Use tNormalize to dump the single Category row into one row per Id by normalizing the 'list' column:
Set the tFileOutputExcel file name to be the current iterations Category as defined in tFlowToIterate:
Final result is one file per Category with one row per Id:

Update a serialized table on disk in kdb

I have a serialised table on disk which I want to update based on condition.
One way of doing so, is by loading the table in memory, updating it and then serializing it again on disk.
Eg:
q)`:file set ([] id:10 20; l1:("Blue hor";"Antop"); l2:("Malad"; "KC"); pcd:("NCD";"FRB") );
q)t:get `:file;
q)update l1:enlist "Chin", l2:enlist "Gor" from `t where id=10;
q)`:file set t;
I tried updating the table directly on disk but received type error:
q)update l1:enlist "Chin", l2:enlist "Gor" from `:file where id=10
'type
[0] update l1:enlist "Chin", l2:enlist "Gor" from `:file where id=10
Is there a way in which we can update the serialized table directly on disk?(In one of the case we don't have enough primary memory to load the serialized table)
If you save your table as one flat file, then the whole table has to be loaded in, updated and then written down, requiring enough memory to hold the full table.
To avoid this you can splay your table by adding a trailing / in your filepath, ie
`:file/ set ([] id:10 20; l1:("Blue hor";"Antop"); l2:("Malad"; "KC"); pcd:("NCD";"FRB") );
If symbols columns are present they will need to be enumerated using .Q.en.
This will split the table vertically and save your columns as individual files, under a directory called file. Having the columns saved as individual files allows you to only load in the columns required as opposed to the entire table, resulting in smaller memory requirements. You only need to specify the required columns in a select query.
https://code.kx.com/q4m3/14_Introduction_to_Kdb%2B/#142-splayed-tables
You can further split your data horizontally by partitioning, resulting in smaller subsets again if individual columns are too big.
https://code.kx.com/q4m3/14_Introduction_to_Kdb%2B/#143-partitioned-tables
When you run
get`:splayedTable
This memory maps the table assuming your columns are mappable, as opposed to copying it into memory, shown by .Q.w[]
You could do
tab:update l1:enlist "Chin", l2:enlist "Gor" from (select id, l1,l2 from get`:file) where id=10
`:file/l1 set tab`l1
`:file/l2 set tab`l2
If loading only the required columns for your query into memory is still too big, you can load them one at a time. First load id and find the required indices (where id=10), delete id from memory, load in l1 and modify using the indices,
#[get`:file/l1;indicies;:;enlist"Chin"]
write it down and then delete it from memory. Then do the same with l2. This way you would have at most one column in memory. Ideally your table would be appropriately partitioned so you can hold the data in memory.
You can also directly update vectors on disk, which avoids having to re write the whole file, for example,
ind:exec i from get`:file where id=10
#[`:file/l1;ind;:;enlist"Chin"]
Though there are some restrictions on the file which are mentioned in the below link
https://code.kx.com/q/ref/amend/#functional-amend

Check the details of columns which have attributes applied to them

In the splayed tables we can find the details/order of columns in the .d file.
I was searching if there is any file which maintains the attributes information of the columns in a table.
How can we find the details of the attributes in the file system?
t:([] a:1 2 3; b:4 5 6; c:`a`b`c)
`:/home/st set .Q.en[`:/home/st;t]
get `:/home/st/.d / Output - `a`b`c
#[`:/home/st/;`a;`s#] / Is there any place in file system where we can find the attribute applied to a column
meta get `:/home/st/ / Show that attribute s is applied on column a
Attributes details are stored in the column file itself. For example, in your case file /home/st/a will contain sorted attribute information.
But since these files are serialized data (binary format), and structure of splayed binary data is not open, we can not get the attribute information directly from the file.
You can actually read the attributes from columns on disk, it's just not recommended (and potentially subject to change):
q){(0x0001020304!``s`u`p`g)first read1(x;3;1)}`:st/a
`s
q){(0x0001020304!``s`u`p`g)first read1(x;3;1)}`:st/b
`

Copying untitled columns from tsv file to postgresql?

By tsv I mean a file delimited by tabs. I have a pretty large (6GB) data file that I have to import into a PostgreSQL database, and out of 56 columns, the first 8 are meaningful, then out of the other 48 there are several columns (like 7 or so) with 1's sparsely distributed with the rest being 0's. Is there a way to specify which columns in the file you want to copy into the table? If not, then I am fine with importing the whole file and just extracting the desired columns to use as data for my project, but I am concerned about allocating excessively large memory to a table in which less than 1/4 of the data is meaningful. Will this pose an issue, or will I be fine accommodating the meaningful columns into my table? I have considered using that table as a temp table and then importing the meaningful columns to another table, but I have been instructed to try to avoid doing an intermediary cleaning step, so I should be fine directly using the large table if it won't cause any problems in PostgreSQL.
With PostgreSQL 9.3 or newer, COPY accepts a program as input . This option is precisely meant for that kind of pre-processing. For instance, to keep only tab-separated fields 1 to 4 and 7 from a TSV file, you could run:
COPY destination_table FROM PROGRAM 'cut -f1-4,7 /path/to/file' (format csv, delimiter '\t');
This also works with \copy in psql, in which case the program is executed client-side.

load data to db2 in a single row (cell)

I need to load an entire file (contains only ASCII text), to the database (DB2 Express ed.). The table has only two columns (ID, TEXT). The ID column is PK, with auto generated data, whereas the text is CLOB(5): I have no idea about the input parameter 5, it was entered by default in the Data Studio.
Now I need to use the load utility to save a text file (contains 5 MB of data), in a single row, namely in the column TEXT. I do not want the text to be broken into different rows.
thanks for your answer in advance!
Firstly, you may want to redefine your table: CLOB(5) means you expect 5 bytes in the column, which is hardly enough for a 5 MB file. After that you can use the DB2 IMPORT or LOAD commands with the lobsinfile modifier.
Create a text file and place LOB Location Specifiers (LLS) for each file you want to import, one per line.
LLS is a way to tell IMPORT where to find LOB data. It has this
format: <file path>[.<offset>.<length>/], e.g.
/tmp/lobsource.dta.0.100/ to indicate that the first 100 bytes of
the file /tmp/lobsource.dta should be loaded into the particular LOB
column. Notice also the trailing slash. If you want to import the
entire file, skip the offset and length part. LLSes are placed in
the input file instead of the actual data for each row and LOB column.
So, for example:
echo "/home/you/yourfile.txt" > /tmp/import.dat
Since you said the IDs will be generated in the input data, you don't need to enter them in the input file, just don't forget to use the appropriate command modifier: identitymissing or generatedmissing, depending on how the ID column is defined.
Now you can connect to the database and run the IMPORT command, e.g.
db2 "import from /tmp/import.dat of del
modified by lobsinfile identitymissing
method p (1)
insert into yourtable (yourclobcolumn)"
I split the command onto multiple lines for readability, but you should type it on a single line.
method p (1) means parse the input file and read the column in position 1.
More info in the manual