Check the details of columns which have attributes applied to them - kdb

In the splayed tables we can find the details/order of columns in the .d file.
I was searching if there is any file which maintains the attributes information of the columns in a table.
How can we find the details of the attributes in the file system?
t:([] a:1 2 3; b:4 5 6; c:`a`b`c)
`:/home/st set .Q.en[`:/home/st;t]
get `:/home/st/.d / Output - `a`b`c
#[`:/home/st/;`a;`s#] / Is there any place in file system where we can find the attribute applied to a column
meta get `:/home/st/ / Show that attribute s is applied on column a

Attributes details are stored in the column file itself. For example, in your case file /home/st/a will contain sorted attribute information.
But since these files are serialized data (binary format), and structure of splayed binary data is not open, we can not get the attribute information directly from the file.

You can actually read the attributes from columns on disk, it's just not recommended (and potentially subject to change):
q){(0x0001020304!``s`u`p`g)first read1(x;3;1)}`:st/a
`s
q){(0x0001020304!``s`u`p`g)first read1(x;3;1)}`:st/b
`

Related

How to split data into multiple outputs files based on value of a given column

Using Talend Open Studio for Data integration
How can I split one Excel file into multiple outputs based on values of given column ?
Example
Example of data in input.xlsx :
ID; Category
1; AAA
2; AAA
3; BBB
4; CCC
Example of output files :
AAA.xlsx contains ID 1 and 2
BBB.xslx contains ID 3
CCC.xslx contains ID 4
What I tried ?
tfilelist-->tinputexcel-->tuniqrows-->tflowtoiterate-->tfileinputexcel-->tfilterow-->tlogrow
In order to perform these actions :
Browse a folder of Excel files
Iterate to Open Excel file
Get uniques values in Excel files (on column used for the split)
Iterate to generate splitted files with the unique values and tfilterow to filter Excel file and that's where I get an error about Garbage Collector
Exception in component tFileInputExcel_4 (automatisation_premed)
java.io.IOException: GC overhead limit exceeded
Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
Talend's job diagram
Do someone have an idea to optimize this talend workflow and solve GC error ? Thanks for the support
Finally, I think we must not iterate on an Excel Input as openning twice the same file is a problem both on Windows and in designed job so a workaround should be :
Talend diagram for the job
There are multiple ways to tackle this type of thing in Talend. One approach is to store the Excel file somewhere after loading (Database, CSV, Hash, etc).
An alternative approach is to aggregate -> iterate -> normalize the data like so:
In tAggregateRow you want to group by the field containing the 'base' of your file name (Category in this case):
The aggregate function should be 'list' (with an appropriate delimiter not already contained in your Id column:
Feed the aggregated output into a tFlowToIterate to loop over each Category:
tFixedFlow can be used to output each of the aggregates to an independent flow:
Use tNormalize to dump the single Category row into one row per Id by normalizing the 'list' column:
Set the tFileOutputExcel file name to be the current iterations Category as defined in tFlowToIterate:
Final result is one file per Category with one row per Id:

Google Cloud Data Fusion is appending a column to original data

When I am loading data encrypted data from GCS source to GCS sink there one additional column getting added.
Original data
Employee ID,Employee First Name,Employee Last Name,Employee Joining Date,Employee location
1,Vinay,Argekar,01/01/2017,India
2,Thirukkumaran,Haridass,02/02/2017,USA
3,David,Wu,03/04/2000,Canada
4,Vinod,Kumar,04/02/2002,India
5,Joshua,Abraham,04/15/2010,France
6,Allaudin,Dastigar,09/24/2012,UK
7,Senthil,Kumar,08/15/2009,Germany
8,Sudha,Narayanan,12/14/2016,India
9,Ravi,Prasad,11/11/2011,Costa Rica
Data came to file after running pipeline
0,Employee ID,Employee First Name,Employee Last Name,Employee Joining Date,Employee location
91,1,Vinay,Argekar,01/01/2017,India
124,2,Thirukkumaran,Haridass,02/02/2017,US
164,3,David,Wu,03/04/2000,Canada
193,4,Vinod,Kumar,04/02/2002,India
224,5,Joshua,Abraham,04/15/2010,France
259,6,Allaudin,Dastigar,09/24/2012,UK
293,7,Senthil,Kumar,08/15/2009,Germany
328,8,Sudha,Narayanan,12/14/2016,India
363,9,Ravi,Prasad,11/11/2011,Costa Rica
First column 0 was not present in original file
When you are configuring the GCS source, did you specify the Format to be CSV or was it left as Text? When the Format is Text, the output schema actually contains an offset, which is the first column that first column that you see in the output data. When you specify the format to be CSV, you have to specify the output schema of the file.

Nifi : How to move CSV content and its meta data to single table in Postgresdatabase using NiFi

I have csv files and I want to move the content of files along with its meta data (File name, source (To be hard coded), control number (Part of file name - to be extracted from file name itself) using NiFi. So here is the sample File name and layout -
File name - 12345_user_data.csv (control_number_user_data.csv)
source - Newyork
CSV File Content/columns - 
Fields - abc1, abc2, abc3, abc4 
values - 1,2,3,4
Postgres Database table layout
Table name - User_Education
fields name  -
control_number, file_name, source, abc1, abc2, abc3, abc4
Values - 
12345, 12345_user_data.csv, Newyork, 1,2,3,4
I am planning to use below processors - 
ListFile
FetchFile
UpdateAttributes
PutDatabaseRecords
LogAttributes
But I am not sure how to combine the actual content with the meta data to load into one single table. Please help
You can use UpdateRecord before PutDatabaseRecord to add the control_number, file_name, and source fields to each record, setting the populating the "Replacement Value Strategy" property to "Literal Value" and use Expression Language to set the values to the corresponding attributes.
For example, you could have a user-defined property /file_name set to ${filename}, that will add the file_name field to each record and set the value to whatever is in the "filename" attribute of the FlowFile.

load data to db2 in a single row (cell)

I need to load an entire file (contains only ASCII text), to the database (DB2 Express ed.). The table has only two columns (ID, TEXT). The ID column is PK, with auto generated data, whereas the text is CLOB(5): I have no idea about the input parameter 5, it was entered by default in the Data Studio.
Now I need to use the load utility to save a text file (contains 5 MB of data), in a single row, namely in the column TEXT. I do not want the text to be broken into different rows.
thanks for your answer in advance!
Firstly, you may want to redefine your table: CLOB(5) means you expect 5 bytes in the column, which is hardly enough for a 5 MB file. After that you can use the DB2 IMPORT or LOAD commands with the lobsinfile modifier.
Create a text file and place LOB Location Specifiers (LLS) for each file you want to import, one per line.
LLS is a way to tell IMPORT where to find LOB data. It has this
format: <file path>[.<offset>.<length>/], e.g.
/tmp/lobsource.dta.0.100/ to indicate that the first 100 bytes of
the file /tmp/lobsource.dta should be loaded into the particular LOB
column. Notice also the trailing slash. If you want to import the
entire file, skip the offset and length part. LLSes are placed in
the input file instead of the actual data for each row and LOB column.
So, for example:
echo "/home/you/yourfile.txt" > /tmp/import.dat
Since you said the IDs will be generated in the input data, you don't need to enter them in the input file, just don't forget to use the appropriate command modifier: identitymissing or generatedmissing, depending on how the ID column is defined.
Now you can connect to the database and run the IMPORT command, e.g.
db2 "import from /tmp/import.dat of del
modified by lobsinfile identitymissing
method p (1)
insert into yourtable (yourclobcolumn)"
I split the command onto multiple lines for readability, but you should type it on a single line.
method p (1) means parse the input file and read the column in position 1.
More info in the manual

Splitting a column into multiple columns in MongoDB

I have a Dictionary field in a MongoDB document which contains values that are separated by a semicolon. Is there any query that I could use to split the column into multiple columns.
The scenario is that I load in contents from a CSV file which sometimes has columns that are delimited by characters like a semicolon. Since I will have to support any kind of input CSV file, I cannot fix anything in the schema. Thus I have a dictionary field called "content" that stores the document contents as a dictionary. Now I need to be able to perform splits on columns that have multiple values.
Eg: Author Names column has entries like Author1;Author2;Author3. The user should be able to split this into 3 columns - one for each author.
Edit: For now, I have implemented this by means of a process on the server side. Ideally it would be great if I can do this in MongoDB itself (speed constraints).