How do I use a shared embedding for one hot encoded data in csv? - tf.keras

I have multiple one hot encoded columns in a csv file
(my one hot encoded columns)
I had to one hot encode them manually because they were comma separated values in each cell as shown here (the column that was one hot encoded). How do I create one embedding using tensorflow.keras.layers.Embedding for all the columns?
What should be the input shape for that embedding? It would be helpful if I could get some example code.

Related

How to add header to file in Azure Data Factory

I am storing the header in a CSV file and concatenating it with the data file using mapping data flow.
I am using union Activity to combine these two files. While combining the header file and data file, I can see the data but header data is not at the top. It's randomly present in the sink file.
How can I make the header at top ?
1,Company_Reference_ID,PH2_TIC,USD,PH2_Actuals,2021-03-12,PH2_X_VB_V3,85738,V3Premium
1,Company_Reference_ID,PH2_TIC,USD,PH2_Actuals,2021-03-12,PH2_X_VB_V3,85738,V3Premium
Journey,CompanyRerenceIDType,CompaReferenceID,Currecy,Ledgerype,Accountinate,Journaource
1,Company_Reference_ID,PH2_TIC,USD,PH2_Actuals,2021-03-12,PH2_X_VB_V3,85738,V3Commission
Update:
My debug result is as follows, I think it is what you want:
I created a simple test to merge two csv files. One header.csv and another vlaues.csv.
As #Mark Kromer MSFT said we can use Surrogate Key and then sort these rows.The Row_No of heard.csv will start from 1 and values.csv will start from 2.
Set header source to the header.csv and don't select First row as header.
Set header source to the values.csv and don't select First row as header.
At SurrogateKey1 activity , enter Row_No as Key column and 1 as Start value.
At SurrogateKey2 activity , enter Row_No as Key column and 2 as Start value.
Then we can uion SurrogateKey1 stream and SurrogateKey2 stream at Union1 activity.
Then we can sort these rows by Row_No at Sort1 activity.
We can use Select1 activity to filter Row_No column.
I think it is what you want:
For now, you would need to use a Surrogate Key for the different streams and make sure that the header row has 1 for the surrogate key value and sort by that column.
We are working on a feature for adding a header to the delimited text sink as a property in the data flow Sink. That will make it much easier and should light-up in the UI soon.

Azure ADF Copy Activity with Trailing Column Delimiter

I have a strange source CSV file where it contains a trailing column delimiter at the end of each record just before the carriage return/new line.
When ADF is previewing this data, it displays only 2 columns without issue and all the data rows. However, when using the copy activity, it fails with the following exception.
ErrorCode=DelimitedTextColumnNameNotAllowNull,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=The
name of column index 3 is empty. Make sure column name is properly
specified in the header
Now I understand why it's complaining about this due to trailing delimiter, but my question is whether or not there is a way to deal with this condition? I've tried including the trailing comma in the record delimiter (,\r\n), but then it just pivots the data where all the columns become rows.
Is there a way to address this condition in copy activity?
When preview the data in dataset, it seams correct:
But actually in copy actives, the data will derived to 3 columns by the column delimiter ",", the third column is empty or NULL value. This will cause the error.
If you use Data Flow import projection from source, you can see the third column:
Just for now, copy active doesn't support modify the data schema. You must use Data flow Derived Column to create a new schema for the source. For example:
Then mapping the new column/schema to sink will solve the problem.
HTH.
Use a different encoding for your CSV. CSV utf-8 will do the trick.

Copy text file using postgres with custom delimiter by character size

I need to copy a text file which has confusing delimiter. I believe the delimiter is space. However, some of the column values are empty and I cannot differentiate which column which making it harder to load the data to database since the space is not indicating anything. Thus, when I try to COPY, the mapping is not right and I am getting ERROR: extra data after last expected column
I have tried to change the delimiter to comma and such, I am still getting the same error above. The below code can be used when I try to load some dummy data with proper delimiter.
COPY usm00070219(HEADREC_ID,YEAR,MONTH,DAY,HOUR,RELTIME,NUMLEV,P_SRC,NP_SRC,LAT,LON) FROM 'D:\....\USM00070219-data.txt' DELIMITER ' ';
This is example data:
It should have 11 columns but the data on the first row is only 10 and it cannot identify the empty value column. The spacings are not helpful at all!
Is there any way I can separate the columns by character size as delimiter and force the data to be divided by the size given?
COPY is not made to handle fixed-width text files. I can think of two options:
Load the file as it is into a table with a single text column using COPY. Then use regexp_split_to_array to split it into its components and inser these into another table.
You can use file_fdw to create a foreign table with a single text column like above and operate on that. That saves loading the file into the database.
There is a foreign data wrapper for fixed-width text files that you can try.

load data to db2 in a single row (cell)

I need to load an entire file (contains only ASCII text), to the database (DB2 Express ed.). The table has only two columns (ID, TEXT). The ID column is PK, with auto generated data, whereas the text is CLOB(5): I have no idea about the input parameter 5, it was entered by default in the Data Studio.
Now I need to use the load utility to save a text file (contains 5 MB of data), in a single row, namely in the column TEXT. I do not want the text to be broken into different rows.
thanks for your answer in advance!
Firstly, you may want to redefine your table: CLOB(5) means you expect 5 bytes in the column, which is hardly enough for a 5 MB file. After that you can use the DB2 IMPORT or LOAD commands with the lobsinfile modifier.
Create a text file and place LOB Location Specifiers (LLS) for each file you want to import, one per line.
LLS is a way to tell IMPORT where to find LOB data. It has this
format: <file path>[.<offset>.<length>/], e.g.
/tmp/lobsource.dta.0.100/ to indicate that the first 100 bytes of
the file /tmp/lobsource.dta should be loaded into the particular LOB
column. Notice also the trailing slash. If you want to import the
entire file, skip the offset and length part. LLSes are placed in
the input file instead of the actual data for each row and LOB column.
So, for example:
echo "/home/you/yourfile.txt" > /tmp/import.dat
Since you said the IDs will be generated in the input data, you don't need to enter them in the input file, just don't forget to use the appropriate command modifier: identitymissing or generatedmissing, depending on how the ID column is defined.
Now you can connect to the database and run the IMPORT command, e.g.
db2 "import from /tmp/import.dat of del
modified by lobsinfile identitymissing
method p (1)
insert into yourtable (yourclobcolumn)"
I split the command onto multiple lines for readability, but you should type it on a single line.
method p (1) means parse the input file and read the column in position 1.
More info in the manual

Splitting a column into multiple columns in MongoDB

I have a Dictionary field in a MongoDB document which contains values that are separated by a semicolon. Is there any query that I could use to split the column into multiple columns.
The scenario is that I load in contents from a CSV file which sometimes has columns that are delimited by characters like a semicolon. Since I will have to support any kind of input CSV file, I cannot fix anything in the schema. Thus I have a dictionary field called "content" that stores the document contents as a dictionary. Now I need to be able to perform splits on columns that have multiple values.
Eg: Author Names column has entries like Author1;Author2;Author3. The user should be able to split this into 3 columns - one for each author.
Edit: For now, I have implemented this by means of a process on the server side. Ideally it would be great if I can do this in MongoDB itself (speed constraints).