Facing issues with properly formatting CSV data

Facing issues with properly formatting CSV data - postgresql

Before I begin my question and background information, I'd like to state that I realize that many people have asked a similar question, but none of the answers to their questions have applied to my situation.
Background info: I'm trying to properly format a very large CSV file so that I can import it into a table in my PostgreSQL database. This CSV file only contains two fields, and the delimiter is ;
Problems encountered/attempted solutions
Problem #1: The delimiter is a semicolon, and many of the values in one of the fields contain semicolons. PostgreSQL obviously doesn't like this.
Solution #1: I used sed to change the delimiter to a string of characters that I knew would only occur as a delimiter.
Problem #2: The delimiter can only be a single character.
Solution #2: I changed the delimiter to a unicode character that I knew wouldn't occur as anything other than a delimiter.
Problem #3: The delimiter can only be a single-byte character.
Solution #3: I decided to go back in my steps, and rather than mess with the delimiter, I tried using sed to enclose all field values in double quotes in order to avoid the problem of some of the values containing the delimiter character. More specifically, I tried using the command found in the answer to this question - sed statement to change/modify CSV separators and delimiters
Problem #4: This resulted in many data errors, as any time a delimiter was in one of the values, double quotes were placed around it, and this caused Postgre SQL to attempt to copy values that were far too long and were simply not individual values. This row here is a perfect example of that -
"m[redacted]#[redacted].com";"mk,l.";"/'"
This row in particular made PostgreSQL think that it was copying 3 columns. Not to mention this row -
"[redacted]&#39";"of&#39";"all&#39";"your&#39";"[redacted]#[redacted].com";"[redacted]#[redacted].com:hapa[redacted]hoha"
Which made PostgreSQL attempt to copy the entire rest of the file into the second field as a single value.
Question
With all of that having been said, my final question is this - how can I enclose every value in the CSV file in double quotes in such a way that it will be properly imported into PostgreSQL?
Right now I'm backed against a wall and would appreciate any advice, even if it isn't a clear answer. I've tried everything I can think of. If an answer is even possible, I'd like one that can apply to CSV files that contain more than two fields, as I have many more CSV files to import after this one.

You state that one of the two fields can contain semicolons. If so (the other field does not ever contain any) then the 1st semicolon abutting this field is the delimiter. If the field containing semicolons as part of the data is first, then you need to find the last semicolon on the line, otherwise the first.
I've never used SED but regex allows you to match on the first or last occurrence of a character thus you can replace this single semicolon with a temporary character or pattern, then you should be able to successfully place quotes around the fields, and finally change the temporary field delimiter back.

Related

PostgreSQL: Cannot use numbers as tags for dollar quoted strings

How do we use numbers as tags for dollar quoted strings?
INSERT INTO table(user_id,user_data)
values (22176,to_jsonb($123${"name": "Helo. $ what is this $"}$123$::jsonb))
The above query fails, however if I replace numeric tags with alphabetic then it works. I didn't find anything in the documentation against using numbers for tags.
I need to make my tags as unique as possible, since I'm trying to avoid a situation where user content inside the jsonb matches my tags, for example
$abc${"name": "hello $abc$"}$abc$
I was trying to use UUIDs but it's not accepting numbers as tags.
Note: It's an example query, I have a lot of ' in my json values.

You cannot use $123$, because PostgreSQL uses $1, $2 etc. as placeholders in prepared statements. $a1$ would be ok.
To get a rare string to avoid collisions, drop on the keyboard a few times and make sure not to hit a digit first.

DB2 UnLOAD in unicode with two chardelimiter

I have to create an UNLOAD job for a DB2 table and save the UNload in unicode. That's no problem.
But unfortunately there are contents in the table columns that correspond to the separators.
For example, I would like the combination #! as a separator, but I can't do that in unicode.
Can someone tell me how to do this?
Now my statement looks like this:
DELIMITED COLDEL X'3B' CHARDEL X'24' DECPT X'2E'
UNICODE
thanks a lot for your help

The delimiter can be a single character (not two characters, as you want).
In this case the chosen solution was to find a single character that did not appear in the data.
When that is not possible, consider a non-delimited output format, or a different technique to get the data to the external system (for example via federation or other SQL-based interchange, or XML etc.

Azure ADF Copy Activity with Trailing Column Delimiter

I have a strange source CSV file where it contains a trailing column delimiter at the end of each record just before the carriage return/new line.
When ADF is previewing this data, it displays only 2 columns without issue and all the data rows. However, when using the copy activity, it fails with the following exception.
ErrorCode=DelimitedTextColumnNameNotAllowNull,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=The
name of column index 3 is empty. Make sure column name is properly
specified in the header
Now I understand why it's complaining about this due to trailing delimiter, but my question is whether or not there is a way to deal with this condition? I've tried including the trailing comma in the record delimiter (,\r\n), but then it just pivots the data where all the columns become rows.
Is there a way to address this condition in copy activity?

When preview the data in dataset, it seams correct:
But actually in copy actives, the data will derived to 3 columns by the column delimiter ",", the third column is empty or NULL value. This will cause the error.
If you use Data Flow import projection from source, you can see the third column:
Just for now, copy active doesn't support modify the data schema. You must use Data flow Derived Column to create a new schema for the source. For example:
Then mapping the new column/schema to sink will solve the problem.
HTH.

Use a different encoding for your CSV. CSV utf-8 will do the trick.

Copy text file using postgres with custom delimiter by character size

I need to copy a text file which has confusing delimiter. I believe the delimiter is space. However, some of the column values are empty and I cannot differentiate which column which making it harder to load the data to database since the space is not indicating anything. Thus, when I try to COPY, the mapping is not right and I am getting ERROR: extra data after last expected column
I have tried to change the delimiter to comma and such, I am still getting the same error above. The below code can be used when I try to load some dummy data with proper delimiter.
COPY usm00070219(HEADREC_ID,YEAR,MONTH,DAY,HOUR,RELTIME,NUMLEV,P_SRC,NP_SRC,LAT,LON) FROM 'D:\....\USM00070219-data.txt' DELIMITER ' ';
This is example data:
It should have 11 columns but the data on the first row is only 10 and it cannot identify the empty value column. The spacings are not helpful at all!
Is there any way I can separate the columns by character size as delimiter and force the data to be divided by the size given?

COPY is not made to handle fixed-width text files. I can think of two options:
Load the file as it is into a table with a single text column using COPY. Then use regexp_split_to_array to split it into its components and inser these into another table.
You can use file_fdw to create a foreign table with a single text column like above and operate on that. That saves loading the file into the database.
There is a foreign data wrapper for fixed-width text files that you can try.

When do Postgres column or table names need quotes and when don't they?

Let's consider the following postgres query:
SELECT *
FROM "MY_TABLE"
WHERE "bool_var"=FALSE
AND "str_var"='something';
The query fails to respond properly when I remove quotes around "str_var" but not when I do the same around "bool_var". Why? What is the proper way to write the query in that case, no quotes around the boolean column and quotes around the text column? Something else?

PostgreSQL converts all names (table name, column names etc) into lowercase if you don't prevent it by double quoting them in create table "My_Table_ABC" ( "My_Very_Upper_and_Lowercasy_Column" numeric,...). If you have names like this, you must always double quote those names in selects and other references.
I would recommend not creating tables like this and also not using chars outside a-z, 0-9 and _. You can not guarantee that every piece of software, library etc ever to be used against your database will support case-sensitivity. It's also tedious to remember and doing this double quoting.

Thanks to #TimBiegeleisen's comment, I was able to pinpoint the problem; I used a reserved keyword ("user") as a column name.
Link to reserved keywords in the doc: https://www.postgresql.org/docs/current/sql-keywords-appendix.html.
Now I know not to use quotes to query column names, but rather to avoid reserved keywords as column names.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse