Usage of Escape char and Text Enclosure - talend

What is the use of Escape char and Text Enclosure in tFileOutputDelimited component and How can we use them??
Thanks in Advance...

To answer your question, consider the below example from CSV file
bookId,bookname,description,authorname
1,Grammer,book that has details about grammer,author1
2,Special Characters, book that describes about some escape characters like \", punctuations and special characters ,etc.,author2
3,Mathematics, book that has mathematical operations like addition +, subtraction -, multiplication *, division / etc, author3
I have created a simple job like below
In the above sample, character comma "," is the delimiter. But there are some commas in between the data.
The data that is written to CSV file will look like below,
Now When I read the data from that file I will get below data
.------+------------------+-------------------------------------------------------+-------------------------------------.
| tLogRow_3 |
|=-----+------------------+-------------------------------------------------------+------------------------------------=|
|bookId|bookName |description |author |
|=-----+------------------+-------------------------------------------------------+------------------------------------=|
|1 |Grammer |book that has details about grammer |author1 |
|2 |Special Characters|book that describes about some escape characters like "| punctuations and special characters |
|3 |Mathematics |book that has mathematical operations like addition + | subtraction - |
'------+------------------+-------------------------------------------------------+-------------------------------------'
If you notice, some data are missing in the log for "author" column.
This is because of the comma in between the data. To avoid it Text Enclosure option is used. Also there is a escape character in the data, which is \". In the file it will be printed as ". If Text Enclosure has value as """, then you need to escape the character " which is present inside the data. To do this, you have to use Escape char option, like below
Now the output that I got is
When I read this data, I will get data like below,
.------+------------------+-------------------------------------------------------------------------------------------------------+-------.
| tLogRow_3 |
|=-----+------------------+-------------------------------------------------------------------------------------------------------+------=|
|bookId|bookName |description |author |
|=-----+------------------+-------------------------------------------------------------------------------------------------------+------=|
|1 |Grammer |book that has details about grammer |author1|
|2 |Special Characters|book that describes about some escape characters like ", punctuations and special characters ,etc. |author2|
|3 |Mathematics |book that has mathematical operations like addition +, subtraction -, multiplication *, division / etc.|author3|
'------+------------------+-------------------------------------------------------------------------------------------------------+-------'
If you notice, no data is lost.
Hope this would help you out.

Related

Postgres escape double quotes

I am working with a malformed database which seems to have double quotes as part of the column names.
Example:
|"Market" |
|---------|
|Japan |
|UK |
|USA |
And I want to select like below
SELECT "\"Market\"" FROM mytable; /* Does not work */
How does one select such a thing?
The documentation says
[A] delimited identifier or quoted identifier […] is formed by enclosing an arbitrary sequence of characters in double-quotes ("). […]
Quoted identifiers can contain any character, except the character with code zero. (To include a double quote, write two double quotes.)
So you'll want to use
SELECT """Market""" AS "Market" FROM mytable;
An alternative would be
A variant of quoted identifiers allows including escaped Unicode characters identified by their code points. This variant starts with U& (upper or lower case U followed by ampersand) immediately before the opening double quote, without any spaces in between, for example U&"foo". […] Inside the quotes, Unicode characters can be specified in escaped form by writing a backslash followed by the four-digit hexadecimal code point number or alternatively a backslash followed by a plus sign followed by a six-digit hexadecimal code point number.
which in your case would mean
SELECT U&"\0022Market\0022" AS "Market" FROM mytable;
SELECT U&"\+000022Market\+000022" AS "Market" FROM mytable;
Disclaimer: your database may not actually have double quotes as part of the name itself. As mentioned in the comments, this might just be the way in which the tool you are using does display a column named Market (not market) since
Quoting an identifier also makes it case-sensitive
So all you might need could be
SELECT "Market" FROM mytable;

Column names with line breaks

I know that for text strings in PostgreSQL line breaks are unified by appending symbol E or e in front of the text:
SELECT E'first\nsecond'
results in:
first
second
But PostgreSQL also support line breaks within column names - not sure why or how evil this practice is, but one can do the following:
CREATE TABLE One("first\nsecond" text);
CREATE TABLE Two("first
second" text);
When you are unfortunate enough to run into one of these, you would find that while these queries work:
SELECT "first\nsecond" from One;
SELECT "first
second" from Two;
these ones do not:
SELECT "first
second" from One;
SELECT "first\nsecond" from Two;
My question is: Is there a way in PostgreSQL that unifies such differences, similar to the situation with the column values?
I have tried putting E in front of "first\nsecond" column names, but it is not supported. Trying to put \r\n instead (I'm using Windows) gave me a third type of column names, one that can only be queried as:
SELECT "first\r\nsecond" FROM Third
Column names are identifiers, and the gory details of the syntax for identifiers are described at:
http://www.postgresql.org/docs/current/static/sql-syntax-lexical.html#SQL-SYNTAX-IDENTIFIERS
TL;DR: use the U&"..." syntax to inject non-printable characters into identifiers through their Unicode codepoints, and there's no way to unify CR,LF with LF alone.
How to refer to the column in a single line
We're allowed to use Unicode escape sequences in identifiers, so per documentation, the following does work:
select U&"first\000asecond" from Two;
if it's just a newline character between the two words.
What happens with the queries on the first table
The table is created with:
CREATE TABLE One("first\nsecond" text);
As the backslash character has no special meaning here, this column does not contain any newline.
It contains first followed by \ followed by n followed by second.
So:
SELECT "first\nsecond" from One;
does work because it's the same as what's in the CREATE TABLE
whereas
SELECT "first
second" from One;
fails because there's a newline in that SELECT where the actual column name in the table has a backslash followed by a n.
What happens with the queries on the second table
This is the opposite of "One".
CREATE TABLE Two("first
second" text);
The newline is taken verbatim and is part of the column.
So
SELECT "first
second" from Two;
works because the newline is there exactly as in the CREATE TABLE,
with an embedded newline,
whereas
SELECT "first\nsecond" from Two;
fails because as previously \n in this context does not mean a newline.
Carriage Return followed by Newline, or anything weirder
As mentioned in comments and your edit, this could be carriage return and newline instead, in which case the following should do:
select U&"first\000d\000asecond" from Two;
although in my test, hitting Enter in the middle of a column with psql on Unix and Windows has the same effect: a single newline in the column's name.
To check what exact characters ended up in a column name, we can inspect them in hexadecimal.
When applied to your create table example, from inside psql under Unix:
CREATE TABLE Two("first
second" text);
select convert_to(column_name::text,'UTF-8')
from information_schema.columns
where table_schema='public'
and table_name='two';
The result is:
convert_to
----------------------------
\x66697273740a7365636f6e64
For more complex cases (e.g. non-ascii characters with several bytes in UTF-8), a more advanced query might help, for easy-to-read codepoints:
select c,lpad(to_hex(ascii(c)),4,'0') from (
select regexp_split_to_table(column_name::text,'') as c
from information_schema.columns
where table_schema='public'
and table_name='two'
) as g;
c | lpad
---+------
f | 0066
i | 0069
r | 0072
s | 0073
t | 0074
+| 000a
|
s | 0073
e | 0065
c | 0063
o | 006f
n | 006e
d | 0064

Can some explain in details about one sed command

I didnt get the command sed 's/^.*\(.\{4\}\)$/\1/' what its doing. If someone could explain me per each character that would great and I can understand it very well. I am basic only with sed and learning it now only.
You have two things going on, understanding sed and understanding the regular expression passed into the sed substitution command.
Let's start with the over all command:
sed 's/^.*\(.\{4\}\)$/\1/'
^ ^ ^ ^
| | | |- what you want to replace found text with
| | |
| | |- what you're looking for
| |
| |- tell sed you want to substitute the text we find
| between the first two '/' with the contents between
| the last two '/'
|
|- call the sed application
Next up is understanding the regular expression. https://regex101.com/ is a great resource for this. First, let's look at the regular expression:
^.*\(.\{4\}\)$
You're sending this through the shell so there is some shell escaping going on. Let's remove the shell escaping to see the real regex:
^.*(.{4})$
Now this is a bit more clear. This regular expression:
matches the beginning of the line: ^
followed by zero or more characters: .*
and capture the last 4 characters of the line: (.{4})$
the parenthesis create the capture group
. captures any character
{4} four times
$ anchored by the end of the line
Lastly we have the /\1/ portion of the sed command. This tells sed to replace whatever it found with ^.*(.{4})$ with everything found in the capture group created by (.{4})$.
So basically, this command replaces each line in a file with the last four characters found in that line.

Load PostgreSQL table from CSV with data with commas between brackets

I've been given some CSV data I need to load into a PostgreSQL table with the following format:
7227, {text with, commas}, 10.0, 3.0, text with no commas
I want my table rows to appear as:
7227 | text with, commas | 10.0 | 3.0 | text with no commas
How can I use COPY and get it to ignore commas between the brackets?
I'm trying to avoid pre-processing the file, although that is an option.
I'm afraid you'll have to edit the file. This format should be ok:
7227, "text with, commas", 10.0, 3.0, text with no commas
Alternatively
7227, text with\, commas, 10.0, 3.0, text with no commas
according to this principle in the documentation:
Backslash characters (\) can be used in the COPY data to quote data
characters that might otherwise be taken as row or column delimiters.
In particular, the following characters must be preceded by a
backslash if they appear as part of a column value: backslash itself,
newline, carriage return, and the current delimiter character.

How to escape a pipe char in a code statement in a markdown table?

On GitHub I want to build a table containing pieces of code in Markdown. It works fine except when I put a pipe char (i.e. | ) between the backtick (i.e. ` ) chars.
Here is what I want:
a | r
------------|-----
`a += x;` | r1
`a |= y;` | r2
The problem is that the vertical bar in the code statement of the second line is interpreted as a column delimiter. Then the table rendering looks pretty ugly. How could I avoid that?
Note that I already tried to use the | HTML code, but it produces a |= y;.
As of March 2017 using escaped pipes is much easier: \| See other answers.
If you remove the backticks (`), using the | hack works
a | r
------------|-----
`a += x;` | r1
a |= y; | r2
and produces the following output
Alternatively, you can replace the backticks (`) with a <code></code> markup which fixes the issues more nicely by preserving the rendering
a | r
------------|-----
`a += x;` | r1
<code>a |= y;</code> | r2
generating the following output
As of mid-2017, the pipe may simply be escaped with a backslash, like so: \|
This works both inside and outside of backticks.
The HTML code may now be used again, too, but only outside of backticks.
Previous answer:
As of March 2017, the accepted answer stopped working because GitHub
changed their markdown
parser.
Using another unicode symbol that resembles a pipe seems to be the
only option right now, e.g.:
ǀ (U+01C0, Latin letter dental click)
∣ (U+2223, Symbol divides)
⎮ (U+23AE, Integral Extension)
You can escape the | in a table in GFM with a \ like so:
| a | r
|------------|-----
| `a += x;` | r1
| `a \|= y;` | r2
See https://github.github.com/gfm/#example-191 or https://github.com/dotnet/csharplang/pull/743 for an example.
this works fine in github markdown:
| a | r
| ------------|-----
| `a += x;` | r1
| `a \|= y;` | r2
very similar to https://stackoverflow.com/a/45122039/1426932 but with added | in first column (it didn't render well in comments so I'm adding an answer here).
note that outside a table cell, a \|= y; will render the backslash, but inside a table cell, it won't.