How to remove multiple characters between 2 special characters in a column in SSIS expression - tsql

I want to remove the multiple characters starting from '#' till the ';' in derived column expression in SSIS.
For example,
my input column values are,
and want the output as,
Note: Length after '#' is not fixed.
Already tried in SQL but want to do it via SSIS derived column expression.

First of all: Please do not post pictures. We prefer copy-and-pastable sample data. And please try to provide a minimal, complete and reproducible example, best served as DDL, INSERT and code as I do it here for you.
And just to mention this: If you control the input, you should not mix information within one string... If this is needed, try to use a "real" text container like XML or JSON.
SQL-Server is not meant for string manipulation. There is no RegEx or repeated/nested pattern matching. So we would have to use a recursive / procedural / looping approach. But - if performance is not so important - you might use a XML hack.
--DDL and INSERT
DECLARE #tbl TABLE(ID INT IDENTITY,YourString VARCHAR(1000));
INSERT INTO #tbl VALUES('Here is one without')
,('One#some comment;in here')
,('Two comments#some comment;in here#here is the second;and some more text')
--The query
SELECT t.ID
,t.YourString
,CAST(REPLACE(REPLACE((SELECT t.YourString AS [*] FOR XML PATH('')),'#','<!--'),';','--> ') AS XML) SeeTheIntermediateXML
,CAST(REPLACE(REPLACE((SELECT t.YourString AS [*] FOR XML PATH('')),'#','<!--'),';','--> ') AS XML).value('.','nvarchar(max)') CleanedValue
FROM #tbl t
The result
+----+-------------------------------------------------------------------------+-----------------------------------------+
| ID | YourString | CleanedValue |
+----+-------------------------------------------------------------------------+-----------------------------------------+
| 1 | Here is one without | Here is one without |
+----+-------------------------------------------------------------------------+-----------------------------------------+
| 2 | One#some comment;in here | One in here |
+----+-------------------------------------------------------------------------+-----------------------------------------+
| 3 | Two comments#some comment;in here#here is the second;and some more text | Two comments in here and some more text |
+----+-------------------------------------------------------------------------+-----------------------------------------+
The idea in short:
Using some string methods we can wrap your unwanted text in XML comments.
Look at this
Two comments<!--some comment--> in here<!--here is the second--> and some more text
Reading this XML with .value() the content will be returned without the comments.
Hint 1: Use '-->;' in your replacement to keep the semi-colon as delimiter.
Hint 2: If there might be a semi-colon ; somewhere else in your string, you would see the --> in the result. In this case you'd need a third REPLACE() against the resulting string.

Related

Is there a way to see raw string values using SQL / presto SQL / athena?

Edit after asked to better specify my need:
TL;DR: How to show whitespace escaped characters (such as /r) in the Athena console when performing a query? So this: "abcdef/r" instead of this "abcdef ".
I have a dataset with a column that contains some strings of variable length, all of them with a trailing whitespace.
Now, since I had analyzed this data before, using python, I know that this whitespace is a \r; however, if in Athena I SELECT my_column, it obviously doesn't show the escaped whitespace.
Essentially, what I'm trying to achieve:
my_column | ..
----------+--------
abcdef\r | ..
ghijkl\r | ..
What I'm getting instead:
my_column | ..
----------+--------
abcdef | ..
ghijkl | ..
If you're asking why would I want that, it's just to avoid having to parse this data through python if I ever incur in this situation again, so that I can immediately know if there's any weird escaped characters in my strings.
Any help is much appreciated.

Postgres Full-Text Search with Hyphen and Numerals

I have observed what seems to me an odd behavior of Postgres' to_tsvector function.
SELECT to_tsvector('english', 'abc-xyz');
returns
'abc':2 'abc-xyz':1 'xyz':3
However,
SELECT to_tsvector('english', 'abc-001');
returns
'-001':2 'abc':1
Why not something like this?
'abc':2 'abc-001':1 '001':3
And what should I do to be able to search by the numeric portion alone, without the hyphen?
Seems the text search parser identifies the hyphen followed by digits to be the sign of a signed integer. Debug with ts_debug():
SELECT * FROM ts_debug('english', 'abc-001');
alias | description | token | dictionaries | dictionary | lexemes
-----------+-----------------+-------+--------------+------------+---------
asciiword | Word, all ASCII | abc | {simple} | simple | {abc}
int | Signed integer | -001 | {simple} | simple | {-001}
Other text search configurations (like 'simple' instead of 'english') won't help as the parser itself is "at fault" here (debatable).
A simple way around it (other than modifying the parser, which I never tried) would to pre-process strings and replace hyphens with m-dash (—) or just blanks to make sure those are identified as "Space symbols". (Actual signed integers lose their negative sign in the process.)
SELECT to_tsvector('english', translate('abc-001', '-', '—'))
## to_tsquery ('english', '001'); -- true now
db<>fiddle here
This can be circumvented with PG13's dict-int addon's absval option. See the official documentation.
But in case you're stuck with an earlier PG version, here's the generalized version of a "number or negative number" workaround in a query.
select regexp_replace($$'test' & '1':* & '2'$$::tsquery::text,
'''([.\d]+''(:\*)?)', '(''\1 | ''-\1)', 'g')::tsquery;
This results in:
'test' & ( '1':* | '-1':* ) & ( '2' | '-2' )
It replaces lexemes that look like positive numbers with "number or negative number" kind of subqueries.
The double cast ::tsquery::text is just there to show how you would pass a tsquery casted to text.
Note that it handles prefix matching numeric lexemes as well.

Stop PostgreSQL from spliting value(s) in multiple lines?

I'm converting a binary data (bytea) data type to string using encode(foo::bytea, 'base64') but the output is being split in multiple lines:
-[ RECORD 1 ]-+-----------------------------------------------------------------------
req_id | 132675
b_string | d4IF4jCCBd4GCSqGSIb3DQEHAqCCBc8wggXLAgEDMQ0wCwYJBAIBMIGYBgZngQgBAQGg+
| gY0EgYowgYcCAQAwCwYJYIZIAQAwCwYJYIZIAWUDBAIHUwUdH0JybzpY2evf+v9Xg86b+
| HSGTGYBIb/QwJQIBAgQg1M6/cJ+S39XY1lm43oenxJNLrYcc3hVw7fgwJQIBDgQgIAil+
| 1JnYbdS0p4pK07kMkb/dbMcxryx6mqbLTzx+YJ6gggQbMI2gAwIBAgIESS7vwTAKBggq+
| LUxRjUXbTgfGwUKOFwemsc4KXbsLZ13MkbNfAQ==
How can get a single string instead?
UPDATE: based on the solution by #LaurenzAlbe
Just for the completeness, this is what I end up doing the gave me the result I wanted:
translate(encode(foo::bytea, 'base64'), E'\n', '')
psql doesn't split your string in multiple lines.
It's the string that contains new-line characters (ASCII 10), and psql displays them accurately. The + at the end of every line is psql's way of telling you that the value is continued in the next line.
You can use the unaligned mode (psql option -A) to get rid of the +, but the output format is less attractive then.
You can get rid of the newlines in a string with
SELECT translate(..., E'\n', '');
decode will be able to handle such a string.

PostgreSQL tuple format

Is there any document describing the tuple format that PostgreSQL server adheres to? The official documentation appears arcane about this.
A single tuple seems simple enough to figure out, but when it comes to arrays of tuples, arrays of composite tuples, and finally nested arrays of composite tuples, it is impossible to be certain about the format simply by looking at the output.
I am asking this following my initial attempt at implementing pg-tuple, a parser that's still missing today, to be able to parse PostgreSQL tuples within Node.js
Examples
create type type_A as (
a int,
b text
);
with a simple text: (1,hello)
with a complex text: (1,"hello world!")
create type type_B as (
c type_A,
d type_A[]
);
simple-value array: {"(2,two)","(3,three)"}
for type_B[] we can get:
{"(\"(7,inner)\",\"{\"\"(88,eight-1)\"\",\"\"(99,nine-2)\"\"}\")","(\"(77,inner)\",\"{\"\"(888,eight-3)\"\",\"\"(999,nine-4)\"\"}\")"}
It gets even more complex for multi-dimensional arrays of composite types.
UPDATE
Since it feels like there is no specification at all, I have started working on reversing it. Not sure if it can be done fully though, because from some initial examples it is often unclear what formatting rules are applied.
As Nick posted, according to docs:
the whitespace will be ignored if the field type is integer, but not
if it is text.
and
The composite output routine will put double quotes around field
values if they are empty strings or contain parentheses, commas,
double quotes, backslashes, or white space.
and
Double quotes and backslashes embedded in field values will be
doubled.
and now quoting Nick himself:
nested elements are converted to strings, and then quoted / escaped
like any other string
I give shorted example below, comfortably compared against its nested value:
a=# create table playground (t text, ta text[],f float,fa float[]);
CREATE TABLE
a=# insert into playground select 'space here',array['','bs\'],8.0,array[null,8.1];
INSERT 0 1
a=# insert into playground select 'no_space',array[null,'nospace'],9.0,array[9.1,8.0];
INSERT 0 1
a=# select playground,* from playground;
playground | t | ta | f | fa
---------------------------------------------------+------------+----------------+---+------------
("space here","{"""",""bs\\\\""}",8,"{NULL,8.1}") | space here | {"","bs\\"} | 8 | {NULL,8.1}
(no_space,"{NULL,nospace}",9,"{9.1,8}") | no_space | {NULL,nospace} | 9 | {9.1,8}
(2 rows)
If you go for deeper nested quoting, look at:
a=# select nested,* from (select playground,* from playground) nested;
nested | playground | t | ta | f | fa
-------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------+------------+----------------+---+------------
("(""space here"",""{"""""""",""""bs\\\\\\\\""""}"",8,""{NULL,8.1}"")","space here","{"""",""bs\\\\""}",8,"{NULL,8.1}") | ("space here","{"""",""bs\\\\""}",8,"{NULL,8.1}") | space here | {"","bs\\"} | 8 | {NULL,8.1}
("(no_space,""{NULL,nospace}"",9,""{9.1,8}"")",no_space,"{NULL,nospace}",9,"{9.1,8}") | (no_space,"{NULL,nospace}",9,"{9.1,8}") | no_space | {NULL,nospace} | 9 | {9.1,8}
(2 rows)
As you can see, the output again follows rules the above.
This way in short answers to your questions would be:
why array is normally presented inside double-quotes, while an empty array is suddenly an open value? (text representation of empty array does not contain comma or space or etc)
why a single " is suddenly presented as \""? (text representation of 'one\ two', according to rules above is "one\\ two", and text representation of the last is ""one\\\\two"" and it is just what you get)
why unicode-formatted text is changing the escaping for \? How can we tell the difference then? (According to docs,
PostgreSQL also accepts "escape" string constants, which are an
extension to the SQL standard. An escape string constant is specified
by writing the letter E (upper or lower case) just before the opening
single quote
), so it is not unicode text, but the the way you tell postgres that it should interpret escapes in text not as symbols, but as escapes. Eg E'\'' will be interpreted as ' and '\'' will make it wait for closing ' to be interpreted. In you example E'\\ text' the text represent of it will be "\\ text" - we add backslsh for backslash and take value in double quotes - all as described in online docs.
the way that { and } are escaped is not always clear (I could not anwer this question, because it was not clear itself)

Search text between symbol

I have this text (taken from concatenated field row)
Astronomic Event 2013/1434H - Aceh ....
How do We search it by 2013 or 1434h keywords?
I have tried below code but it resulting no row.
to_tsvector result:
'2013/1434h':8,12 'aceh':1 'bin.....
Sample Case:
WITH sample_table as
(SELECT to_tsvector('Astronomic Event 2013/1434H - Aceh') sample_content)
SELECT *
FROM sample_table, to_tsquery('2013') q
WHERE sample_content ## q
How do We search it by 2013 or 1434h keywords?
It seems like you want to replace:
to_tsquery('1434h') q
with:
to_tsquery('1434h | 2013') q
http://www.postgresql.org/docs/current/static/functions-textsearch.html
Side note: the to_tsquery() syntax is extremely capricious. It doesn't allow for much if any fantasy, and many of the assumptions in Postgres are everything but end-user friendly.
More often than not, you'll be better off using plainto_tsquery(), which allows any amount of garbage to be thrown at it. Thus, consider pre-processing the string before issuing the query. For instance, you could split the string, and OR the original parts together:
where sc.text_index ## (plainto_tsquery('1434h') || plainto_tsquery('2013'))
Doing so will make your code a bit more complex, but it won't rely on your users needing to understand that (contrary to what they're accustomed to in Google) they should enter 'quick & brown & fox & jumps & lazy & dog' instead of plain 'The quick brown fox jumps over the lazy dog'.
Edit: I ended up actually trying your sample query, and it seems you're actually running into a parser issue:
# SELECT alias, description, token FROM ts_debug('Astronomic Event 2013/1434H - Aceh');
alias | description | token
-----------+-------------------+------------
asciiword | Word, all ASCII | Astronomic
blank | Space symbols |
asciiword | Word, all ASCII | Event
blank | Space symbols |
file | File or path name | 2013/1434H
blank | Space symbols |
blank | Space symbols | -
asciiword | Word, all ASCII | Aceh
(8 rows)
http://www.postgresql.org/docs/current/static/textsearch-parsers.html
It looks like you might need to write (or find) and configure an app-specific parser. Having never done this personally, the best I can do is to highlight that Postgres allows this and includes a sample:
http://www.postgresql.org/docs/current/static/test-parser.html
Alternatively, change your tsvector-related trigger so that it matches e.g. \d{4}/\d+[a-zA-Z] or whatever seems most appropriate, and adds spaces accordingly, before converting it to a tsvector. Something as simple as the following might do the trick if you never need to store file names:
SELECT alias, description, token
FROM ts_debug(replace('Astronomic Event 2013/1434H - Aceh', '/', ' / '));