Postgres regexp_replace: inability to replace source text with first captured group - postgresql

Using PostgreSQL, I am unable to design the correct regex pattern to achieve the desired output of an SQL statement that uses regexp_replace.
My source text consists of several scattered blocks of text of the form 'PU*' followed by a date string in the form of 'YYYY-MM'--for example, 'PU*2020-11'. These blocks are surrounded by strings of unpredictable, arbitrary text (including other instances of 'PU*' followed by the above date string format, such as 'PU*2017-07), white space, and line feeds.
My desire is to replace the entire source text with the FIRST instance of the 'YYYY-MM' text pattern. In the above example, the desired output would be '2020-11'.
Currently, my search pattern results in the correct replacement text in place of the first capturing group, but unfortunately, all of the text AFTER the first capturing group also inadvertently appears in the output, which is not the desired output.
Specifically:
Version: postgres (PostgreSQL) 13.0
A more complex example of source text:
First line
Exec committee
PU*2020-08
PU*2019-09--cancelled
PU*2017-10
added by Terranze
My pattern so far:
(\s|\S)*?PU\*(\d{4}-\d{2})(\s|\S*)*
Current SQL statement:
select regexp_replace('First line\nExec committee; PU*2020-08\nPU*2019-09\nPU*2017-10\n\nadded by Terranze\n', '(\s|\S)*?PU\*(\d{4}-\d{2})(\s|\S*)*', '\2') as _regex;
Current output on https://regex101.com/
2020-08
Current output on psql
_regex
───────────────────────────────────────────────────────────────────
2020-08\nPU*2019-09--cancelled\nPU*2017-10\n\nadded by Terranze\n
(1 row)
Desired output:
2020-08
Any help appreciated. Thanks--

How about this expression:
'^.*?PU\*(\d{4}-\d{2}).*$'

Related

Dealing With a Weird delimited data format in Talend or other tool?

So i have got a weird delimited format that i am not familiar with it's based on the output of a chat related application and the format is peculiar to me can anyone please enlighten me as to what this delimited format is if it's standard and any possible way to convert this to CSV with text quotations if possible.
"NumValue1|""TextValue2""|""TextValue3""|""TextValue"""
so my assumptions with this data format is there is a row ""
the text qualifiers are "" text ""
and the deliminator is |
also what is the value of delimiting in this format as apposed to say csv with text qualifiers? the text values don't seem to have " in them
Talend is my preferred tool but open to use anything to solve this problem.
I think this is a nested structure. I think the original data was a pipe delimited quote enclosed CSV file.
NumValue1|"TextValue2"|"TextValue3"|"TextValue"
Now they wanted to enclose this in quotes, but the original quotes needs to be handled. So they doubled that (common technique in SQL)
My quick and dirty suggestion would be to create a workflow in talend that:
tFileInputfullRow -> tJavaRow -> tFileOutputDelimited (by default OutputDelimited is buggy so it will leave your line intact at least in Talend 5 it was like that)
row2.line = row1.line.substring(1,row1.line.length()-2).replace("\"\"","\"")
Then you can do a tFileInputDelimited with | and "

Can COMMENTS in Postgres contain line breaks?

I have a very long comment I want to add to a Postgres table.
Since I do not want a very long single line as a comment I want to split it into several lines.
Is this possible? \n does not work since Postgres does not use the backslash as an escape character.
Just write a multi-line string:
COMMENT ON TABLE foo IS 'This
comment
is stored
in multiple lines';
You can also embed \n escape sequences in “extended” string constants that start with E:
COMMENT ON TABLE foo IS E'A comment\nwith three\nlines.';
You can use automatic concatenation of adjacent string literals together with E'\n' escape sequences for linebreaks:
COMMENT ON TABLE foo IS E''
'This comment is stored in multiple lines. But only some'
'end with linebreaks like this one.\n'
'You can even create empty lines to simulate paragraphs:'
'\n\n'
'This would be the second paragraph, then.';
Details:
Note the initial E'' at the end of the first line. This is essential to make all the adjacent string literals that follow it use the extended string literal syntax, providing us with the option to write \n for a linebreak. Of course, that E could also be placed into the second line instead, at the start of the real string: E'This comment …'. Me putting it into the first line is just source code aesthetics … character alignment and stuff.
I consider this solution slightly better than multi-line strings (proposed in another answer here) because it allows to fit the comment into the typical line width limit and the indentation requirements of source files. Useful when you keep your SQL in well-formatted files under version control, that is, treating it just as any other source code. When including indentation into multi-line strings, on the other hand, this results in lots of additional whitespace in the live table comment.
Note for OP: When you say "I do not want a very long single line as a comment", it is not clear if you don't want that long line in your .sql source code file, or if you don't want it in the table comment of the live table, such as when seen in a database admin tool. It does not really matter, as this solution gives you tools for both purposes: use adjacent string literals to fit your line into the source code file, without affecting line breaks in the live table comment; and use \n to create line breaks and empty lines in the live table comment.

How to extract string from sentence using sub-string and position function?

I have to extract a value from string and I am working on cognos application that doesn't support regex. It has some built in functions like substring and position
My string is similar to
/content/folder[#name='ab_Salary Reports']/folder[#name='INT Salary Reports']/folder[#name='INT Sal Sche']/jobDefinition[#name='Salary Rep R025']
And I have to extract Salary Rep R025, ie. the last name value.
Static substring will not work because string is variable.
Use the position function to locate the starting and ending point of your target substring. Try
position('/jobDefinition', [pathstring])
combined with substring:
substring( [pathstring], position('/jobDefinition', [pathstring]) + 22, length([pathstring]) - position('/jobDefinition', [pathstring]) + 22)
This will start 22 characters after where it finds /jobDefinition, meaning it will start just past '/jobDefinition[#name='', and will proceed for the remaining length of the string, determined by subtracting the starting point from the full length.
You may need to adjust by +1 or -1 in order to include or exclude your quotes.
Also note that this is using Report Studio functions. The source for Cognos reports is queries on tables, so you may have native functions available depending on your source. For example, most of the reports I work with come out of an Oracle database, so I can use oracle string functions instead of Report Studio functions. They work better, and are processed on the database side rather than on the Cognos Dispatcher, which is always faster.

PostgreSQL Trimming Leading and Trailing Characters: = and "

I'm working to build an import tool that utilizes a quoted CSV file. However, several of the fields in the CSV file are reported as such:
"=""38000"""
Where 38000 is the data I need. The data integration software I use (Talend 6.11) already strips the leading and trailing double quotes for me (so, "38000" becomes 38000), but I can't find a way to get rid of those others.
So, essentially, I need "=""38000""" to become "38000" where the leading "=" is removed and the trailing "" is removed.
Is there a TRIM function that can accomplish this for me? Perhaps there is a method in Talend that can do this?
As the other answer stated, you could do that operation in SQL. Or, you could do it in Java, Groovy, etc, within Talend. However, if there is an existing Talend component which does the job, my preference is to use it. That leads to faster development, potentially less testing, and easier maintenance. Having said that, it is important to review all the components which are available, so you know what's available to you.
You can use the Talend component tReplace, to inspect each of the input columns you want to trim of quotes and equal signs. A single tReplace component can do search and replace operations on multiple input columns. If all the of the replaces are related to each other, I would keep them within a single tReplace. When it gets to the point of doing unrelated replacements, I might place those within a new tReplace so that logical operations are organized and grouped together.
tReplace
For a given Input Column
search for "=", replace with ""
search for "\"", replace with ""
Something like that:
SELECT format( '"%s"', trim( both '"=' from '"=""38000"""' ) );
-[ RECORD 1 ]---
format | "38000"
1st: trim() function removes all " and = chars. Result is simply 38000
2nd: with format can add double quote back to get wishful end result
Alternatively, can use regexp and other Postgres string functions.
See more:
https://www.postgresql.org/docs/current/static/functions-string.html

Find all references to a specific table column in Oracle (10g) [duplicate]

Is there a query I can run to search all packages to see if a particular table and/or column is used in the package? There are too many packages to open each one and do a find on the value(s) I'm looking for.
You can do this:
select *
from user_source
where upper(text) like upper('%SOMETEXT%');
Alternatively, SQL Developer has a built-in report to do this under:
View > Reports > Data Dictionary Reports > PLSQL > Search Source Code
The 11G docs for USER_SOURCE are here
you can use the views *_DEPENDENCIES, for example:
SELECT owner, NAME
FROM dba_dependencies
WHERE referenced_owner = :table_owner
AND referenced_name = :table_name
AND TYPE IN ('PACKAGE', 'PACKAGE BODY')
Sometimes the column you are looking for may be part of the name of many other things that you are not interested in.
For example I was recently looking for a column called "BQR", which also forms part of many other columns such as "BQR_OWNER", "PROP_BQR", etc.
So I would like to have the checkbox that word processors have to indicate "Whole words only".
Unfortunately LIKE has no such functionality, but REGEXP_LIKE can help.
SELECT *
FROM user_source
WHERE regexp_like(text, '(\s|\.|,|^)bqr(\s|,|$)');
This is the regular expression to find this column and exclude the other columns with "BQR" as part of the name:
(\s|\.|,|^)bqr(\s|,|$)
The regular expression matches white-space (\s), or (|) period (.), or (|) comma (,), or (|) start-of-line (^), followed by "bqr", followed by white-space, comma or end-of-line ($).
By the way, if you need to add other characters such as "(" or ")" because the column may be used as "UPPER(bqr)", then those options can be added to the lists of before and after characters.
(\s|\(|\.|,|^)bqr(\s|,|\)|$)