postgres substring split text using regex - postgresql

I am having following string pattern and I want to split the text into 4 fields.
NIFTY21JUN11100CE --> NIFTY, 21JUN, 11100, CE
In above string, only 2 string formats are constant. For ex: 21JUN represents year and month and it is constant 5 character representation. Before that represent name which can be any number of characters. I think regex will be like (([1-2][0-9]))(JAN|FEB|MAR|APR|MAY|JUN|JUL|AUG|SEP|OCT|NOV|DEC)
last 2 characters are constant and its value can be either PE|CE. value between 21JUN and CE|PE represent strike price and it is always numeric but can be any number of digits.
Now I want them to be split into 4 fields and struggling to get the regex. Is anyone familiar with Postgres command for this requirement?

You can use SELECT regexp_match('NIFTY21JUN11100CE','^(\D+)(\d{2}[A-Z]{3})(\d+)(PE|CE)$');
Step by step:
^ Beginning of the string
( start capture
\D+ more than zero non-digit chars
) end capture
( start capture
\d{2} exactly 2 digits
[A-Z]{3} exactly 3 chars in the range from A to Z
) end capture
( start capture
\d+ more than zero digit chars
) end capture
( start capture
PE|CE one of 'PE' or 'CE'
) end capture
$ end of the string
The year-month regexes from your question using character classes [1-2][0-9] and alternations (JAN|FEB|...) are a little bit more strict and could also be used.

Related

How to format decimal merge field to display numbers without extra padding?

I'm having trouble formatting a decimal number to be displayed with a thousand separator and a decimal separator if needed. Number can have up to three decimal digits. I have a feeling that I'm missing something very obvious here.
Basically:
1 -> 1
1.11 -> 1.11
1.111 -> 1.111
I have been using codes for formatting number and so far I've tried this combinations:
\# ,0
\# ,0.000
\# ,#
\# ,#.###
\# #,#
\# #,#.###
Basically, for a value of 1.11 I've gotten either a 1 as a result or 1.110 as a result.
Given your most recent comment (i.e. "The database returns a padded number 1.110 for example"), your data don't have 'up to three decimal digits'; they have exactly three decimal digits. If this is for a mailmerge, you could obtain both the thousands separator and suppression of trailing 0s with a field coded as:
{={MERGEFIELD MyField} # ,0.###}
Where the number ends with a trailing decimal 0, you'll end up with a space where the 0 would otherwise be and, where it's an integer, you'll end up with a trailing decimal point. If you really want to omit all this trailing stuff, you should modify the data so it's stored in the appropriate format; otherwise some fairly complex field coding will be required to achieve your desired outcome:
{QUOTE{SET Val {MERGEFIELD MyField}}{IF{=INT(Val) # 0.000}= {REF Val} {=INT(Val) # ,0} {IF{=INT(Val*10)/10 # 0.000}= {REF Val} {=Val # ,0.0} {IF{=INT(Val*100)/100 # 0.000}= {REF Val} {=Val # ,0.00} {=Val # ,0.000}}}}}
Note: The field brace pairs (i.e. '{ }') for the above examples are all created in the document itself, via Ctrl-F9 (Cmd-F9 on a Mac); you can't simply type them or copy & paste them from this message. Nor is it practical to add them via any of the standard Word dialogues. The spaces represented in the field constructions are all required.

how to remove # character from national data type in cobol

i am facing issue while converting unicode data into national characters.
When i convert the Unicode data into national using national-of function, some junk character like # is appended after the string.
E.g
Ws-unicode pic X(200)
Ws-national pic N(600)
--let the value in Ws-Unicode is これらの変更は. getting from java end.
move function national-of ( Ws-unicode ,1208 ) to Ws-national.
--after converting value is like これらの変更は #.
i do not want the extra # character added after conversion.
please help me to find out the possible solution, i have tried to replace N'#' with space using inspect clause.
it worked well but failed in some specific scenario like if we have # in input from user end. in that case genuine # also converted to space.
Below is a snippet of code I used to convert EBCDIC to UTF. Before I was capturing string lengths, I was also getting # symbols:
STRING
FUNCTION DISPLAY-OF (
FUNCTION NATIONAL-OF (
WS-EBCDIC-STRING(1:WS-XML-EBCDIC-LENGTH)
WS-EBCDIC-CCSID
)
WS-UTF8-CCSID
)
DELIMITED BY SIZE
INTO WS-UTF8-STRING
WITH POINTER WS-XML-UTF8-LENGTH
END-STRING
SUBTRACT 1 FROM WS-XML-UTF8-LENGTH
What this code does is string the UTF8 representation of the EBCIDIC string into another variable. The WITH POINTER clause will capture the new length of the string + 1 (+ 1 because the pointer is positioned to the next position after the string ended).
Using this method, you should be able to know exactly how long second string is and use that string with the exact length.
That should remove the unwanted #s.
EDIT:
One thing I forgot to mention, in my case, the # signs were actually EBCDIC low values when viewing the actual hex on the mainframe
Use inspect with reverse and stop after first occurence of #

finding a comma in string

[23567,0,0,0,0,0] and other value is [452221,0,0,0,0,0] and the value should be contineously displaying about 100 values and then i want to display only the sensor value like in first sample 23567 and in second sample 452221 , only the these values have to display . For that I have written a code
value = str2double(str(2:7));see here my attempt
so I want to find the comma in the output and only display the value before first comma
As proposed in a comment by excaza, MATLAB has dedicated functions, such as sscanf for such purposes.
sscanf(str,'[%d')
which matches but ignores the first [, and returns the next (i.e. the first) number as a double variable, and not as a string.
Still, I like the idea of using regular expressions to match the numbers. Instead of matching all zeros and commas, and replacing them by '' as proposed by Sardar_Usama, I would suggest directly matching the numbers using regexp.
You can return all numbers in str (still as string!) with
nums = regexp(str,'\d*','match')
and convert the first number to a double variable with
str2double(nums{1})
To match only the first number in str, we can use the regexp
nums = regexp(str,'[(\d*),','tokens')
which finds a [, then takes an arbitrary number of decimals (0-9), and stops when it finds a ,. By enclosing the \d* in brackets, only the parts in brackets are returned, i.e. only the numbers without [ and ,.
Final Note: if you continue working with strings, you could/should consider the regexp solution. If you convert it to a double anyways, using sscanf is probably faster and easier.
You can use regexprep as follows:
str='[23567,0,0,0,0,0]' ;
required=regexprep(str(2:end-1),',0','')
%Taking str(2:end-1) to exclude brackets, and then removing all ,0
If there can be values other than 0 after , , you can use the following more general approach instead:
required=regexprep(str(2:end-1),',[-+]?\d*\.?\d*','')

Encode a Date and a four digit number into a string with max 8 characters

I have a datetime and a four digit number and I need to encode this into a 8 character case insensitive ASCII string.
The four digit number is not actually an arbitrary number, but there are only a certain numbers (about 20 or so) of the form (2513, 2595, 2579, ...).
My current approach is to use Base36 encoding. Further, I have a dictionary for the four digit numbers that maps like this:
2513 -> '00'
2595 -> '01'
...
The first two characters of the resulting string are used for this. The remaining six characters are used for encoding a unix timestamp with seconds stripped (I only need seconds resolution) in Base36.
So, (2513, 07.01.2015) maps to '000E3HEU'.
My question is, if anyone can think of an even more compact encoding?

Formatting dates [duplicate]

This question already has answers here:
Pattern matching dates
(4 answers)
Closed 9 years ago.
April 9, 2012 can be written in any of these ways:
4912
4/9/12
4-9-12
4 9 12
04-9-12
04-09-12
4 9 2012
4 09 2012
(I think you get the point)
For those of you that don't understand, the rules are:
1. Dates may or may not have ` `, `-` or `/` between them
2. The year can be written as 2 digits (assumed to be dates in the range of [2000, 2099] inclusive) or 4 digits
3. One digit month/days may or may not have leading zeroes.
How would you go about problem solving this to format the dates into 04/09/12?
I know the dates can be ambiguous, i.e., 12112 can be 12/1/12 or 1/21/12, but assume the smallest month possible.
This actually is something that regexes are good at; making an assumption, moving forward with it, then backtracking if necessary to get a successful match.
s{
\A
( 1[0-2] | 0?[1-9] )
[-/ ]?
( 3[01] | [12][0-9] | 0?[1-9] )
[-/ ]?
( (?: [0-9]{2} ){1,2} )
\z
}
{
sprintf '%02u/%02u/%04u', $1, $2, ( length $3 == 4 ? $3 : 2000+$3 )
}xe;
The range checks present, while not determined by the value of the month, should be sufficient to pick a good date from the ambiguous cases (where there is a good date).
Note that it is important to try two digit month and days first; otherwise 111111 becomes 1-1-1111, not the presumably intended 11-11-11. But this means 11111 will prefer to be 11-1-11, not 1-11-11.
If a valid day of month check is needed, it should be performed after reformatting.
Notes:
s{}{} is a substitution using curly braces instead of / to delimit the parts of the regex to avoid having to escape the /, and also because using paired delimiters allows opening and closing both the pattern and replacement parts, which looks nice to me.
\A matches the start of the string being matched; \z matches the end. ^ and $ are often used for this, but can have slightly different meanings in some cases; I prefer these since they always only mean one thing.
The x flag on the end says this is an extended regex that can have extra whitespace or comments that are ignored, so that it is more readable. (Whitespace inside a character class isn't ignored.) The e flag says the replacement part isn't a string, it is code to execute.
'%02u/%02u/%02u' is a printf format, used for taking values and formatting them in a particular way; see http://perldoc.perl.org/functions/sprintf.html.
Install Date::Calc
On ubuntu libdate-calc-perl
This should be able to read in all those dates ( except 4912, 4 9 2012, 4 09 2012 ) and then output them in a common format