substring based on symbols and letters - postgresql

I'm trying to substring for some values in postgres starting with some letters and ending with a symbol. Please see example below,
Some row for a column
'value: 12423, store: target, date: 2010-08-22'
How do I substring for target in the column?
The new column becomes 'target'

substring('value: 12423, store: target, date: 2010-08-22' from 'store: ([^,]*),')
More information here.

First split the string on the comma ',' producing a vector/array of 3 items.
Then split the second item in the vector on the colon ':' and read off the second part of that split. Consider the functions string_to_array() and split_part().

Related

Azure Data Factory - Dynamic Skip Lines Expression

I am attempting to import a CSV into ADF however the file header is not the first line of the file. It is dynamic therefore I need to match it based on the first column (e.g "TestID,") which is a string.
Example Data (Header is on Line 4)
Date:,01/05/2022
Time:,00:30:25
Test Temperature:,25C
TestID,StartTime,EndTime,Result
TID12345-01,00:45:30,00:47:12,Pass
TID12345-02,00:46:50,00:49:12,Fail
TID12345-03,00:48:20,00:52:17,Pass
TID12345-04,00:49:12,00:49:45,Pass
TID12345-05,00:50:22,00:51:55,Fail
I found this article which addresses this issue however I am struggling to rewrite the expression from using an integer to using a string.
https://kromerbigdata.com/2019/09/28/adf-dynamic-skip-lines-find-data-with-variable-headers
First Expression
iif(!isNull(toInteger(left(toString(byPosition(1)),1))),toInteger(rownum),toInteger(0))
As the article states, this expression looks at the first character of each row and if it is an integer it will return the row number (rownum)
How do I perform this action for a string (e.g "TestID,")
Many Thanks
Jonny
I think you want to consider first line that starts with string as your header and preceding lines that starts with numbers should not be considered as header. You can use isNan function to check if the first character is Not a number(i.e. string) as seen in the below modified expression:
iif(isNan(left(toString(byPosition(1)),1))
,toInteger(rownum)
,toInteger(0)
)
Following is a breakdown of the above expression:
left(toString(byPosition(1)),1): gets first character fron left side of the first column.
isNan: checks if the character is "not a number".
iif: not a number, true then return rownum, false then return 0.
Or you can also use functions like isInteger() to check if the first character is an integer or not and perform actions accordingly.
Later on as explained in the cited article you need to find minimum rownum to skip.
Hope it helps.

how to return empty field in Spark

I am trying to check incomplete record and identify the bad record in Spark.
eg. sample test.txt file, it is in record format, columns separated by \t
L1C1 L1C2 L1C3 L1C4
L2C1 L2C2 L2C3
L3C1 L3C2 L3C3 L3C4
scala> sc.textFile("test.txt").filter(_.split("\t").length < 4).collect.foreach(println)
L2C1 L2C2 L2C3
The second line is printing as having less number of columns.
How should i parse without ignoring the empty column after in second line
It is the split string in scala removes trailing empty substrings.
The behavior is similar to Java, to let all the substrings checked we can call as
"L2C1 L2C2 L2C3 ".split("\t",-1)

How to delete space in character text?

I wrote a code that automatically pulls time-related information from the system. As indicated in the table is fixed t247 Month names to 10 characters in length. But it is a bad image when showing on the report screen.
I print this way:
WRITE : 'Bugün', t_month_names-ltx, ' ayının'.
CONCATENATE gv_words-word '''nci günü' INTO date.
CONCATENATE date ',' INTO date.
CONCATENATE date gv_year INTO date SEPARATED BY space.
TRANSLATE date TO LOWER CASE.
I tried the CONDENSE t_month_names-ltx NO-GAPS. method to delete the spaces, but it was not enough.
After WRITE, I was able to write statically by setting the blank value:
WRITE : 'Bugün', t_month_names-ltx.
WRITE : 14 'ayının'.
CONCATENATE gv_words-word '''nci günü' INTO date.
CONCATENATE date ',' INTO date.
CONCATENATE date gv_year INTO date SEPARATED BY space.
TRANSLATE date TO LOWER CASE.
But this is not a correct use. How do I achieve this dynamically?
You could use a temporary field of type STRING:
DATA l_month TYPE STRING.
l_month = t_month_names-ltx.
WRITE : 'Bugün', l_month.
WRITE : 14 'ayının'.
CONCATENATE gv_words-word '''nci günü' INTO date.
CONCATENATE date ',' INTO date.
CONCATENATE date gv_year INTO date SEPARATED BY space.
TRANSLATE date TO LOWER CASE.
You can not delete trailing spaces from a TYPE C field, because it's of constant length. The unused length is always filled with spaces.
But after you assembled you string, you can use CONDENSE without NO-GAPS to remove any chains of more than one space within the string.
Add CONDENSE date. below the code you wrote and you should get the results you want.
Another option is to abandon CONCATENATE and use string templates (string literals within | symbols) for string assembly instead, which do not have the annoying habit of including trailing spaces of TYPE C fields:
DATA long_char TYPE C LENGTH 128.
long_char = 'long character field'.
WRITE |this is a { long_char } inserted without spaces|.
Output:
this is a long character field inserted without spaces

USQL Escape Quotes

I am new to Azure data lake analytics, I am trying to load a csv which is double quoted for sting and there are quotes inside a column on some random rows.
For example
ID, BookName
1, "Life of Pi"
2, "Story about "Mr X""
When I try loading, it fails on second record and throwing an error message.
1, I wonder if there is a way to fix this in csv file, unfortunatly we cannot extract new from source as these are log files?
2, is it possible to let ADLA to ignore the bad rows and proceed with rest of the records?
Execution failed with error '1_SV1_Extract Error :
'{"diagnosticCode":195887146,"severity":"Error","component":"RUNTIME","source":"User","errorId":"E_RUNTIME_USER_EXTRACT_ROW_ERROR","message":"Error
occurred while extracting row after processing 9045 record(s) in the
vertex' input split. Column index: 9, column name:
'instancename'.","description":"","resolution":"","helpLink":"","details":"","internalDiagnostics":"","innerError":{"diagnosticCode":195887144,"severity":"Error","component":"RUNTIME","source":"User","errorId":"E_RUNTIME_USER_EXTRACT_EXTRACT_INVALID_CHARACTER_AFTER_QUOTED_FIELD","message":"Invalid
character following the ending quote character in a quoted
field.","description":"Invalid character is detected following the
ending quote character in a quoted field. A column delimiter, row
delimiter or EOF is expected.\nThis error can occur if double-quotes
within the field are not correctly escaped as two
double-quotes.","resolution":"Column should be fully surrounded with
double-quotes and double-quotes within the field escaped as two
double-quotes."
As per the error message, if you are importing a quoted csv, which has quotes within some of the columns, then these need to be escaped as two double-quotes. In your particular example, you second row needs to be:
..."Life after death and ""good death"" models - a qualitative study",...
So one option is to fix up the original file on output. If you are not able to do this, then you can import all the columns as one column, use RegEx to fix up the quotes and output the file again, eg
// Import records as one row then use RegEx to clean columns
#input =
EXTRACT oneCol string
FROM "/input/input132.csv"
USING Extractors.Text( '|', quoting: false );
// Fix up the quotes using RegEx
#output =
SELECT Regex.Replace(oneCol, "([^,])\"([^,])", "$1\"\"$2") AS cleanCol
FROM #input;
OUTPUT #output
TO "/output/output.csv"
USING Outputters.Csv(quoting : false);
The file will now import successfully. My results:

reading via matlab a number after a specific string in a txt file

I re explain my pb in a large a.txt file i have
Amount of Food is 1
Desired Travel is 5
I need to read the 1 after the 'Amount of Food is ' expression and the 5 after the 'Desired Travel is' expression, Thanks again
You can have a look at this: with regexpi you can simply look for numbers in your strings.
The syntax is as simple as this:
startIndex = regexpi(str,expression)
where the expression parameter is a regex expression (i.e. '\d*' to retrieve consecutive digits).
In your specific case a way to perform this with regular expressions would be:
First you have to decide what strings are valid in your search
for example:
firstpar = 'First parameter is [0-9]+';
means that you are looking for a string 'First parameter is '
that ends with a sequence of digits.
Then you could use regexp or regexpi in the following way:
results = regexp(mystring, firstpar, 'match');
Where mystring is the text you perform the search on and 'match' means that you want parts of the text as output, not indexes.
Now, results is a cell matrix with each cell containing a string that appeared in your text and fulfilled your firstpar definition. In order to extract just the numbers from cell matrix of strings you could use regexp again, but now helping yourself with cellfun, which iteratively applies your command to all cells of a cell matrix:
numbers = cellfun(#(x) str2num(regexp(x, '[0-9]+', 'match', 'once')), results);
numbers is an array of numbers that you were looking for.
You can do the same for different string patterns - if you want to have a more general string definitions (instead of straightforward firstpar that we used here) read matlab documentation about regular expressions (alexcasalboni pasted it in his comment), scroll down to Input Arguments and expand 'expressions'.
The difference between regexp and regexpi is that the latter is case insensitive.