Azure Data Factory - Dynamic Skip Lines Expression - azure-data-factory

I am attempting to import a CSV into ADF however the file header is not the first line of the file. It is dynamic therefore I need to match it based on the first column (e.g "TestID,") which is a string.
Example Data (Header is on Line 4)
Date:,01/05/2022
Time:,00:30:25
Test Temperature:,25C
TestID,StartTime,EndTime,Result
TID12345-01,00:45:30,00:47:12,Pass
TID12345-02,00:46:50,00:49:12,Fail
TID12345-03,00:48:20,00:52:17,Pass
TID12345-04,00:49:12,00:49:45,Pass
TID12345-05,00:50:22,00:51:55,Fail
I found this article which addresses this issue however I am struggling to rewrite the expression from using an integer to using a string.
https://kromerbigdata.com/2019/09/28/adf-dynamic-skip-lines-find-data-with-variable-headers
First Expression
iif(!isNull(toInteger(left(toString(byPosition(1)),1))),toInteger(rownum),toInteger(0))
As the article states, this expression looks at the first character of each row and if it is an integer it will return the row number (rownum)
How do I perform this action for a string (e.g "TestID,")
Many Thanks
Jonny

I think you want to consider first line that starts with string as your header and preceding lines that starts with numbers should not be considered as header. You can use isNan function to check if the first character is Not a number(i.e. string) as seen in the below modified expression:
iif(isNan(left(toString(byPosition(1)),1))
,toInteger(rownum)
,toInteger(0)
)
Following is a breakdown of the above expression:
left(toString(byPosition(1)),1): gets first character fron left side of the first column.
isNan: checks if the character is "not a number".
iif: not a number, true then return rownum, false then return 0.
Or you can also use functions like isInteger() to check if the first character is an integer or not and perform actions accordingly.
Later on as explained in the cited article you need to find minimum rownum to skip.
Hope it helps.

Related

how to return empty field in Spark

I am trying to check incomplete record and identify the bad record in Spark.
eg. sample test.txt file, it is in record format, columns separated by \t
L1C1 L1C2 L1C3 L1C4
L2C1 L2C2 L2C3
L3C1 L3C2 L3C3 L3C4
scala> sc.textFile("test.txt").filter(_.split("\t").length < 4).collect.foreach(println)
L2C1 L2C2 L2C3
The second line is printing as having less number of columns.
How should i parse without ignoring the empty column after in second line
It is the split string in scala removes trailing empty substrings.
The behavior is similar to Java, to let all the substrings checked we can call as
"L2C1 L2C2 L2C3 ".split("\t",-1)

Can I write a PCRE conditional that only needs the no-match part?

I am trying to create a regular expression to determine if a string contains a number for an SQL statement. If the value is numeric, then I want to add 1 to it. If the number is not numeric, I want to return a 1. More or less. Here is the SQL:
SELECT
field,
CASE
WHEN regexp_like(field, '^ *\d*\.?\d* *$') THEN dec(field) + 1
ELSE 1
END nextnumber
FROM mytable
This actually works, and returns something like this:
INVALID 1
00000 1
00001E 1
00379 380
00013 14
99904 99905
But to push the envelope of understanding, what if I wanted to cover negative numbers, or those with a positive sign. The sign would have to immediately precede or follow the number, but not both, and I would not want to allow white space between the sign and the number.
I came up with a conditional expression with a capture group to capture the sign on the front of the number to determine if a sign was allowed on the end, but it seems a little awkward to handle given I don't really need a yes-pattern.
Here is the modified regex: ^ ([+-]?)*\d*\.?\d*(?(1) *|[+-]? *)$
This works at regex101.com, but in order for it to work I need to have something before the pipe, so I have to duplicate the next pattern in both the yes-pattern and the no-pattern.
All that background for this question: How can I avoid that duplication?
EDIT: DB2 for i uses International Components for Unicode to provide regular expression processing. It turns out that this library does not support conditionals like PRCE, so I changed the tags on this question. The answer given by Wiktor Stribiżew provides a working alternative to the conditional by using a negative lookahead.
You do not have to duplicate the end pattern, just move it outside the conditional:
^ *([+-])?\d*\.?\d*(?(1)|[+-]?) *$
See the regex demo. So, the yes-part is empty, and the no-part has an optional pattern.
You may also solve it with a mere negative lookahead:
^ *([+-](?!.*[-+]))?\d*\.?\d*[+-]? *$
See another regex demo. Here, ([+-](?!.*[-+]))? matches (optionally) a + or - that are not followed with any 0+ char followed with another + or -.

how to remove # character from national data type in cobol

i am facing issue while converting unicode data into national characters.
When i convert the Unicode data into national using national-of function, some junk character like # is appended after the string.
E.g
Ws-unicode pic X(200)
Ws-national pic N(600)
--let the value in Ws-Unicode is これらの変更は. getting from java end.
move function national-of ( Ws-unicode ,1208 ) to Ws-national.
--after converting value is like これらの変更は #.
i do not want the extra # character added after conversion.
please help me to find out the possible solution, i have tried to replace N'#' with space using inspect clause.
it worked well but failed in some specific scenario like if we have # in input from user end. in that case genuine # also converted to space.
Below is a snippet of code I used to convert EBCDIC to UTF. Before I was capturing string lengths, I was also getting # symbols:
STRING
FUNCTION DISPLAY-OF (
FUNCTION NATIONAL-OF (
WS-EBCDIC-STRING(1:WS-XML-EBCDIC-LENGTH)
WS-EBCDIC-CCSID
)
WS-UTF8-CCSID
)
DELIMITED BY SIZE
INTO WS-UTF8-STRING
WITH POINTER WS-XML-UTF8-LENGTH
END-STRING
SUBTRACT 1 FROM WS-XML-UTF8-LENGTH
What this code does is string the UTF8 representation of the EBCIDIC string into another variable. The WITH POINTER clause will capture the new length of the string + 1 (+ 1 because the pointer is positioned to the next position after the string ended).
Using this method, you should be able to know exactly how long second string is and use that string with the exact length.
That should remove the unwanted #s.
EDIT:
One thing I forgot to mention, in my case, the # signs were actually EBCDIC low values when viewing the actual hex on the mainframe
Use inspect with reverse and stop after first occurence of #

finding a comma in string

[23567,0,0,0,0,0] and other value is [452221,0,0,0,0,0] and the value should be contineously displaying about 100 values and then i want to display only the sensor value like in first sample 23567 and in second sample 452221 , only the these values have to display . For that I have written a code
value = str2double(str(2:7));see here my attempt
so I want to find the comma in the output and only display the value before first comma
As proposed in a comment by excaza, MATLAB has dedicated functions, such as sscanf for such purposes.
sscanf(str,'[%d')
which matches but ignores the first [, and returns the next (i.e. the first) number as a double variable, and not as a string.
Still, I like the idea of using regular expressions to match the numbers. Instead of matching all zeros and commas, and replacing them by '' as proposed by Sardar_Usama, I would suggest directly matching the numbers using regexp.
You can return all numbers in str (still as string!) with
nums = regexp(str,'\d*','match')
and convert the first number to a double variable with
str2double(nums{1})
To match only the first number in str, we can use the regexp
nums = regexp(str,'[(\d*),','tokens')
which finds a [, then takes an arbitrary number of decimals (0-9), and stops when it finds a ,. By enclosing the \d* in brackets, only the parts in brackets are returned, i.e. only the numbers without [ and ,.
Final Note: if you continue working with strings, you could/should consider the regexp solution. If you convert it to a double anyways, using sscanf is probably faster and easier.
You can use regexprep as follows:
str='[23567,0,0,0,0,0]' ;
required=regexprep(str(2:end-1),',0','')
%Taking str(2:end-1) to exclude brackets, and then removing all ,0
If there can be values other than 0 after , , you can use the following more general approach instead:
required=regexprep(str(2:end-1),',[-+]?\d*\.?\d*','')

Checking the format of a string in Matlab

So I'm reading multiple text files in Matlab that have, in their first columns, a column of "times". These times are either in the format 'MM:SS.milliseconds' (sorry if that's not the proper way to express it) where for example the string '29:59.9' would be (29*60)+(59)+(.9) = 1799.9 seconds, or in the format of straight seconds.milliseconds, where '29.9' would mean 29.9 seconds. The format is the same for a single file, but varies across different files. Since I would like the times to be in the second format, I would like to check if the format of the strings match the first format. If it doesn't match, then convert it, otherwise, continue. The code below is my code to convert, so my question is how do I approach checking the format of the string? In otherwords, I need some condition for an if statement to check if the format is wrong.
%% Modify the textdata to convert time to seconds
timearray = textdata(2:end, 1);
if (timearray(1, 1) %{has format 'MM.SS.millisecond}%)
datev = datevec(timearray);
newtime = (datev(:, 5)*60) + (datev(:, 6));
elseif(timearray(1, 1) %{has format 'SS.millisecond}%)
newtime = timearray;
You can use regular expressions to help you out. Regular expressions are methods of specifying how to search for particular patterns in strings. As such, you want to find if a string follows the formats of either:
xx:xx.x
or:
xx.x
The regular expression syntax for each of these is defined as the following:
^[0-9]+:[0-9]+\.[0-9]+
^[0-9]+\.[0-9]+
Let's step through how each of these work.
For the first one, the ^[0-9]+ means that the string should start with any number (^[0-9]) and the + means that there should be at least one number. As such, 1, 2, ... 10, ... 20, ... etc. is valid syntax for this beginning. After the number should be separated by a :, followed by another sequence of numbers of at least one or more. After, there is a . that separates them, then this is followed by another sequence of numbers. Notice how I used \. to specify the . character. Using . by itself means that the character is a wildcard. This is obviously not what you want, so if you want to specify the actual . character, you need to prepend a \ to the ..
For the second one, it's almost the same as the first one. However, there is no : delimiter, and we only have the . to work with.
To invoke regular expressions, use the regexp command in MATLAB. It is done using:
ind = regexp(str, expression);
str represents the string you want to check, and expression is a regular expression that we talked about above. You need to make sure you encapsulate your expression using single quotes. The regular expression is taken in as a string. ind would this return the starting index of your string of where the match was found. As such, when we search for a particular format, ind should either be 1 indicating that we found this search at the beginning of the string, or it returns empty ([]) if it didn't find a match. Here's a reproducible example for you:
B = {'29:59.9', '29.9', '45:56.8', '24.5'};
for k = 1 : numel(B)
if (regexp(B{k}, '^[0-9]+:[0-9]+\.[0-9]+') == 1)
disp('I''m the first case!');
elseif (regexp(B{k}, '^[0-9]+\.[0-9]+') == 1)
disp('I''m the second case!');
end
end
As such, the code should print out I'm the first case! if it follows the format of the first case, and it should print I'm the second case! if it follows the format of the second case. As such, by running this code, we get:
I'm the first case!
I'm the second case!
I'm the first case!
I'm the second case!
Without knowing how your strings are formatted, I can't do the rest of it for you, but this should be a good start for you.