OpenRefine - replacing string value in one column based no the value in another - data-cleaning

I have a large CSV which contains information about how a collection is divided up. For example one column contains information about the top level category, another about the sub-category and there can be quite a few of these depending on sub-classifications.
In OpenRefine these look like(for example):
||field 1 || field 2 || field 3
||I am a section || I am a section with a subsection || I am a section with a subsection with another subsection
In order to be able to correctly split these out into top level and subsections I thought perhaps I could use the replace function to remove the value of field1 from the value of field 2 and onwards. This would leave me with
||field 1 || field 2 || field 3
||I am a section || with a subsection || with another subsection
My questions are:
Is this the right approach or is there something more elegant?
Is it is, how do I use the replace function to dynamically do this in the entire CSV?

You can reference another column with the following GREL expression: cells['field 2'].value
For example, in your case, you will be replacing the value from field 2 by nothing (using ''), the expression is: value.replace(cells['field 2'].value,'')

Related

Azure Data Factory - Dynamic Skip Lines Expression

I am attempting to import a CSV into ADF however the file header is not the first line of the file. It is dynamic therefore I need to match it based on the first column (e.g "TestID,") which is a string.
Example Data (Header is on Line 4)
Date:,01/05/2022
Time:,00:30:25
Test Temperature:,25C
TestID,StartTime,EndTime,Result
TID12345-01,00:45:30,00:47:12,Pass
TID12345-02,00:46:50,00:49:12,Fail
TID12345-03,00:48:20,00:52:17,Pass
TID12345-04,00:49:12,00:49:45,Pass
TID12345-05,00:50:22,00:51:55,Fail
I found this article which addresses this issue however I am struggling to rewrite the expression from using an integer to using a string.
https://kromerbigdata.com/2019/09/28/adf-dynamic-skip-lines-find-data-with-variable-headers
First Expression
iif(!isNull(toInteger(left(toString(byPosition(1)),1))),toInteger(rownum),toInteger(0))
As the article states, this expression looks at the first character of each row and if it is an integer it will return the row number (rownum)
How do I perform this action for a string (e.g "TestID,")
Many Thanks
Jonny
I think you want to consider first line that starts with string as your header and preceding lines that starts with numbers should not be considered as header. You can use isNan function to check if the first character is Not a number(i.e. string) as seen in the below modified expression:
iif(isNan(left(toString(byPosition(1)),1))
,toInteger(rownum)
,toInteger(0)
)
Following is a breakdown of the above expression:
left(toString(byPosition(1)),1): gets first character fron left side of the first column.
isNan: checks if the character is "not a number".
iif: not a number, true then return rownum, false then return 0.
Or you can also use functions like isInteger() to check if the first character is an integer or not and perform actions accordingly.
Later on as explained in the cited article you need to find minimum rownum to skip.
Hope it helps.

Find rows where string contains certain character at specific place

I have a field in my database, that contains 10 characters:
Fx: 1234567891
I want to look for the rows where the field has eg. the numbers 8 and 9 in places 5 and 6
So for example,
if the rows are
a) 1234567891
b) 1234897891
c) 1234877891
I only want b) returned in my select.
The type of the field is string/character varying.
I have tried using:
where field like '%89%'
but that won't work, because I need it to be 89 at a specific place in the string.
The fastest solution would be
WHERE substr(field, 8, 2) = '89'
If the positions are not adjacent, you end up with two conditions joined with AND.
You should be able to evaluate the single character using the underscore(_) character. So you should be able to use it as follows.
where field like '____89%'

converting 1x1 matrix to a variable

I read the data from the csv which contains two columns id which text/string and the cancer which is 1/0. please see the code be
M = readtable('data.csv');
I try to access the very first value using
row= M(n,1); //It's from the ID column which is text
But it comes in the form of a 1x1matrix, and I am unable to put it in a single variable.
for example I want after the above line works row should contain a string in it like. row = 'patientID'. Now is there anyway to convert it into a single value?
Use row = M{n,1}. Note the curly braces.
The curly braces say "get the contents of the table", as opposed to the circular brackets you had been using which say "get me a portion of the table, as a table".

Matlab split text column in a table

I have a table object in MatLab with a text column. This text column is a "tag" and contains underscores two split the tag.
I'd like to create a column with the second element of the tag. I used strsplit but It didn't work. Also I tried regexp but it gives me a cell object with 126 cells objects inside, and I don't know how to extract the second element of every cell.
Any suggestion?
Example:
a = {'a_b'; 'a_c';'a_n';'a_t'}
t = table(a)
I just want a vector with the second element.
Thanks.
How about
t=[t rowfun(#(x) x{1}(3),t)]
with 1 being the column and 3 being the element you want. For undefined length of the string parts it gets a little bit more tricky
t=[t rowfun(#(X) X{1}(strfind(X{1},'_')+1:end),t,'OutputFormat','cell')];
strfind() gets the '_' element so (find+1:end) is the rest of the string. as they can be of different length everything has to a cell as Output and then be added to the table. if the column changes you have to adopt the code in both {1}

How to show several fields values in one textField

Can nay one help to add multiple DB field values in one field.
Say i have 3 DB fields:
Name
Address
Age
I want to display all 3 fields in the same field:
John Peter 28.
I tried doing 3 fields next to each other and it did work but when i wrap text. It looks really bad:
Name
Jo.pe.28
hn te
r
My requirement is show data in one text field, for example: John.Peter.26
If you want to put them in one line (which i guess is the case), its straight forward.
Put this as a text box $F{Name} + "." + $F{Address} + "." + $F{Age}.toString()
Or you can use string concatenation (I dont personally like the syntax, take more effort to understand) $F{Name}.concat(".").concat($F{Address}).concat(".").concat($F{Age})
The SQL Method
Why not concatenate all the 3 fields you need in the query you use itself like (Assuming you are with Postgres.),
select (name || address|| to_char(age)) as data from my_table
In Ireport
As suggested,
$F{Name} + "." + $F{Address} + "." + $F{Age}.toString()
too works if needed to make it work from the report.
Make sure that all your fields are of same data type.