OpenRefine: Fill down with increasing counter - autofill

Is it possible in OpenRefine to fill down blank cells with a counter instead of copying the top non-blank value?
In this example image:
Or here the same example as typed text - image this as a column from top to bottom:
1
1
blank
1
blank
blank
blank
blank
blank
1
I would like to see the column filled as follows (again, imagine top to bottom):
1
1
2
1
2
3
4
5
6
1
Thanks, help is very much appreciated.

It's not really simple. You have to:
1 Replace the blanks with something else, such as an "x"
2 Create a unique record for the entire dataset
3 Use this Jython script:
import itertools
data = row['record']['cells']['YOUR COLUMN NAME']['value']
x = itertools.count(2)
liste = []
for i, el in enumerate(data):
if data[i] == "x":
liste.append(x.next())
else:
x = itertools.count(2)
liste.append(el)
return ",".join([str(x) for x in liste])
4 Use Blank down to clear duplicates
5 Split the first multivalued cell.
Here is a screencast of the operations described above.
If you know a little Python, you can also transform your file using pandas. I do not know what is the most elegant way to do it, but this script should work.
import itertools
import pandas as pd
x = itertools.count(2)
def set_x():
global x
x = itertools.count(2)
set_x()
def increase(value):
if not value:
return next(x)
else:
set_x()
return value
data = pd.read_csv("your_file.csv", na_values=['nan'], keep_default_na=False)
data['column 1'] = data['column 1'].apply(lambda row: increase(row))
print(data)
data.to_csv("final_file.csv")

Here are two simple solutions using GREL.
Use records
You could move the column to the beginning, telling OpenRefine to use the numbers as records. You might need to transform the column to text to really convince OpenRefine to use it as records.
Then either add a new column or transform the existing one with the following expression.
1 + row.index - row.record.fromRowIndex
Use record markers
In case you don't want to use records or don't have a static number, you can create a similar setup. Imagine you have an incomplete counter like in the following table and want to fill it.
Origin
Desired
1
1
2
1
1
2
2
3
1
1
To fill the missing cells first add a new column based on your orignal column using the following expression and name it record_row_index.
if(isNonBlank(value), row.index, "")
After that fill down the original column and the new column record_row_index.
Then create a new column based on the original filled column using the following expression.
value + row.index - cells["record_row_index"].value
Hint: the expression is expecting both columns to be of type number.
If one of them is of type text, you can either transform the column beforehand or use toNumber() in the expression.
The following table shows how these operations are working together.
Origin
Origin filled
row.index
record_row_index
Desired
1
1
0
0
1 + 0 - 0 = 1
1
1
0
1 + 1 - 0 = 2
1
1
2
2
1 + 2 - 2 = 1
2
2
3
3
2 + 3 - 3 = 2
2
4
3
2 + 4 - 3 = 3
1
1
5
5
1 + 5 - 5 = 1

Related

str_detect for multiple patterns

I am using str_detect within the stringr package and I am having trouble searching a string with more than one pattern.
Here is the code I am using, however it is not returning anything even though my vector ("Notes-Title") contains these patterns.
filter(str_detect(`Notes-Title`, c("quantity","single")))
The logic I want to code is:
Search each row and filter it if it contains the string "quantity" or "single".
You need to use the | separator in your search, all within one set of "".
> words <- c("quantity", "single", "double", "triple", "awful")
> set.seed(1234)
> df = tibble(col = sample(words,10, replace = TRUE))
> df
# A tibble: 10 x 1
col
<chr>
1 triple
2 single
3 awful
4 triple
5 quantity
6 awful
7 triple
8 single
9 single
10 triple
> df %>% filter(str_detect(col, "quantity|single"))
# A tibble: 4 x 1
col
<chr>
1 single
2 quantity
3 single
4 single

Detect contiguous numbers - MATLAB

I coded a program that create some bunch of binary numbers like this:
out = [0,1,1,0,1,1,1,0,0,0,1,0];
I want check existence of nine 1 digit together in above out, for example when we have this in our output:
out_2 = [0,0,0,0,1,1,1,1,1,1,1,1,1];
or
out_3 = [0,0,0,1,1,1,1,0,0,1,0,1,1,1,1,1,1,1,1,1,0,0,0,1,1,0];
condition variable should be set to 1. We don't know exact position of start of ones in outvariable. It is random. I only want find existence of duplicate ones values in above variable (one occurrence or more).
PS.
We are searching for a general answer to find other duplicate numbers (not only 1 here and not only for binary data. this is just an example)
You can use convolution to solve such r-contiguous detection cases.
Case #1 : To find contiguous 1s in a binary array -
check = any(conv(double(input_arr),ones(r,1))>=r)
Sample run -
input_arr =
0 0 0 0 1 1 1 1 1 1 1 1 1
r =
9
check =
1
Case #2 : For detecting any number as contiguous, you could modify it a bit, like so -
check = any(conv(double(diff(input_arr)==0),ones(1,r-1))>=r-1)
Sample run -
input_arr =
3 5 2 4 4 4 5 5 2 2
r =
3
check =
1
To save Stackoverflow from further duplicates, also feel free to look into related problems -
Fast r-contiguous matching (based on location similarities).
r-contiguous matching, MATLAB.

How can I quickly extract rows based on content in a specific content in a fixed width text file using command line tools?

I have a large text file (> 4 gb) which is in fixed-width format. I want to get a subset of that file based on content in specific columns. What would be the fastest way to do this?
For example the file will have the following format:
Column width 1 = 3
Column width 2 = 3
Column width 3 = 2
Column width 4 = 2
Column width 5 = 1
Column width 6 = 2
Column width 7 = 2
Column width 8 = 2
Colwidth 9 = 2
And a line of the file might look like:
150-9912 17 7 1 0 0
If I wanted to search based on the values of column 2 (e.g. where value of column 2 == -99), what would be the most efficient way to do this? I have multiple files ~ 4GB in size with close to 10 million lines in each file. Appreciate the help!
Using GNU awk:
awk 'BEGIN{FIELDWIDTHS="3 3 2 2 1 2 2 2 2"} $2==-99'
The above will get you well on the way.

select rows by comparing columns using HDFStore

How can I select some rows by comparing two columns from hdf5 file using Pandas? The hdf5 file is too big to load into memory. For example, I want to select rows where column A and columns B is equal. The dataframe is save in file 'mydata.hdf5'. Thanks.
import pandas as pd
store = pd.HDFstore('mydata.hdf5')
df = store.select('mydf',where='A=B')
This doesn't work. I know that store.select('mydf',where='A==12') will work. But I want to compare column A and B. The example data looks like this:
A B C
1 1 3
1 2 4
. . .
2 2 5
1 3 3
You cannot directly do this, but the following will work
In [23]: df = DataFrame({'A' : [1,2,3], 'B' : [2,2,2]})
In [24]: store = pd.HDFStore('test.h5',mode='w')
In [26]: store.append('df',df,data_columns=True)
In [27]: store.select('df')
Out[27]:
A B
0 1 2
1 2 2
2 3 2
In [28]: store.select_column('df','A') == store.select_column('df','B')
Out[28]:
0 False
1 True
2 False
dtype: bool
This should be pretty efficient.

Using SUM and UNIQUE to count occurrences of value within subset of a matrix

So, presume a matrix like so:
20 2
20 2
30 2
30 1
40 1
40 1
I want to count the number of times 1 occurs for each unique value of column 1. I could do this the long way by [sum(x(1:2,2)==1)] for each value, but I think this would be the perfect use for the UNIQUE function. How could I fix it so that I could get an output like this:
20 0
30 1
40 2
Sorry if the solution seems obvious, my grasp of loops is very poor.
Indeed unique is a good option:
u=unique(x(:,1))
res=arrayfun(#(y)length(x(x(:,1)==y & x(:,2)==1)),u)
Taking apart that last line:
arrayfun(fun,array) applies fun to each element in the array, and puts it in a new array, which it returns.
This function is the function #(y)length(x(x(:,1)==y & x(:,2)==1)) which finds the length of the portion of x where the condition x(:,1)==y & x(:,2)==1) holds (called logical indexing). So for each of the unique elements, it finds the row in X where the first is the unique element, and the second is one.
Try this (as specified in this answer):
>>> [c,~,d] = unique(a(a(:,2)==1))
c =
30
40
d =
1
3
>>> counts = accumarray(d(:),1,[],#sum)
counts =
1
2
>>> res = [c,counts]
Consider you have an array of various integers in 'array'
the tabulate function will sort the unique values and count the occurances.
table = tabulate(array)
look for your unique counts in col 2 of table.