How can I quickly extract rows based on content in a specific content in a fixed width text file using command line tools? - command-line

I have a large text file (> 4 gb) which is in fixed-width format. I want to get a subset of that file based on content in specific columns. What would be the fastest way to do this?
For example the file will have the following format:
Column width 1 = 3
Column width 2 = 3
Column width 3 = 2
Column width 4 = 2
Column width 5 = 1
Column width 6 = 2
Column width 7 = 2
Column width 8 = 2
Colwidth 9 = 2
And a line of the file might look like:
150-9912 17 7 1 0 0
If I wanted to search based on the values of column 2 (e.g. where value of column 2 == -99), what would be the most efficient way to do this? I have multiple files ~ 4GB in size with close to 10 million lines in each file. Appreciate the help!

Using GNU awk:
awk 'BEGIN{FIELDWIDTHS="3 3 2 2 1 2 2 2 2"} $2==-99'
The above will get you well on the way.

Related

str_detect for multiple patterns

I am using str_detect within the stringr package and I am having trouble searching a string with more than one pattern.
Here is the code I am using, however it is not returning anything even though my vector ("Notes-Title") contains these patterns.
filter(str_detect(`Notes-Title`, c("quantity","single")))
The logic I want to code is:
Search each row and filter it if it contains the string "quantity" or "single".
You need to use the | separator in your search, all within one set of "".
> words <- c("quantity", "single", "double", "triple", "awful")
> set.seed(1234)
> df = tibble(col = sample(words,10, replace = TRUE))
> df
# A tibble: 10 x 1
col
<chr>
1 triple
2 single
3 awful
4 triple
5 quantity
6 awful
7 triple
8 single
9 single
10 triple
> df %>% filter(str_detect(col, "quantity|single"))
# A tibble: 4 x 1
col
<chr>
1 single
2 quantity
3 single
4 single

OpenRefine: Fill down with increasing counter

Is it possible in OpenRefine to fill down blank cells with a counter instead of copying the top non-blank value?
In this example image:
Or here the same example as typed text - image this as a column from top to bottom:
1
1
blank
1
blank
blank
blank
blank
blank
1
I would like to see the column filled as follows (again, imagine top to bottom):
1
1
2
1
2
3
4
5
6
1
Thanks, help is very much appreciated.
It's not really simple. You have to:
1 Replace the blanks with something else, such as an "x"
2 Create a unique record for the entire dataset
3 Use this Jython script:
import itertools
data = row['record']['cells']['YOUR COLUMN NAME']['value']
x = itertools.count(2)
liste = []
for i, el in enumerate(data):
if data[i] == "x":
liste.append(x.next())
else:
x = itertools.count(2)
liste.append(el)
return ",".join([str(x) for x in liste])
4 Use Blank down to clear duplicates
5 Split the first multivalued cell.
Here is a screencast of the operations described above.
If you know a little Python, you can also transform your file using pandas. I do not know what is the most elegant way to do it, but this script should work.
import itertools
import pandas as pd
x = itertools.count(2)
def set_x():
global x
x = itertools.count(2)
set_x()
def increase(value):
if not value:
return next(x)
else:
set_x()
return value
data = pd.read_csv("your_file.csv", na_values=['nan'], keep_default_na=False)
data['column 1'] = data['column 1'].apply(lambda row: increase(row))
print(data)
data.to_csv("final_file.csv")
Here are two simple solutions using GREL.
Use records
You could move the column to the beginning, telling OpenRefine to use the numbers as records. You might need to transform the column to text to really convince OpenRefine to use it as records.
Then either add a new column or transform the existing one with the following expression.
1 + row.index - row.record.fromRowIndex
Use record markers
In case you don't want to use records or don't have a static number, you can create a similar setup. Imagine you have an incomplete counter like in the following table and want to fill it.
Origin
Desired
1
1
2
1
1
2
2
3
1
1
To fill the missing cells first add a new column based on your orignal column using the following expression and name it record_row_index.
if(isNonBlank(value), row.index, "")
After that fill down the original column and the new column record_row_index.
Then create a new column based on the original filled column using the following expression.
value + row.index - cells["record_row_index"].value
Hint: the expression is expecting both columns to be of type number.
If one of them is of type text, you can either transform the column beforehand or use toNumber() in the expression.
The following table shows how these operations are working together.
Origin
Origin filled
row.index
record_row_index
Desired
1
1
0
0
1 + 0 - 0 = 1
1
1
0
1 + 1 - 0 = 2
1
1
2
2
1 + 2 - 2 = 1
2
2
3
3
2 + 3 - 3 = 2
2
4
3
2 + 4 - 3 = 3
1
1
5
5
1 + 5 - 5 = 1

Google Sheets IMPORTDATA() with no header

I'm looking over the documentation for IMPORTRANGE() to remove the header row, but it looks like I cannot do that.
How would I remove a header row from an external data source I'm bringing into Google Sheets using IMPORTRANGE() ?
Example data:
A B C
1 2 3
4 5 6
7 8 9
But I only want:
1 2 3
4 5 6
7 8 9
Thanks
=QUERY(IMPORTDATA(<myUrl>),
"SELECT * OFFSET 1", 0)
Selects all but the first row of the CSV, works because the 0 argument makes it consider the header a row that is then skippable.
=QUERY(IMPORTDATA(<myUrl>),
"SELECT * OFFSET 1", 0)
This works.
But this one does not:
=QUERY(IMPORTDATA(),
"SELECT *", 1)
Clearly the optional [headers] parameter in QUERY() has a specialized meaning that has nothing to do with eliminating the header row(s).
From: https://support.google.com/docs/answer/3093343?hl=en:
headers - [ OPTIONAL ] - The number of header rows at the top of data. If omitted or set to -1, the value is guessed based on the content of data.
The docs should clarify what that parameter does to the data, i.e., that it will merge the specified value of rows into one header row.

create a new matrix from values obtained iterating through other matricies

In Matlab I have 4 matricies which are all 1(row) by 4(coloumns) (ABDC, EFGH, IJKL, MNOP)
Their names are also stored in a list
Stock_List2 = {'ABCD' 'EFGH' 'IJKL' 'MNOP'} and is a 1 by 4 cell.
I want to iterate through the list and create a new matrix called "display" which takes the values of the indvidual matricies and places them underneath each other)
I am trying something like
for e = 1:length(Stock_List2)
display(e) = eval(strcat(Stock_List2)(e))
end
Error: ()-indexing must appear last in an index expression.
However getting the following error expression which truthfully may well just be that I'm way off the mark.
As an example if the orginal matricies are as follows:
ABCD 1 2 3 4
DEFG 5 6 7 8
HIJK 9 8 7 6
LMNO 5 4 3 2
I would like the final output ie the 'display matrix to be a 4 by 4 matrix looking like
display
1 2 3 4
5 6 7 8
9 8 7 6
5 4 3 2
If I understood right you want to concatenate vertically the matrices ABDC, EFGH, IJKL and MNOP saving them in the matrix "display".
You could do:
display = [ABDC; EFGH; IJKL; MNOP]
or:
for i=1:length(Stock_List2)
display(i,:) = Stock_List2{i}
end
Apologies if what I wanted wasnt clear - I've got the following from a colleague which achieves the desired result
for e=1:length(Stock_List2)
eval(strcat('display_mat(e,:) = ',Stock_List2{e}));
end

Problems Using Textscan to Read Multiple Lines

I'm a bit new to data import using Matlab.
Basically, I have an Ascii file. It has 13 Header Lines, along with 765 columns and ~3500 rows of data. I am attempting to import the data into a 3500 x 765 matrix in Matlab. I've tried the following:
fileID = fopen('filename');
formatspec = [repmat('%f ', [1,765])];
raw_data=textscan(fileID,formatspec, 'Headerlines',13,'delimiter','\t');
It successfully skips the 13 header lines. However, it only gives me a 1 x 765 matrix containing only the data from the first row.
Perhaps I have misunderstood just how I am supposed to use textscan, so any help in getting my other ~3499 rows of data would be very well appreciated.
~Thank You
NOTE
The Data File itself is formatted as follows. The First 13 lines do not contain the data itself. All lines following that contain sets of data similar to what will be pasted below, extending for 700+ columns and 3000+ rows.
Wyko ASCII Data File Format 0 1 1
X Size 3571
Y Size 765
Block Name Type Length Value
Wavelength 7 4 72.482628
Aspect 7 4 1
Pixel_size 7 4 0.00196
StageY 7 4 -0.048055
Magnification 8 8 5.05
StageX 7 4 0.214484
ScannerPosition 7 4 3490.000732
ScannerSpeed 7 4 3.165393
RAW_DATA 3 10927260
-10976.61035 -10977.07324 -10981.07422 -10985.6084 ...
-10967.41309 -10963.31836 -10966.75195 -10980.40723 ...
-10969.08496 -10976.03711 -10976.62988 -10964.23731 ...
-10974.12695 -10976.61133 -10979.2627 -10973.57813 ...
-10969.21094 -10966.56543 -10973.74512 -10983.41797 ...
-10970.18359 -10980.82715 -10968.00195 -10975.58594 ...
-10980.41016 -10982.39356 -10982.74316 -10974.51563 ...
-10972.31641 -10984.00488 -10987.89453 -10976.23633 ...
I think the following should work, but I don't have Matlab on this machine to test it out.
fileID = fopen('filename');
formatspec = [repmat('%f ', [1,765])];
raw_data = new_data = textscan(fileID,formatspec, 'Headerlines',13,'delimiter','\t');
while ~feof(fileID)
new_data = textscan(fileID,formatspec,'delimiter','\t');
raw_data = [raw_data; new_data];
end
fclose(fileID);
Note that this is not a particularly efficient way to do it. If your header lines give you the size of your array, you may want to use zeros to create an array of the appropriate size and then read the data into your array.