split a line into its components - perl

I need to split the lines of an input file into its columns.
ATOM 0 HB3 ALA C 999 28.811 -7.680 12.279 1.00 57.53 H
ATOM 7637 N PRO C1000 27.299 -5.667 10.647 1.00216.82 N
The code I have works fine, as long as the 6th column is <1000, or shorter than 4 digits:
($ATOM, $atom_num, $atom_type, $res, $chain, $res_num) = split(" ", $pdb)
However as soon as column 6 reaches 1000, it will no longer discriminate the two columns. I am no expert in perl, but the code I am dealing with is perl, so I need to figure out how to split this e.g. by the number of digits of each column.
Any suggestions?

I solved it by using unpack and defining the length of each column.
$format = 'A6 A6 A5 A4 A1 A5';
($ATOM, $atom_num, $atom_type, $res, $chain, $res_num) = unpack($format, $pdb);

Related

How to convert spaces into new line using perl?

I have the following input which is stored in single scalar variable named as $var1.
Input(i.e stored in $var1)
Gain Dead_coverage Export_control Functional_coverage Function_logic top dac_decoder Datapath System_Level Black_DV Sync_logic temp1 temp2 temp3 temp4 temp5 temp6 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263
Expected output:
Gain
Dead_coverage
Export_control
Functional_coverage
Function_logic
top
dac_decoder
Datapath
System_Level
Black_DV
Sync_logic
temp1
temp2
temp3
temp4
temp5
temp6
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
My code:
I had tried the following regular expression.
$var1=tr{\s}{\n};
The above regular expression not brings my expected output.
Note:the numbers may range upto n numbers and the character may starts or ends with capital or lower case.Whatever i need to bring like the expected output.For that which regular expression can i use it.
Requirements:
1.split space into new line.
2.for numbers(i.e 123456789101112.....) it should be considered as follows
1
2
3
4
5
6
7
8
9
10
11
12
.
.
.
so on,...
After digit 9 the other numbers should be considered as double digit.
tr is a transliteration. That only works with individual characters, not patterns. You need to use s/// with the /g modifier.
$var1 =~ s/\s/\n/g;
You can also do this with split and join.
$var1 = join "\n", split / /, $var1;
It shouldn't make a difference in terms of performance, even if there are a lot of strings.

MATLAB writing csv with mixed alphanumeric strings, scalar arrays, nans

Ok, coming from Python and never having used MATLAB before, it seems like it is unnecessarily hard to write data to a csv using MATLAB...
So my data looks like this:
col1 A2A B2 CC3 D5
asd189 123 33 71119 18291
as33d 1311 31 NaN 1011
asd189 NaN 44 79 191
It has N header columns that are made of alphanumeric strings.
It has a leftmost column of length M which is made of alphanumeric strings.
It has an (M-1) x (N-1) array of NUMERIC data, with possible NaNs.
Can you please provide code to write this to a csv? I cannot use the xlswrite function because I'm on a cluster without Excel installed. Really just want to get on with the actual data analysis. Thanks
You can only write matrices (not cell arrays) directly using csvwrite, and as you say you need Excel installed for xlswrite, so that leaves you with low level operations. You can see a walkthrough for writing to text files here, and code for your example below:
% Initialise example cell array
M = {'col1', 'A2A', 'B2', 'CC3', 'D5'
'asd189', 123, 33, 71119, 18291
'as33d', 1311, 31, NaN, 1011
'asd189', NaN, 44, 79, 191};
% Open a file for writing to (doesn't have to already exist, can specify full directory)
fID = fopen('test.csv','w');
% Write header line, formatted as strings with comma delimiter. Note \r\n for new line
fprintf(fID, [repmat('%s, ', 1, size(M,2)-1),'%s\r\n'], M{1,:});
% Loop through other rows
for row = 2:size(M,1)
% Write each line of cell array, with first column formatted as string
% and other columns formatted as floats
fprintf(fID, ['%s, ', repmat('%f, ', 1, size(M,2)-2),'%f\r\n'], M{row,:});
end
% Close file after writing
fclose(fID);
Result:
Use writetable. It makes writing to CSV (or to an Excel file, or to other text-delimited file formats) much easier than using csvwrite, or xlswrite, or low-level commands such as fprintf.
>> t = table({'asd189';'as33d';'asd189'},[123;1311;NaN],[33;31;44],[71119;NaN;79],[18291;1011;191]);
>> t.Properties.VariableNames = {'col1','A2A','B2','CC3','D5'}
t =
col1 A2A B2 CC3 D5
________ ____ __ _____ _____
'asd189' 123 33 71119 18291
'as33d' 1311 31 NaN 1011
'asd189' NaN 44 79 191
>> writetable(t,'myfile.csv')
If your data is currently not stored as a table (maybe it's in an array or cell array), it's pretty easy to convert to a table using utility functions such as array2table or cell2table. You will only pay a small time penalty for doing this.
PS - you don't need Excel to be installed in order to write to an Excel file. You may not be able to read them afterwards, but MATLAB can still write them. But it sounds like you'd prefer .csv anyway.

skip header in non-rectangular matrix

I have consecutive .dat files which I want to read and input into a single matrix by concatenating the files vertically. The code I have so far works fine for simple numeric files with only tabs as delimiter.
import=[];
data=[];
for i = 1:32
data1=[import dlmread(sprintf('%d.dat',i))];
data=vertcat(data, data1);
clear data1;
end
and I take the correct output into the data matrix. But my file format is as follows:
first second third
0 11/15 08:57:42.000 54 67 82
1 11/15 09:48:47.010 49 32 31
...
As you can see I have three delimiters (: \t /) and headers only in the last three columns which are essentially the ones I want to read, that is I want a matrix:
54 67 82
49 32 31
...
I tried specifying the delimiters into the dlmwrite and how many rows/columns to skip but an error occurs in sprintf ('delimiter = sprintf(delimiter); % Interpret \t (if necessary)'). Does anyone have any idea how to go about this?
UPDATE:
I managed to get a little further
data=[];
for i = 1:32
filename = sprintf( '%d.dat',i );
data1=importdata(filename);%creates a cell array
data2=cell2mat(data1(3:end,:));%converts it to char
%The data, without the header, start from the 3rd row.
data=vertcat(data, data2); %concatenate vertically all the files
clear data1; clear data2;
end
%the data
a1=str2num(data(1:end,20:25));%the first data column is in char 20-25
a2=str2num(data(1:end,30:35));%the second data column is in char 30-35
The thing is that the last part takes too much time, over an hour has passed until I manually stopped it. Does anyone know a simpler and faster way to do this?
I managed to solve this myself so I post it here for future reference:
for i = 1:32
filename = sprintf( '%d.dat',i );
data1 = dlmread(filename,'',2,3);%start from row 2, headercolumn 3
data=vertcat(data, data1);
clear data1;
end
Now the data matrix contains only my data columns and it runs in a few seconds.

Loading text data in Octave with specific format

I have a data set that I would like to store and be able to load in Octave
18.0 8 307.0 130.0 3504. 12.0 70 1 "chevrolet chevelle malibu"
15.0 8 350.0 165.0 3693. 11.5 70 1 "buick skylark 320"
18.0 8 318.0 150.0 3436. 11.0 70 1 "plymouth satellite"
16.0 8 304.0 150.0 3433. 12.0 70 1 "amc rebel sst"
17.0 8 302.0 140.0 3449. 10.5 70 1 "ford torino"
15.0 8 429.0 198.0 4341. 10.0 70 1 "ford galaxie 500"
14.0 8 454.0 220.0 4354. 9.0 70 1 "chevrolet impala"
14.0 8 440.0 215.0 4312. 8.5 70 1 "plymouth fury iii"
14.0 8 455.0 225.0 4425. 10.0 70 1 "pontiac catalina"
15.0 8 390.0 190.0 3850. 8.5 70 1 "amc ambassador dpl"
It does not work immediately when I try to use:
data = load('auto.txt')
Is there a way to load from a text files with the given format or do I need to convert it to e.g
18.0,8,307.0,130.0,3504.0,12.0,70,1
...
EDIT:
Deleting the last row and fixing the 'half' number e.g. 3504. -> 3504.0
and then used:
data = load('-ascii','autocleaned.txt');
Loaded the data as wanted in to a matrix in Octave.
load is usually meant for loading octave and matlab binary files but can be used for loading textual data like yours. You can load your data using the "-ascii" option but you'd have to reformat your file slightly before putting it into load even with the "-ascii" option enabled. Use a consistent column separator ie. just a tab or a comma, use full numbers not 3850. and don't use strings.
Then you can do something like this to get it to work
DATA = load("-ascii", "auto.txt");
If the final string field is removed from each line, the file can be read with:
filename='stack25148040_1.txt'
fid = fopen(filename, 'r');
[x, count] = fscanf(fid, '%f', [10, Inf])
endif
fclose(fid);
Alternatively the whole file could read in as one column and reshaped.
I haven't figured out how to read both the numeric fields and the string field. For that I've had to fall back on Python with more general purpose file reading tools.
Here is a Python script that reads the file, creates a numpy structured array, writes that to a .mat file, which Octave can then read:
import csv
import numpy as np
data=[]
with open('stack25148040.txt','rb') as f:
r = csv.reader(f, delimiter=' ')
# csv handles quoted strings with white space
for l in r:
# remove empty strings from the split on ' '
data.append([x for x in l if x])
print data[0]
for dd in data:
# convert 8 of the strings (per line) to float
dd[:]=[float(d) for d in dd[:8]]+dd[-1:]
data=data[:-1] # remove empty last line
print data[0]
print
# make a structured array, with numbers and a string
dt=np.dtype("f8,i4,f8,f8,f8,f8,i4,i4,|S25")
A=np.array([tuple(d) for d in data],dtype=dt)
print A
from scipy.io import savemat
savemat('stack25148040.mat',{'A':A})
In Octave this could read with
load stack25148040.mat
A
# A = 1x10 struct array containing the fields:
# f0 f1 ... f8
A.f8 # string field
A(1) # 1st row
# scalar structure containing the fields:
# f0 = 18
# f1 = 8
...
# f8 = chevrolet chevelle malibu
Newer Octave (3.8) has an importdata function. It handles the original data file without any extra arguments. It returns a structure with 2 fields
x.data is a (10,11) matrix. x.data(:,1:8) is the desire numerical data. x.data(:,9:11) is a mix of NA and random numbers. The NA stand in for the words at the end of the lines. x.textdata is a (24,1) cell with those words. The quoted string s could be reassembled from those words, using the NA and quotes to determine how many words belong to which line.
To read the numeric data it uses dlmread. Since the rest of importdata is written in Octave, it could be used as the starting point for a custom function that handles the string data properly.
dlmread ('stack25148040.txt')(:,1:8)
importread ('stack25148040.txt').data(:,1:8)
textread ('stack25148040.txt','')(:,1:8)
https://octave.org/doc/v4.0.0/Simple-File-I_002fO.html
Try this,
data = importdata('Auto.data')

Code Golf - Word Scrambler

Please answer with the shortest possible source code for a program that converts an arbitrary plaintext to its corresponding ciphertext, following the sample input and output I have given below. Bonus points* for the least CPU time or the least amount of memory used.
Example 1:
Plaintext: The quick brown fox jumps over the lazy dog. Supercalifragilisticexpialidocious!
Ciphertext: eTh kiquc nobrw xfo smjup rvoe eth yalz .odg !uioiapeislgriarpSueclfaiitcxildcos
Example 2:
Plaintext: 123 1234 12345 123456 1234567 12345678 123456789
Ciphertext: 312 4213 53124 642135 7531246 86421357 975312468
Rules:
Punctuation is defined to be included with the word it is closest to.
The center of a word is defined to be ceiling((strlen(word)+1)/2).
Whitespace is ignored (or collapsed).
Odd words move to the right first. Even words move to the left first.
You can think of it as reading every other character backwards (starting from the end of the word), followed by the remaining characters forwards. Corporation => XoXpXrXtXoX => niaorCoprto.
Thank you to those who pointed out the inconsistency in my description. This has lead many of you down the wrong path, which I apologize for. Rule #4 should clear things up.
*Bonus points will only be awarded if Jeff Atwood decides to do so. Since I haven't checked with him, the chances are slim. Sorry.
Python, 50 characters
For input in i:
' '.join(x[::-2]+x[len(x)%2::2]for x in i.split())
Alternate version that handles its own IO:
print ' '.join(x[::-2]+x[len(x)%2::2]for x in raw_input().split())
A total of 66 characters if including whitespace. (Technically, the print could be omitted if running from a command line, since the evaluated value of the code is displayed as output by default.)
Alternate version using reduce:
' '.join(reduce(lambda x,y:y+x[::-1],x) for x in i.split())
59 characters.
Original version (both even and odd go right first) for an input in i:
' '.join(x[::2][::-1]+x[1::2]for x in i.split())
48 characters including whitespace.
Another alternate version which (while slightly longer) is slightly more efficient:
' '.join(x[len(x)%2-2::-2]+x[1::2]for x in i.split())
(53 characters)
J, 58 characters
>,&.>/({~(,~(>:#+:#i.#-#<.,+:#i.#>.)#-:)#<:##)&.><;.2,&' '
Haskell, 64 characters
unwords.map(map snd.sort.zip(zipWith(*)[0..]$cycle[-1,1])).words
Well, okay, 76 if you add in the requisite "import List".
Python - 69 chars
(including whitespace and linebreaks)
This handles all I/O.
for w in raw_input().split():
o=""
for c in w:o=c+o[::-1]
print o,
Perl, 78 characters
For input in $_. If that's not acceptable, add six characters for either $_=<>; or $_=$s; at the beginning. The newline is for readability only.
for(split){$i=length;print substr$_,$i--,1,''while$i-->0;
print"$_ ";}print $/
C, 140 characters
Nicely formatted:
main(c, v)
char **v;
{
for( ; *++v; )
{
char *e = *v + strlen(*v), *x;
for(x = e-1; x >= *v; x -= 2)
putchar(*x);
for(x = *v + (x < *v-1); x < e; x += 2)
putchar(*x);
putchar(' ');
}
}
Compressed:
main(c,v)char**v;{for(;*++v;){char*e=*v+strlen(*v),*x;for(x=e-1;x>=*v;x-=2)putchar(*x);for(x=*v+(x<*v-1);x<e;x+=2)putchar(*x);putchar(32);}}
Lua
130 char function, 147 char functioning program
Lua doesn't get enough love in code golf -- maybe because it's hard to write a short program when you have long keywords like function/end, if/then/end, etc.
First I write the function in a verbose manner with explanations, then I rewrite it as a compressed, standalone function, then I call that function on the single argument specified at the command line.
I had to format the code with <pre></pre> tags because Markdown does a horrible job of formatting Lua.
Technically you could get a smaller running program by inlining the function, but it's more modular this way :)
t = "The quick brown fox jumps over the lazy dog. Supercalifragilisticexpialidocious!"
T = t:gsub("%S+", -- for each word in t...
function(w) -- argument: current word in t
W = "" -- initialize new Word
for i = 1,#w do -- iterate over each character in word
c = w:sub(i,i) -- extract current character
-- determine whether letter goes on right or left end
W = (#w % 2 ~= i % 2) and W .. c or c .. W
end
return W -- swap word in t with inverted Word
end)
-- code-golf unit test
assert(T == "eTh kiquc nobrw xfo smjup rvoe eth yalz .odg !uioiapeislgriarpSueclfaiitcxildcos")
-- need to assign to a variable and return it,
-- because gsub returns a pair and we only want the first element
f=function(s)c=s:gsub("%S+",function(w)W=""for i=1,#w do c=w:sub(i,i)W=(#w%2~=i%2)and W ..c or c ..W end return W end)return c end
-- 1 2 3 4 5 6 7 8 9 10 11 12 13
--34567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890
-- 130 chars, compressed and written as a proper function
print(f(arg[1]))
--34567890123456
-- 16 (+1 whitespace needed) chars to make it a functioning Lua program,
-- operating on command line argument
Output:
$ lua insideout.lua 'The quick brown fox jumps over the lazy dog. Supercalifragilisticexpialidocious!'
eTh kiquc nobrw xfo smjup rvoe eth yalz .odg !uioiapeislgriarpSueclfaiitcxildcos
I'm still pretty new at Lua so I'd like to see a shorter solution if there is one.
For a minimal cipher on all args to stdin, we can do 111 chars:
for _,w in ipairs(arg)do W=""for i=1,#w do c=w:sub(i,i)W=(#w%2~=i%2)and W ..c or c ..W end io.write(W ..' ')end
But this approach does output a trailing space like some of the other solutions.
For an input in s:
f=lambda t,r="":t and f(t[1:],len(t)&1and t[0]+r or r+t[0])or r
" ".join(map(f,s.split()))
Python, 90 characters including whitespace.
TCL
125 characters
set s set f foreach l {}
$f w [gets stdin] {$s r {}
$f c [split $w {}] {$s r $c[string reverse $r]}
$s l "$l $r"}
puts $l
Bash - 133, assuming input is in $w variable
Pretty
for x in $w; do
z="";
for l in `echo $x|sed 's/\(.\)/ \1/g'`; do
if ((${#z}%2)); then
z=$z$l;
else
z=$l$z;
fi;
done;
echo -n "$z ";
done;
echo
Compressed
for x in $w;do z="";for l in `echo $x|sed 's/\(.\)/ \1/g'`;do if ((${#z}%2));then z=$z$l;else z=$l$z;fi;done;echo -n "$z ";done;echo
Ok, so it outputs a trailing space.