My .emacs file has the following content:
$ cat ~/.emacs
(setq vc-handled-backends nil)
(global-linum-mode t)
$ od -xcb ~/.emacs
0000000 7328 7465 2071 6376 682d 6e61 6c64 6465
( s e t q v c - h a n d l e d
050 163 145 164 161 040 166 143 055 150 141 156 144 154 145 144
0000020 622d 6361 656b 646e 2073 696e 296c 280a
- b a c k e n d s n i l ) \n (
055 142 141 143 153 145 156 144 163 040 156 151 154 051 012 050
0000040 6c67 626f 6c61 6c2d 6e69 6d75 6d2d 646f
g l o b a l - l i n u m - m o d
147 154 157 142 141 154 055 154 151 156 165 155 055 155 157 144
0000060 2065 2974
e t )
145 040 164 051
0000064
These are absolutely valid Emacs's LISP expressions.
But recently whenever I start emacs, the line numbers no longer show up, instead, an error comes up:
$emacs --debug-init ~/.emacs
Debugger entered--Lisp error: (void-function global-linum-mode)
(global-linum-mode t)
eval-buffer(#<buffer *load*> nil "/Users/user/.emacs" nil t) ; Reading at buffer position 53
load-with-code-conversion("/Users/user/.emacs" "/Users/user/.emacs" t t)
load("~/.emacs" t t)
#[nil "^H\205\276^# \306=\203^Q^#\307^H\310Q\202A^# \311=\2033^#\312\307\313\314#\203#^#\315\202A^#\312\307\313\316$
command-line()
normal-top-level()
Version of emacs:
$ emacs --version
GNU Emacs 22.1.1
Copyright (C) 2007 Free Software Foundation, Inc.
GNU Emacs comes with ABSOLUTELY NO WARRANTY.
You may redistribute copies of Emacs
under the terms of the GNU General Public License.
For more information about these matters, see the file named COPYING.
Does anyone have an idea what might have caused this?
Thanks
Looks like linum mode was added to the Emacs distribution in version 23.1 (changelog). Because linum isn't distributed with Emacs 22, you're calling an undefined function and therefore getting an error.
Perhaps you used to run a more recent version of emacs, which has since been clobbered. You could either:
download the linum source, add it to your load path, then require it
install a newer version of Emacs.
Edit: As mentioned in the comments above, you could have multiple Emacs binaries on your path, with different versions. Have a look in /usr/bin, /opt/local/bin, et al, to see if this is the case.
First thing I would check the contents of that file with something other than cat such as the octal dump program od if you're running under a UNIXy system:
od -xcb .emacs
Because 53 characters is about the size of those two lines, it might have some rubbish at the end of (or within) the file.
Related
I want to delete N number of lines after the first match in a text file using sed.
(I know most of these questions have been answered with "use awk", but I want to use sed, regardless of how much more powerful it is than awk. It's more a matter of which tool I'm most comfortable with using at the moment, within a certain time constraint)
The furthest I got is this:
sed -i "0,/pattern/{/pattern/,+Nd}" file.txt
The thought is that 0, denotes the first occurrence, where the curly brackets search the first line for the pattern, and deletes N lines after that occurence
Try
sed '/pattern/{N;N;N;N;N;N;N;d;}' file.txt
The 0, construct and the relative line number addressing you tried to use are specific to GNU sed. Portable sed does not have these facilities.
This will remove the next six lines after every match. If you only want to remove the first occurrence and leave the rest of the file unchanged, maybe add a separate loop to simply print all remaining lines.
The problem with your attempt is that 0,/pattern/ restricts matching to the lines up through the first occurrence of /pattern/ but then that's the end of the range, so anything selected by this expression cannot operate on lines outside of that range.
Assuming your shell is bash (the question originally had a bash tag):
n=3
sed -f <(printf -v nsp '%*s' $n; printf '/%s/{x;/./!{s/^/./;h;%sd;};x;}\n' 'pattern' "${nsp// /N;}") file
Note that n is variable (3 is just an instance) and constructed sed script is not GNU specific.
This might work for you (GNU sed):
sed '0,/pattern/{//{:a;N;s/\n/&/N;Ta;d}}' file
Deletes the line containing pattern and then N lines after it once only.
Alternative:
sed '/pattern/{x;//{x;b};x;h;:a;N;s/\n/&/N;Ta;d}' file
N.B. The N following the substitution command refers to the nth occurrence of a newline in the pattern space.
UPDATE 1 : Example where sed solution above does not meet objective universally:
cmd='/5=P$/{N;N;N;N;N;N;d;}'
echo "\n input \${b} :: \n\n———————\n" \
"${b}\n--------------\n\n sed " \
"commands :: \n\n--------------\n " \
"${cmd}\n--------------\n\n GNU sed "\
"::\n\n$( gsed "${cmd}" <<< "${b}" )" \
"\n\n BSD sed ::\n\n$( sed "${cmd}" <<< "${b}" )\n\n"
input ${b} ::
--------------
84 77138=48001=P
85 77138=48035=P
86 77138=78118=P
87 77138=79248=P
--------------
sed commands ::
--------------
/5=P$/{N;N;N;N;N;N;d;}
--------------
GNU sed ::
84 77138=48001=P
85 77138=48035=P
86 77138=78118=P
87 77138=79248=P
BSD sed ::
84 77138=48001=P
For unknown reasons, when the input lacks sufficient rows past the pattern,
this solution works on BSD sed,
but totally fails on GNU sed.
============================
Is sed a must have requirement ? You can also do one-liners with awk :
(it's intentionally verbose to showcase exactly what the lines matched and skipped look like) :
# gawk profile, created Thu Apr 28 18:36:55 2022
# BEGIN rule(s)
BEGIN {
1 printf "\n\t N :: %.f :: FS i.e. "\
"pattern :: %*s\n\n", N = +N, ++__, FS = pattern
}
# Rule(s)
87 NF *= -(_+=(_= __<NF ? -__-N :_)^!__)<+_ { # 45
45 print
}
1 77138=501=A
2 77138=3413=A
3 77138=3414=A
4 77138=8624=A
5 77138=19572=A
6 77138=22220=A
7 77138=23670=A
8 77138=25413=A
9 77138=26351=A
10 77138=27340=A
11 77138=29288=A
12 77138=121060=A
13 77138=123028=A
14 77138=132081=A
15 77138=135789=A
16 77138=154341=A
17 77138=155876=A
18 77138=170871=A
19 77138=178562=A
skipped :: 20 77138=185367=A
skipped :: 21 77138=196718=A
skipped :: 22 77138=196985=A
skipped :: 23 77138=200012=A
skipped :: 24 77138=207162=A
skipped :: 25 77138=228289=A
skipped :: 26 77138=244747=A
skipped :: 27 77138=284795=A
skipped :: 28 77138=294579=A
skipped :: 29 77138=299765=A
skipped :: 30 77138=317856=A
skipped :: 31 77138=318815=A
32 77138=324570=A
33 77138=408049=A
34 77138=514403=A
35 77138=1647865=A
36 77138=1738771=A
37 77138=3217183=A
skipped :: 38 77138=3222837=A
skipped :: 39 77138=3235292=A
skipped :: 40 77138=14957980=I
skipped :: 41 77138=1159=M
skipped :: 42 77138=1196=M
skipped :: 43 77138=1251=M
44 77138=1252=M
45 77138=4951=M
46 77138=16740=M
47 77138=71501=M
skipped :: 48 77138=137=P
skipped :: 49 77138=348=P
skipped :: 50 77138=518=P
skipped :: 51 77138=519=P
skipped :: 52 77138=520=P
skipped :: 53 77138=925=P
54 77138=1363=P
55 77138=1483=P
56 77138=1814=P
57 77138=2692=P
58 77138=3540=P
59 77138=3594=P
60 77138=3682=P
61 77138=3869=P
62 77138=3940=P
skipped :: 63 77138=3977=P
skipped :: 64 77138=4025=P
skipped :: 65 77138=4252=P
skipped :: 66 77138=4396=P
skipped :: 67 77138=9501=P
skipped :: 68 77138=13006=P
69 77138=18113=P
skipped :: 70 77138=20907=P
skipped :: 71 77138=31936=P
skipped :: 72 77138=34954=P
skipped :: 73 77138=37126=P
skipped :: 74 77138=37482=P
skipped :: 75 77138=40135=P
76 77138=40206=P
77 77138=41279=P
78 77138=41280=P
79 77138=46140=P
skipped :: 80 77138=46157=P
skipped :: 81 77138=46173=P
skipped :: 82 77138=46218=P
skipped :: 83 77138=47592=P
skipped :: 84 77138=48001=P
skipped :: 85 77138=48035=P
86 77138=78118=P
87 77138=79248=P
N :: 5 :: FS i.e. pattern :: [7]=[AP]$
1 77138=501=A
2 77138=3413=A
3 77138=3414=A
4 77138=8624=A
5 77138=19572=A
6 77138=22220=A
7 77138=23670=A
8 77138=25413=A
9 77138=26351=A
10 77138=27340=A
11 77138=29288=A
12 77138=121060=A
13 77138=123028=A
14 77138=132081=A
15 77138=135789=A
16 77138=154341=A
17 77138=155876=A
18 77138=170871=A
19 77138=178562=A
32 77138=324570=A
33 77138=408049=A
34 77138=514403=A
35 77138=1647865=A
36 77138=1738771=A
37 77138=3217183=A
44 77138=1252=M
45 77138=4951=M
46 77138=16740=M
47 77138=71501=M
54 77138=1363=P
55 77138=1483=P
56 77138=1814=P
57 77138=2692=P
58 77138=3540=P
59 77138=3594=P
60 77138=3682=P
61 77138=3869=P
62 77138=3940=P
69 77138=18113=P
76 77138=40206=P
77 77138=41279=P
78 77138=41280=P
79 77138=46140=P
86 77138=78118=P
87 77138=79248=P
more concisely, it would be
mawk -v pattern='[7]=[AP]$' -v N='5' -- '
BEGIN {
++__
FS = pattern
} NF *= -(_+=(_=__<NF?-__-N:_)^!__) < +_'
or in awk one-liner style
mawk 'NF*=-(_+=(_=1<NF?-1-N:_)^0)<+_' FS='[7]=[AP]$' N=5
The man page of cat says:
-v, --show-nonprinting
use ^ and M- notation, except for LFD and TAB
What is the M- notation and where is it documented?
Example:
$cat log -A
wrote 262144 bytes from file test.x in 9.853947s (25.979 KiB/s)^M$
^M> ^H^H ^H^H>
What do ^M and ^H mean?
I was wondering this too. I checked the source but it seemed easier to create a input file to get the mapping.
I created a test input file with a Perl scrip for( my $i=0 ; $i < 256; $i++ ) { print ( sprintf( "%c is %d %x\n", $i, $i ,$i ) ); } and then ran it through cat -v
Also if you see M-oM-;M-? at the start of a file it is the UTF-8 byte order mark.
Scroll down through these to get to the M- values:
^# is 0 0
^A is 1 1
^B is 2 2
^C is 3 3
^D is 4 4
^E is 5 5
^F is 6 6
^G is 7 7
^H is 8 8
(9 is tab)
(10 is NL)
^K is 11 b
^L is 12 c
^M is 13 d
^N is 14 e
^O is 15 f
^P is 16 10
^Q is 17 11
^R is 18 12
^S is 19 13
^T is 20 14
^U is 21 15
^V is 22 16
^W is 23 17
^X is 24 18
^Y is 25 19
^Z is 26 1a
^[ is 27 1b
^\ is 28 1c
^] is 29 1d
^^ is 30 1e
^_ is 31 1f
...printing chars removed...
^? is 127 7f
M-^# is 128 80
M-^A is 129 81
M-^B is 130 82
M-^C is 131 83
M-^D is 132 84
M-^E is 133 85
M-^F is 134 86
M-^G is 135 87
M-^H is 136 88
M-^I is 137 89
M-^J is 138 8a
M-^K is 139 8b
M-^L is 140 8c
M-^M is 141 8d
M-^N is 142 8e
M-^O is 143 8f
M-^P is 144 90
M-^Q is 145 91
M-^R is 146 92
M-^S is 147 93
M-^T is 148 94
M-^U is 149 95
M-^V is 150 96
M-^W is 151 97
M-^X is 152 98
M-^Y is 153 99
M-^Z is 154 9a
M-^[ is 155 9b
M-^\ is 156 9c
M-^] is 157 9d
M-^^ is 158 9e
M-^_ is 159 9f
M- is 160 a0
M-! is 161 a1
M-" is 162 a2
M-# is 163 a3
M-$ is 164 a4
M-% is 165 a5
M-& is 166 a6
M-' is 167 a7
M-( is 168 a8
M-) is 169 a9
M-* is 170 aa
M-+ is 171 ab
M-, is 172 ac
M-- is 173 ad
M-. is 174 ae
M-/ is 175 af
M-0 is 176 b0
M-1 is 177 b1
M-2 is 178 b2
M-3 is 179 b3
M-4 is 180 b4
M-5 is 181 b5
M-6 is 182 b6
M-7 is 183 b7
M-8 is 184 b8
M-9 is 185 b9
M-: is 186 ba
M-; is 187 bb
M-< is 188 bc
M-= is 189 bd
M-> is 190 be
M-? is 191 bf
M-# is 192 c0
M-A is 193 c1
M-B is 194 c2
M-C is 195 c3
M-D is 196 c4
M-E is 197 c5
M-F is 198 c6
M-G is 199 c7
M-H is 200 c8
M-I is 201 c9
M-J is 202 ca
M-K is 203 cb
M-L is 204 cc
M-M is 205 cd
M-N is 206 ce
M-O is 207 cf
M-P is 208 d0
M-Q is 209 d1
M-R is 210 d2
M-S is 211 d3
M-T is 212 d4
M-U is 213 d5
M-V is 214 d6
M-W is 215 d7
M-X is 216 d8
M-Y is 217 d9
M-Z is 218 da
M-[ is 219 db
M-\ is 220 dc
M-] is 221 dd
M-^ is 222 de
M-_ is 223 df
M-` is 224 e0
M-a is 225 e1
M-b is 226 e2
M-c is 227 e3
M-d is 228 e4
M-e is 229 e5
M-f is 230 e6
M-g is 231 e7
M-h is 232 e8
M-i is 233 e9
M-j is 234 ea
M-k is 235 eb
M-l is 236 ec
M-m is 237 ed
M-n is 238 ee
M-o is 239 ef
M-p is 240 f0
M-q is 241 f1
M-r is 242 f2
M-s is 243 f3
M-t is 244 f4
M-u is 245 f5
M-v is 246 f6
M-w is 247 f7
M-x is 248 f8
M-y is 249 f9
M-z is 250 fa
M-{ is 251 fb
M-| is 252 fc
M-} is 253 fd
M-~ is 254 fe
M-^? is 255 ff
^M is for Control-M (a carriage return), ^H for Control-H (a backspace). M-Something is Meta-Something (Meta- is what the Alt key does in some terminals).
I am not sure about the M- notation, but the ones involving ^ uses the caret notation:
Caret notation is a notation for control characters in ASCII.
In particular,
The digraph stands for the control character whose ASCII code is the
same as the character's ASCII code with the uppermost bit, in a 7-bit
encoding, reversed.
which you can verify by looking at the ASCII binary (octal) representation:
Image source: http://www.asciitable.com
Because ASCII is such a limited character set (as you can see above), it's straightforward to list all control chars representable by the caret notation, e.g., http://xahlee.info/comp/unicode_character_representation.html.
You can see the definition in the key_name(3) manpage
Likewise, the meta(3X) function allows the caller to change the output of keyname, i.e., it determines whether to use the “M-” prefix for “meta” keys (codes in the range 128 to 255). Both use_legacy_coding(3X) and meta(3X) succeed only after curses is initialized. X/Open Curses does not document the treatment of codes 128 to 159. When treating them as “meta” keys (or if keyname is called before initializing curses), this implementation returns strings “M-^#”, “M-^A”, etc.
key_name(3X)
So basically the Meta analog of the Ctrl version is the keycode of Ctrl + 128. You can see that easily in Brian's table. Here's a slightly modified version for ease of comparison
$ LC_ALL=C perl -e 'for( my $i=0 ; $i < 128; $i++ ) {
print ( sprintf( "%c is %d %x\t\t%c is %d %x\n",
$i, $i, $i, $i + 128, $i + 128, $i + 128 ) );
}' >bytes.txt
$ cat -v bytes.txt
^# is 0 0 M-^# is 128 80
^A is 1 1 M-^A is 129 81
^B is 2 2 M-^B is 130 82
^C is 3 3 M-^C is 131 83
...
^Y is 25 19 M-^Y is 153 99
^Z is 26 1a M-^Z is 154 9a
^[ is 27 1b M-^[ is 155 9b
^\ is 28 1c M-^\ is 156 9c
^] is 29 1d M-^] is 157 9d
^^ is 30 1e M-^^ is 158 9e
^_ is 31 1f M-^_ is 159 9f
is 32 20 M- is 160 a0
! is 33 21 M-! is 161 a1
" is 34 22 M-" is 162 a2
# is 35 23 M-# is 163 a3
$ is 36 24 M-$ is 164 a4
% is 37 25 M-% is 165 a5
& is 38 26 M-& is 166 a6
' is 39 27 M-' is 167 a7
( is 40 28 M-( is 168 a8
) is 41 29 M-) is 169 a9
* is 42 2a M-* is 170 aa
+ is 43 2b M-+ is 171 ab
, is 44 2c M-, is 172 ac
- is 45 2d M-- is 173 ad
. is 46 2e M-. is 174 ae
/ is 47 2f M-/ is 175 af
0 is 48 30 M-0 is 176 b0
1 is 49 31 M-1 is 177 b1
...
: is 58 3a M-: is 186 ba
; is 59 3b M-; is 187 bb
< is 60 3c M-< is 188 bc
= is 61 3d M-= is 189 bd
> is 62 3e M-> is 190 be
? is 63 3f M-? is 191 bf
# is 64 40 M-# is 192 c0
A is 65 41 M-A is 193 c1
B is 66 42 M-B is 194 c2
...
Z is 90 5a M-Z is 218 da
[ is 91 5b M-[ is 219 db
\ is 92 5c M-\ is 220 dc
] is 93 5d M-] is 221 dd
^ is 94 5e M-^ is 222 de
_ is 95 5f M-_ is 223 df
` is 96 60 M-` is 224 e0
a is 97 61 M-a is 225 e1
b is 98 62 M-b is 226 e2
...
z is 122 7a M-z is 250 fa
{ is 123 7b M-{ is 251 fb
| is 124 7c M-| is 252 fc
} is 125 7d M-} is 253 fd
~ is 126 7e M-~ is 254 fe
^? is 127 7f M-^? is 255 ff
The part after M- on the right is exactly the same as on the left, with the keycodes differ by 128
You can also check cat's source code, the basic expression is *bpout++ = ch - 128; for the Meta key version in the show_nonprinting case
Answer from the book.
Unix power tools.
25.7 Show Non-Printing Characters with cat -v or od -c.
"cat -v has its own symbol for characters outside the ASCII range with their high bits set, also called metacharacters. cat -v prints those as M- followed by another character. There are two of them in the cat -v output: M-^? and M-a . To get a metacharacter, you add 200 octal. "Say what?" Let's look at M-a first. The octal value of the letter a is 141. When cat -v prints M-a , it means the character you get by adding 141+200, or 341 octal. You can decode the character cat prints as M-^? in the same way. The ^? stands for the DEL character, which is octal 177. Add 200+177 to get 377 octal. "
How to output words bounds using tesseract command line with config file?
So far I been able to output chars using
tesseract image.png myBox makebox
This created a myBox.box file that looks like this:
N 51 1844 75 1874 0
o 80 1843 100 1867 0
S 113 1843 136 1875 0
I 140 1844 145 1874 0
M 151 1844 181 1874 0
c 197 1843 216 1867 0
a 219 1843 238 1867 0
r 243 1844 254 1867 0
d 256 1843 275 1876 0
How ever those only chars and I need words, so I been able to combine it with standard output
tesseract image.png myBox
This creates a file like this:
no simcard
Combining those two outputs I can get words bounds. How ever I prefer to find a method that does not require examining the same image twice. Please help
I have a problem to get rpy2 running in iPython notebook.
If I load
%load_ext rpy2.ipython
in iPython 4.0.3 everything is fine. But if I do the same thing in a iPython notebook I get:
---------------------------------------------------------------------------
FileNotFoundError Traceback (most recent call last)
<ipython-input-3-a69f80d0128e> in <module>()
----> 1 get_ipython().magic('load_ext rpy2.ipython')
C:\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py in magic(self, arg_s)
2334 magic_name, _, magic_arg_s = arg_s.partition(' ')
2335 magic_name = magic_name.lstrip(prefilter.ESC_MAGIC)
-> 2336 return self.run_line_magic(magic_name, magic_arg_s)
2337
2338 #-------------------------------------------------------------------------
C:\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py in run_line_magic(self, magic_name, line)
2255 kwargs['local_ns'] = sys._getframe(stack_depth).f_locals
2256 with self.builtin_trap:
-> 2257 result = fn(*args,**kwargs)
2258 return result
2259
<decorator-gen-65> in load_ext(self, module_str)
C:\Anaconda3\lib\site-packages\IPython\core\magic.py in <lambda>(f, *a, **k)
191 # but it's overkill for just that one bit of state.
192 def magic_deco(arg):
--> 193 call = lambda f, *a, **k: f(*a, **k)
194
195 if callable(arg):
C:\Anaconda3\lib\site-packages\IPython\core\magics\extension.py in load_ext(self, module_str)
64 if not module_str:
65 raise UsageError('Missing module name.')
---> 66 res = self.shell.extension_manager.load_extension(module_str)
67
68 if res == 'already loaded':
C:\Anaconda3\lib\site-packages\IPython\core\extensions.py in load_extension(self, module_str)
82 if module_str not in sys.modules:
83 with prepended_to_syspath(self.ipython_extension_dir):
---> 84 __import__(module_str)
85 mod = sys.modules[module_str]
86 if self._call_load_ipython_extension(mod):
C:\Anaconda3\lib\site-packages\rpy2\ipython\__init__.py in <module>()
----> 1 from .rmagic import load_ipython_extension
C:\Anaconda3\lib\site-packages\rpy2\ipython\rmagic.py in <module>()
50 # numpy and rpy2 imports
51
---> 52 import rpy2.rinterface as ri
53 import rpy2.robjects as ro
54 import rpy2.robjects.packages as rpacks
C:\Anaconda3\lib\site-packages\rpy2\rinterface\__init__.py in <module>()
72 if not os.path.exists(Rlib):
73 continue
---> 74 ctypes.CDLL(Rlib)
75 _win_ok = True
76 break
C:\Anaconda3\lib\ctypes\__init__.py in __init__(self, name, mode, handle, use_errno, use_last_error)
345
346 if handle is None:
--> 347 self._handle = _dlopen(self._name, mode)
348 else:
349 self._handle = handle
FileNotFoundError: [WinError 161] Der angegebene Pfadname ist ungültig
Is there some way to get both running? As rpy2 runs properly in iPython I guess there the installation shoiuld be correct.
Thanks,
Marv
There is likely more differences between the environment from which ipython is called and the one from which the notebook is called: the error Der angegebene Pfadname ist ungültig occurs while trying the R shared library.
You'd need to tell us a little more about how you start either ipython or the notebook.
Having that said, you should also note that rpy2 is likely working better on Linux or OS X. If the ipython notebook is your primary interest, running through a Docker container could be a good solution.
I am stuck at 1 point in my project. I am a biomedical science. So, I don't know perl programming much.
I have a file that explains proteins interactions with ligands. The file looks as shown below:
H P L A 82 SER 1290 N --> O12 1668 GSH 106 A 2.90
H P L A 83 SER 1301 N --> O12 1668 GSH 106 A 2.93
N P L A 19 LYS 302 NZ --- O31 1682 GSH 106 A 3.86
N P L A 22 CYS 348 CB --- CB2 1677 GSH 106 A 3.75
N P L A 22 CYS 348 CB --- SG2 1678 GSH 106 A 3.02
N P L A 22 CYS 349 SG --- CB2 1677 GSH 106 A 3.03
N P L A 22 CYS 349 SG --- SG2 1678 GSH 106 A 2.02
N P L A 24 TYR 372 CB --- CG1 1670 GSH 106 A 3.68
Now you can see the are O12 in two rows. Similarly you can see that there are two CB2 as well. These O12 and CB2 are atom symbols. O12 means oxygen 12 in an atom. Now I need to calculate how many different atom symbols are there in file. I have to use perl script to do that. I am reading this file line by line using perl. while (my $line = <MYFILE>) { }; Now, I need to calculate how many different atom symbols are there while reading the file line by line. I hope I am clear enough to explain my problem. Waiting for a kind reply...
How the problem is best solved depends on how your data is delimited. As it looks like fixed width, I'll present that solution first:
use strict;
use warnings;
my %atom;
while (<DATA>) {
my (undef,$atom) = unpack "A34A4 ", $_;
$atom{$atom}++;
}
print scalar keys %atom;
__DATA__
H P L A 82 SER 1290 N --> O12 1668 GSH 106 A 2.90
H P L A 83 SER 1301 N --> O12 1668 GSH 106 A 2.93
N P L A 19 LYS 302 NZ --- O31 1682 GSH 106 A 3.86
N P L A 22 CYS 348 CB --- CB2 1677 GSH 106 A 3.75
N P L A 22 CYS 348 CB --- SG2 1678 GSH 106 A 3.02
N P L A 22 CYS 349 SG --- CB2 1677 GSH 106 A 3.03
N P L A 22 CYS 349 SG --- SG2 1678 GSH 106 A 2.02
N P L A 24 TYR 372 CB --- CG1 1670 GSH 106 A 3.68
Note here that I estimated the offset used by unpack, so you may need to tweak that to fit your data.
If your data is tab-delimited, you'll need to split on tab, or better yet use Text::CSV to parse your data. Basic script is the same:
use Text::CSV;
my $csv = Text::CSV->new({
binary => 1,
sep_char => "\t",
});
my %atom;
while (<DATA>) {
$csv->parse($_);
my $atom = ($csv->fields())[9];
next unless defined $atom;
$atom{$atom}++;
}
You can also use the loop condition while (my $aref = $csv->getline(*DATA)), which is more efficient, but also breaks if your csv data is not consistent.
A simpler and possibly as valid (depending on how complex your data can be) solution is using split:
while (<DATA>) {
my $atom = (split /\t/)[9]; # implicitly splits $_
$atom{$atom}++;
}
If your data is space delimited, simply remove /\t/ from the above.
Note that I assumed all spaces were tabs in your input, so if they are not, my count may need to be tweaked.
In command line (no perl):
cat yourfile | awk '{print $10}' | sort | uniq | wc -l
Works on your input.
Have a look at this Perl Cookbook recipe.
While you're reading the file line by line you want to split/extract the atom symbols and count them in a hash.
use strict;
use warnings;
# open FILE goes here...
my %seen; # we use this to count
while (<FILE>) {
m/--[>-]\s+(\w+)\s/; # fetch the atom symbol after arrow-thing
$seen{$1}++;
}
close FILE;
print scalar keys %seen; # number of unique atom symbols
print join ', ', keys %seen; # List as string
Or in perl:
#!/usr/bin/perl
while(my $line = <DATA>){
my $atom = (split / +/, $line)[9];
$atoms{$atom}++;
}
print "$_: $atoms{$_}\n" for keys %atoms;
__DATA__
H P L A 82 SER 1290 N --> O12 1668 GSH 106 A 2.90
H P L A 83 SER 1301 N --> O12 1668 GSH 106 A 2.93
N P L A 19 LYS 302 NZ --- O31 1682 GSH 106 A 3.86
N P L A 22 CYS 348 CB --- CB2 1677 GSH 106 A 3.75
N P L A 22 CYS 348 CB --- SG2 1678 GSH 106 A 3.02
N P L A 22 CYS 349 SG --- CB2 1677 GSH 106 A 3.03
N P L A 22 CYS 349 SG --- SG2 1678 GSH 106 A 2.02
N P L A 24 TYR 372 CB --- CG1 1670 GSH 106 A 3.68