What is the fastest way to autobreak a line of gigabytes separated by keywords using bash shell? - perl

For example, given a line a11b12c22d322 e... the fields of break are the numbers or spaces, we want to transform it into
a
b
c
d
e
...
sed need to read the whole line into memory, for gigabytes a line, it would not be efficient, and the job could not be done if we don't have sufficient memory.
EDIT:
Could anyone please explain how do grep, tr, Awk, perl, and python manipulate the memory in reading a large file? What and how much content do they read into memory once a time?

If you use gawk (which is the default awk on Linux, I believe), you can use the RS parameter to specify that multi-digit numbers or spaces are recognized as line terminators instead of a new-line.
awk '{print}' RS="[[:digit:]]+| +" file.txt
As to your second question, all of these programs will need to read some fixed number of bytes and search for its idea of a line separator in an internal buffer to simulate the appearance of reading a single line at a time. To prevent it from reading too much data while searching for the end of the line, you need to change the programs idea of what terminates a line.
Most languages allow you to do this, but only allow you to specify a single character. gawk makes it easy by allowing you to specify a regular expression to recognize an end-of-line character. This saves you from having to implement the fixed-size buffer and end-of-line search yourself.

Fastest... You can do it with help of gcc, here's a version which reads data from given file name if given, otherwise from stdin. If this is still too slow, you can see if you can make it faster by replacing getchar() and putchar() (which may be macros and should optimize very well) with your own buffering code. If we want to get ridiculous, for even faster, you should have three threads, so kernel can copy next block of data with one core, while another core does processing, and third core copies processed output back to kernel.
#!/bin/bash
set -e
BINNAME=$(mktemp)
gcc -xc -O3 -o $BINNAME - <<"EOF"
#include <stdio.h>
#include <stdlib.h>
int main(void)
{
int sep = 0;
/* speed is a requirement, so let's reduce io overhead */
const int bufsize = 1024*1024;
setvbuf(stdin, malloc(bufsize), _IOFBF, bufsize);
setvbuf(stdout, malloc(bufsize), _IOFBF, bufsize);
/* above buffers intentionally not freed, it doesn't really matter here */
int ch;
while((ch = getc(stdin)) >= 0) {
if (isdigit(ch) || isspace(ch)) {
if (!sep) {
if (putc('\n', stdout) == EOF) break;
sep = 1;
}
} else {
sep = 0;
if (putc(ch, stdout) == EOF) break;
}
}
/* flush should happen by on-exit handler, as buffer is not freed,
but this will detect write errors, for program exit code */
fflush(stdout);
return ferror(stdin) || ferror(stdout);
}
EOF
if [ -z "$1" ] ; then
$BINNAME <&0
else
$BINNAME <"$1"
fi
Edit: I happened too look at GNU/Linux stdio.h, some notes: putchar/getchar are not macros, but putc/getc are, so using those instead might be a slight optimization, probably avoiding one function call, changed code to reflect this. Also added checking return code of putc, while at it.

With grep:
$ grep -o '[^0-9 ]' <<< "a11b12c22d322 e"
a
b
c
d
e
With sed:
$ sed 's/[0-9 ]\+/\n/g' <<< "a11b12c22d322 e"
a
b
c
d
e
With awk:
$ awk 'gsub(/[0-9 ]+/,"\n")' <<< "a11b12c22d322 e"
a
b
c
d
e
I'll let you benchmark.

Try with tr:
tr -s '[:digit:][:space:]' '\n' <<< "a11b12c22d322e"
That yields:
a
b
c
d
e

Related

Extracting values from a single file

I have a file with multiple lines; but a specific line contains tons of information, with several repeated expressions. I'm trying to extract some specific values. I first tried some commands with sed, for instance, but with no success. So, I was wondering if you could give me some insights.
So, here you have one fraction of the unique line of the given document I mentioned:
[...]6[&length_range={0.19
[... a lot of more information here in between ...]
0.01},habitat.set.prob={0.01,0.03,0.56,0.01,0.01,0.34,0.01,0.01,0.01},DLOOP.rate_median=0.04131395026396427,length=
[...]
10[&length_range={0.19
[... a lot of more information here in between ...]
0.01},habitat.set.prob={0.21,0.33,0.56,0.01,0.01,0.33,0.01,0.01,0.61},DLOOP.rate_median=0.04131395026396427,length=
[...]
My aim here is first to extract all the values that is between the brackets, after "habitat.set.prob={". and put them in a single line in a text file.
Also, it would be important to extract the numbers that appears just before the expression "[&length_range=]", which in this case are "6" and "10". They are the label of the set of numbers after "prob={"
So the set of numbers I want to extract always appears between "habitat.set.prob={" and "},DLOOP.rate_median", while the other number (the label) is always rigth before "[&length_range="; but what is before the label is not the same expression; actually it is a random number.
The goal then is end up with a file with the following characteristcs:
6 0.21,0.33,0.56,0.01,0.01,0.33,0.01,0.01,0.61
10 0.21,0.33,0.56,0.01,0.01,0.33,0.01,0.01,0.61
and so on …
What do you think? Is this possible?
I started with this very basic command at least to try to extract the set of numbers, but it didn't work
sed -n "/habitat.set.prob={/,/},DLOOP.rate_median=/ p"
| Well... I got some improvement.
I was able to get the values at least:
awk '{gsub("habitat.set.prob={","\n");printf"%s",$0}' filename | awk -F'},' '{print $1"}"}' | grep -iv "TREE" > stats.txt
|
Many thanks in advance.
Cheers,
Luiz
Something like that:
sed -rn '/.*[0-9]+\[&length_range=\{/,/habitat.set.prob=\{/{s/.*\b([0-9]+)\[&length_range.*/\1/p; s/.*habitat.set.prob=\{([^D]+)\},DLOOP.rate.*/\1/p}' habitat
6
0.01,0.03,0.56,0.01,0.01,0.34,0.01,0.01,0.01
10
0.21,0.33,0.56,0.01,0.01,0.33,0.01,0.01,0.61
The first part '/.a./,/.b./' searches from pattern a to b, distributed over multiple lines. The -n told sed to do non-printing as default.
In '/.a./,/.b./{s/.c./.d./p; s/.e./.f./p}'
there are two substitution commands with p=print in curly braces.
I am not sure if you really digged a little, so not providing the complete answer, but let's hope this would help you:
for the first part: getting the no(which you call as label) you didn't mention if there is any specific pattern, so try this (data is the file which contains the actual input) - you need to work on how to get the number and tweak the RE a bit
sed -n 's/.*\([0-9][0-9]*\).*length_range.*/\1/p' data
For the other part which gives the numericals between habitat and DLOOP:
sed -n 's/.*habitat.set.prob=\(.*\),DLOOP.*/\1/pg' data | tr '{' ' ' | tr '}' ' '
Now, try to take this as a starter and work on your output to get your desired result!
To explain a bit:
In the first section - I am trying to capture the numericals between anything(.*) and (.*)length_range [you can escape the character [ and & by using \ in front of them]
In the second section: I am capturing pattern in between habitat.set.prob and DLOOP and then doin a tr to remove the brackets.
#include <iostream>
using namespace std;
int main()
{
string p = "1:2:3:4"; //input your string
int arr[4] = {}; //create a new empty integer array to put the integers in it
for(int i=0, j=0; i <p.length(); i++){//loop on the string to extract integers
if( p[i] == ':'){continue;}//if the value = ':' skip it and continue
arr[j]=(int)p[i]-48;j++;//put the integer in the array we created
}
cout << "String={"<<arr[0]<<" "<<arr[1]<<" "<<arr[2]<<" "<<arr[3]<<"}";//print the array
return 0;
}

sed add different lines in different files

I'm trying to find a way to run multiple sed commands that adds lines to the start of different files (on Mac OS).
This works when run from terminal.
sed -i '' '1i\
\\version \"2.19.65\"\
\\language \"english\"\
\\include \"dynamics-defs.ily\"\
altosaxINotes = \\transpose c ef {\
\\relative c {\
' altosaxI.ily
But I want to add slightly different text on a different file:
sed -i '' '1i\
\\version \"2.19.65\"\
\\language \"english\"\
\\include \"dynamics-defs.ily\"\
altosaxIINotes = \\transpose c ef {\
\\relative c {\
' altosaxII.ily
I have about 30 or 40 of these to run, all slightly different. Is it possible to combine them all into one terminal command, or perhaps use Mac's automator, or maybe a better solution?
This might work for you (GNU sed):
# create a function f with one parameter
f(){ cat <<! >tempFile && sed -i '1e cat tempFile' ${1}.ily; }
\\version "2.19.65"
\\language "english"
\\include "dynamics-defs.ily"
${1}Notes = \\transpose c ef {
\\relative c {
!
# call the function
f altosaxI
The function f can then be included in a for-loop or a script.

use sed to change a text report to csv

I have a report looks like this:
par_a
.xx
.yy
par_b
.zz
.tt
I wish to convert this format into csv format as below using sed 1 liner:
par_a,.xx
par_a,.yy
par_b,.zz
par_b,.tt
please help.
With awk:
awk '/^par_/{v=$0;next}/^ /{$0=v","$1;print}' File
Or to make it more generic:
awk '/^[^[:blank:]]/{v=$0;next} /^[[:blank:]]/{$0=v","$1;print}' File
When a line starts with par_, save the content to variable v. Now, when a line starts with space, change the line to content of v followed by , followed by the first field.
Output:
AMD$ awk '/^par_/{v=$0}/^ /{$0=v","$1;print}' File
par_a,.xx
par_a,.yy
par_b,.zz
par_b,.tt
With sed:
sed '/^par_/ { h; d; }; G; s/^[[:space:]]*//; s/\(.*\)\n\(.*\)/\2,\1/' filename
This works as follows:
/^par_/ { # if a new paragraph begins
h # remember it
d # but don't print anything yet
}
# otherwise:
G # fetch the remembered paragraph line to the pattern space
s/^[[:space:]]*// # remove leading whitespace
s/\(.*\)\n\(.*\)/\2,\1/ # rearrange to desired CSV format
Depending on your actual input data, you may want to replace the /^par_/ with, say, /^[^[:space:]]/. It just has to be a pattern that recognizes the beginning line of a paragraph.
Addendum: Shorter version that avoids regex repetition when using the space pattern to recognize paragraphs:
sed -r '/^\s+/! { h; d; }; s///; G; s/(.*)\n(.*)/\2,\1/' filename
Or, if you have to use BSD sed (as comes with Mac OS X):
sed '/^[[:space:]]\{1,\}/! { h; d; }; s///; G; s/\(.*\)\n\(.*\)/\2,\1/' filename
The latter should be portable to all seds, but as you can see, writing portable sed involves some pain.

Code Golf - Word Scrambler

Please answer with the shortest possible source code for a program that converts an arbitrary plaintext to its corresponding ciphertext, following the sample input and output I have given below. Bonus points* for the least CPU time or the least amount of memory used.
Example 1:
Plaintext: The quick brown fox jumps over the lazy dog. Supercalifragilisticexpialidocious!
Ciphertext: eTh kiquc nobrw xfo smjup rvoe eth yalz .odg !uioiapeislgriarpSueclfaiitcxildcos
Example 2:
Plaintext: 123 1234 12345 123456 1234567 12345678 123456789
Ciphertext: 312 4213 53124 642135 7531246 86421357 975312468
Rules:
Punctuation is defined to be included with the word it is closest to.
The center of a word is defined to be ceiling((strlen(word)+1)/2).
Whitespace is ignored (or collapsed).
Odd words move to the right first. Even words move to the left first.
You can think of it as reading every other character backwards (starting from the end of the word), followed by the remaining characters forwards. Corporation => XoXpXrXtXoX => niaorCoprto.
Thank you to those who pointed out the inconsistency in my description. This has lead many of you down the wrong path, which I apologize for. Rule #4 should clear things up.
*Bonus points will only be awarded if Jeff Atwood decides to do so. Since I haven't checked with him, the chances are slim. Sorry.
Python, 50 characters
For input in i:
' '.join(x[::-2]+x[len(x)%2::2]for x in i.split())
Alternate version that handles its own IO:
print ' '.join(x[::-2]+x[len(x)%2::2]for x in raw_input().split())
A total of 66 characters if including whitespace. (Technically, the print could be omitted if running from a command line, since the evaluated value of the code is displayed as output by default.)
Alternate version using reduce:
' '.join(reduce(lambda x,y:y+x[::-1],x) for x in i.split())
59 characters.
Original version (both even and odd go right first) for an input in i:
' '.join(x[::2][::-1]+x[1::2]for x in i.split())
48 characters including whitespace.
Another alternate version which (while slightly longer) is slightly more efficient:
' '.join(x[len(x)%2-2::-2]+x[1::2]for x in i.split())
(53 characters)
J, 58 characters
>,&.>/({~(,~(>:#+:#i.#-#<.,+:#i.#>.)#-:)#<:##)&.><;.2,&' '
Haskell, 64 characters
unwords.map(map snd.sort.zip(zipWith(*)[0..]$cycle[-1,1])).words
Well, okay, 76 if you add in the requisite "import List".
Python - 69 chars
(including whitespace and linebreaks)
This handles all I/O.
for w in raw_input().split():
o=""
for c in w:o=c+o[::-1]
print o,
Perl, 78 characters
For input in $_. If that's not acceptable, add six characters for either $_=<>; or $_=$s; at the beginning. The newline is for readability only.
for(split){$i=length;print substr$_,$i--,1,''while$i-->0;
print"$_ ";}print $/
C, 140 characters
Nicely formatted:
main(c, v)
char **v;
{
for( ; *++v; )
{
char *e = *v + strlen(*v), *x;
for(x = e-1; x >= *v; x -= 2)
putchar(*x);
for(x = *v + (x < *v-1); x < e; x += 2)
putchar(*x);
putchar(' ');
}
}
Compressed:
main(c,v)char**v;{for(;*++v;){char*e=*v+strlen(*v),*x;for(x=e-1;x>=*v;x-=2)putchar(*x);for(x=*v+(x<*v-1);x<e;x+=2)putchar(*x);putchar(32);}}
Lua
130 char function, 147 char functioning program
Lua doesn't get enough love in code golf -- maybe because it's hard to write a short program when you have long keywords like function/end, if/then/end, etc.
First I write the function in a verbose manner with explanations, then I rewrite it as a compressed, standalone function, then I call that function on the single argument specified at the command line.
I had to format the code with <pre></pre> tags because Markdown does a horrible job of formatting Lua.
Technically you could get a smaller running program by inlining the function, but it's more modular this way :)
t = "The quick brown fox jumps over the lazy dog. Supercalifragilisticexpialidocious!"
T = t:gsub("%S+", -- for each word in t...
function(w) -- argument: current word in t
W = "" -- initialize new Word
for i = 1,#w do -- iterate over each character in word
c = w:sub(i,i) -- extract current character
-- determine whether letter goes on right or left end
W = (#w % 2 ~= i % 2) and W .. c or c .. W
end
return W -- swap word in t with inverted Word
end)
-- code-golf unit test
assert(T == "eTh kiquc nobrw xfo smjup rvoe eth yalz .odg !uioiapeislgriarpSueclfaiitcxildcos")
-- need to assign to a variable and return it,
-- because gsub returns a pair and we only want the first element
f=function(s)c=s:gsub("%S+",function(w)W=""for i=1,#w do c=w:sub(i,i)W=(#w%2~=i%2)and W ..c or c ..W end return W end)return c end
-- 1 2 3 4 5 6 7 8 9 10 11 12 13
--34567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890
-- 130 chars, compressed and written as a proper function
print(f(arg[1]))
--34567890123456
-- 16 (+1 whitespace needed) chars to make it a functioning Lua program,
-- operating on command line argument
Output:
$ lua insideout.lua 'The quick brown fox jumps over the lazy dog. Supercalifragilisticexpialidocious!'
eTh kiquc nobrw xfo smjup rvoe eth yalz .odg !uioiapeislgriarpSueclfaiitcxildcos
I'm still pretty new at Lua so I'd like to see a shorter solution if there is one.
For a minimal cipher on all args to stdin, we can do 111 chars:
for _,w in ipairs(arg)do W=""for i=1,#w do c=w:sub(i,i)W=(#w%2~=i%2)and W ..c or c ..W end io.write(W ..' ')end
But this approach does output a trailing space like some of the other solutions.
For an input in s:
f=lambda t,r="":t and f(t[1:],len(t)&1and t[0]+r or r+t[0])or r
" ".join(map(f,s.split()))
Python, 90 characters including whitespace.
TCL
125 characters
set s set f foreach l {}
$f w [gets stdin] {$s r {}
$f c [split $w {}] {$s r $c[string reverse $r]}
$s l "$l $r"}
puts $l
Bash - 133, assuming input is in $w variable
Pretty
for x in $w; do
z="";
for l in `echo $x|sed 's/\(.\)/ \1/g'`; do
if ((${#z}%2)); then
z=$z$l;
else
z=$l$z;
fi;
done;
echo -n "$z ";
done;
echo
Compressed
for x in $w;do z="";for l in `echo $x|sed 's/\(.\)/ \1/g'`;do if ((${#z}%2));then z=$z$l;else z=$l$z;fi;done;echo -n "$z ";done;echo
Ok, so it outputs a trailing space.

Insert double quotes multiple times into string

I have inherited a flat html file with a few hundred lines similar to this:
<blink>
<td class="pagetxt bordercolor="#666666 width="203 colspan="3 height="20>
</blink>
So far I have not been able to work out a sed way of inserting the closing double quotes for each element. Probably needs something other than sed to do this. Can anyone suggest an easy way to do this?
Thanks
sed -i 's/"\([^" >]\+\)\( \|>\)/"\1"\2/g' file.html
Explanation:
" - leading double quote
\([^" >]\+\) - non-quote-or-space-or-'>' chars, grouped (into group 1)
\( \|>\) - terminating space or '>', grouped (into group 2)
We replace it with '"<group1>"<group2>'.
One solution that pops out at me is to parse through each line of the file looking for the quote. When it finds one, activate a flag to keep track of being inside a quoted area, then continue parsing the line until it hits the first space or > it comes to and inserts an additional " just before it. Flip the flag off, then continue through the string looking for the next quote. Probably not a perfect solution, but a start perhaps.
If all lines share the same structure, you could use a simple texteditor to globally replace
' bordercolor'
with
'" bordercolor'
(without single-quotes). This is then independend from the field values and works similarly for the other fields. You still have to do some manual work, but if it's just one big file, I'd bite the bullet this time and not waste probably more time working out a sed-solution.
This should do if your file is simple - it won't work if you have whitespace which should be inside the quotes - in that case, a more complex code will be needed, but can be done along the same lines.
#!usr/bin/env python
#change the "utf-8" bellow to your files encoding
data = open("<myfile.html>").read().decode("utf-8")
new_data = []
inside_tag = False
inside_quotes = False
for char in data:
if char == "<":
inside_tag = True
if char == '"':
inside_quotes = True
if inside_tag and (char.isspace() or char==">") and inside_quotes:
new_data.append('"')
inside_quotes = False
if char == ">":
inside_tag = False
new_data.append(char)
outputfile = open("<mynewfile.html>", "wt")
outputfile.write("".join(new_data).encode("utf-8"))
outputfile.close()
with bash
for file in *
do
flag=0
while read -r line
do
case "$line" in
*"<blink>"*)
flag=1
;;
esac
if [ "$flag" -eq 1 ];then
case "$line" in
*class=\"pagetxt*">" )
line="${line%>}\">"
flag=0
;;
esac
fi
echo "${line}"
done <"file" > temp
mv temp "$file"
done
Regular expressions are your friend:
Find: (="[^" >]+)([ >])
Replace: \1"\2
After you've done that, make sure to run this one too:
Find: </?blink>
Replace: \n
(This won't fix more than one class on an element, like <element class="class1 class2 id="jimmy">)