use crunch to generate all the possible IATA codes comsbination - crunch

I don't really know how to formulate this, but I have a bunch of IATA codes, and I want to generate all the possible combinations ex : JFK/LAX, BOS/JFK, ...etc, separated by a character such as "/" or "|".

Here we assume your IATA codes are stored in the file file; one code per line.
crunch has the -q option which generates permutations of lines from a file. However, in this mode crunch ignores most of the other options like <max-len>, which would be important here to print only pairs of codes.
Therefore, it would be easier and faster to …
Use something different than crunch
For instance, try
join -j2 -t/ -o 1.1 2.1 file file | awk -F/ '$1!=$2'
If you really, really, really want, you can …
Translate the input into something crunch can work with
We translate each line from file to a unique single character, supply that list of characters to crunch, and then translate the result back.
crunch supports Unicode characters, so files with more than 255 lines are totally fine. Here we enumerate the lines in file by characters in Unicode's Supplementary Private Use Area-A. Therefore, file may have at most 65'534 lines.
If you need more lines, you could combine multiple Unicode planes, but at some point you might run into ARG_MAX issues. Also, with 65'534 lines you would already generate (a bit less than) 65'534^2 = 4'294'705'156 pairs, occupying more than 34 GB when translated into pairs of IATA codes.
I suspect the back-translation to be a huge slowdown, so above alternative seems to be better in every aspect (efficiency, brevity, maintainability, …).
# This assumes your locale is using any Unicode encoding,
# e.g. UTF-8, UTF-16, … (doesn't matter which one).
file=...
((offset=0xF0000))
charset=$(
echo -en "$(bc <<< "obase=16;
max=$offset+$(wc -l < "$file");
for(i=$offset;i<max;i++) {\"\U\"; i}" |
tr -d \\n
)"
)
crunch 2 2 "$charset" -d 1# --stdout |
iconv -t UTF-32 |
od -j4 -tu4 -An -w12 -v |
awk -v o="$offset" 'NR==FNR{a[o+NR-1]=$0;next} {print a[$1]"/"a[$2]}' "$file" -

Related

What's the correct usage of sed with parallel --jobs option?

parallel -a input --colsep ' ' --jobs 100 -I {} sed -i 's/{1}/{2}/g' file
input is a file delimited by space, where the first column is pattern and the second column is replacement.
The problem is that after I ran the command, not all patterns were replaced in file. Then I ran the same command again, more patterns were replaced, but still not all.
However, if I change --jobs 100 to --jobs 1, it will work as expected (but much slower).
Is there any parameter necessary missing in my command?
Sounds more like you have a race condition. If you have several sed processes writing to the file, one will win, and the other(s) will lose.
Having multiple processes process the same file is hugely suboptimal anyway; just generate a single sed script and then run it once. Or if you really want to parallelize, split the input file into smaller pieces, run the generated sed script on each in parallel, and then concatenate them back when you are done.
Parallel processing helps when your task is CPU bound, but this one is I/O bound; you are simply creating congestion by having several processes fight over the access to bytes from the disk, and then in this case also fighting over write access back to the same file.
There are many examples of how to generate a sed script; here's a quick and dirty one which will however not work on some platforms where sed -f - does not read the script from standard input.
sed 's%^\([^ ]*\) \([^ ]*\)$%s/\1/\2/g%' input |
sed -f - file >temp # or sed -f - -i file
I omitted the -i option so that you can check that this does what you want before plunging ahead and deploying it in production. The commented-out version is what you would use once you are satisfied that this really does what you want.
There is still the question of replacement precedence. If you have s/a/b/ and s/b/c/ then do you want effectively s/a/c/, or the opposite? If you have s/abc/x/ and s/abcdef/y/, should abcdef always become y, or is xdef what you expect? A common hack is to sort the replacements by length so that the longer ones always get executed before the shorter ones; then at least you know what to expect.
Let us assume that input is big and file is huge.
You really do not want to read file more than once.
First you need to convert input into a single big sed script.
cat input | parallel --colsep ' ' echo s/{1}/{2}/g >bigsed
As #tripleee says, you may need to sort this, so the longest source string is first.
Then you need to split file into one chunk per CPU thread, run the script on each chunk and finally append the replaced chunks back in order:
parallel --pipepart -a file -k sed -f bigsed > replaced
You will need that /tmp has enough free space to contain replaced or set $TMPDIR to a dir that is.

Using Sed to Delete multiple lines using a file with patterns

I am currently using sed to delete lines and subsequent line with various patterns from a file using the following the following code:
sed -i -e"/String1/,+1d" -e"/String2/,+1d," filename.txt
Works very well however I have a lot of patterns which vary from time to time.
Is it possible to put all patterns in another text file and make sed to delete all entries for patterns found in such file ?
Thanks
Here is an awk version
awk 'NR==FNR {a[$0]++;next} {for (i in a) if ($0~i) f=2} --f<0' list yourfile
NR==FNR {a[$0]++;next} store the list of lines to remove for file list in array a
for (i in a) for every line, loop through all lines in list
if ($0~i) f=2 if trigger line is found, set flag f to 2
--f<0 decrease flag f by one and test if it less than 0, if yes, print the line.
example
cat yourfile
one
two
three
four
five
six
seven
eight
nine
ten
eleven
cat list
three
eight
awk 'NR==FNR {a[$0]++;next} {for (i in a) if ($0~i) f=2} --f<0' list yourfile
one
two
five
six
seven
ten
eleven
Trying to stick with sed - at all cost, and being creative :-)
Consider using sed itself to generate the sed script that will perform the substitutions, based on the patterns file.
Important to note that this is solution will process each input file with one-pass, making it possible to use on large files/many patterns.
Proposed Solution:
sed -i -e "$(sed -e '/\//d;s/^/\//;s/$/\/,+1d/' < patterns.txt)" filename.txt
The embedded sed program (sed -e '/\//d;s/^/\//;s/$/\/,+1d/ ...) will convert the patterns.txt to a small sed script:
pattern.txt:
three
eight
foo/bar
Output: (noticed foo/bar ignored - contains '/')
/three/,+1d
/eight/,+1d
Notes, Limitations, etc:
One limit (of above implementation) is the delimiter, code remove any pattern with '/' to simplify generation of sed script, and to avoid potential injection. Possible to work around this limitation and allow for alternate delimiter (by escaping special characters in the pattern, or leveraging the '\%' addresses). May need additional testing.
Code assumes that the patterns are valid RE.

I want to print a text file in columns

I have a text file which looks something like this:
jdkjf
kjsdh
jksfs
lksfj
gkfdj
gdfjg
lkjsd
hsfda
gadfl
dfgad
[very many lines, that is]
but would rather like it to look like
jdkjf kjsdh
jksfs lksfj
gkfdj gdfjg
lkjsd hsfda
gadfl dfgad
[and so on]
so I can print the text file on a smaller number of pages.
Of course, this is not a difficult problem, but I'm wondering if there is some excellent tool out there for solving problems like these.
EDIT: I'm not looking for a way to remove every other newline from a text file, but rather a tool which interprets text as "pictures" and then lays these out on the page nicely (by writing the appropriate whitespace symbols).
You can use this python code.
tables=input("Enter number of tables ")
matrix=[]
file=open("test.txt")
for line in file:
matrix.append(line.replace("\n",""))
if (len(matrix)==int(tables)):
print (matrix)
matrix=[]
file.close()
(Since you don't name your operating system, I'll simply assume Linux, Mac OS X or some other Unix...)
Your example looks like it can also be described by the expression "joining 2 lines together".
This can be achieved in a shell (with the help of xargs and awk) -- but only for an input file that is structured like your example (the result always puts 2 words on a line, irrespective of how many words each one contains):
cat file.txt | xargs -n 2 | awk '{ print $1" "$2 }'
This can also be achieved with awk alone (this time it really joins 2 full lines, irrespective of how many words each one contains):
awk '{printf $0 " "; getline; print $0}' file.txt
Or use sed --
sed 'N;s#\n# #' < file.txt
Also, xargs could do it:
xargs -L 2 < file.txt
I'm sure other people could come up with dozens of other, quite different methods and commandline combinations...
Caveats: You'll have to test for files with an odd number of lines explicitly. The last input line may not be processed correctly in case of odd number of lines.

How can I remove all non-word characters except the newline?

I have a file like this:
my line - some words & text
oh lóok i've got some characters
I want to 'normalize' it and remove all the non-word characters. I want to end up with something like this:
mylinesomewordstext
ohlóokivegotsomecharacters
I'm using Linux on the command line at the moment, and I'm hoping there's some one-liner I can use.
I tried this:
cat file | perl -pe 's/\W//'
But that removed all the newlines and put everything one line. Is there someway I can tell Perl to not include newlines in the \W? Or is there some other way?
This removes characters that don't match \w or \n:
cat file | perl -C -pe 's/[^\w\n]//g'
#sth's solution uses Perl, which is (at least on my system) not Unicode compatible, thus it loses the accented o character.
On the other hand, sed is Unicode compatible (according to the lists on this page), and gives a correct result:
$ sed 's/\W//g' a.txt
mylinesomewordstext
ohlóokivegotsomecharacters
In Perl, I'd just add the -l switch, which re-adds the newline by appending it to the end of every print():
perl -ple 's/\W//g' file
Notice that you don't need the cat.
The previous response isn't echoing the "ó" character. At least in my case.
sed 's/\W//g' file
Best practices for shell scripting dictate that you should use the tr program for replacing single characters instead of sed, because it's faster and more efficient. Obviously use sed if replacing longer strings.
tr -d '[:blank:][:punct:]' < file
When run with time I get:
real 0m0.003s
user 0m0.000s
sys 0m0.004s
When I run the sed answer (sed -e 's/\W//g' file) with time I get:
real 0m0.003s
user 0m0.004s
sys 0m0.004s
While not a "huge" difference, you'll notice the difference when running against larger data sets. Also please notice how I didn't pipe cat's output into tr, instead using I/O redirection (one less process to spawn).

Convert double-byte numbers and spaces in filenames to ASCII

Given a directory of filenames consisting of double-byte/full-width numbers and spaces (along with some half-width numbers and underscores), how can I convert all of the numbers and spaces to single-byte characters?
For example, this filename consists of a double-byte number, followed by a double-byte space, followed by some single-byte characters:
2 2_3.ext
and I'd like to change it to all single-byte like so:
2 2_3.ext
I've tried convmv to convert from utf8 to ascii, but the following message appears for all files:
"ascii doesn't cover all needed characters for: filename"
You need either (1) normalization from Java 1.6 (java.text.Normalizer), or (2) ICU, or (3 (unlikely)) a product sold by the place I work.
What tools do you have available? There are Unicode normalisation functions in several scripting languages, for example in Python:
for child in os.listdir(u'.'):
normal= unicodedata.normalize('NFKC', child)
if normal!=child:
os.rename(child, normal)
Thanks for your quick replies, bmargulies and bobince. I found a Perl module, Unicode::Japanese, that helped get the job done. Here is a bash script I made (with help from this example) to convert filenames in the current directory from full-width to half-width characters:
#!/bin/bash
for file in *;do
newfile=$(echo $file | perl -MUnicode::Japanese -e'print Unicode::Japanese->new(<>)->z2h->get;')
test "$file" != "$newfile" && mv "$file" "$newfile"
done