Extracting Specific Variables from a | delimited .txt and extracting them to a new .txt - perl

I have a .txt that is say 100,000 rows (observations) by 50 columns (variables), and the variables are | delimited. I would like to extract the 8th and 9th variables (or 7 and 8 if the indexing were to start at 0). In doing so, I'd like to create a new .txt that is 100,000 rows (the same observations) by 2 columns (these 2 variables) in which these 2 variables remain | delimited.
For example, the data in one row is formatted as:
var1|var2|var3|var4|var5|var6|var7|var8|var9|var10|var11 .........
I'd like to create a .txt with this row being:
var7|var8
I've tried:
$ perl -wplaF'|' -e'$_ = join "|", #F[7, 8]' fileoriginal.txt > filenew.txt
This output is just kind of gibberish, however.
Any help would be greatly appreciated!

The argument to -F is compiled into a regular expression, and | is a special character in regular expressions. To use a literal | char, you need to escape it on the command line.
One of
perl -F\\\| -wlape ...
perl -F'\|' -wlape ...
does the trick on Unix.

Related

Comm command issue

I'm trying to compare two gene lists and extract the common ones. I sorted my .txt files and used comm command:
comm gene_list1.txt gene_list2.txt
Strangely, when I check the output, there are many common genes that are not printed in the third line. Here is part of the output:
As you can see, AAAS and AAGAB etc. exist in both files, but they are not printed as common lines! Any idea why this happens?
Thank you
$ comm file1.txt file2.txt
The output of the above command contains three columns where the first column is separated by zero tabs and contains names only present in file1.txt.
The second column contains names only present in file2.txt and separated by one tab.
The third column contains names common to both the files and is separated by two tabs from the beginning of the line.
This is the default pattern of the output produced by comm command when no option is used.
I am assuming, both the input files are in the sorted order. Then the required command for your use case would be
$ comm -12 gene_list1.txt gene_list2.txt
This means both the columns (1 and 2) are suppressed (not displayed). Since you are only interested in the elements common to both the files.

Insert new line after any amount of numeric characters

I need to insert a new line, or delimiter, in a text file after a "numeric" string consisting of 10 numbers, then a "-", then either 1 to 4 numbers...
Example:
randomtext,1234567890-1234blahblah
Should be:
randomtext,1234567890-1234, blahblah
Or:
randomtext,1234567890-1234
blahblah
Note that the set of numbers will always be 10 characters, the numbers after the - will either 1,2,3 or 4 characters.
I've used sed a lot for similar tasks, but can't find a way to work with the last set of numbers which vary from 1 to 4 characters....
I really hope someone can help!
Many thanks!
$ echo randomtext,1234567890-1234blahblah |
sed -E 's/[0-9]{10}+-[0-9]{1,4}/&\n/'
randomtext,1234567890-1234
blahblah

adding delimiters to end of file

I am working on a TPT script to process some large files we have. Right now, each record length in the file has a delimiter, |.
The problem is that not all fields are used by each record. For example, record 1 may have 100 fields and record 2 may have 260. For TPT to work, we need to have a delimiter for each field, so the records that have less than 261 fields populated, I need to append the appropriate number of pipes to the end of each record.
So, taking my example above, record one would have 161 pipes appended to the end and record two would have 1.
I have a perl script which will count the number of pipes in each record, but I am not sure how to take that info and accomplish the task of appending that many pipes to the field.
perl -ne 'print scalar(split(/\|/, $_)) . "\n"'
Any advice?
To get the number of pipe symbols, you can use the tr operator.
my $count = tr/|//;
Just subtract the number of pipe symbols from the maximal number to get the number of pipes to add, use the x (times) operator to get them:
perl -lne 'print $_, "|" x (260 - tr/|//)'
I'm not sure the number is correct, it depends on whether the pipes also start or end the line.

How to extract a string, number, or word from a line or database and save it to a variable? (script in bash)

My question can be split in 2. First I have a data file (file.dat) that looks like:
Parameter stuff number 1 (1029847) word index 2 (01293487), bla bla
Parameter stuff number 3 (134123) word index 4 (02983457), bla bla
Parameter stuff number 2 (109847) word index 3 (1029473), bla bla
etc...
I want to extract the number in brackets and save it to a variable for example the first one in line one to be 'x1', the second on the same line to be 'y1', for line 2 'x2' and 'y2', and so on... The numbers change randomly line after line, their position (in columns, if you like) stays the same line after line. The number of lines is variable (0 to 'n'). How can I do this? Please.
I have search for answers and I get lost with the many different commands one can use, however those answers attend to particular examples where the word is at the end or in brackets but only one per line, etc. Anyhow, here is what I have done so far (I am newby):
1) I get rid of the characters that are not part of the number in the string
sed -i 's/(//g' file.dat
sed -i 's/),//g' file.dat
2) Out of frustration I decided to output the whole lines to variables (getting closer?)
2.1) Get the number of lines to iterate for:
numlines=$(wc -l < file.dat)
2.2) Loop to numlines (I havent tested this bit yet!)
for i in {1..$numlines}
do
line${!i}=$(sed -n "${numlines}p" file.dat)
done
2.3) I gave up here, any help appreciated.
The second question is similar and merely out of curiosity: imagine a database separated by spaces, or tabs, or comas, any separator; this database has a variable number of lines ('n') and the strings per line may vary too ('k'). How do I extract the value of the 'i'th line on the 'j'th string, and save it to a variable 'x'?
Here is a quick way to store value in bash array variable.
x=("" $(awk -F"[()]" '{printf "%s ",$2}' file))
y=("" $(awk -F"[()]" '{printf "%s ",$4}' file))
echo ${x[2]}
134123
If you are going to use these data for more jobs, I would have done it in awk. Then you can use internal array in awk
awk -F"[()]" '{x[NR]=$2;y[NR]=$4}' file
#!/usr/bin/env bash
x=()
y=()
while read line; do
x+=("$(sed 's/[^(]*(\([0-9]*\)).*/\1/' <<< $line)")
y+=("$(sed 's/[^(]*([^(]*(\([0-9]*\)).*/\1/' <<< $line)")
done < "data"
echo "${x[#]}"
echo "${y[#]}"
x and y are declared as arrays. Then you loop over the input file and invoke a sed command to every line in your input file.
x+=(data) appends the value data to the array x. Instead of writing the value we want to store in the array, we use command substitution, which is done with $(command), instead of appending the literal meaning of $(command) to the array, the command is executed and its return value is stored in the array.
Let's look at the sed commands:
's' is the substitute command, with [^(]* we want to match everything, except (, then we match (. The following characters we want to store in the array, to do that we use \( and \), we can later reference to it again (with \1). The number is matched with [0-9]*. In the end we match the closing bracket ) and everything else with .*. Then we replace everything we matched (the whole line), with \1, which is just what we had between \( and \).
If you are new to sed, this might be highly confusing, since it takes some time to read the sed syntax.
The second sed command is very similar.
How do I extract the value of the 'i'th line on the 'j'th string, and
save it to a variable 'x'?
Try using awk
x=$(awk -v i=$i -v j=$j ' NR==i {print $j; exit}' file.dat)
I want to extract the number in brackets and save it to a variable for
example the first one in line one to be 'x1', the second on the same
line to be 'y1', for line 2 'x2' and 'y2', and so on...
Using awk
x=($(awk -F'[()]' '{print $2}' file.dat))
y=($(awk -F'[()]' '{print $4}' file.dat))
x1 can be accessed as ${x[0]} and y1 as ${y[0]}, likewise for other sequence of variables.

Ubuntu Splitting a file into three files with a third of the number of total lines in each

I have a simple ascii text file with a string in each line, something like
aa1
aa2
ab1
...
with a total of N lines. I know I can use the split command to split it out into a fixed number of lines per file. How do I specify the number of files I want to split it into and let split decide how many lines go into each file. For example if the file had 100 lines, I want to be able to specify
split 3 foo.txt
and it would write out three files xaa xab and xac each with 33, 33 and 34 lines. Is this even possible? Or do I write a custom Perl script for this?
Try doing this :
split -n 3 file
see
man split | less +/'^\s*-n'
There's no option for that[*]
You could use 'wc' to get the number of lines, and divide by 3, so it's few lines of whatever scripting you want to use.
([*]update: on ubuntu there is, and that's what the question was about. -n Does not seem to be there on all linux, or older).
If your split implementation doesn't accept -n paramater you can use this bash function:
function split_n() { split -l $((($1+`wc -l <"$2"`-1)/$1)) "$2" "${3:-$2.}"; }
You can invoke it as
split_n 3 file.txt
or
split_n 3 file.txt prefix
Given your comment that you do not have the -n option in your split, here is a slightly hackier approach you could take.
lines=`wc -l < foo.txt`
lines=$((lines/3+1))
split $lines foo.txt
If you do this often you could store it in a script by taking in the number of splits and filename as follows:
splits=$1
filename=$2
lines=`wc -l < $filename`
lines=$((lines/$splits+1))
split $lines $filename