How to split the data with splitting value? - perl

How to split the data using by particular letter but the splitting data is present into the previous split ted line.
My perl code
$data ="abccddaabcdebb";
#split = split('b',"$data");
foreach (#split){
print "$_\n";
}
In this code gives the outputs but i expected outputs is:
ab
ccddaab
cdeb
b
How can i do this

You can use lookbehind to keep the b:
$data ="abccddaabcdebb";
#split = split(/(?<=b)/, $data);
foreach (#split){
print "$_\n";
}
will print out
ab
ccddaab
cdeb
b

You'll need positive look behind if you want to include letter b as delimiter is excluded from resulting list.
my $data ="abccddaabcdebb";
my #split = split(/(?<=b)/, $data);
foreach (#split) {
print "$_\n";
}
From perldoc -f split
Anything in EXPR that matches PATTERN is taken to be a separator that separates the EXPR into substrings (called "fields") that do not include the separator.

The first parameter of split defines what separates the elements you want to extract. b doesn't separate your elements you want since it's actually part of what you want.
You could specify the split after b using
my #parts = split /(?<=b)/, $s;
You could also use
my #parts = $s =~ /[^b]*b/g;
Side note:
split /(?<=b)/
splits
a b c b b
at three spots
a b|c b|b|
so it results in four strings
ab
cb
b
Empty string
Fortunately, split removes trailing blank strings from its result by default, so it results in the three desired strings instead.
ab
cb
b

Related

Regular Expression Matching Perl for first case of pattern

I have multiple variables that have strings in the following format:
some_text_here__what__i__want_here__andthen_someĀ 
I want to be able to assign to a variable the what__i__want_here portion of the first variable. In other words, everything after the FIRST double underscore. There may be double underscores in the rest of the string but I only want to take the text after the FIRST pair of underscores.
Ex.
If I have $var = "some_text_here__what__i__want_here__andthen_some", I would like to assign to a new variable only the second part like $var2 = "what__i__want_here__andthen_some"
I'm not very good at matching so I'm not quite sure how to do it so it just takes everything after the first double underscore.
my $text = 'some_text_here__what__i__want_here';
# .*? # Match a minimal number of characters - see "man perlre"
# /s # Make . match also newline - see "man perlre"
my ($var) = $text =~ /^.*?__(.*)$/s;
# $var is not defined when there is no __ in the string
print "var=${var}\n" if defined($var);
You might consider this an example of where split's third parameter is useful. The third parameter to split constrains how many elements to return. Here is an example:
my #examples = (
'some_text_here__what__i_want_here',
'__keep_this__part',
'nothing_found_here',
'nothing_after__',
);
foreach my $string (#examples) {
my $want = (split /__/, $string, 2)[1];
print "$string => ", (defined $want ? $want : ''), "\n";
}
The output will look like this:
some_text_here__what__i_want_here => what__i_want_here
__keep_this__part => keep_this__part
nothing_found_here =>
nothing_after__ =>
This line is a little dense:
my $want = (split /__/, $string, 2)[1];
Let's break that down:
my ($prefix, $want) = split /__/, $string, 2;
The 2 parameter tells split that no matter how many times the pattern /__/ could match, we only want to split one time, the first time it's found. So as another example:
my (#parts) = split /#/, "foo#bar#baz#buzz", 3;
The #parts array will receive these elements: 'foo', 'bar', 'baz#buzz', because we told it to stop splitting after the second split, so that we get a total maximum of three elements in our result.
Back to your case, we set 2 as the maximum number of elements. We then go one step further by eliminating the need for my ($throwaway, $want) = .... We can tell Perl we only care about the second element in the list of things returned by split, by providing an index.
my $want = ('a', 'b', 'c', 'd')[2]; # c, the element at offset 2 in the list.
my $want = (split /__/, $string, 2)[1]; # The element at offset 1 in the list
# of two elements returned by split.
You use brackets to capature then reorder the string, the first set of brackets () is $1 in the next part of the substitution, etc ...
my $string = "some_text_here__what__i__want_here";
(my $newstring = $string) =~ s/(some_text_here)(__)(what__i__want_here)/$3$2$1/;
print $newstring;
OUTPUT
what__i__want_here__some_text_here

Work around for split function when last character is a terminator

I have this line of data with 20 fields:
my $data = '54243|601|0|||0|N|0|0|0|0|0||||||99582|';
I'm using this to split the data:
my #data = split ('\|'), $data;
However, instead of 20 pieces of data, you only get 19:
print scalar #data;
I could manually push an empty string onto #data if the last character is a | but I'm wondering if there is a more perlish way.
Do
my #data = split /\|/, $data, -1;
The -1 tells split to include empty trailing fields.
(Your parentheses around the regex are incorrect, and lead to $data not being considered a parameter of split. Also, with one exception, the first argument of split is always a regex, so it is better to specify it as a regex not a string that will be interpreted as a regex.)

Split functions

I want to get the split characters. I tried the below coding, but I can able to get the splitted text only. However if the split characters are same then it should be returned as that single characters
For example if the string is "asa,agas,asa" then only , should be returned.
So in the below case I should get as "| : ;" (joined with space)
use strict;
use warnings;
my $str = "Welcome|a:g;v";
my #value = split /[,;:.%|]/, $str;
foreach my $final (#value) {
print $final, "\n";
}
split splits a string into elements when given what separates those elements, so split is not what you want. Instead, use:
my #punctuations = $str =~ /([,;:.%|])/g;
So you want to get the opposite of split
try:
my #value=split /[^,;:.%|]+/,$str;
It will split on anything but the delimiters you set.
Correction after commnets:
my #value=split /[^,;:.%|]+/,$str;
shift #value;
this works fine, and gives unique answers
#value = ();
foreach(split('',",;:.%|")) { push #value,$_ if $str=~/$_/; }
To extract all the separators only once, you need something more elaborate
my #punctuations = keys %{{ map { $_ => 1 } $str =~ /[,;:.%|]/g }};
Sounds like you call "split characters" what the rest of us call "delimiters" -- if so, the POSIX character class [:punct:] might prove valuable.
OTOH, if you have a defined list of delimiters, and all you want to do is list the ones present in the string, it's much more efficient to use m// rather than split.

How to isolate a word that corresponds with a letter from a different column of a CSV file?

I have a CSV file, like this:
ACDB,this is a sentence
BECD,this is another sentence
BCAB,this is yet another
Each character in the first column corresponds to a word in the second column, e.g., in the first column, A corresponds with "this", C with "is", D with "a", and B, with sentence.
Given the variable character, which can be set to any of the characters appearing in the first column, I need to isolate the word which corresponds to the selected letter, e.g., if I set character="B", then the output of the above would be:
sentence
this
this another
If I set `character="C", then the output of the above would be:
is
another
is
How can I output only those words which correspond to the position of the selected letter?
The file contains many UTF-8 characters.
For every character in column 1, there is always an equal number of words in column 2.
The words in column 2 are separated by spaces.
Here is the code I have so far:
while read line
do
characters="$(echo $line | awk -F, '{print $1}')"
words="$(echo $line | awk -F, '{print $2}')"
character="B"
done < ./file.csv
This might work for you:
x=B # set wanted key variable
sed '
:a;s/^\([^,]\)\(.*,\)\([^ \n]*\) *\(.*\)/\2\4\n\1 \3/;ta # pair keys with values
s/,// # delete ,
s/\n[^'$x'] [^\n]*//g # delete unwanted keys/values
s/\n.//g # delete wanted keys
s/ // # delete first space
/^$/d # delete empty lines
' file
sentence
this
this another
or in awk:
awk -F, -vx=B '{i=split($1,a,"");split($2,b," ");c=s="";for(n=1;n<=i;n++)if(a[n]==x){c=c s b[n];s=" "} if(length(c))print c}' file
sentence
this
this another
This seems to do the trick. It reads data from within the source file using the DATA file handle, whereas you will have to obtain it from your own source. You may also have to cater for there being no word corresponding to a given letter (as for 'A' in the second data line here).
use strict;
use warnings;
my #data;
while (<DATA>) {
my ($keys, $words) = split /,/;
my #keys = split //, $keys;
my #words = split ' ', $words;
my %index;
push #{ $index{shift #keys} }, shift #words while #keys;
push #data, \%index;
}
for my $character (qw/ B C /) {
print "character = $character\n";
print join(' ', #{$_->{$character}}), "\n" for #data;
print "\n";
}
__DATA__
ACDB,this is a sentence
BECD,this is another sentence
BCAB,this is yet another
output
character = B
sentence
this
this another
character = C
is
another
is
Here's a mostly - done rump answer.
Since SO is not a "Do my work for me" site, you will need to fill in some trivial blanks.
sub get_index_of_char {
my ($character, $charset) = #_;
# Homework: read about index() function
#http://perldoc.perl.org/functions/index.html
}
sub split_line {
my ($line) = #_;
# Separate the line into a charset (before comma),
# and whitespace separated word list.
# You can use a regex for that
my ($charset, #words) = ($line =~ /^([^,]+),(?(\S+)\s+)+(\S+)$/g); # Not tested
return ($charset, \#words);
}
sub process_line {
my ($line, $character) = #_;
chomp($line);
my ($charset, $words) = split_line($line);
my $index = get_index_of_char($character, $charset);
print $words->[$index] . "\n"; # Could contain a off-by-one bug
}
# Here be the main loop calling process_line() for every line from input

PERL -- Regex incl all hash keys (sorted) + deleting empty fields from $_ in file read

I'm working on a program and I have a couple of questions, hope you can help:
First I need to access a file and retrieve specific information according to an index that is obtained from a previous step, in which the indexes to retrieve are found and store in a hash.
I've been looking for a way to include all array elements in a regex that I can use in the file search, but I havenĀ“t been able to make it work. Eventually i've found a way that works:
my #atoms = ();
my $natoms=0;
foreach my $atomi (keys %{$atome}){
push (#atoms,$atomi);
$natoms++;
}
#atoms = sort {$b cmp $a} #atoms;
and then I use it as a regex this way:
while (<IN_LIG>){
if (!$natoms) {last;}
......
if ($_ =~ m/^\s*$atoms[$natoms-1]\s+/){
$natoms--;
.....
}
Is there any way to create a regex expression that would include all hash keys? They are numeric and must be sorted. The keys refer to the line index in IN_LIG, whose content is something like this:
8 C5 9.9153 2.3814 -8.6988 C.ar 1 MLK -0.1500
The key is to be found in column 0 (8). I have added ^ and \s+ to make sure it refers only to the first column.
My second problem is that sometimes input files are not always identical and they make contain white spaces before the index, so when I create an array from $_ I get column0 = " " instead of column0=8
I don't understand why this "empty column" is not eliminated on the split command and I'm having some trouble to remove it. This is what I have done:
#info = split (/[\s]+/,$_);
if ($info[0] eq " ") {splice (#info, 0,1);} # also tried $info[0] =~ m/\s+/
and when I print the array #info I get this:
Array:
Array: 8
Array: C5
Array: 9.9153
Array: 2.3814
.....
How can I get rid of the empty column?
Many thanks for your help
Merche
There is a special form of split where it will remove both leading and trailing spaces. It looks like this, try it:
my $line = ' begins with spaces and ends with spaces ';
my #tokens = split ' ', $line;
# This prints |begins:with:spaces:and:ends:with:spaces|
print "|", join(':', #tokens), "|\n";
See the documentation for split at http://p3rl.org/split (or with perldoc split)
Also, the first part of your program might be simpler as:
my #atoms = sort {$b cmp $a} keys %$atome;
my $natoms = #atoms;
But, what is your ultimate goal with the atoms? If you simply want to verify that the atoms you're given are indeed in the file, then you don't need to sort them, nor to count them:
my #atoms = keys %$atome;
while (<IN_LIG>){
# The atom ID on this line
my ($atom_id) = split ' ';
# Is this atom ID in the array of atom IDs that we are looking for
if (grep { /$atom_id/ } #atoms) {
# This line of the file has an atom that was in the array: $atom_id
}
}
Lets warm up by refining and correcting some of your code:
# If these are all numbers, do a numerical sort: <=> not cmp
my #atoms = ( sort { $b <=> $a } keys %{$atome} );
my $natoms = scalar #atoms;
No need to loop through the keys, you can insert them into the array right away. You can also sort them right away, and if they are numbers, the sort must be numerical, otherwise you will get a sort like: 1, 11, 111, 2, 22, 222, ...
$natoms can be assigned directly by the count of values in #atoms.
while(<IN_LIG>) {
last unless $natoms;
my $key = (split)[0]; # split splits on whitespace and $_ by default
$natoms-- if ($key == $atoms[$natoms - 1]);
}
I'm not quite sure what you are doing here, and if it is the best way, but this code should work, whereas your regex would not. Inside a regex, [] are meta characters. Split by default splits $_ on whitespace, so you need not be explicit about that. This split will also definitely remove all whitespace. Your empty field is most likely an empty string, '', and not a space ' '.
The best way to compare two numbers is not by a regex, but with the equality operator ==.
Your empty field should be gone by splitting on whitespace. The default for split is split ' '.
Also, if you are not already doing it, you should use:
use strict;
use warnings;
It will save you a lot of headaches.
for your second question you could use this line:
#info = $_ =~ m{^\s*(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)}xms;
in order to capture 9 items from each line (assuming they do not contain whitespace).
The first question I do not understand.
Update: I would read alle the lines of the file and use them in a hash with $info[0] as the key and [#info[1..8]] as the value. Then you can lookup the entries by your index.
my %details;
while (<IN_LIG>) {
#info = $_ =~ m{^\s*(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)}xms;
$details{ $info[0] } = [ #info[1..$#info] ];
}
Later you can lookup details for the indices you are interested in and process as needed. This assumes the index is unique (has the property of keys).
thanks for all your replies. I tried the split form with ' ' and it saved me several lines of code. thanks!
As for the regex, I found something that could make all keys as part of the string expression with join and quotemeta, but I couldn't make it work. Nevertheless I found an alternative that works, but I liked the join/quotemeta solution better
The atom indexes are obtained from a text file according to some energy threshold. Later, in the IN_LIG loop, I need to access the molecule file to obtain more information about the atoms selected, thus I use the atom "index" in the molecule to identify which lines of the file I have to read and process. This is a subroutine to which I send a hash with the atom index and some other information.
I tried this for the regex:
my $strings = join "|" map quotemeta,
sort { $hash->{$b} <=> $hash->{$a}} keys %($hash);
but I did something wrong cos it wouldn't take all keys