I have a file with multiple blocks of test. FOR EACH block of test, I want to be able to extract what is in the square bracket, the line containing the FIRST instance of the word "area", and what is on the right of the square bracket. Everything will be a string. Essentially what I want to do is store each string into a variable in a hash so i can print it into a 3 column csv file.
Here's a sample of what the file looks like:
Student-[K-6] Exceptional in Math
/home/area/kinder/mathadvance.txt, 12
Students in grade K-12 shown to be exceptional in math.
Placed into special after school program.
See /home/area/overall/performance.txt, 200
Student-[Junior] Weak Performance
Students with overall weak performance.
Summer program services offered as shown in
"/home/area/services/summer.txt", 212
Student-[K-6] Physical Excerise Time Slots
/home/area/pe/schedule.txt, 303
Assigned time slots for PE based on student's grade level. Make reference to
/home/area/overall/classtimes.txt, 90
I want to to have a final csv file that looks like:
Grade,Topic,Path
K-6, Exceptional in Math, /home/area/kinder/mathadvance.txt, 12
K-6, Physical Exercise Time Slots, /home/area/pe/schedule.txt, 303
Junior, Weak Performance, "/home/area/services/summer.txt", 212
Since it's a csv file, I know it will also separate at the line number when exporting into excel but I'm fine with that.
I started off by putting the grade type into an array because I want to be able to add more strings to it for different grade levels.
My program looks like this so far:
#!/usr/bin/perl
use strict;
use warnings;
my #grades = ("K-6", "Junior", "Community-College", "PreK");
I was thinking that I will need to do some sort of system sed command to grab what is in the brackets and store it into a variable. Then I will grab everything to the right of the bracket on the line and store it into a variable. And then I will grep for a line containing "area" to get the path and I will store it as a string into a variable, put these in a hash, and then print into csv. I'm not sure if I'm thinking about this the right way. Also, I have NO IDEA how to do this for each BLOCK of text in the file. I need it by block because each block has its own corresponding grades, topics, and paths.
perl -000 -ne '($grade, $topic) = /\[(.*)\] (.*)/;
($path) = m{(.*/area/.*)};
print "$grade, $topic, $path\n"' -- file.txt
-000 turns on paragraph mode, -n won't read line by line, but paragraph by paragraph
/\[(.*)\] (.*)/ matches the square brackets and whatever follows them up to a newline. The inside of the square brackets and the following text are captured using the parentheses.
m{(.*/area/.*)} captures the line containing "area". It uses the m{} syntax instead of // so we don't have to backslash the slashes (avoiding so called "leaning toothpick syndrome")
I'm looking to create a macro in P6 which converts its argument to a string.
Here's my macro:
macro tfilter($expr) {
quasi {
my $str = Q ({{{$expr}}});
filter-sub $str;
};
}
And here is how I call it:
my #some = tfilter(age < 50);
However, when I run the program, I obtain the error:
Unable to parse expression in quote words; couldn't find final '>'
How do I fix this?
Your use case, converting some code to a string via a macro, is very reasonable. There isn't an established API for this yet (even in my head), although I have come across and thought about the same use case. It would be nice in cases such as:
assert a ** 2 + b ** 2 == c ** 2;
This assert statement macro could evaluate its expression, and if it fails, it could print it out. Printing it out requires stringifying it. (In fact, in this case, having file-and-line information would be a nice touch also.)
(Edit: 007 is a language laboratory to flesh out macros in Perl 6.)
Right now in 007 if you stringify a Q object (an AST), you get a condensed object representation of the AST itself, not the code it represents:
$ bin/007 -e='say(~quasi { 2 + 2 })'
Q::Infix::Addition {
identifier: Q::Identifier "infix:+",
lhs: Q::Literal::Int 2,
rhs: Q::Literal::Int 2
}
This is potentially more meaningful and immediate than outputting source code. Consider also the fact that it's possible to build ASTs that were never source code in the first place. (And people are expected to do this. And to mix such "synthetic Qtrees" with natural ones from programs.)
So maybe what we're looking at is a property on Q nodes called .source or something. Then we'd be able to do this:
$ bin/007 -e='say((quasi { 2 + 2 }).source)'
2 + 2
(Note: doesn't work yet.)
It's an interesting question what .source ought to output for synthetic Qtrees. Should it throw an exception? Or just output <black box source>? Or do a best-effort attempt to turn itself into stringified source?
Coming back to your original code, this line fascinates me:
my $str = Q ({{{$expr}}});
It's actually a really cogent attempt to express what you want to do (turn an AST into its string representation). But I doubt it'll ever work as-is. In the end, it's still kind of based on a source-code-as-strings kind of thinking à la C. The fundamental issue with it is that the place where you put your {{{$expr}}} (inside of a string quote environment) is not a place where an expression AST is able to go. From an AST node type perspective, it doesn't typecheck because expressions are not a subtype of quote environments.
Hope that helps!
(PS: Taking a step back, I think you're doing yourself a disservice by making filter-sub accept a string argument. What will you do with the string inside of this function? Parse it for information? In that case you'd be better off analyzing the AST, not the string.)
(PPS: Moritz++ on #perl6 points out that there's an unrelated syntax error in age < 50 that needs to be addressed. Perl 6 is picky about things being defined before they are used; macros do not change this equation much. Therefore, the Perl 6 parser is going to assume that age is a function you haven't declared yet. Then it's going to consider the < an opening quote character. Eventually it'll be disappointed that there's no >. Again, macros don't rescue you from needing to declare your variables up-front. (Though see #159 for further discussion.))
So, I happened to notice that last.fm is hiring in my area, and since I've known a few people who worked there, I though of applying.
But I thought I'd better take a look at the current staff first.
Everyone on that page has a cute/clever/dumb strapline, like "Is life not a thousand times too short for us to bore ourselves?". In fact, it was quite amusing, until I got to this:
perl -e'print+pack+q,c*,,map$.+=$_,74,43,-2,1,-84, 65,13,1,5,-12,-3, 13,-82,44,21, 18,1,-70,56, 7,-77,72,-7,2, 8,-6,13,-70,-34'
Which I couldn't resist pasting into my terminal (kind of a stupid thing to do, maybe), but it printed:
Just another Last.fm hacker,
I thought it would be relatively easy to figure out how that Perl one-liner works. But I couldn't really make sense of the documentation, and I don't know Perl, so I wasn't even sure I was reading the relevant documentation.
So I tried modifying the numbers, which got me nowhere. So I decided it was genuinely interesting and worth figuring out.
So, 'how does it work' being a bit vague, my question is mainly,
What are those numbers? Why are there negative numbers and positive numbers, and does the negativity or positivity matter?
What does the combination of operators +=$_ do?
What's pack+q,c*,, doing?
This is a variant on “Just another Perl hacker”, a Perl meme. As JAPHs go, this one is relatively tame.
The first thing you need to do is figure out how to parse the perl program. It lacks parentheses around function calls and uses the + and quote-like operators in interesting ways. The original program is this:
print+pack+q,c*,,map$.+=$_,74,43,-2,1,-84, 65,13,1,5,-12,-3, 13,-82,44,21, 18,1,-70,56, 7,-77,72,-7,2, 8,-6,13,-70,-34
pack is a function, whereas print and map are list operators. Either way, a function or non-nullary operator name immediately followed by a plus sign can't be using + as a binary operator, so both + signs at the beginning are unary operators. This oddity is described in the manual.
If we add parentheses, use the block syntax for map, and add a bit of whitespace, we get:
print(+pack(+q,c*,,
map{$.+=$_} (74,43,-2,1,-84, 65,13,1,5,-12,-3, 13,-82,44,21,
18,1,-70,56, 7,-77,72,-7,2, 8,-6,13,-70,-34)))
The next tricky bit is that q here is the q quote-like operator. It's more commonly written with single quotes:
print(+pack(+'c*',
map{$.+=$_} (74,43,-2,1,-84, 65,13,1,5,-12,-3, 13,-82,44,21,
18,1,-70,56, 7,-77,72,-7,2, 8,-6,13,-70,-34)))
Remember that the unary plus is a no-op (apart from forcing a scalar context), so things should now be looking more familiar. This is a call to the pack function, with a format of c*, meaning “any number of characters, specified by their number in the current character set”. An alternate way to write this is
print(join("", map {chr($.+=$_)} (74, …, -34)))
The map function applies the supplied block to the elements of the argument list in order. For each element, $_ is set to the element value, and the result of the map call is the list of values returned by executing the block on the successive elements. A longer way to write this program would be
#list_accumulator = ();
for $n in (74, …, -34) {
$. += $n;
push #list_accumulator, chr($.)
}
print(join("", #list_accumulator))
The $. variable contains a running total of the numbers. The numbers are chosen so that the running total is the ASCII codes of the characters the author wants to print: 74=J, 74+43=117=u, 74+43-2=115=s, etc. They are negative or positive depending on whether each character is before or after the previous one in ASCII order.
For your next task, explain this JAPH (produced by EyesDrop).
''=~('(?{'.('-)#.)#_*([]#!#/)(#)#-#),#(##+#)'
^'][)#]`}`]()`#.#]#%[`}%[#`#!##%[').',"})')
Don't use any of this in production code.
The basic idea behind this is quite simple. You have an array containing the ASCII values of the characters. To make things a little bit more complicated you don't use absolute values, but relative ones except for the first one. So the idea is to add the specific value to the previous one, for example:
74 -> J
74 + 43 -> u
74 + 42 + (-2 ) -> s
Even though $. is a special variable in Perl it does not mean anything special in this case. It is just used to save the previous value and add the current element:
map($.+=$_, ARRAY)
Basically it means add the current list element ($_) to the variable $.. This will return a new array with the correct ASCII values for the new sentence.
The q function in Perl is used for single quoted, literal strings. E.g. you can use something like
q/Literal $1 String/
q!Another literal String!
q,Third literal string,
This means that pack+q,c*,, is basically pack 'c*', ARRAY. The c* modifier in pack interprets the value as characters. For example, it will use the value and interpret it as a character.
It basically boils down to this:
#!/usr/bin/perl
use strict;
use warnings;
my $prev_value = 0;
my #relative = (74,43,-2,1,-84, 65,13,1,5,-12,-3, 13,-82,44,21, 18,1,-70,56, 7,-77,72,-7,2, 8,-6,13,-70,-34);
my #absolute = map($prev_value += $_, #relative);
print pack("c*", #absolute);
So me being the 'noob' that I am, being introduced to programming via Perl just recently, I'm still getting used to all of this. I have a .fasta file which I have to use, although I'm unsure if I'm able to open it, or if I have to work with it 'blindly', so to speak.
Anyway, the file that I have contains DNA sequences for three genes, written in this .fasta format.
Apparently it's something like this:
>label
sequence
>label
sequence
>label
sequence
My goal is to write a script to open and read the file, which I have gotten the hang of now, but I have to read each sequence, compute relative amounts of 'G' and 'C' within each sequence, and then I'm to write it to a TAB-delimited file the names of the genes, and their respective 'G' and 'C' content.
Would anyone be able to provide some guidance? I'm unsure what a TAB-delimited file is, and I'm still trying to figure out how to open a .fasta file to actually see the content. So far I've worked with .txt files which I can easily open, but not .fasta.
I apologise for sounding completely bewildered. I'd appreciate your patience. I'm not like you pros out there!!
I get that it's confusing, but you really should try to limit your question to one concrete problem, see https://stackoverflow.com/faq#questions
I have no idea what a ".fasta" file or 'G' and 'C' is.. but it probably doesn't matter.
Generally:
Open input file
Read and parse data. If it's in some strange format that you can't parse, go hunting on http://metacpan.org for a module to read it. If you're lucky someone has already done the hard part for you.
Compute whatever you're trying to compute
Print to screen (standard out) or another file.
A "TAB-delimite" file is a file with columns (think Excel) where each column is separated by the tab ("\t") character. As quick google or stackoverflow search would tell you..
Here is an approach using 'awk' utility which can be used from the command line. The following program is executed by specifying its path and using awk -f <path> <sequence file>
#NR>1 means only look at lines above 1 because you said the sequence starts on line 2
NR>1{
#this for-loop goes through all bases in the line and then performs operations below:
for (i=1;i<=length;i++)
#for each position encountered, the variable "total" is increased by 1 for total bases
total++
}
{
for (i=1;i<=length;i++)
#if the "substring" i.e. position in a line == c or g upper or lower (some bases are
#lowercase in some fasta files), it will carry out the following instructions:
if(substr($0,i,1)=="c" || substr($0,i,1)=="C")
#this increments the c count by one for every c or C encountered, the next if statement does
#the same thing for g and G:
c++; else
if(substr($0,i,1)=="g" || substr($0,i,1)=="G")
g++
}
END{
#this "END-block" prints the gene name and C, G content in percentage, separated by tabs
print "Gene name\tG content:\t"(100*g/total)"%\tC content:\t"(100*c/total)"%"
}
I've spent entirely way too long trying to figure this out. I'm using XML: RSS and Perl to read / parse an Ebay RSS feed. Within the <item></item> area, I see these entries:
<rx:BuyItNowPrice xmlns:rx="urn:ebay:apis:eBLBaseComponents">1395</rx:BuyItNowPrice>
<rx:CurrentPrice xmlns:rx="urn:ebay:apis:eBLBaseComponents">1255</rx:CurrentPrice>
However, I can't figure out how to grab the details during the loop. I wrote a regex to grab them:
#current_price = $item =~ m/\<rx\:CurrentPrice.*\>(\d+)\<\/rx\:CurrentPrice\>/g;
Which works if you place the above 'CurrentPrice' entry into a standalone string, but not while the script is reading through the RSS feed.
I can grab most of the information I want out of the item->description area (# bids, auction end time, BIN price, thumbnail image, etc.), but it would be nicer if I could grab the info from the feed without me having to deal with grabbing all that information manually.
How to grab custom fields from an RSS feed (short of writing regexes to parse the entire feed w/o a module)?
Here's the code I'm working with:
$my_limit = 0;
use LWP::Simple;
use XML::RSS;
$rss = XML::RSS->new();
$data = get( $mylink );
$rss->parse( $data );
$channel = $rss->{channel};
$NumItems = 0;
foreach $item (#{$rss->{'items'}}) {
if($NumItems > $my_limit){
last;
}
#current_price = $item =~ m/\<rx\:CurrentPrice.*\>(\d+)\<\/rx\:CurrentPrice\>/g;
print "$current_price[0]";
}
If you have the rss/xml document and want specific data you could use XPATH:
Perl CPAN XPATH
XPath Introduction
What is the way in which "it doesn't work" from an RSS feed? Do you mean no matches when there should be matches? Or one match where there should be several matches?
One thing that jumps out at me about your regular expression is that you use .*, which can sometimes be greedier than you want. That is, if $item contained the expression
<rx:BuyItNowPrice xmlns:rx="urn:...nts">1395</rx:BuyItNowPrice>
<rx:CurrentPrice xmlns:rx="urn:...nts">1255</rx:CurrentPrice>
<rx:BuyItNowPrice xmlns:rx="urn:...nts">1395</rx:BuyItNowPrice>
<rx:SomeMoreStuff xmlns:rx="urn:...nts">zzz</rx:BuyItNowPrice>
<rx:CurrentPrice xmlns:rx="urn:...nts">1255</rx:CurrentPrice>
then the first part of your regular expression (\<rx\:CurrentPrice.*\>) will wind up matching everything on lines 2, 3, and 4, plus the first part of line 5 (up to the >). Instead, you might want to use the regular expression1
m/\<rx:CurrentPrice[^>]*>(\d+)\<\/rx:CurrentPrice\>/
which will only match up to the closing </rx:CurrentPrice> tag after a single instance of an opening <rx:CurrentPrice> tag.
1 The other obvious answer is that you really don't want to use a regular expression at all, that regular expressions are inferior tools for parsing XML compared to customized parsing modules, and that all the special cases you will have to deal with using regular expressions will eventually render you unconscious from having repeatedly beaten your head against your desk. See Salgar's answer, for example.