parsing rss feed, description field

parsing rss feed, description field - perl

I use SimpleXml in perl to extract data in tag
<description><strong>CUSIP:</strong> 912828UC2<br /><strong>Term and Type:</strong> 3-Year Note<br /><strong>Offering Amount:</strong> $32,000,000,000<br /><strong>Auction Date:</strong> 12/11/2012<br /><strong>Issue Date:</strong> 12/17/2012<br /><strong>Maturity Date:</strong> 12/15/2015<br /><a href="http://www.treasurydirect.gov/instit/annceresult/press/preanre/2012/A_20121206_6.pdf">PDF version of the announcement</a><br /><a href="http://www.treasurydirect.gov/xml/A_20121206_6.xml">XML version of the announcement</a><br /></description>
I now have trouble extracting individual symbols. For example for Auction Date, use
if ($desc=~m/Auction\sDate:<\/strong>\s+(\d\d\/\d\d\/\d\d\d\d)<br/)
{
}
but I feel it's not robust enough. What is the standard way to extract fields?

As Dan1111 points out in his answer, if you're already using an XML parser (Simple::XML?) you should stick with it for parsing the data within your description tags. It isn't a good idea to attempt to parse data out of an XML/HTML feed; use a parser built for that purpose.
Because of the formatting of the data in your post, I am assuming that you don't have valid HTML that a parser can help you with. In this case, there's no 'standard' way to extract fields, but here's the way I'd approach this problem:
print "$desc\n";
my #parts = split(/;br /, $desc);
my %dates;
foreach my $part (#parts) {
if ($part =~ m/(\w+\s+Date).+(\d{2}\/\d{2}\/\d{4})/) {
$dates{$1} = $2;
}
}
foreach my $label (keys %dates) {
printf "%-16s%12s\n", "${label}:", $dates{$label};
}
Looking at the original string, I can see that there are 3 dates, and several other records, so the first thing to do is to split them up. I found that each record in the string is delimited by the characters ';br ', so I used that for the split:
my #parts = split(/;br /, $desc);
After doing that, I have an array that contains each of the different data parts from your string. now, I just needed to parse each part. because your question is interested in the Auction Date value, I wrote a regular expression that will capture the date. anticipating that the other dates might be valuable as well, I modified my regex so that I could capture the label (Auction, Issue, Maturity), and I stored each label-date pair in a hash (%dates) :
foreach my $part (#parts) {
if ($part =~ m/(\w+\s+Date).+(\d{2}\/\d{2}\/\d{4})/) {
$dates{$1} = $2;
}
}
Finally, I just printed out my hash:
foreach my $part (#parts) {
if ($part =~ m/(\w+\s+Date).+(\d{2}\/\d{2}\/\d{4})/) {
$dates{$1} = $2;
}
}
Make sense?

What is more robust depends on your expected input and what you are looking for. However, here is something that you might find helpful.
I used XML::Twig for this. XML::Simple (which I assume is what you are using now) is not recommended for new development due to various quirks.
use Modern::Perl;
use XML::Twig;
my $twig = XML::Twig->new();
$twig->parse(<DATA>);
my %params;
my $key;
for my $child (map {$_->text} $twig->root->children)
{
if ($child =~ /(.*):/)
{
$key = $1;
}
else
{
$params{$key} = $child if (defined $key);
undef $key;
}
}
say "$_ is $params{$_}" foreach (keys %params);
__DATA__
<description><strong>CUSIP:</strong> 912828UC2<br /><strong>Term and Type:</strong> 3-Year Note<br /><strong>Offering Amount:</strong> $32,000,000,000<br /><strong>Auction Date:</strong> 12/11/2012<br /><strong>Issue Date:</strong> 12/17/2012<br /><strong>Maturity Date:</strong> 12/15/2015<br />PDF version of the announcement<br />XML version of the announcement<br /></description>
This takes any element that ends with a colon as a key, then assumes the next element in the tree is the value. Obviously that makes some assumptions of what kind of input you will get, but it is pretty robust as long as all of the "key" elements will be enclosed in tags.
Another approach would be stripping out all of the tags first, then searching for key-value pairs in just the text. You can do this with XML::Twig as well; simply calling $twig->root->text will get the text from the entire element. However, in this approach it would be tricky to determine where one key ends and another value begins.

The <description> elements in the RSS feed you show contain valid XHTML fragments as PCDATA. This solution extracts those elements and decodes them, and parses them in turn to access the text of the <strong> elements and their corresponding values.
Note that the XHTML contains multiple elements, and as XHTML is allowed only a single root element I have wrapped it in a dummy <root> elements in $twig->parse("<root>$desc</root>").
Hopefully you will be able to extrapolate from this to access the data you require.
use strict;
use warnings;
use LWP::Simple;
use XML::Twig;
my $xml = get 'http://www.treasurydirect.gov/RI/TreasuryOfferingAnnouncements.rss';
my $twig = XML::Twig->new;
$twig->parse($xml);
for my $desc ($twig->get_xpath('/rss/channel/item/description')) {
$desc = $desc->text;
my $twig = XML::Twig->new;
$twig->parse("<root>$desc</root>");
for my $strong ($twig->get_xpath('/root/strong')) {
my ($key, $val) = ($strong->trimmed_text, $strong->next_sibling->trimmed_text);
$key =~ s/:$//;
print "$key => $val\n";
}
print "\n";
}
output
CUSIP -> 912810QY7
Term and Type -> 29-Year 11-Month Bond
Offering Amount -> $13,000,000,000
Auction Date -> 12/13/2012
Issue Date -> 12/17/2012
Maturity Date -> 11/15/2042
CUSIP -> 912796DT3
Term and Type -> 3-Day Bill
Offering Amount -> $10,000,000,000
Auction Date -> 12/13/2012
Issue Date -> 12/14/2012
Maturity Date -> 12/17/2012
CUSIP -> 912828UE8
Term and Type -> 5-Year Note
Offering Amount -> $35,000,000,000
Auction Date -> 12/18/2012
Issue Date -> 12/31/2012
Maturity Date -> 12/31/2017
CUSIP -> 912828UD0
Term and Type -> 2-Year Note
Offering Amount -> $35,000,000,000
Auction Date -> 12/17/2012
Issue Date -> 12/31/2012
Maturity Date -> 12/31/2014
CUSIP -> 912796AM1
Term and Type -> 26-Week Bill
Offering Amount -> $28,000,000,000
Auction Date -> 12/17/2012
Issue Date -> 12/20/2012
Maturity Date -> 06/20/2013
CUSIP -> 912828UF5
Term and Type -> 7-Year Note
Offering Amount -> $29,000,000,000
Auction Date -> 12/19/2012
Issue Date -> 12/31/2012
Maturity Date -> 12/31/2019
CUSIP -> 912828SQ4
Term and Type -> 4-Year 4-Month TIPS
Offering Amount -> $14,000,000,000
Auction Date -> 12/20/2012
Issue Date -> 12/31/2012
Maturity Date -> 04/15/2017
CUSIP -> 9127957M7
Term and Type -> 13-Week Bill
Offering Amount -> $32,000,000,000
Auction Date -> 12/17/2012
Issue Date -> 12/20/2012
Maturity Date -> 03/21/2013
CUSIP -> 912828TY6
Term and Type -> 9-Year 11-Month Note
Offering Amount -> $21,000,000,000
Auction Date -> 12/12/2012
Issue Date -> 12/17/2012
Maturity Date -> 11/15/2022
CUSIP -> 912828UC2
Term and Type -> 3-Year Note
Offering Amount -> $32,000,000,000
Auction Date -> 12/11/2012
Issue Date -> 12/17/2012
Maturity Date -> 12/15/2015
CUSIP -> 912796AK5
Term and Type -> 52-Week Bill
Offering Amount -> $25,000,000,000
Auction Date -> 12/11/2012
Issue Date -> 12/13/2012
Maturity Date -> 12/12/2013
CUSIP -> 9127955V9
Term and Type -> 4-Week Bill
Offering Amount -> $40,000,000,000
Auction Date -> 12/11/2012
Issue Date -> 12/13/2012
Maturity Date -> 01/10/2013
CUSIP -> 912796AL3
Term and Type -> 26-Week Bill
Offering Amount -> $28,000,000,000
Auction Date -> 12/10/2012
Issue Date -> 12/13/2012
Maturity Date -> 06/13/2013
CUSIP -> 9127957L9
Term and Type -> 13-Week Bill
Offering Amount -> $32,000,000,000
Auction Date -> 12/10/2012
Issue Date -> 12/13/2012
Maturity Date -> 03/14/2013
CUSIP -> 912796DT3
Term and Type -> 11-Day Bill
Offering Amount -> $25,000,000,000
Auction Date -> 12/04/2012
Issue Date -> 12/06/2012
Maturity Date -> 12/17/2012
CUSIP -> 9127956Z9
Term and Type -> 4-Week Bill
Offering Amount -> $40,000,000,000
Auction Date -> 12/04/2012
Issue Date -> 12/06/2012
Maturity Date -> 01/03/2013

Related

How to convert datetime to day name and month name in erlang?

How to get following DateTime in Erlang?
Fri Jul 13 19:12:59 IST 2018

TL; DR
Use the exceptional qdate for all your date/time formatting, converting, and timezone handling. Look at the Demonstration section in particular to get the gist, and adjust to your needs.
Erlang's date handling, in my opinion, is convoluted and lacking in the major functionality that's needed for proper date handling. It's getting better, but not quite there. Moreover, timezone handling is primitive at best.
qdate's functions will take (almost) any date format and convert to any date format, while using either an implicit timezone (setting the timezone on a per-process basis), or by setting a specific timezone.
In any case, if you go custom, you will end up with something similar to this:
1> {{Year, Month, Day}, {Hour, Minute, Second}} = calendar:now_to_datetime(erlang:now()).
{{2018,7,13},{14,39,45}}
2> lists:flatten(io_lib:format("~4..0w-~2..0w-~2..0wT~2..0w:~2..0w:~2..0w",[Year,Month,Day,Hour,Minute,Second])).
"2018-07-13T14:39:45"
...not good ;)
Those are my two cents. Cheers!

I found the solution.
A = calendar:universal_time().
qdate:to_string(<<"D M j G:i:s T Y">> , <<"IST">>, A).
You can use http://uk3.php.net/manual/en/function.date.php for different formatting. Advisable to use only if you have to support legacy system because this function call use seems expensive.

date_time() ->
{{Year, Month, Day},{ Hour, Minute, Second}} = calendar:local_time(),
DayOfWeek = calendar:day_of_the_week({Year, Month, Day}),
DayName = day_check(DayOfWeek),
MonthName = month_check(Month),
lists:flatten(io_lib:format("~3..0s ~3..0s ~2..0w ~2..0w:~2..0w:~2..0w IST ~4..0w", [DayName, MonthName, Day, Hour, Minute, Second, Year])).
day_check(1) -> 'Mon';
day_check(2) -> 'Tue';
day_check(3) -> 'Wed';
day_check(4) -> 'Thu';
day_check(5) -> 'Fri';
day_check(6) -> 'Sat';
day_check(7) -> 'Sun'.
month_check(1) -> 'Jan';
month_check(2) -> 'Feb';
month_check(3) -> 'Mar';
month_check(4) -> 'Apr';
month_check(5) -> 'May';
month_check(6) -> 'Jun';
month_check(7) -> 'Jul';
month_check(8) -> 'Aug';
month_check(9) -> 'Sep';
month_check(10) -> 'Oct';
month_check(11) -> 'Nov';
month_check(12) -> 'Dec'.

Dates between today and another date

im burning my brains trying to make a function that gives me the ammount of days between todays date and a given date.
possible today function:
today = fmap (formatTime defaultTimeLocale "%Y-%m-%d") getCurrentTime
and thought using diffDays, but wont be able to make it work with a ::Day date
any ideas?

Your formatTime version returns a string, but you want a Day (which looks like your string when you inspect it, but is a different type entirely). Here's one way to write a today function, using utctDay to get a Day out of a UTCTime:
import Data.Time.Calendar
import Data.Time.Clock
today :: IO Day
today = fmap utctDay getCurrentTime
And here's a days-from-today function (which I gave the shorter name daysAway) that uses it:
daysAway :: Day -> IO Integer
daysAway day = fmap (diffDays day) today
If you're always specifying the target as a calendar date, you can do that easily enough:
daysToDate :: Integer -> Int -> Int -> IO Integer
daysToDate year month day = daysAway $ fromGregorian year month day
Given a shorthand function for a commonly-needed relative day:
tomorrow :: IO Day
tomorrow = fmap (addDays 1) today
We can demonstrate the correctness of Annie's Thesis:
ghci> tomorrow >>= daysAway
1

How to print only 2 result in reactivemongo

In reactivemongo my query look like this:
val result =collName.find(BSONDocument("loc" -> BSONDocument("$near" ->
BSONArray(51,-114)))).cursor[BSONDocument].enumerate()
result.apply(Iteratee.foreach { doc => println(+BSONDocument.pretty(doc))})
I want to print only top 2 result, so i pass the maxdocs value in enumerate and then query is
val result =collName.find(BSONDocument("loc" -> BSONDocument("$near" ->
BSONArray(51,-114)))).cursor[BSONDocument].enumerate(2)
result.apply(Iteratee.foreach { doc => println(+BSONDocument.pretty(doc))})
But it's not workinng, it's print all document of query.
How to print only top 2 result ?

I basically stumbled over the same thing.
Turns out, that the ReactiveMongo driver transfers the result documents in batches, taking the maxDocs setting into account only when it wants to load the next batch of documents.
You can configure the batch size to be equal to the maxDocs limit or to a proper divisor thereof:
val result = collName.
find(BSONDocument("loc" -> BSONDocument("$near" -> BSONArray(51,-114)))).
options(QueryOpts(batchSizeN = 2)).
cursor[BSONDocument].enumerate(2)
Or, alternatively, let MongoDB choose the batch size and limit the documents you process using an Enumeratee:
val result = collName.
find(BSONDocument("loc" -> BSONDocument("$near" -> BSONArray(51,-114)))).
cursor[BSONDocument].
enumerate(2) &> Enumeratee.take(2)

Zend_Date : expanding a 2 digit year value to 4 digit year

Have a date of birth in format 'MM/dd/yy' for people born in the 1900's. I'm using Zend_Date to
parse and convert the string value
$date = new Zend_Date();
$logger->info(sprintf('Convert DOB %s -> %s',$dateOfBirth,$date->toString('yyyy-M-dd')));
I get
2010-06-24T16:55:50+00:00 INFO (6): DOB 9/13/57
2010-06-24T16:55:50+00:00 INFO (6): Convert DOB : 9/13/57 -> 2057-9-13
I expected
2010-06-24T16:55:50+00:00 INFO (6): Convert 9/13/57 -> 1957-9-13
What am i missing? I don't think this is related to the real year 'yyyy' / ISO year 'YYYY' handling in Zend_Date.
My current horrible hack
$formattedDate = $date->toString('dd/M').'/19'.$date->toString('YY');

short (1 or 2 digit) version of YEAR is always in current century.
so use:
$dob = '9/13/57';
$date = new Zend_Date($dob, 'M/d/yy');
echo $date->subYear(100)->toString('YYYY-MM-d');

Apparently, it's a bit more complicated. According to this site, 2-digit years greater than or equal to 70 become 1970-1999 whereas those less than 70 become 2000-2069.

NSNumberFormatter to display custom labels for 10^n (10000 -> 10k)

I need to display numbers on a plot axis. The values could change but I want to avoid too long numbers that will ruin the readability of the graph.
My thought was to group every 3 characters and substitute them with K, M and so on (or a custom character).
So:
1 -> 1,
999 -> 999,
1.000 -> 1k,
1.200 -> 1.2k,
1.280 -> 1.2k,
12.800 -> 12.8k,
999.999 -> 999.9k,
1.000.000 -> 1M,
...
Note that probably I'll only need to format round numbers (1, 10, 1000, 1500, 2000, 10000, 20000, 30000, 100000, ...).
Is that possibile with NSNumberFormatter? I saw that it has a setFormat method but I don't know how much customizable it is.
I'm using NSNumberFormatter cause the graph object I use wants it to set label format and I want to avoid changing my data to set the label.

You can use this code:
let formatter = NSNumberFormatter()
formatter.multiplier = 0.001
formatter.positiveFormat = "#,###k"
formatter.zeroSymbol = "0"
return formatter
It helped me to convert the currency values:
2000 -> 2k
10000 -> 10k

No. The closest you can get is Scientific Notation. Have a look here for how to create a format for that. You could obviously quite easily do the k, M etc substitution yourself though.