Data exporting from Financial Times - matlab

I am new at Matlab and I am currently working with financial data exporting from financial times website. I would like to know how can I get, for example, share price forecast information from this page
http://markets.ft.com/research/Markets/Tearsheets/Forecasts?s=DIS:NYQ
High +34.7 % 85.00
Med +15.7 % 73.00
Low -9.6 % 57.00
And save this information as a variables.

Here's a simple solution using urlread and regexpi:
% Create URL string and read in HTML
ftbaseurl = 'http://markets.ft.com/research/Markets/Tearsheets/Forecasts?s=';
ticksym = 'DIS:NYQ';
s = urlread([ftbaseurl ticksym]);
% Create pattern string for regular expression matching
trspan = '<tr><td class="text"><span class="';
tdspan1 = '</span></td><td><span class="\w\w\w color ">'; % \w\w\w matchs pos or neg
matchstr1 = '(?<percent>[\+|\-]*\d+.\d+)'; % percent: match (+or-)(1+ digits).(1+ digits)
tdspan2 = ' %</span></td><td>';
matchstr2 = '(?<price>\d+\.\d\d)</td></tr>'; % price: match (1+ digits) . 2 digits
pat = [trspan 'high">High' tdspan1 matchstr1 tdspan2 matchstr2 '|' ...
trspan 'med">Med' tdspan1 matchstr1 tdspan2 matchstr2 '|' ...
trspan 'low">Low' tdspan1 matchstr1 tdspan2 matchstr2];
% Match patterns in HTML, case insensitive, put results in struct array
forecasts = regexpi(s,pat,'names');
The result is a 1-by-3 struct array where each element has two fields, 'percent' and 'price', that each contain strings extracted by the regular expression parser. For example
>> forecasts(3)
ans = percent: '-10.3'
price: '57.00'
>> str2double(forecasts(3).percent)
-10.3000
I'll leave it to you to convert the strings to numbers (note that financial software usually stores prices in integer cents (or what ever the lowest denomination is) rather than floating point dollars to avoid numerical issues) and to turn this into a general function. Here's some more information on regular expressions in Matlab.
My comment above still stands. This is very inefficient. You're downloading the entire webpage HTML and parsing it in order to find a few small bits of data. This is fine if this doesn't update very often or if you don't need it to be very fast. Also, this scheme is fragile. If the Financial Times updates their website, it may break the code. And if you try downloading their regular webpages very often they may also have means of blocking you.

Related

Efficiently calculate mean value from files

From a Monte-Carlo simulation I have a range of files, say: file_1.mat, file_2.mat,...,file_n.mat, where n is large. Each file contains one or several (maximum 3 if it matters) large 1D arrays in time of interest, say var1, var2, var3.
I am now as always interested in finding the mean value of these variables. My question is now, how do I do this in the most efficient way? The keyword here is efficiency. Below you will find the MWE which is done the standard way, but this is quite time consuming as the files are large and there are many.
I am programming in Matlab, however ideas presented in pseudo code is also very well received.
MWE:(The standard way)
meanVar1 = zeros(1,1e6); %I do not remember the exact size, just use 1e6
meanVar2 = zeros(1,1e6);
meanVar3 = zeros(1,1e6);
for i 1=1:n
load(strcat('file_',int2str(i)),'var1','var2','var3')
meanVar1 = meanVar1 + var1;
meanVar2 = meanVar2 + var2;
meanVar3 = meanVar3 + var3;
end
meanVar1 = meanVar1/n;
meanVar2 = meanVar2/n;
meanVar3 = meanVar3/n;

Using For and While Loops to Determine Who to Hire MATLAB

It's that time of the week where I realize just how little I understand in MATLAB. This week, we have homework on iteration, so using for-loops and while-loops. The problem I am currently experiencing difficulties with is one where I have to write a function that decides who to hire somebody. I'm given a list of names, a list of GPAs and a logical vector that tells me whether or not a student stayed to talk. What I have to output is the names of people to hire and the time they spent chatting with the recruiter.
function[candidates_hire, time_spent] = CFRecruiter(names, GPAs, stays_to_talk)
In order to be hired, a canidate must have a GPA that is higher than 2.5 (not inclusive). In order to be hired, the student must stick around to talk, if they don't talk, they don't get hired. The names are separated by a ', ' and the GPAs is a vector. The time spent talking is determined by:
Time in minutes = (GPA - 2.5) * 4;
My code so far:
function[candidates_hire, time_spent] = CFRecruiter(names, GPAs, stays_to_talk)
candidates = strsplit(names, ', ');
%// My attempt to split up the candidates names.
%// I get a 1x3 cell array though
for i = 1:length(GPAs)
%// This is where I ran into trouble, I need to separate the GPAs
student_GPA = (GPAs(1:length(GPAs)));
%// The length is unknown, but this isn't working out quite yet.
%// Not too sure how to fix that
return
end
time_spent = (student_GPA - 2.5) * 4; %My second output
while stays_to_talk == 1 %// My first attempt at a while-loop!
if student_GPA > 2.5
%// If the student has a high enough GPA and talks, yay for them
student = 'hired';
else
student = 'nothired'; %If not, sadface
return
end
end
hired = 'hired';
%// Here was my attempt to get it to realize how was hired, but I need
%// to concatenate the names that qualify into a string for the end
nothired = 'nothired';
canidates_hire = [hired];
What my main issue is here is figuring out how to let the function know them names(1) has the GPA of GPAs(1). It was recommended that I start a counter, and that I had to make sure my loops kept the names with them. Any suggestions with this problem? Please and thank you :)
Test Codes
[Names, Time] = CFRecruiter('Jack, Rose, Tom', [3.9, 2.3, 3.3],...
[false true true])
=> Name = 'Tom'
Time = 3.2000
[Names, Time] = CFRecruiter('Vatech, George Burdell, Barnes Noble',...
[4.0, 2.5, 3.6], [true true true])
=> Name = 'Vatech, Barnes Noble'
Time = 10.4000
I'm going to do away with for and while loops for this particular problem, mainly because you can solve this problem very elegantly in (I kid you not) three lines of code... well four if you count returning the candidate names. Also, the person who is teaching you MATLAB (absolutely no offense intended) hasn't the faintest idea of what they're talking about. The #1 rule in MATLAB is that if you can vectorize your code, do it. However, there are certain situations where a for loop is very suitable due to the performance enhancements of the JIT (Just-In-Time) accelerator. If you're curious, you can check out this link for more details on what JIT is about. However, I can guarantee that using loops in this case will be slow.
We can decompose your problem into three steps:
Determine who stuck around to talk.
For those who stuck around to talk, check their GPAs to see if they are > 2.5.
For those that have satisfied (1) and (2), determine the total time spent on talking by using the formula in your post for each person and add up the times.
We can use a logical vector to generate a Boolean array that simultaneously checks steps #1 and #2 so that we can index into our GPA array that you are specifying. Once we do this, we simply apply the formula to the filtered GPAs, then sum up the time spent. Therefore, your code is very simply:
function [candidates_hire, time_spent] = CFRecruiter(names, GPAs, stays_to_talk)
%// Pre-processing - split up the names
candidates = strsplit(names, ', ');
%// Steps #1 and #2
filtered_candidates = GPAs > 2.5 & stays_to_talk;
%// Return candidates who are hired
candidates_hire = strjoin(candidates(filtered_candidates), ', ');
%// Step #3
time_spent = sum((GPAs(filtered_candidates) - 2.5) * 4);
You had the right idea to split up the names based on the commas. strsplit splits up a string that has the token you're looking for (which is , in your case) into separate strings inside a cell array. As such, you will get a cell array where each element has the name of the person to be interviewed. Now, I combined steps #1 and #2 into a single step where I have a logical vector calculated that tells you which candidates satisfied the requirements. I then use this to index into our candidates cell array, then use strjoin to join all of the names together in a single string, where each name is separated by , as per your example output.
The final step would be to use the logical vector to index into the GPAs vector, grab those GPAs from those candidates who are successful, then apply the formula to each of these elements and sum them up. With this, here are the results using your sample inputs:
>> [Names, Time] = CFRecruiter('Jack, Rose, Tom', [3.9, 2.3, 3.3],...
[false true true])
Names =
Tom
Time =
3.2000
>> [Names, Time] = CFRecruiter('Vatech, George Burdell, Barnes Noble',...
[4.0, 2.5, 3.6], [true true true])
Names =
Vatech, Barnes Noble
Time =
10.4000
To satisfy the masses...
Now, if you're absolutely hell bent on using for loops, we can replace steps #1 and #2 by using a loop and an if condition, as well as a counter to keep track of the total amount of time spent so far. We will also need an additional cell array to keep track of those names that have passed the requirements. As such:
function [candidates_hire, time_spent] = CFRecruiter(names, GPAs, stays_to_talk)
%// Pre-processing - split up the names
candidates = strsplit(names, ', ');
final_names = [];
time_spent = 0;
for idx = 1 : length(candidates)
%// Steps #1 and #2
if GPAs(idx) > 2.5 && stays_to_talk(idx)
%// Step #3
time_spent = time_spent + (GPAs(idx) - 2.5)*4;
final_names = [final_names candidates(idx)];
end
end
%// Return candidates who are hired
candidates_hire = strjoin(final_names, ', ');
The trick with the above code is that we are keeping an additional cell array around that stores those candidates that have passed. We will then join all of the strings together with a , between each name as we did before. You'll also notice that there is a difference in checking for steps #1 and #2 between the two methods. In particular, there is a & in the first method and a && in the second method. The single & is for arrays and matrices while && is for single values. If you don't know what that symbol is, that is the symbol for logical AND. This means that something is true only if both the left side of the & and the right side of the & are both true. In your case, this means that someone who has a GPA of > 2.5 and stays to talk must both be true if they are to be hired.

How to extract year from a dates cell array in MATLAB?

i have a cell array as below, which are dates. I am wondering how can i extract the year at the last 4 digits? Could anyone teach me how to locate the year in the string? Thank you!
'31.12.2001'
'31.12.2000'
'31.12.2004'
'31.12.2003'
'31.12.2002'
'31.12.2000'
'31.12.1999'
'31.12.1998'
'31.12.1997'
'31.12.2005'
'31.12.2004'
'31.12.2003'
'31.12.2002'
'31.12.2001'
'31.12.2000'
'31.12.1999'
'31.12.1998'
'31.12.2005'
'31.12.2004'
'31.12.2003'
'31.12.2002'
'31.12.2005'
Example cell array:
A = {'31.12.2001'; '31.12.2002'; '31.12.2003'};
Apply some regular expressions:
B = regexp(A, '\d\d\d\d', 'match')
B = [B{:}];
EDIT: I never realized that matlab will "nest" an extra layer of cells until I tested this. I don't like this solution as much now that I know the second line is necessary. Here is an alternative approach that gets you the years in numeric form:
C = datevec(A, 'dd.mm.yyyy');
C = C(:, 1);
SECOND EDIT: Suprisingly, if your cell array has less than 10000 elements, the regexp approach is faster on my machine. But the output of it is another cell array (which takes up much more memory than a numeric matrix). You can use B = cell2mat(B) to get a character array instead, but this brings the two approaches to approximately equal efficiency.
Just to add a fun answer, designed to take the OP to the stranger regions of Matlab:
C = char(C);
y = (D(:,7:end)-'0') * 10.^(3:-1:0).'
which is an order of magnitude faster than anything posted in the other answers :)
Or, to stay a bit closer to home,
y = cellfun(#(x)str2double(x(7:end)),C);
or, yet another regexp variation:
y = str2num(char(regexprep(C, '\d+\.\d+\.','')));
Assuming your matrix with dates is M or a cell array C:
In case your data is in a cell array start with
M = cell2mat(C)
Then get the relevant part
Y=M(:,end-4:end)
If required you can even make the year a number
Year = str2num(Y)
Using regexp this will works also with dates with slightly different formats, like 1.1.2000, which can mess with you offsets
res = regexp(dates, '(?<=\d+\.\d+\.)\d+', 'match')

IDL and MatLab getting strange values from NetCDF file

I have a NetCDF file, which contains data representing total precipitation across the globe over several months (so it's stored in a three dimensional array). I first ensured that the data was sensible, and the way it was formed, both in XConv and ncdump. All looks sensible - values vary from very small (~10^-10 - this makes sense, as this is model data, and effectively represents zero) to about 5x10^-3.
The problems start when I try to handle this data in IDL or MatLab. The arrays generated in these programs are full of huge negative numbers such as -4x10^4, with occasional huge positive numbers, such as 5000. Strangely, looking at a plot of the data in MatLab with respect to latitude and longitude (at a specific time), the pattern of rainfall looks sensible, but the values are just completely wrong.
In IDL, I'm reading the file in to write it to a text file so it can be handled by some software that takes very basic text files. Here's the code I'm using:
PRO nao_heaps
address = '/Users/levyadmin/Downloads/'
file_base = 'output'
ncid = ncdf_open(address + file_base + '.nc')
MONTHS=['january','february','march','april','may','june','july','august','september','october','november','december']
varid_field = ncdf_varid(ncid, "tp")
varid_lon = ncdf_varid(ncid, "longitude")
varid_lat = ncdf_varid(ncid, "latitude")
varid_time = ncdf_varid(ncid, "time")
ncdf_varget,ncid, varid_field, total_precip
ncdf_varget,ncid, varid_lat, lats
ncdf_varget,ncid, varid_lon, lons
ncdf_varget,ncid, varid_time, time
ncdf_close,ncid
lats = reform(lats)
lons = reform(lons)
time = reform(time)
total_precip = reform(total_precip)
total_precip = total_precip*1000. ;put in mm
noLats=(size(lats))(1)
noLons=(size(lons))(1)
noMonths=(size(time))(1)
; the data may not be an integer number of years (otherwise we could make this next loop cleaner)
av_precip=fltarr(noLons,noLats,12)
for month=0, 11 do begin
year = 0
while ( (year*12) + month lt noMonths ) do begin
av_precip(*,*,month) = av_precip(*,*,month) + total_precip(*,*, (year*12)+month )
year++
endwhile
av_precip(*,*,month) = av_precip(*,*,month)/year
endfor
fname = address + file_base + '.dat'
OPENW,1,fname
PRINTF,1,'longitude'
PRINTF,1,lons
PRINTF,1,'latitude'
PRINTF,1,lats
for month=0,11 do begin
PRINTF,1,MONTHS(month)
PRINTF,1,av_precip(*,*,month)
endfor
CLOSE,1
END
Anyone have any ideas why I'm getting such strange values in MatLab and IDL?!
AH! Found the answer. NetCDF files use an offset, and a scale factor for the data to keep the size of the file to a minimum. To get the correct values, I simply need to:
total_precip = offset + (scale_factor * total_precip) ;put into correct range
At present I'm getting the scale factor and offset from ncdump, and hard coding them into my IDL program, but does anyone know how I can get them dynamically in my IDL code..?

Regarding BigDecimal

I have a csv file where amount and quantity fields are present in each detail record except header and trailer record. Trailer record has a total charge values which is the total sum of quantity multiplied by amount field in detail records . I need to check whether the trailer total charge value is equal to my calculated value of amount and quantity fields. I am using the double data type for all these calculations. When i browsed i am able to understand from the below web link that it might create an issue using double datatype while comparison with decimal points. It's suggesting to using BigDecimal
http://epramono.blogspot.com/2005/01/double-vs-bigdecimal.html
Will i get issues if i use double data type. How can i do the calculations using BigDecimal. Also i am not sure how many digits i will get after decimal points in csv file. Also amount can have a positive or negative value.
In csv file
H,ABC.....
"D",....,"1","12.23"
"D",.....,"3","-13.334"
"D",......,"2","12"
T,csd,123,12.345
------------------------------ While Validation i am having the below code --------------------
double detChargeCount =0;
//From csv file i am reading trailer records charge value
String totChargeValue = items[3].replaceAll("\"","").trim();
if (null != totChargeValue && !totChargeValue.equals("")) {
detChargeCount = new Double(totChargeValue).doubleValue();
if(detChargeCount==calChargeCount)
validflag=true;
-----------------------While reading CSV File i am having the below code
if (null != chargeQuan && !chargeQuan.equals("")) {
tmpChargeQuan=Long(chargeQuan).longValue();
}
if (null != chargeAmount && !chargeAmount.equals("")) {
tmpChargeAmt=new Double(chargeAmount).doubleValue();
calChargeCount=calChargeCount+(tmpChargeQuan*tmpChargeAmt);
}
I had declared the variables tmpChargeQuan, tmpChargeAmt, calChargeCount as double
Especially for anything with financial data, but in general for everything dealing with human readable numbers, BigDecimal is what you want to use instead of double, just as that source says.
The documentation on BigDecimal is pretty straight-forward, and should provide everything you need.
It has a int, double, and string constructors, so you can simply have:
BigDecimal detChargeCount = new BigDecimal(0);
...
detChargeCount = new BigDecimal(totChargeValue);
The operators are implemented as functions, so you'd have to do things like
tmpChargeQuan.multiply(tmpChargeAmt)
instead of simply tmpChargeQun * tmpChargeAmt, but that shouldn't be a big deal.
but they're all defined with all the overloads you could need as well.
It is very possible that you will have issues with doubles, by which I mean the precomputed value and the newly computed value may differ by .000001 or less.
If you don't know how the value you are comparing to was computed, I think the best solution is to define "equal" as having a difference of less than epsilon, where epsilon is a very small number such as .0001.
I.e. rather than using the test A == B, use abs(A - B) < .0001.