Total number of pages in a PDF document - matlab

MATLAB provides the extractFileText function which allows us to read text from PDF files, among other file formats, and save the extracted text as a string.
We can pass an extra argument to this function in order to extract text from specific pages of the document.
For example, to extract the text from pages 3, 5 and 7 from the sample exampleSonnets.pdf file:
str = extractFileText("exampleSonnets.pdf", 'Pages', [3 5 7]);
This function, however, does not provide a way of finding out the total number of pages that the PDF document contains beforehand.
So if we happen to do something like:
str = extractFileText("exampleSonnets.pdf", 'Pages', [99 100]);
The following error is thrown:
Error using extractFileText (line 95)
No page 100 in file. Maximum page number: 47.
Warning us that we have requested a page number that exceeds the actual total number of pages in the document.
This is fine.
However, how can I know the total number of pages in a PDF document beforehand, without triggering the error, so that I can safely narrow my searches to the maximum page number?
Is there a function for this purpose?

I'm not aware of a way that ould let you do this. But you can use try/catch to handle the situation directly without knowing the number of pages beforehand.
If you do need to know the number of pages beforehand you could just iterate through the pages until you hit an error that you do handle using try/catch (works for small pdfs) or implement e.g. a binary search in a similar way.

flawr's idea is actually very clever!
In fact, since the maximum page number is contained in the error message, we can trigger the exception (asking for any ridiculously big page number on purpose), catch it, and then parse the error message to recover the maximum page number.
No page 100 in file. Maximum page number: 47.
^
This is all we need
So we don't even need to iterate through every page of the document :)
I went ahead and made this simple numpages function:
function [num] = numpages(filename)
% Queried page number. Any big number should do.
bignum = 1e6;
try
extractFileText(filename, 'Pages', bignum);
catch ME
if strcmp(ME.identifier, 'textanalytics:extractFileText:NoSuchPage')
% Extract the Maximum page number from the exception message.
num = str2double(extractBetween(ME.message, "number: ", "."));
else
% Not the exception we are interested in. Rethrow it.
rethrow(ME);
end
end
end
Test case:
>> numpages("exampleSonnets.pdf")
ans =
47
It works!

Related

Scala Gatling keep inserting user until end of feed file is reached

I'm trying to inject users to a scenario in such a way that it will keep inserting user until every single entry of the feed file is used since the feed file contains log in information. I would like all the users in the feed file to log in. Right now all I could think of is two possible approaches.
Here I insert the number of rows in the feedfile at once.
scenario("Verified_Login")
.exec(LoginScenario.scn)
.inject(atOnceUsers(number_of_entries_in_feedfile))
Here I insert a very high time duration, for example, 100 seconds and then make the feedfile circular.
scenario("Verified_Login")
.exec(LoginScenario.scn)
.inject(atOnceUsers(1),constantUsersPerSec(1) during(100 seconds)
The problem with the first approach is I have to find the number of entries in the feed file which can be tedious as there could be thousands there. The problem with the second is that entries could and probably will be repeated. So is there a way to keep injecting users till feed file runs out of entries?
According to this source, from last year, Stéphane Landelle - who is the leading contributor of gatling, says that you must provide enough data for a simulation to complete using this method.
The post I linked from Stéphane does suggest to simply read the length of the file and use that to drive the amount of users, as you have already mentioned in your question.
I suggest you read the post as it will give you an alternate method to achieving what you want. Seems to be as close as you will ever get unless things have changed.
Here is their code.
val systemsIdentifier = jdbcFeeder(databaseUrl, databaseUser, databasePassword, sql_systemsIdentifier)
val count = new AtomicInteger(systemsIdentifier.records.size).asLongAs(_ => count.getAndIncrement < systemsIdentifier.records.size)
val comScn = scenario("My scenario")
.repeat(systemsIdentifier.records.size / count) {
feed(systemsIdentifier)
.exec(performActionsChain)
}
setUp(comScn.inject(rampUsers(count) over (60 seconds))).protocols(httpConf)

Find category of MATLAB mlint warning ID

I'm using the checkcode function in MATLAB to give me a struct of all error messages in a supplied filename along with their McCabe complexity and ID associated with that error. i.e;
info = checkcode(fileName, '-cyc','-id');
In MATLAB's preferences, there is a list of all possible errors, and they are broken down into categories. Such as "Aesthetics and Readability", "Syntax Errors", "Discouraged Function Usage" etc.
Is there a way to access these categories using the error ID gained from the above line of code?
I tossed around different ideas in my head for this question and was finally able to come up with a mostly elegant solution for how to handle this.
The Solution
The critical component of this solution is the undocumented -allmsg flag of checkcode (or mlint). If you supply this argument, then a full list of mlint IDs, severity codes, and descriptions are printed. More importantly, the categories are also printed in this list and all mlint IDs are listed underneath their respective mlint category.
The Execution
Now we can't simply call checkcode (or mlint) with only the -allmsg flag because that would be too easy. Instead, it requires an actual file to try to parse and check for errors. You can pass any valid m-file, but I have opted to pass the built-in sum.m because the actual file itself only contains help information (as it's real implementation is likely C++) and mlint is therefore able to parse it very rapidly with no warnings.
checkcode('sum.m', '-allmsg');
An excerpt of the output printed to the command window is:
INTER ========== Internal Message Fragments ==========
MSHHH 7 this is used for %#ok and should never be seen!
BAIL 7 done with run due to error
INTRN ========== Serious Internal Errors and Assertions ==========
NOLHS 3 Left side of an assignment is empty.
TMMSG 3 More than 50,000 Code Analyzer messages were generated, leading to some being deleted.
MXASET 4 Expression is too complex for code analysis to complete.
LIN2L 3 A source file line is too long for Code Analyzer.
QUIT 4 Earlier syntax errors confused Code Analyzer (or a possible Code Analyzer bug).
FILER ========== File Errors ==========
NOSPC 4 File <FILE> is too large or complex to analyze.
MBIG 4 File <FILE> is too big for Code Analyzer to handle.
NOFIL 4 File <FILE> cannot be opened for reading.
MDOTM 4 Filename <FILE> must be a valid MATLAB code file.
BDFIL 4 Filename <FILE> is not formed from a valid MATLAB identifier.
RDERR 4 Unable to read file <FILE>.
MCDIR 2 Class name <name> and #directory name do not agree: <FILE>.
MCFIL 2 Class name <name> and file name do not agree: <file>.
CFERR 1 Cannot open or read the Code Analyzer settings from file <FILE>. Using default settings instead.
...
MCLL 1 MCC does not allow C++ files to be read directly using LOADLIBRARY.
MCWBF 1 MCC requires that the first argument of WEBFIGURE not come from FIGURE(n).
MCWFL 1 MCC requires that the first argument of WEBFIGURE not come from FIGURE(n) (line <line #>).
NITS ========== Aesthetics and Readability ==========
DSPS 1 DISP(SPRINTF(...)) can usually be replaced by FPRINTF(...).
SEPEX 0 For better readability, use newline, semicolon, or comma before this statement.
NBRAK 0 Use of brackets [] is unnecessary. Use parentheses to group, if needed.
...
The first column is clearly the mlint ID, the second column is actually a severity number (0 = mostly harmless, 1 = warning, 2 = error, 4-7 = more serious internal issues), and the third column is the message that is displayed.
As you can see, all categories also have an identifier but no severity, and their message format is ===== Category Name =====.
So now we can just parse this information and create some data structure that allows us to easily look up the severity and category for a given mlint ID.
Again, though, it can't always be so easy. Unfortunately, checkcode (or mlint) simply prints this information out to the command window and doesn't assign it to any of our output variables. Because of this, it is necessary to use evalc (shudder) to capture the output and store it as a string. We can then easily parse this string to get the category and severity associated with each mlint ID.
An Example Parser
I have put all of the pieces I discussed previously together into a little function which will generate a struct where all of the fields are the mlint IDs. Within each field you will receive the following information:
warnings = mlintCatalog();
warnings.DWVRD
id: 'DWVRD'
severity: 2
message: 'WAVREAD has been removed. Use AUDIOREAD instead.'
category: 'Discouraged Function Usage'
category_id: 17
And here's the little function if you're interested.
function [warnings, categories] = mlintCatalog()
% Get a list of all categories, mlint IDs, and severity rankings
output = evalc('checkcode sum.m -allmsg');
% Break each line into it's components
lines = regexp(output, '\n', 'split').';
pattern = '^\s*(?<id>[^\s]*)\s*(?<severity>\d*)\s*(?<message>.*?\s*$)';
warnings = regexp(lines, pattern, 'names');
warnings = cat(1, warnings{:});
% Determine which ones are category names
isCategory = cellfun(#isempty, {warnings.severity});
categories = warnings(isCategory);
% Fix up the category names
pattern = '(^\s*=*\s*|\s*=*\s*$)';
messages = {categories.message};
categoryNames = cellfun(#(x)regexprep(x, pattern, ''), messages, 'uni', 0);
[categories.message] = categoryNames{:};
% Now pair each mlint ID with it's category
comp = bsxfun(#gt, 1:numel(warnings), find(isCategory).');
[category_id, ~] = find(diff(comp, [], 1) == -1);
category_id(end+1:numel(warnings)) = numel(categories);
% Assign a category field to each mlint ID
[warnings.category] = categoryNames{category_id};
category_id = num2cell(category_id);
[warnings.category_id] = category_id{:};
% Remove the categories from the warnings list
warnings = warnings(~isCategory);
% Convert warning severity to a number
severity = num2cell(str2double({warnings.severity}));
[warnings.severity] = severity{:};
% Save just the categories
categories = rmfield(categories, 'severity');
% Convert array of structs to a struct where the MLINT ID is the field
warnings = orderfields(cell2struct(num2cell(warnings), {warnings.id}));
end
Summary
This is a completely undocumented but fairly robust way of getting the category and severity associated with a given mlint ID. This functionality existed in 2010 and maybe even before that, so it should work with any version of MATLAB that you have to deal with. This approach is also a lot more flexible than simply noting what categories a given mlint ID is in because the category (and severity) will change from release to release as new functions are added and old functions are deprecated.
Thanks for asking this challenging question, and I hope that this answer provides a little help and insight!
Just to close this issue off. I've managed to extract the data from a few different places and piece it together. I now have an excel spreadsheet of all matlab's warnings and errors with columns for their corresponding ID codes, category, and severity (ie, if it is a warning or error). I can now read this file in, look up ID codes I get from using the 'checkcode' function and draw out any information required. This can now be used to create analysis tools to look at the quality of written scripts/classes etc.
If anyone would like a copy of this file then drop me a message and I'll be happy to provide it.
Darren.

Is it possible to create an RHQ plugin that collects historic measurements from files?

I'm trying to create an RHQ plugin to gather some measurements. It seems relativity easy to create a plugin that return a value for the present moment. However, I need to collect these measurements from files. These files are created on a schedule, for example one per hour, but they contain much finer measurements, for example a measurement for every minute. The file may look something like below:
18:00 20
18:01 42
18:02 39
...
18:58 12
18:59 15
Is it possible to create a RHQ plugin that can return many values with timestamps for a measurement?
I think you can within org.rhq.core.pluginapi.measurement.MeasurementFacet#getValues return as many values as you want within the MeasurementReport.
So basically open the file, seek to the last known position (if the file is always appended to), read from there and for each line you go
MeasurementData data = new MeasurementDataNumeric(timeInFile, request, valueFromFile);
report.add(data);
Of course alerting on this (historical) data is sort of questionable, as if you only read the file one hour later, the alert can not be retroactively fired at the time the bad value happened :->
Yes it is surely possible .
#Override
public void getValues(MeasurementReport report, Set<MeasurementScheduleRequest> metrics) throws Exception {
for (MeasurementScheduleRequest request : metrics) {
Double result = SomeReadUtilClass.readValueFromFile();
MeasurementData data = new MeasurementDataNumeric(request, result)
report.addData(data );
}
}
SomeReadUtilClass is a utility class to read the file and readValueFromFile is the function, you can write you login to read the value from file.
result is the Double variable that is more important, this result value you can calculate from database or read file. And then this result value you have to provide to MeasurementDataNumeric function MeasurementDataNumeric(request, result));

Find a Name in an Email (Low-Level I/O)

Round 2: Picking out leaders in an email
Alrighty, so my next problem is trying to figure out who the leader is in a project. In order to determine this, we are given an email and have to find who says "Do you want..." (capitalization may vary). I feel like my code should work for the most part, but I really have an issue figuring out how to correctly populate my cell array. I can get it to create the cell array, but it just puts the email in it over over again. So each cell is basically the name.
function[Leader_Name] = teamPowerHolder(email)
email = fopen(email, 'r'); %// Opens my file
lines = fgets(email); %// Reads the first line
conversations = {lines}; %// Creates my cell array
while ischar(lines) %// Populates my cell array, just not correct
Convo = fgets(email);
if Convo == -1 %// Prevents it from just logging -1 into my cell array like a jerk
break; %// Returns to function
end
conversations = [conversations {lines}]; %// Populates my list
end
Sentences = strfind(conversations,'Do you want'); %// Locates the leader position
Leader_Name = Sentences{1}; %// Indexes that position
fclose(email);
end
What I ideally need it to do is find the '/n' character (hence why I used fgets) but I'm not sure how to make it do that. I tried to have my while loop be like:
while lines == '/n'
but that's incorrect. I feel like I know how to do the '/n' bit, I just can't think of it. So I'd appreciate some hints or tips to do that. I could always try to strsplit or strtok the function, but I need to then populate my cell array so that might get messy.
Please and thanks for help :)
Test Case:
Anna: Hey guys, so I know that he just assigned this project, but I want to go ahead and get started on it.
Can you guys please respond and let me know a weekly meeting time that will work for you?
Wiley: Ummmmm no because ain't nobody got time for that.
John: Wiley? What kind of a name is that? .-.
Wiley: It's better than john. >.>
Anna: Hey boys, let's grow up and talk about a meeting time.
Do you want to have a weekly meeting, or not?
Wiley: I'll just skip all of them and not end up doing anything for the project anyway.
So I really don't care so much.
John: Yes, Anna, I'd like to have a weekly meeting.
Thank you for actually being a good teammate and doing this. :)
out2 = teamPowerHolder('teamPowerHolder_convo2.txt')
=> 'Anna'
The main reason why it isn't working is because you're supposed to update the lines variable in your loop, but you're creating a new variable called Convo that is updating instead. This is why every time you put lines in your cell array, it just puts in the first line repeatedly and never quits the loop.
However, what I would suggest you do is read in each line, then look for the : character, then extract the string up until the first time you encounter this character minus 1 because you don't want to include the actual : character itself. This will most likely correspond to the name of the person that is speaking. If we are missing this occurrence, then that person is still talking. As such, you would have to keep a variable that keeps track of who is still currently talking, until you find the "do you want" string. Whoever says this, we return the person who is currently talking, breaking out of the loop of course! To ensure that the line is case insensitive, you'll want to convert the string to lower.
There may be a case where no leader is found. In that case, you'll probably want to return the empty string. As such, initialize Leader_Name to the empty string. In this case, that would be []. That way, should we go through the e-mail and find no leader, MATLAB will return [].
The logic that you have is pretty much correct, but I wouldn't even bother storing stuff into a cell array. Just examine each line in your text file, and keep track of who is currently speaking until we encounter a sentence that has another : character. We can use strfind to facilitate this. However, one small caveat I'll mention is that if the person speaking includes a : in their conversation, then this method will break.
Judging from the conversation that I'm seeing your test case, this probably won't be the case so we're OK. As such, borrowing from your current code, simply do this:
function[Leader_Name] = teamPowerHolder(email)
Leader_Name = []; %// Initialize leader name to empty
name = [];
email = fopen(email, 'r'); %// Opens my file
lines = fgets(email); %// Reads the first line
while ischar(lines)
% // Get a line in your e-mail
lines = fgets(email);
% // Quit like a boss if you see a -1
if lines == -1
break;
end
% // Check if this line has a ':' character.
% // If we do, then another person is talking.
% // Extract the characters just before the first ':' character
% // as we don't want the ':' character in the name
% // If we don't encounter a ':' character, then the same person is
% // talking so don't change the current name
idxs = strfind(lines, ':');
if ~isempty(idxs)
name = lines(1:idxs(1)-1);
end
% // If we find "do you want" in this sentence, then the leader
% // is found, so quit.
if ~isempty(strfind(lower(lines), 'do you want'))
Leader_Name = name;
break;
end
end
By running the above code with your test case, this is what I get:
out2 = teamPowerHolder('teamPowerHolder_convo2.txt')
out2 =
Anna

updating existing data on google spreadsheet using a form?

I want to build kind of an automatic system to update some race results for a championship. I have an automated spreadsheet were all the results are shown but it takes me a lot to update all of them so I was wondering if it would be possible to make a form in order to update them more easily.
In the form I will enter the driver name and the number o points he won on a race. The championship has 4 races each month so yea, my question is if you guys know a way to update an existing data (stored in a spreadsheet) using a form. Lets say that in the first race, the driver 'X' won 10 points. I will insert this data in a form and then call it from the spreadsheet to show it up, that's right. The problem comes when I want to update the second race results and so on. If the driver 'X' gets on the second race 12 points, is there a way to update the previous 10 points of that driver and put 22 points instead? Or can I add the second race result to the first one automatically? I mean, if I insert on the form the second race results can it look for the driver 'X' entry and add this points to the ones that it previously had. Dunno if it's possible or not.
Maybe I can do it in another way. Any help will be much appreciated!
Thanks.
Maybe I missed something in your question but I don't really understand Harold's answer...
Here is a code that does strictly what you asked for, it counts the total cumulative value of 4 numbers entered in a form and shows it on a Spreadsheet.
I called the 4 questions "race number 1", "race number 2" ... and the result comes on row 2 so you can setup headers.
I striped out any non numeric character so you can type responses more freely, only numbers will be retained.
form here and SS here (raw results in sheet1 and count in Sheet2)
script goes in spreadsheet and is triggered by an onFormSubmit trigger.
function onFormSubmit(e) {
var sh = SpreadsheetApp.getActiveSpreadsheet().getSheetByName('Sheet2');
var responses = []
responses[0] = Number(e.namedValues['race number 1'].toString().replace(/\D/g,''));
responses[1] = Number(e.namedValues['race number 2'].toString().replace(/\D/g,''));
responses[2] = Number(e.namedValues['race number 3'].toString().replace(/\D/g,''));
responses[3] = Number(e.namedValues['race number 4'].toString().replace(/\D/g,''));
var totals = sh.getRange(2,1,1,responses.length).getValues();
for(var n in responses){
totals[0][n]+=responses[n];
}
sh.getRange(2,1,1,responses.length).setValues(totals);
}
edit : I changed the code to allow you to change easily the number of responses... range will update automatically.
EDIT 2 : a version that accepts empty responses using an "if" condition on result:
function onFormSubmit(e) {
var sh = SpreadsheetApp.getActiveSpreadsheet().getSheetByName('Sheet2');
var responses = []
responses[0] = Number((e.namedValues['race number 1']==null ? 0 :e.namedValues['race number 1']).toString().replace(/\D/g,''));
responses[1] = Number((e.namedValues['race number 2']==null ? 0 :e.namedValues['race number 2']).toString().replace(/\D/g,''));
responses[2] = Number((e.namedValues['race number 3']==null ? 0 :e.namedValues['race number 3']).toString().replace(/\D/g,''));
responses[3] = Number((e.namedValues['race number 4']==null ? 0 :e.namedValues['race number 4']).toString().replace(/\D/g,''));
var totals = sh.getRange(2,1,1,responses.length).getValues();
for(var n in responses){
totals[0][n]+=responses[n];
}
sh.getRange(2,1,1,responses.length).setValues(totals);
}
I believe you can found everything you want here.
It's a form url, when you answer this form you'll have the url of the spreadsheet where the data are stored. One of the information stored is the url to modify your response, if you follow the link it will open the form again and update the spreadsheet in consequence. the code to do this trick is in the second sheet of the spreadsheet.
It's a google apps script code that need to be associated within the form and triggered with an onFormSubmit trigger.
It may be too late now. I believe we need a few things (I have not tried it)
A unique key to map each submitted response, such as User's ID or email.
Two Google Forms:
a. To request the unique key
b. To retrieve relevant data with that unique key
Create a pre-filled URL (See http://www.cagrimmett.com/til/2016/07/07/autofill-google-forms.html)
Open the URL from your form (See Google Apps Script to open a URL)