How to process a simple loop in Perl's WWW::Mechanize? - perl

Especially interesting for me as a PHP/Perl-beginner is this site in Switzerland:
see this link:http://www.edi.admin.ch/esv/00475/00698/index.html?lang=de&webgrab_path=http://esv2000.edi.admin.ch/d/entry.asp?Id=1308
Which has a dataset of 2700 foundations. All the data are free to use with no limitations copyrights on it.
what we have so far: Well the harvesting task should be no problem if i take WWW::Mechanize - particularly for doing the form based search and selecting the individual entries. Hmm - i guess that the algorithm would be basically 2 nested loops: the outer loop runs the form based search, the inner loop processes the search results.
The outer loop would use the select() and the submit_form() functions on the second search form on the page. Can we use DOM processing here. Well - how can we get the get the selection values.
The inner loop through the results would use the follow link function to get to the actual entries using the following call.
$mech->follow_link(url_regex => qr/webgrab_path=http:\/\/evs2000.*\?
Id=\d+$/, n => $result_nbr);
This would forward our mechanic browser to the entry page. Basically the URL query looks for links that have the webgrap_path to Id pattern, which is unique for each database entry. The $result_nbr variable tells mecha which one of the results it should follow next.
If we have several result pages we would also use the same trick to traverse through the result pages. For the semantic extraction of the entry information,we could parse the content of the actual entries with XML:LibXML's html parser (which works fine on this page), because it gives you some powerful DOM selection (using XPath) methods.
Well the actual looping through the pages should be doable in a few lines of perl of max. 20 lines - likely less.
But wait: the processing of the entry pages will then be the most complex part
of the script.
Approaches: In principle we could do the same algorithm with a single while loop
if we use the back() function smartly.
Can you give me a hint for the beginning - the processing of the entry pages - doing this in Perl:: Mechanize

"Which has a dataset of 2700 foundations. All the data are free to use with no limitations copyrights on it."
Not true. See http://perlmonks.org/?node_id=905767
"The data is copyrighted even though it is made available freely: "Downloading or copying of texts, illustrations, photos or any other data does not entail any transfer of rights on the content." (and again, in German, as you've been scraping some other German list to spam before)."

Related

How do I reduce the number of 301 redirect entries using wildcards and variables in Squarespace?

I recently renamed all of the URLs that make up my blog... and have written redirects for almost every page... using wildcards where I can... keeping in mind... all that I know is the * wildcard at this time...
Here is an example of what I have...
/season-1/2017/1/1/snl-s01e01-host-george-carlin -> /season-1/snl-s01e01-george-carlin 301
I want to write a catch-all that will redirect all 38 seasons of reviews with one redirect entry... but I can't figure out how to get rid of just the word "host" between s01e01- and -george-carlin... and was thinking it would work something like this...
/season-*/*/*/*/snl-s*e*-host-*-* -> /season-*/snl-s*e*[code to remove the word "host"]-*-* 301
Is that even close to being correct? Do I need that many *s
Thanks in advance for any help...
Unfortunately, you won't be able to reduce the number of individual redirect entries using the redirect features that Squarespace has to offer, namely the wildcard (*) and a single variable ([name]). Multiple variables would be needed, but only [name] is supported.
The closest you can get is:
/season-1/*/*/*/snl-s01e01-host-[name] -> /season-1/snl-s01e01-[name] 301
But, if I'm understanding things, while the above redirect appears more general, it would still need to be copy/pasted for each post individually. So although it demonstrates the best that could be achieved, it is not a technical improvement.
Therefore, there are only two alternatives:
Create a Google Sheet (or other spreadsheet) where the old URLs are copy/pasted in column one, a formula using arrayformula and regular expressions to parse the old URL and generate the new URL is added in column two, and in column three a formula is written to join the two cells with -> and 301. With that done, you could click, drag and highlight all cells in column 3, copy, and paste them in the "URL Shortcuts" text area in Squarespace.It can be quite time consuming to figure out, write and test the correct formula, but it does avoid having to manually type out every redirect. Whether it is less time/effort in total depends on the number of redirects and one's proficiency with writing spreadsheet formulas.It could be that using the redirect code above would simplify the formula that'd need to be written in the spreadsheet, which may save some time.
Another alternative would be to remove your redirects and instead handle the redirect via JavaScript added to the 404/Page-not-found page. Because it sounds like you already have all of the redirects in place but are simply trying to reduce the overall number, I wouldn't recommend changing to a JavaScript-based approach. There are other drawbacks to using JavaScript, in any case.

How to filter by both text and property in Chrome DevTool's network panel?

I want to filter Chrome DevTool's network panel by the method property and text in the URL. For example, if I am searching for the text chromequestion in the URL and only HTTP GET requests (ignore PUT, POST, DELETE, etc).
I am able to filter by text or by method:
I am not able to combine the filter to search by both text and method:
I read the documentation at https://developers.google.com/web/tools/chrome-devtools/network-performance/reference#filters and I am able to filter by multiple properties (.e.g, domain:*.com method:GET). However, I am unable to filter by text and property (e.g., method:GET chromequestion).
Unfortunately, it's not possible to do this currently. I played around in DevTools originally, but couldn't find a way. I later had a look into how the filtering was implemented, and can confirm there's a limitation preventing you from mixing the pre-defined filters and text filters.
Implementation details
This is a bit long but I thought it might be interesting for some to see how it's implemented. I will probably look into improving the implementation, either myself or I'll log it because it's limited.
There's a _parseFilterQuery function that parses the input field and categorises the entries into two arrays. The first is called filters, and it's the pre-defined filtering options, such as method:GET etc. The second is a text array filter, split up by spaces. The parser determines the difference fairly naively, by checking for the occurrence of :, and - at the start (for negation).
Scenario 1
You only input a pre-defined filter, or multiple filters. For each filter, the specific filter function, which looks at the different properties of the request object, is pushed to a network module filters array (this._filters). Later on, for each request, the function is called on it, and a match returns true, otherwise false. This will determine whether the request is shown. There's obviously a requirement for ALL filters to return true for the row to show.
Scenario 2
This is the interesting one, where you input both a pre-defined filter and a bit of text. This covers the Stack Overflow question. The _parseFilterQuery function looks at the text filters first, before the pre-defined ones. In Scenario 1, this was empty, so it was skipped.
We pass each text word to the _createTextFilter, and push each of the resulting filters to the network module filters array. However, the implementation of this is questionable. The only time the actual word passed in is used is to check whether its a negation filter for a bit of text. If the first character is -, it means the user doesn't want to see a request with the following word in the name. For example -icon means don't show any request with that in the name/page. If there is no negation, it simply returns the WHOLE input text as a regular expression, NOT the word passed in. In my case, it returns /method:GET icon/i.
The pre-defined filters are looked at next. In this case, method:GET is pushed.
Finally, it loops over the requests calling each filter on it. However, since the first filter is /method:GET icon/i, it makes ALL other filters redundant because it will NEVER pass. The text filters only apply to name and path, so method:GET in a text filter will be invalid.

Sum of DOM elements using XPath

I am using MSXML v3.0 in a VB 6.0 application. The application calculates sum of an attribute of all nodes using for each loop as shown below
Set subNodes = docXML.selectNodes("//Transaction")
For Each subNode In subNodes
total = total + Val(subNode.selectSingleNode("Amount").nodeTypedValue)
Next
This loop is taking too much time, sometime it takes 15-20 minutes for 60 thousand nodes.
I am looking for XPath/DOM solution to eliminate this loop, probably
docXML.selectNodes("//Transaction").Sum("Amount")
or
docXML.selectNodes("Sum(//Transaction/Amount)")
Any suggestion is welcomed to get this sum faster.
// Open the XML.
docNav = new XPathDocument(#"c:\books.xml");
// Create a navigator to query with XPath.
nav = docNav.CreateNavigator();
// Find the sum
// This expression uses standard XPath syntax.
strExpression = "sum(/bookstore/book/price)";
// Use the Evaluate method to return the evaluated expression.
Console.WriteLine("The price sum of the books are {0}", nav.Evaluate(strExpression));
source: http://support.microsoft.com/kb/308333
Any solution that uses the XPath // pseudo-operator on an XML document with 60000+ nodes is going to be quite slow, because //x causes a complete traversal of the tree starting at the root of the document.
The solution can be speeded up significantly, if a more exact XPath expression is used, that doesn't include the // pseudo-operator.
If you know the structure of the XML document, always use a specific chain of location steps -- never //.
If you provide a small example, showing the specific structure of the document, then many people will be able to provide a faster solution than any solution that uses //.
For example, if it is known that all Transaction elements can be selected using this XPath expression:
/x/y/Transaction
then the evaluation of
sum(/x/y/Transaction/Amount)
is likely to be significantly faster than Sum(//Transaction/Amount)
Update:
The OP has revealed in a comment that the structure of the XML file is quite simple.
Accordingly, I tried with a 60000 Transaction nodes XML document the following:
/*/*/Amount
With .NET XslCompiledTransform (Yes, I used XSLT as the host for the XPath engine) this took 220ms (milliseconds), that means 0.22 seconds, to produce the sum.
With MSXML3 it takes 334 seconds.
With MSXML6 it takes 76 seconds -- still quite slow.
Conclusion: This is a bug in MSXML3 -- try to upgrade to another XPath engine, such as the one offered by .NET.

Dom-Processing with Perl-Mechanize: finalizing a little programme

I'm currently working on a little harvester, using this dataset of 2700 foundations. All the data are free to use with no limitations or copyright isues.
What I have so far: The harvesting task should be no problem if I take WWW::Mechanize — particularly for doing the form based search and selecting the individual entries. Hmm — I guess that the algorithm would be basically two nested loops: the outer loop runs the form-based search, the inner loop processes the search results.
The outer loop would use the select() and the submit_form() functions on the second search form on the page. Can we use DOM processing here? Well — how can we get the get the selection values.
The inner loop through the results would use the follow link function to get to the actual entries using the following call.
$mech->follow_link(url_regex => qr/webgrab_path=http:\/\/evs2000.*\?
Id=\d+$/, n => $result_nbr);
This would forward our mechanic browser to the entry page. Basically the URL query looks for links that have the webgrap_path to Id pattern, which is unique for each database entry. The $result_nbr variable tells mecha which one of the results it should follow next.
If we have several result pages we would also use the same trick to traverse through the result pages. For the semantic extraction of the entry information,we could parse the content of the actual entries with XML:LibXML's html parser (which works fine on this page), because it gives you some powerful DOM selection (using XPath) methods.
Well the actual looping through the pages should be doable in a few lines of Perl (max. 20 lines — likely less).
But wait: the processing of the entry pages will then be the most complex part
of the script.
Approaches: In principle we could do the same algorithm with a single while loop
if we use the back() function smartly.
Can you give me a hint for the beginning — the processing of the entry pages — doing this in Perl:: Mechanize?
Here's what I have:
GetThePage(
starting url
);
sub GetThePage {
my $mech ...
my #pages = ...
while(#pages) {
my $page = shift #pages;
$mech->get( $page );
push #pages, GetMorePages( $mech );
SomethingImportant( $mech );
SomethingXPATH( $mech );
}
}
The question is how to find the DOM-paths.
Use Firebug, Opera Dragonfly, Chromium Developer tools.
Call the context menu on the indicated element to copy an XPath expression or CSS selector (useful for Web::Query) to clipboard.
Really you want to use Web::Scraper for this kind of thing.

RESTful URL design for search

I'm looking for a reasonable way to represent searches as a RESTful URLs.
The setup: I have two models, Cars and Garages, where Cars can be in Garages. So my urls look like:
/car/xxxx
xxx == car id
returns car with given id
/garage/yyy
yyy = garage id
returns garage with given id
A Car can exist on its own (hence the /car), or it can exist in a garage. What's the right way to represent, say, all the cars in a given garage? Something like:
/garage/yyy/cars ?
How about the union of cars in garage yyy and zzz?
What's the right way to represent a search for cars with certain attributes? Say: show me all blue sedans with 4 doors :
/car/search?color=blue&type=sedan&doors=4
or should it be /cars instead?
The use of "search" seems inappropriate there - what's a better way / term? Should it just be:
/cars/?color=blue&type=sedan&doors=4
Should the search parameters be part of the PATHINFO or QUERYSTRING?
In short, I'm looking for guidance for cross-model REST url design, and for search.
[Update] I like Justin's answer, but he doesn't cover the multi-field search case:
/cars/color:blue/type:sedan/doors:4
or something like that. How do we go from
/cars/color/blue
to the multiple field case?
For the searching, use querystrings. This is perfectly RESTful:
/cars?color=blue&type=sedan&doors=4
An advantage to regular querystrings is that they are standard and widely understood and that they can be generated from form-get.
The RESTful pretty URL design is about displaying a resource based on a structure (directory-like structure, date: articles/2005/5/13, object and it's attributes,..), the slash / indicates hierarchical structure, use the -id instead.
Hierarchical structure
I would personaly prefer:
/garage-id/cars/car-id
/cars/car-id #for cars not in garages
If a user removes the /car-id part, it brings the cars preview - intuitive. User exactly knows where in the tree he is, what is he looking at. He knows from the first look, that garages and cars are in relation. /car-id also denotes that it belongs together unlike /car/id.
Searching
The searchquery is OK as it is, there is only your preference, what should be taken into account. The funny part comes when joining searches (see below).
/cars?color=blue;type=sedan #most prefered by me
/cars;color-blue+doors-4+type-sedan #looks good when using car-id
/cars?color=blue&doors=4&type=sedan #also possible, but & blends in with text
Or basically anything what isn't a slash as explained above.
The formula: /cars[?;]color[=-:]blue[,;+&], though I wouldn't use the & sign as it is unrecognizable from the text at first glance if that's your thing.
** Did you know that passing JSON object in URI is RESTful? **
Lists of options
/cars?color=black,blue,red;doors=3,5;type=sedan #most prefered by me
/cars?color:black:blue:red;doors:3:5;type:sedan
/cars?color(black,blue,red);doors(3,5);type(sedan) #does not look bad at all
/cars?color:(black,blue,red);doors:(3,5);type:sedan #little difference
possible features?
Negate search strings (!)
To search any cars, but not black and red:
?color=!black,!red
color:(!black,!red)
Joined searches
Search red or blue or black cars with 3 doors in garages id 1..20 or 101..103 or 999 but not 5
/garage[id=1-20,101-103,999,!5]/cars[color=red,blue,black;doors=3]
You can then construct more complex search queries. (Look at CSS3 attribute matching for the idea of matching substrings. E.g. searching users containing "bar" user*=bar.)
Conclusion
Anyway, this might be the most important part for you, because you can do it however you like after all, just keep in mind that RESTful URI represents a structure which is easily understood e.g. directory-like /directory/file, /collection/node/item, dates /articles/{year}/{month}/{day}.. And when you omit any of last segments, you immediately know what you get.
So.., all these characters are allowed unencoded:
unreserved: a-zA-Z0-9_.-~
Typically allowed both encoded and not, both uses are then equivalent.
special characters: $-_.+!*'(),
reserved: ;/?:#=&
May be used unencoded for the purpose they represent, otherwise they must be encoded.
unsafe: <>"#%{}|^~[]`
Why unsafe and why should rather be encoded: RFC 1738 see 2.2
Also see RFC 1738#page-20 for more character classes.
RFC 3986 see 2.2
Despite of what I previously said, here is a common distinction of delimeters, meaning that some "are" more important than others.
generic delimeters: :/?#[]#
sub-delimeters: !$&'()*+,;=
More reading:
Hierarchy: see 2.3, see 1.2.3
url path parameter syntax
CSS3 attribute matching
IBM: RESTful Web services - The basics
Note: RFC 1738 was updated by RFC 3986
Although having the parameters in the path has some advantages, there are, IMO, some outweighing factors.
Not all characters needed for a search query are permitted in a URL. Most punctuation and Unicode characters would need to be URL encoded as a query string parameter. I'm wrestling with the same problem. I would like to use XPath in the URL, but not all XPath syntax is compatible with a URI path. So for simple paths, /cars/doors/driver/lock/combination would be appropriate to locate the 'combination' element in the driver's door XML document. But /car/doors[id='driver' and lock/combination='1234'] is not so friendly.
There is a difference between filtering a resource based on one of its attributes and specifying a resource.
For example, since
/cars/colors returns a list of all colors for all cars (the resource returned is a collection of color objects)
/cars/colors/red,blue,green would return a list of color objects that are red, blue or green, not a collection of cars.
To return cars, the path would be
/cars?color=red,blue,green or /cars/search?color=red,blue,green
Parameters in the path are more difficult to read because name/value pairs are not isolated from the rest of the path, which is not name/value pairs.
One last comment. I prefer /garages/yyy/cars (always plural) to /garage/yyy/cars (perhaps it was a typo in the original answer) because it avoids changing the path between singular and plural. For words with an added 's', the change is not so bad, but changing /person/yyy/friends to /people/yyy seems cumbersome.
To expand on Peter's answer - you could make Search a first-class resource:
POST /searches # create a new search
GET /searches # list all searches (admin)
GET /searches/{id} # show the results of a previously-run search
DELETE /searches/{id} # delete a search (admin)
The Search resource would have fields for color, make model, garaged status, etc and could be specified in XML, JSON, or any other format. Like the Car and Garage resource, you could restrict access to Searches based on authentication. Users who frequently run the same Searches can store them in their profiles so that they don't need to be re-created. The URLs will be short enough that in many cases they can be easily traded via email. These stored Searches can be the basis of custom RSS feeds, and so on.
There are many possibilities for using Searches when you think of them as resources.
The idea is explained in more detail in this Railscast.
Justin's answer is probably the way to go, although in some applications it might make sense to consider a particular search as a resource in its own right, such as if you want to support named saved searches:
/search/{searchQuery}
or
/search/{savedSearchName}
I use two approaches to implement searches.
1) Simplest case, to query associated elements, and for navigation.
/cars?q.garage.id.eq=1
This means, query cars that have garage ID equal to 1.
It is also possible to create more complex searches:
/cars?q.garage.street.eq=FirstStreet&q.color.ne=red&offset=300&max=100
Cars in all garages in FirstStreet that are not red (3rd page, 100 elements per page).
2) Complex queries are considered as regular resources that are created and can be recovered.
POST /searches => Create
GET /searches/1 => Recover search
GET /searches/1?offset=300&max=100 => pagination in search
The POST body for search creation is as follows:
{
"$class":"test.Car",
"$q":{
"$eq" : { "color" : "red" },
"garage" : {
"$ne" : { "street" : "FirstStreet" }
}
}
}
It is based in Grails (criteria DSL): http://grails.org/doc/2.4.3/ref/Domain%20Classes/createCriteria.html
This is not REST. You cannot define URIs for resources inside your API. Resource navigation must be hypertext-driven. It's fine if you want pretty URIs and heavy amounts of coupling, but just do not call it REST, because it directly violates the constraints of RESTful architecture.
See this article by the inventor of REST.
In addition i would also suggest:
/cars/search/all{?color,model,year}
/cars/search/by-parameters{?color,model,year}
/cars/search/by-vendor{?vendor}
Here, Search is considered as a child resource of Cars resource.
There are a lot of good options for your case here. Still you should considering using the POST body.
The query string is perfect for your example, but if you have something more complicated, e.g. an arbitrary long list of items or boolean conditionals, you might want to define the post as a document, that the client sends over POST.
This allows a more flexible description of the search, as well as avoids the Server URL length limit.
RESTful does not recommend using verbs in URL's /cars/search is not restful. The right way to filter/search/paginate your API's is through Query Parameters. However there might be cases when you have to break the norm. For example, if you are searching across multiple resources, then you have to use something like /search?q=query
You can go through http://saipraveenblog.wordpress.com/2014/09/29/rest-api-best-practices/ to understand the best practices for designing RESTful API's
Though I like Justin's response, I feel it more accurately represents a filter rather than a search. What if I want to know about cars with names that start with cam?
The way I see it, you could build it into the way you handle specific resources:
/cars/cam*
Or, you could simply add it into the filter:
/cars/doors/4/name/cam*/colors/red,blue,green
Personally, I prefer the latter, however I am by no means an expert on REST (having first heard of it only 2 or so weeks ago...)
My advice would be this:
/garages
Returns list of garages (think JSON array here)
/garages/yyy
Returns specific garage
/garage/yyy/cars
Returns list of cars in garage
/garages/cars
Returns list of all cars in all garages (may not be practical of course)
/cars
Returns list of all cars
/cars/xxx
Returns specific car
/cars/colors
Returns lists of all posible colors for cars
/cars/colors/red,blue,green
Returns list of cars of the specific colors (yes commas are allowed :) )
Edit:
/cars/colors/red,blue,green/doors/2
Returns list of all red,blue, and green cars with 2 doors.
/cars/type/hatchback,coupe/colors/red,blue,green/
Same idea as the above but a lil more intuitive.
/cars/colors/red,blue,green/doors/two-door,four-door
All cars that are red, blue, green and have either two or four doors.
Hopefully that gives you the idea. Essentially your Rest API should be easily discoverable and should enable you to browse through your data. Another advantage with using URLs and not query strings is that you are able to take advantage of the native caching mechanisms that exist on the web server for HTTP traffic.
Here's a link to a page describing the evils of query strings in REST: http://web.archive.org/web/20070815111413/http://rest.blueoxen.net/cgi-bin/wiki.pl?QueryStringsConsideredHarmful
I used Google's cache because the normal page wasn't working for me here's that link as well:
http://rest.blueoxen.net/cgi-bin/wiki.pl?QueryStringsConsideredHarmful