Is there a simple way to extract content from a webpage?

Is there a simple way to extract content from a webpage? - perl

Our build software generates a webpage when the build fails, and lists the users who've committed since the last build. I'd like to have a way to parse the page for members of my team. For example:
Commit
18e1bc67b7e3123987daf8c219a4fbe2003de4
by bob.dole</b><pre>1112233- Description on header is not carried forward to BD doc after PCPROJBILL is ran<br></pre></div></td></tr><tr><td width="16"><img title="The file was modified" height="16" alt="The file was modified" width="16" src="/static/fbfd5d7f/images/16x16/document_edit.png" /></td><td><a>pcbatch/projbill.cpp</a></td></tr><tr class="pane"><td colspan="2" class="changeset"><a name="detail54"></a><div class="changeset-message"><b>
So the script would take a URL as input and search the file for 'bob.dole' and output to a file all of the details associated with him (commit hash, pre-data, etc.)
Could someone give me an idea of what would be the easiest way to accomplish this? I was thinking of using perl, but I'm not sure if there's something more straightforward.

If I got you question correctly, you want to get the webpage content and parse it to find the user name. If it is the case, I would use php
Use get_file_content("your_website"), this will return a string to you to parse.
Then you can use strpos() to find indeces of substrings. This will later help you to extract the user name by using substr() function.
Hope it helps.

The Perl module you are looking for that helps you search based on nodes is Mojo::DOM.

Related

Get a link to a specific line in a diff using GitHub API?

Using the GitHub API I'm looking for a way to generate a link to a specific line in a diff.
I can already contruct a "compare between commits" url, for example:
https://github.com/emmetog/feature-flags/compare/master...d8f9c29bfd0b87d26123b78b76feca8e4c87ad8
And visiting that url in a browser I can click on a specific line and I get this:
https://github.com/emmetog/feature-flags/compare/master...d8f9c29bfd0b87d26123b78b76feca8e4c87ad8#diff-21171d4ef87ca8e3591556dd18dfa456R26
However, I need to generate that last bit, the #diff-21171d4ef87ca8e3591556dd18dfa456R26 bit, programatically throught the github api, or else find another way of linking to the specific line in the diff without going through the browser.
Is this possible?

It is impossible.
I read https://developer.github.com/v3/repos/commits/#compare-two-commits
I tried
curl https://github.com/emmetog/feature-flags/compare/master...d8f9c29bfd0b87d26123b78b76feca8e4c87ad8
By using GitHub API, we can not specify what is the line 26th of different between new version and old version of file src/Emmetog/FeatureFlag/Entity/FeatureFlag.php
Because difference of 2 revisions doesn't happen at line 26, it is impossible for comparing. Or file src/Emmetog/FeatureFlag/Entity/FeatureFlag.php has only 10 lines of code, it is impossible for comparing.
In HTML webpage, id = diff-21171d4ef87ca8e3591556dd18dfa456R26 is auto-generated id. We can not specify intentional way before executing GitHub API request.

This may not be the best way to do it, but it looks like you can do some webscraping.
For example. In the link you provided. That line contains this element:
<td id="diff-21171d4ef87ca8e3591556dd18dfa456R26"
data-line-number="26" class="blob-num blob-num-addition
js-linkable-line-number selected-line"></td>
Which contains the diff hash. You also have the line number (26). Now you just need the 'R' between the diff hash and the line number. That, I believe, is given by whether the line has been added or removed. You can get that from the css class 'blob-num-addition'. It looks like 'blob-num-addition' corresponds to 'R' and 'blob-num-addition' corresponds to 'L'

Search string formatting in Elouqa API

I'm using the Elouqa Rest API in an integration with another product and I want to implement a file browser. As part of this I want to get a list of the folders inside another folder. Theapi documents here say that a search string can be appended but don't give any clues as to the format of the search string. I've tried various things but so far I'm just getting empty results. An example is here:
/API/rest/1.0/assets/email/folders?search=folderId+%3D+250
I've tried with and without +'s and with and without url encoding the = sign, also various combinations of quote marks but so far nothing.

I believe what you want is a slightly different endpoint e.g.:
/API/rest/1.0/assets/email/folder/250/contents
Which would provide a list of folders contained with folder 250
If you wanted to search for a given folder name then you would use
/API/rest/1.0/assets/email/folders?search=foldername
Hope that helps!

RESTful urls for restore operation from a trash bin

I've been implementing a RESTful web service which has these operations:
List articles:
GET /articles
Delete articles (which should remove only selected articles to a trash bin):
DELETE /articles
List articles in the trash bin:
GET /trash/articles
I have to implement an operation for restoring "articles" from "/trash/articles" back to "/articles".
And here is the question. Ho do you usually do it? What url do I have to use?
I came up to the 2 ways of doing it. The first is:
DELETE /trash/articles
But it feels strange and a user can read it like "delete it permanently, don't restore".
And the second way is
PUT /trash/articles
Which is more odd and a user will be confused what this operation does.
I'm new to REST, so please advice how you do it normally. I tried to search in google but I don't know how to ask it right, so I didn't get something useful.

Another option could be to use "query params" to define a "complementary action/verb" to cover this "special condition" you have (given that this is not very easily covered by the HTTP verbs). This then could be done for example by:
PUT /trash/articles?restore=true
This would make the URI path still complaint with REST guideline (referring to a resource, and not encoding "actions" - like "restore") and would shift the "extra semantics" of what you want to do (which is a very special situation) to the "query parameter". "Query params" are very commonly used for "filtering" resources in REST, not so much for this kind of situation... but maybe this is a reasonable assumption given your requirements.

I would recommend using
PUT /restore/articles
or
PUT /restore/trash/articles

Late answer but, in my opinion, the best way is to change the resource itself.
For instance:
<article is_in_trash="true">
<title>come title</title>
<body>the article body</body>
<date>1990-01-01</date>
</article>
So, in order to remove the article from Trash, you would simple use PUT an updated version of the article, where is_in_trash="false".

Trying to figure out what {s: ;} tags mean and where they come from

I am working on migrating posts from the RightNow infrastructure to another service called ZenDesk. I noticed that whenever users added files or even URL links, when I pull the xml data from RightNow it gives me a lot of weird codes like this:
{s:3:""url"";s:45:""/files/56f5be6c1/MUG_presso.pdf"";s:4:""name"";s:27:""MUG presso.pdf"";s:4:""size"";s:5:""2.1MB"";}
It wasn't too hard to write something that parses them and makes normal urls and links, but I was just wondering if this is something specific to the RightNow service, or if it is a tag system that is used. I tried googling for this but am getting some weird results so, thought stack overflow might have someone who has run into this one.
So, anyone know what these {s ;} tags are called and if there are any particular tools to use to read them?
Any answers appreciated!

This resembles partial PHP serialized data, as returned by the serialize() call. It looks like someone may have turned each " into "", which could prevent it from parsing properly. If it's wrapped with text like this before the {s: section, it's almost definitely PHP.
a:6:{i:1;a:10:{s:
These letters/numbers mean things like "an array with six elements follows", "a string of length 20 follows", etc.
You can use any PHP instance with unserialize() to handle the data. If those double-quotes are indeed returned by the API, you might need to replace :"" and ""; with " before parsing.
Parsing modules exist for other languages like Python. You can find more information in this answer.

How can I tell if two image files are the same in Perl?

I have a Perl script I wrote for my own personal use that fetches image files from a website periodically. It then saves these images to a folder. These image files are quite often the same from fetch to fetch, and I'd like to not save duplicates if I can get around it.
My question: What would be the best way to compare/check if they are the same?
My only real thought so far is to open a file handle to existing one, md5 it, md5 the $response->content from the fetch and then compare them. Would that work?
Is there a better way?
EDIT:
Wow, already tons of great suggestions. Does it help if I tell you that this script runs daily via cron? I.e. it is guaranteed to always run at the exact same time everyday? Also: I'm looking at the last-modified headers on some of these, and they don't look 100% accurate, i.e. there are some that have a last-modified of over a week ago when I know the image is more recent than that. I'm assuming that's because the image file itself hasn't been modified on the server since then... which doesn't help me much...

Don't open and hash the stored image each time - stash the hash alongside the image when you store it. Compare sizes as well.
Don't issue a GET request straight away, do a HEAD first and compare the size, last modification date and any Etags to what you got last time.

There are a number of HTTP headers you can use for this -- if you save the time that you last retrieved the file, you can do a conditional get with
If-Modified-Since: <date>
Or, if the server returns an Etag header with the response, you can store that with the image, (or a collection of all of the etags you have seen for that image), and do:
If-None-Match: <all of your etags here>
If the server supports conditional gets, then you will get a "304 Not Modified" response, with no body.

Yep that sounsd right.
Depending on how you're getting the file and how frequently you might also be able to check for HTTP 304 Not Modified and save yourself the download.

md5 would work, but you'd still have to pull the file. Are there any useful metadata in the HTTP headers, content-length, cache-control directives, ETags, etc. ?

There's also a nice fdupes tool for the purpose. Don't know what system you're using and what systems the tool can be built for.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Is there a simple way to extract content from a webpage? - perl

The Perl module you are looking for that helps you search based on nodes is Mojo::DOM.

Related

Get a link to a specific line in a diff using GitHub API?

Search string formatting in Elouqa API

RESTful urls for restore operation from a trash bin

Trying to figure out what {s: ;} tags mean and where they come from

How can I tell if two image files are the same in Perl?

Categories

Resources