Fetching a redirect's target URL in OpenRefine - redirect

I have a CSV of ~2000 URLs that, when queried, do a 301 or 302 redirect, and I'm trying to figure out if OpenRefine is able to export to a new column the destination url that it retrieves HTML from when I fetch the html from it (or some other way).
e.g.
https://www-istp.gsfc.nasa.gov/stargaze/Ssolsys.htm
redirects to
https://pwg.gsfc.nasa.gov/stargaze/Ssolsys.htm
And I know that from clicking the link in my browser of choice. I've found a few answers suggesting that this can be done in various coding languages, but nothing so far suggesting how to do so in OpenRefine, even though I'm like 80% sure that it can be.
Does anyone out there know what I might be able to do to make this happen?

In OpenRefine you can write expressions in GREL, Jython (Java Implementation of Python 2) and Clojure.
As far as I know GREL does not support analyzing the target of a redirection URL, so I would use Python for that.
In your OpenRefine Project go to your column containing the urls and use "Edit column" > "Add column based on this column..."
In the corresponding dialog window (see Screenshot below) you change the expression language to "Python / Jython" and use the following code snippet to retrieve the "real" URL of the request.
import urllib2
response = urllib2.urlopen(value)
return response.geturl()

Related

Github Search API filename only returns 1 result

I'm trying to search for all projects (or at least several thousand) from the github search API. I've gotten everything else to work, except the filters on filename.
For example, sending the following request to the search API only returns 1 result:
https://api.github.com/search/code?q=django+in:requirements.txt+filename:requirements.txt+language:python+org:openmicroscopy
Likewise, sending the following
https://api.github.com/search/repositories?q=filename:Makefile&per_page=100
only returns 1 result as well. I'm willing to bet that there is more than 1 repo on github with a Makefile or a dependency on Django. I must be doing something wrong, but I can't seem to figure out what it is.
According to this post on Github's developer site to support the expected volume of requests, they have added restrictions to code queries which requires us to specify set of users, organizations, or repositories with the query. Read about considerations for code search at this link
Now, about your search API requests, in the first one the in qualifier is provided with file name requirements.txt which is wrong.
The documentation states that in should be provided with file to restrict the search to the file contents, path to restrict the search to the file path or both.
Like this, in:file, in:path, in:file,path
So, if you want to search in file contents the correct API call should be
https://api.github.com/search/code?q=django+in:file+filename:requirements.txt+org:openmicroscopy
I removed the language qualifier since you are searching in a .txt file and doing this improved the result.
Checkout this URL, it will produce same results on the website,
https://github.com/search?utf8=%E2%9C%93&q=org%3Aopenmicroscopy+django+in%3Afile+filename%3Arequirements.txt&type=Code
Your second query is a repository search, it cannot not take a filename as qualifier you should see this link for available qualifiers.

How to pack a variable into an HTTP GET request in socket.send() - Python 2.7

First off thanks for reading!
Second off YES I have tried to find the answer! :) Perhaps I haven't found it because I'm not using the right words to describe my problem, but it's been about 4 hours that I've been trying to figure it out now and I'm getting a little loopy trying to piece it together on my own.
I am very new to programming. Python is my first language. I am on my third Python course. I have an assignment to use the socket library (not urllib library - I know how to do that) to make a socket and use GET to receive information. The problem is that the program needs to take raw input for the URL in question.
I have everything else the way I want it, but I need to know the syntax that I'm supposed to be using INSIDE my "GET" request in order for the HTTP message to include the requested document path.
I have tried (obviously not all together lol):
mysock.send('GET (url) HTTP/1.0\n\n')
mysock.send( ('GET (url) HTTP:/1.0\n\n'))
mysock.send(('GET (url) HTTP:/1.0\n\n'))
mysock.send("GET (url) HTTP/1.0\n\n")
mysock.send( ("'GET' (url) HTTP:/1.0\n\n"))
mysock.send(("'GET' (url) 'HTTP:/1.0\n\n'"))
and:
basically every other configuration of the above (, ((, ( (, ', '' combinations listed above.
I have also tried:
-Creating a string using the 'url' variable first, and then including it inside mysock.send(string)
-Again with the "string-first" theory, but this time I used %r to refer to my user input (so 'GET %r HTTP/1.0\n\n' % url basically)
I've read questions here, other programming websites, the whole chapter in the book and the whole lectures/notes online, I've read articles on the socket library and the .send(), and of course articles on GET requests... but I'm clearly missing something. It seems most don't use socket library when they can use urllib and I don't blame them!!
Thank you again...
Someone from the university posted back to me that the url variable can concatenated with the GET syntax and assigned to a string variable which can then be called with .send(concatenatedvariable) - I had mentioned trying that but had missed that GET requires a space after the word 'GET' so of course concatenating didn't include a space and that blew it. In case anyone else wants to know :)
FYI: A fully quallified URL is only allowed in HTTP/1.1 requests. It is not the norm, though, as HTTP/1.1 requires setting the Host header. The relevant piece of reading would've been RFC 7230, sec. 3.1.1 and possibly RFC 3986. The syntax of the parameters is largely borrowed from the CGI format. It is in no way enforced, however. In a nutshell, everything put together would look like this on the wire:
GET /path?param1=value1&param2=value2 HTTP/1.1
Host: example.com
As a final note: The line delimiter in HTTP is CRLF (\r\n). For robustness, a simple linefeed is acceptable as well but not recommended.

Pass rest resource output format in url

AFAIK every resource have a url in REST design. for example /user/28 is url of user with id equal to 28 and /users will return all users.
There are some way to represent output format of the resource:
passing a query parameter like format
specify it using extensions(changing /users url to /users.json to get the users in json format)
specifying the requested format(xml, json, xls, ...) by setting Accept http header.
I search the web and it seems the correct way is setting Accept header.
But if you want to have a http link (specified by href) to download list of users in xls format, you can't!Also if you want to download the xls by the browser, you will encounter many problems(you should use ajax so the xls should download using ajax and etc.)
If it is the best way, what is the solution for download link and if its not, which solution is better?
The Accept header is considered 'more correct', but there are plenty examples of all the options you mention. As far as I can tell, none of them is considered "bad". Personally, I'd say that you should honor and prefer the Accept header, but a format query parameter should override it, if present. The downside of the 'extension' method is that each format results in a different resource, which can get ugly.

How to create and implement a pixel tracking code

OK, here's a goal I've been looking for a while.
As it's known, most advertising and analytics companies use a so called "pixel" code in order to track websites views, transactions, conversion etc.
I do have a general idea on how it works, the problem is how to implement it. The tracking codes consist from few parts.
The tracking code itself.
This is the code that the users inserts on his webpage in the <head> section. The main goal of this code is to set some customer specific variables and to call the *.js file.
*.js file.
This file holds all the magic of CRUD (create/read/update/delete) cookies, track user's events and interaction with the webpage.
The pixel code.
This is an <img> tag with the src atribute pointing to an image *.gif (for example) file that takes all the parameters collected on the page, and stores them in the database.
Example:
WordPress pixel code: <img id="wpstats" src="http://stats.wordpress.com/g.gif?host=www.hostname.com&list_of_cookies_value_pairs;" alt="">
Google Analitycs:
http://www.google-analytics.com/__utm.gif?utmwv=4&utmn=769876874&etc
Now, it's obvious that the *.gif request has to reach a server side scripting language in order to read the parameters data and store them in a db.
Does anyone have an idea how to implement this in Zend?
UPDATE
Another thing I'm interested in is: How to avoid the user's browser to load the cached *.gif ? Will a random parameter value do the trick? Example: src="pixel.gif?nocache=random_number" where the nocache parameter value will be different on every request.
As Zend is built using PHP, it might be worth reading the following question and answer: Developing a tracking pixel.
In addition to this answer and as you're looking for a way of avoiding caching the tracking image, the easiest way of doing this is to append a unique/random string to it, which is generated at runtime.
For example, server-side and with the creation of each image, you might add a random URL id:
<?php
// Generate random id of min/max length
$rand_id = rand(8, 8);
// Echo the image and append a random string
echo "<img src='pixel.php?a=".$vara."&b=".$varb."&rand=".$rand_id."'>";
?>
Just adding my 2 cents to this thread because I think an important, and frequently used, option is missing: you don't necessarily need a scripting language to capture the request. A more efficient approach is to use the web server access log (like apache access log for instance) to log the request and then handle that log with whatever tools you see fit, like ELK stack for instance.
This makes serving the requests much lighter because no scripting language is loaded to prepare the response, just native apache response, which is typically much more efficient.
First of all, the *.gif doesn't need to be that file type, the only thing that is of interest is the Content-Type http header. Set that to image/gif (or any other, appropiate type) in the beginning, execute your code and render some sort of image to the response body.
Well, all of the above codes are correct and is good but to be certain, the guy above mention "g.gif"
You can just add a simple php code to write to an sql or fwrite("file.txt",$opened)
where var $opened serves as the counter++ if someone opened your mail... then save it as "g.gif"
TO DO all of this just add these:
<Files "/thisdirectory">
AddType application/x-httpd-php .gif
</Files>
to your ".htaccess" file but be sure to make a new directory for that g.gif or whatever.gif where the directory only contains g.gif and .htaccess

Use GET or POST for a search form

I have a couple search forms, 1 with ~50 fields and the other with ~100. Typically, as the HTML spec says, I do searches using the GET method as no data is changed. I haven't run into this problem yet, but I'm wondering if I will run out of URL space soon?
The limit of Internet Explorer is 2083 characters. Other browsers, have a much higher limit. I'm running Apache, so the limit there is around 4000 characters, which IIS is 16384 characters.
At 100 fields, say average field name length of 10 characters, that's already 5000 characters...amazing on the 100 field form, I haven't had any errors yet. (25% of the fields are multiple selects, so the field length is much longer.)
So, I'm wondering what my options are. (Shortening the forms is not an option.) Here my ideas:
Use POST. I don't like this as much because at the moment users can bookmark their searches and perform them again later--a really dang nice feature.
Have JavaScript loop through the form to determine which fields are different than default, populate another form and submit that one. The user would of course bookmark the shortened version.
Any other ideas?
Also, does anyone know if the length is the encoded length or just plain text?
I'm developing in PHP, but it probably doesn't make a difference.
Edit: I am unable to remove any fields; I am unable to shorten the form. This is what the client has asked for and they often do use a range of fields, in the different categories. I know that it's hard to think of a form that looks nice with this many fields, but the users don't have a problem understanding how it works.
Are your users actually going to be using all 50-100 fields to do their searches? If they're only using a few, why not POST the search to an "in between" page which header()-redirects them to the results page with only the user-changed fields in the URL? The results page would then use the default values for the fields that don't exist in the URL.
To indirectly address your question, if I was faced with a 100-field form to fill in on one page, I'd most likely close my browser, it sounds like a complete usability nightmare.
My answer is, if there's a danger that I'm getting anywhere near that limit for normal usage of the form, I'm probably Doing It Wrong.
In order of preference, I would
Split the form up and use some server-side state retention
Switch to POST, and then generate and redirect to a shorter URL on POST that resolved to the same result
Give up ;)
You mention in a comment that many of the fields "are hidden and can be opened as required".
If you are willing to discard graceful degradation, you could always actually add and remove the fields from the form, rather than just hiding and showing them: the browser won't submit the ones that aren't included in the form.
This is a variant of the "Make and model" forms that online insurance etc. pages use -- select the make, submit back to the server and get the list of models for that manufacturer.
If you don't mind using javascript then you could have it work out the length of the query string and if it is too long then switch to a post. Then have some sort of url mapper to allow them to bookmark these posted searches.
Use post and if the user bookmarks the search, save it in a database and give it a unique token, then redirect to the search page using GET and passing the token as parameter.
TinyURL is a nice example: You give it a very long URL, it saves it to a DB, gives you a unique identifier for that URL and later you can request the long URL using that identifier.
In PHP it would be something along the lines of:
<?php
if (isset($_GET['token']))
{
$token = addslashes($_GET['token']);
$qry = mysql_query("SELECT fields FROM searches WHERE token = '{$token}'");
if ($row = mysql_fetch_assoc($qry))
{
performSearch(unserialize($row['fields']));
exit;
}
showError('Your saved search has been removed because it hasn\'t been used in a while');
exit;
}
$fields = addslashes(serialize($_POST));
$token = sha1($_SERVER['REMOTE_ADDR'].rand());
mysql_query("INSERT INTO searches (token, fields, save_time) Values ('{$token}', '{$fields}', NOW())");
header('Location: ?token='.$token);
exit;
?>
And run a script daily:
<?php
mysql_query('DELETE FROM searches WHERE save_time < DATE_ADD(NOW(), INTERVAL -200 DAY)');
?>
Also, does anyone know if the length
is the encoded length or just plain text?
My guess was for encoded length. I made a simple test: a textarea and a submit button to a simplistic PHP script.
Loaded the page in IE6, pasted some French text in the textarea, 2000 characters. If I hit the submit button, nothing. I had to reduce the length of the text to be able to submit.
In other words, the 2083 character limit is exactly the maximal length of the URL found in the address bar after submitting the GET request.
I would go for the JavaScript solution: on submit, analyze the form, create a secondary form with hidden attributes, and submit that.
Some strategies on shortening the output:
As you point out, you can already skip all values left to default (no field, no value).
If you have a form like the one at Processing forum search you can group all checkbox states in one variable only, eg. using letter encoding.
Use short value attributes (in select for example).
Note: if the search page is actually composed of several independent forms, where users fill only one section or another, you can make several separate forms.
Might not apply to your case and might seems obvious but worth mentioning for the record... ^_^
One could philosophically look at the search submission POST as the creation of a saved search (especially when a search is as complex an object as the one your users are making). In this case, you could accept the post for the creation of a search and then redirect using a GET to fetch the appropriate search results (post/redirect/get).
This would also allow the users to bookmark the search results (GET) to coming back at any time to re-run the search.
Get can have one advantage if your search results can be shared, in case of post request if you send the link to someone, that person won't see any search results