Getting date of first crawl of URL by Common Crawl? - common-crawl

In Common Crawl same URL can be harvested multiple times.
For instance, Reddit blog post can be crawled when it was created and then when subsequent comments were added.
Is there a way to find when a given URL has been crawled for the first by Common Crawl?

The URL indexes (CDX or columnar) include a field/column with the capture time. Just search for the URL, record all captures and then look into the page content of the captures regarding the addition of comments. The indexes also include the WARC file name, record offset and length which allow to fetch the WARC record using a HTTP range request.

Related

Passing sizable data in an REST GET request

A REST question. Let's say I have a database will a million items in it. I want to retrieve say 10,000 of them via an REST GET, passing in the GET request the ID's of the 10,000 items. Using URL request query parameters, it'll quickly exceed the maximum length of a URL. How does people solve this? Use a POST instead and pass it in the body? That seems hacky.
You should not address this form through the URL parameters, it has a limit: 2000 characters
Url limit
I guess what you are doing is something like this:
https://localhost/api/applicationcollections/(47495dde-67d2-4854-add0-7e25b2fe88e4,1b470799-cc8a-4284-9ca7-76dc34a5aebb)
If you are planning to get more than 10k records you can pass the information on the body of the request which doesn't have any limit. Technically speaking you should do it through a POST request, but that is not the intent with the semantic of the POST verb. Even for the GET you can include a body:HTTP GET with request body but it should not consider as part of the semantic.
Normally you don't filter 10k elements by id, instead, you get 10k elements on a request, passing a pagination parameter if you want through the URL, but that can kill your app, especially considering that the DTO has more than one field, like
ApplicationDto
field1
field2....
field15
Bellow, you have an example of how to pass pagination parameters and get the first 10k records
https://localhost:44390/api/applications?pageNumber=1&pageSize=10000
Also, the APIs should return an extra header, let's call it X-Pagination where you can get the information if you have more pages to paginate, including as well the total amount of elements.
As an extra effort to reduce the size of the request, you can shape the data and only get the fields you need.
ApplicationDto should bring only: field1, field3 see bellow:
https://localhost:44390/api/applications?fields=field1,field3
See how Twitter address this problem as well:
Twitter cursor response
Hope this helps

Retrieve all of a user's playlist from SoundCloud limited to 50?

I'm trying to retrieve all the playlists from my account via
http://api.soundcloud.com/users/145295911/playlists?client_id=xxxxxx, as the API reference shows.
However, I can only retrieve recent 50 playlists instead of all of my playlists. I've been looking for this but it seems like no one has had this issue before. Is there a way to get all of them?
Check out the section of their API on pagination.
Most results from our API are returned as a collection. The number of items in the collection returned is limited to 50 by default with a maximum value of 200. Most endpoints support a linked_partitioning parameter that will allow you to page through collections. When this parameter is passed, the response will contain a next_href property if there are additional results. To fetch the next page of results, simply follow that URI. If the response does not contain a next_href property, you have reached the end of the results.

HTTP request to search for multiple ObjectIds in a Mongo-based API?

I'm looking to add search functionality to an API, on a resource called Organizations. Organizations can have different Location and Audience ids tagged onto them (which I would like to use in searching). Since these ids are MongoDB ObjectIds, they are quite long and I'm worried about reaching the max query string limit of the browser with a GET request. For example:
GET http://my-site.com/api/organizations?locations=5afa54e5516c5b57c0d43227,5afa54e5516c5b57c0d43226,5afa54e5516c5b57c0d43225,5afa54e5516c5b57c0d43224&audiences=5afa54e5516c5b57c0d43223,5afa54e5516c5b57c0d43222
Which would probably be about an average search, however I don't want it to break if users select many Locations or Audiences.
Any advice on how I could handle this situation?
I've ran into your situation before. You can change your method to POST
For a input of locations and audiences, your resource is not already sitting there. You have to compute it.
By the definition of POST:
Perform resource-specific processing on the request payload.
Providing a block of data, such as the fields entered into an HTML
form, to a data-handling process;
You have to compute and create new resource for response. So it's REST-compliance to do so.

REST collections: pagination enabled by default?

I am trying to figure out what is the common way to respond on GET operation to retrieve multiple resources. For example, if user calls: /books or /books?name=*foo* etc what should good REST api return?
[A] list of all resources in collection. Only if user specifies a range (using start and limit or in any other way), then only a page of results is returned.
[B] always return a first page of resources, even when nothing is specified. Then user may continue with pagination, using the parameters (or any other way).
[C] a document indicating that paging is be involved, with total number of resources in the collection, but without any returned resource; with appropriate status code set (like 300 if I remember correctly at this moment). This response indicates to the user that he can start fetching its data using pagination parameters.
I like C approach, but could not find APIs that are having this.
This depends on either the pagination parameters are mandatory or not. For most APIs it's mandatory simply because /books could return millions of entries.
How about [D]: a redirect.
If the client accesses /books or /books?name=foobar redirect it to /books?page=1&size=15 or /books?name=foobar&page=1&size=15 and return results according to those default parameters.
You could also include pagination links into your response (as per HATEOAS) with a rel attribute that specifies that a link is for the next page, previous, first or last page, so the client can also navigate back and forth between the result pages.

Is it RESTful to match a URI in a database and display associated content via request forwarding?

So, I'm building a Web site application that will comprise a small set of content files each dedicated to a general purpose, rather than a large set of files each dedicated to a particular piece of information. In other words, instead of /index.php, /aboutme.php, /contact.php, etc., there would just be /index.php (basically just a shell with HTML and ) and then content.php, error.php, 404.php, etc.
To deliver content, I plan to store "directory structures" and associated content in a data table, capture URIs, and then query the data table to see if the URI matches a stored "directory structure". If there's a match, the associated content will be returned to the application, which will use Pear's HTTP_Request2 to send a request to content.php. Then, content.php will display the appropriate content that was returned from the database.
EDIT 1: For example:
a user types www.somesite.com/contact/ into their browser
index.php loads
a script upstream of the HTML header on index.php does the following:
submits a mysql query and looks for WHERE path = $_SERVER[REQUEST_URI]
if it finds a match, it sets $content = $dbResults->content and POSTs $content to /pages/content.php for display. The original URI is preserved, although /pages/content.php and /index.php are actually delivering the content
if it finds no match, the contents of /pages/404.php are returned to the user. Here, too, the original URI is preserved even though index.php and /pages/404.php are actually delivering the content.
Is this a RESTful approach? On the one hand, the URI is being used to access a particular resource (a tuple in a data table), but on the other hand, I guess I always thought of a "resource" as an actual file in an actual directory.
I'm not looking to be a REST purist, I'm just really delving into the more esoteric aspects of and approaches to working with HTTP and am looking to refine my knowledge and understanding...
OK, my conclusion is that there is nothing inherently unRESTful about my approach, but how I use the URIs to access the data tables seems to be critical. I don't think it's in the spirit of REST to store a resource's full path in that resource's row in a data table.
Instead, I think it's important to have a unique tuple for each "directory" referenced in a URI. The way I'm setting this up now is to create a "collection" table and a "resource" table. A collection is a group of resources, and a resource is a single entity (a content page, an image, etc., or even a collection, which allows for nesting and taxonomic structuring).
So, for instance, /portfolio would correspond with a portfolio entry in the collection table and the resource table, as would /portfolio/spec-ads; however, /portfolio/spec-ads/hersheys-spec-ad would correspond to an entry only in the resource table. That entry would contain, say, embed code for a Hershey's spec ad on YouTube.
I'm still working out an efficient way to build a query from a parsed URI with multiple "directories," but I'm getting there. The next step will be to work the other way and build a query that constructs a nav system with RESTful URIs. Again, though, I think the approach I laid out in the original question is RESTful, so long as I properly correlate URIs, queries, and the data architecture.
The more I walk down this path, the more I like it...