I am preparing to write my first web crawler, and it looks like Anemone makes the most sense. There is built in support for MongoDB storage, and I am already using MongoDB via Mongoid in my Rails application. My goal is to store the crawled results, and then access them later via Rails. I have a couple of concerns:
1) At the end of this page, it says that "Note: Every storage engine will clear out existing Anemone data before beginning a new crawl." I would expect this to happen at the end of the crawl if I were using the default memory storage, but shouldn't the records be persisted to MongoDB indefinitely so that duplicate pages are not crawled next time the task is run? If they are wiped "before beginning a new crawl", then should I just run my Rails logic before the next crawl? If so, then I would end up having to check for duplicate records from the previous crawl.
2) This is the first time I have really thought about using MongoDB outside the context of Rails models. It looks like the records are created using the Page class, so can I later just query these as I normally would using Mongoid? I guess it is just considered a "model" once it has an ORM providing the fancy methods?
Great questions.
1) It depends on what your goal is.
In most cases this default makes sense. One does a crawl with anemone and examines the data.
When you do a new crawl, the old data should be erased so that the data from the new crawl can replace it.
You could point the storage engine at a new collection before starting the new crawl if you don't want that to happen.
2) Mongoid won't create the model classes for you.
You need to define models so that mongoid knows to create a class for the collection, and optionally define the fields that each of the documents have so that you can use the . accessor method out of the box.
Something like:
class Page
include Mongoid::Document
field :url, type: String #i'm guessing, check what kind of docs anemone produces
field :aliases, type: Array
field ....
end
It will probably need to include the following fields:
url - The URL of the page
aliases - Other URLs that redirected to this page, or the Page that this one redirects to
headers - The full HTTP response headers
code - The HTTP response code (e.g. 200, 301, 404)
body - The raw HTTP response body
doc - A Nokogiri::HTML::Document of the page body (if applicable)
links - An Array of all the URLs found on the page that point to the same domain
But please just take a look at what type (string, array, whatever) the storage engine is storing them as and don't make assumptions.
Good luck!
Related
I'm looking for a best way for implementing an endpoint of REST-full application that will be responsible for creating a new library orders. Let's assume that I have the following resources.
If I want to get all books of a particular author I can use the next endpoint:
HTTP GET
api/books/author/123
If I want to fetch all orders of a particular book I can use the endpoint provided below:
HTTP GET
api/books/456/orders
My question is what will be the most suitable URL and a request model for an endpoint that will create orders?
From my perspective it can be
HTTP POST
api/books/456/orders
And one more question. Is it a good practice in REST to use request models like CreateOrder? If I want to create a REST-full web application can I use the following request model:
class CreateOrder
{
AuthorId: number;
BookId: number;
ClientId: number;
}
Sometimes it makes me confused. Should request models look like our resources or not?
Let's assume that I have the following resources.
Your "resources" look suspiciously like "tables". Resources are closer to (logical) documents about information.
what will be the most suitable URL and a request model for an endpoint that will create orders
For the most part, it doesn't matter what URL you use to create orders. In a hypermedia application (think HTML), I'm going to submit a "form", and the meta data associated with that form are going to describe for the client how to compose a request from the form data.
So the human, or the code, that is manipulating the form doesn't need to know anything about the URL (when is the last time that you looked to see where Google was actually sending your search?)
As far as general purpose web components are concerned, the URL/URI is just an opaque identifier - they don't care what the spelling means.
A thing they do care about is whether the spelling is the same as something that they have cached. One of the consequences of a successful POST /x message is that the cached representation(s) of /x are invalidated.
So if you like, you can think about which cached document should be refreshed when an order is created, and send the request to the identifier for that document.
Should request models look like our resources or not?
It's not necessary. Again, think about the web -- what would the representation of create order look like if you were POSTing form data?
clientId=1&bookId=2
or maybe
bookId=2&copies=3
If the "who is creating an order" is answered using the authorization headers.
In our HTTP requests and responses, we are fundamentally sending message representations - sequences of bytes that conform to some schema. There's no particular reason that those sequences of bytes must, or must not, be the same as those we use elsewhere in the implementation.
Your end-point does not need to always start with /books. You can introduce another end-point /orders for creating or getting orders. So , to create an order , you can :
HTTP POST
api/orders
And does the 'request model' that you mean is the HTTP request body structure ? If yes, it does not need to be 100% match with your back-end persisted/domain model. Just include enough parameters that the server needs to know in order to create an order. (e.g. Include bookId rather than the whole book object etc.)
BTW , to get all books for a particular author , it is more common to use query parameter such as :
HTTP GET
api/books?authorId=123
What you are doing is not REST, it is CRUD over HTTP. REST does not care about your URI structures and resources are very far from database tables. If CRUD is all you need, then download a CRUD generator library https://github.com/search?q=crud+generator&type=Repositories, which will generate all the upper and you won't need to write it manually.
We are designing WebAPI for our software for managing ecommerce product information. We want to provide (among many others) two operations:
Simple one: allow user to add/modify existing product information:
don't create new product if it not exists
don't delete any information from existing product which was not provided in this request
In my opinion HTTP PATCH method is proper way to handle this scenario (with json-patch or json-merge-ptach) with URL like this: /products/{ID}
Harder one: allow user to add/modify existing product or create one
create product if not exists in DB
don't delete any information from existing product which was not provided in this request (same behaviour as in first case)
I'm struggling with designing REST endpoint for this second use case. I have few options but none of them fits perfectly for me in the REST principles:
a) Add custom HTTP header to the endpoint designed for first case (patch) to allow a caller to control of "not found behaviour" eg. create-entity-when-not-exists: true/false - but in my opinion PATCH shouldn't be used for creating resources.
b) Design new endpoint using PUT with special header "preserve-not-provided-data" - this on the other hand violates for me PUT principles because PUT is create-or-replace not create-or-update method
c) Create PATCH for /products URL (without {ID} at the end) - in this case we are updating whole collection(resource) of products - so if product exists we can update it or create new one if not exists.
For now c) solution looks fine for me with one exception: If in the future we would like to support batch operations (for both use cases: 1 and 2) we would like to use /products URL and it will conflict with URL from solution c)
What do you think ? Do you have any other ideas ?
PUT and PATCH have differing message semantics, but the core context ("remote authoring") is the same. In both cases, the client request is "Please, server, make your representation of this resource match my local copy".
For example, I GET a JSON document from the server. I make local edits to it. Now I want to "save" my changes on the server. If the document is modest in size, I might just send the entire revised document over the network. If the document is very large, and my changes are modest, then I might instead send the patch instead.
If you imagine using HTTP to publish edits of HTML web pages to a server, then you've got the right frame of reference. There's not a lot of practical difference between "please patch the title of your copy of the document" and "here is a complete new copy of the document, with my edit to the title". The bytes on disk are going to be the same in either case.
Given that, it would be very odd if those two methods for publishing a new revision of the document were to have vastly different side effects.
Your third approach, based on modifying /products, is potentially fine for both your individual and batch. The server gets the new representation of /products (or the patch document describing the changes), decides whether to accept the changes, and if so computes what it needs to do to its own database to make things work.
Note:
A PUT request applied to the target resource can have side effects on other resources.
The HTTP specification is relatively strict about what the message means, but offers the server a lot of leeway in how it behaves in response.
I have the following URI: /articles/:id, where article is a resource on web-service and have associated model/class. Now I need to return only partial data for each resource (to save bandwidth and make for speed) when collection is requested, but when a single item is requested from collection I need to send full data. My question is should I use two models/classes for the same resource on the server and initiate different one depending on collection or single resource is requested? Or maybe there is should be only one model/class but not all fields should be filled with data when a collection is requested? Or maybe there is another approach?
I suggest using the approach suggested here with a fields query parameter.
If the API is going to be open to everyone to use and client usage is going to be unpredictable, then by default you probably need to limit the fields that you return. Just make sure you document in some way all the possible fields that could be used, in case a client actually needs them.
If the API is going to be consumed only by an app or apps you made, then by default you could return all of the fields and then your app can pass that fields parameter to speed things up.
So, I'm building a Web site application that will comprise a small set of content files each dedicated to a general purpose, rather than a large set of files each dedicated to a particular piece of information. In other words, instead of /index.php, /aboutme.php, /contact.php, etc., there would just be /index.php (basically just a shell with HTML and ) and then content.php, error.php, 404.php, etc.
To deliver content, I plan to store "directory structures" and associated content in a data table, capture URIs, and then query the data table to see if the URI matches a stored "directory structure". If there's a match, the associated content will be returned to the application, which will use Pear's HTTP_Request2 to send a request to content.php. Then, content.php will display the appropriate content that was returned from the database.
EDIT 1: For example:
a user types www.somesite.com/contact/ into their browser
index.php loads
a script upstream of the HTML header on index.php does the following:
submits a mysql query and looks for WHERE path = $_SERVER[REQUEST_URI]
if it finds a match, it sets $content = $dbResults->content and POSTs $content to /pages/content.php for display. The original URI is preserved, although /pages/content.php and /index.php are actually delivering the content
if it finds no match, the contents of /pages/404.php are returned to the user. Here, too, the original URI is preserved even though index.php and /pages/404.php are actually delivering the content.
Is this a RESTful approach? On the one hand, the URI is being used to access a particular resource (a tuple in a data table), but on the other hand, I guess I always thought of a "resource" as an actual file in an actual directory.
I'm not looking to be a REST purist, I'm just really delving into the more esoteric aspects of and approaches to working with HTTP and am looking to refine my knowledge and understanding...
OK, my conclusion is that there is nothing inherently unRESTful about my approach, but how I use the URIs to access the data tables seems to be critical. I don't think it's in the spirit of REST to store a resource's full path in that resource's row in a data table.
Instead, I think it's important to have a unique tuple for each "directory" referenced in a URI. The way I'm setting this up now is to create a "collection" table and a "resource" table. A collection is a group of resources, and a resource is a single entity (a content page, an image, etc., or even a collection, which allows for nesting and taxonomic structuring).
So, for instance, /portfolio would correspond with a portfolio entry in the collection table and the resource table, as would /portfolio/spec-ads; however, /portfolio/spec-ads/hersheys-spec-ad would correspond to an entry only in the resource table. That entry would contain, say, embed code for a Hershey's spec ad on YouTube.
I'm still working out an efficient way to build a query from a parsed URI with multiple "directories," but I'm getting there. The next step will be to work the other way and build a query that constructs a nav system with RESTful URIs. Again, though, I think the approach I laid out in the original question is RESTful, so long as I properly correlate URIs, queries, and the data architecture.
The more I walk down this path, the more I like it...
I have the basics of a REST service done, with "standard" list and GET/POST/PUT/DELETE verbs implemented around my nouns.
However, the client base I'm working with also wants to have more powerful operations. I'm using Mongo DB on the back-end, and it'd be easy to expose an "update" operation. This page describes how Mongo can do updates.
It'd be easy to write a page that takes a couple of JSON/XML/whatever arguments for the "criteria" and the "objNew" parts of the Mongo update function. Maybe I make a page like http://myserver.com/collection/update that takes a POST (or PUT?) request, with a request body that contains that data. Scrub the input for malicious querying and to enforce security, and we're done. Piece of cake.
My question is: what's the "best" way to expose this in a RESTful manner? Obviously, the approach I described above isn't kosher because "update" isn't a noun. This sort of thing seems much more suitable for a SOAP/RPC method, but the rest of the service is already using REST over HTTP, and I don't want users to have to make two different types of calls.
Thoughts?
Typically, I would handle this as:
url/collection
url/collection/item
GET collection: Returns a representation of the collection resource
GET collection/item: Returns a representation of the item resource
(optional URI params for content-types: json, xml, txt, etc)
POST collection/: Creates a new item (if via XML, I use XSD to validate)
PUT collection/item: Update an existing item
DELETE collection/item: Delete an existing item
Does that help?
Since as you're aware it isn't a good fit for REST, you're just going to have to do your best and invent a standard to make it work. Mongo's update functionality is so far removed from REST, I'd actually allow PUTs on the collection. Ignore the parameters in my examples, I haven't thought too hard about them.
PUT collection?set={field:value}
PUT collection?pop={field:1}
Or:
PUT collection/pop?field=1