Is it possible to stream documents from elastic search? - streaming

Is it possible to have a persistant HTTP connection to an elasticsearch where it simply outputs new documents when they are indexed?
For example, I'm adding a theoretical argument called stream:
curl -X GET 'http://localhost:9200/documents/_search?stream'
{"_index":"documents", "_type":"doc", "field": "value #1"}
{"_index":"documents", "_type":"doc", "field": "value #2"}
{"_index":"documents", "_type":"doc", "field": "value #3"}
... which would keep the connection running, presumably with HTTP chunked mode, until the client disconnects.
The alternative that I'm considering is to execute a GET request every second to the cluster with a time range of a second. I was hoping of a streaming mode to prevent the overhead.
There are elasticsearch Rivers, which seems to be the opposite to this.

It does not exist in Elasticsearch yet.
However, you can have a look at this plugin.

There is a websocket transport (https://github.com/jprante/elasticsearch-transport-websocket) that will give you streaming.

Related

What HTTP method should i use when i need to pass a lot of data to a single endpoint?

Currently i've asking if the current HTTP method that i'm using on my rest api is the correct one to the occasion, my endpoint need a lot of data to "match" a query, so i've made a POST endpoint where the user can send a json with the parameters, e.g:
{
"product_id": 1
"partner": 99,
"category": [{
"id": 8,
"subcategories": [
{"id": "x"}, {"id": "y"}
]
}]
"brands": ["apple", "samsung"]
}
Note that brands and category are a list.
Reading the mozzila page about http methods i found:
The POST method is used to submit an entity to the specified resource, often causing a change in state or side effects on the server.
My POST endpoint does not take any effect on my server/database so in theory i'm using it wrong(?), but if i use a GET request how can i make it more "readable" and how can i manage lists on this method?
What HTTP method should i use when i need to pass a lot of data to a single endpoint?
From RFC 7230
Various ad hoc limitations on request-line length are found in practice. It is RECOMMENDED that all HTTP senders and recipients support, at a minimum, request-line lengths of 8000 octets.
That effectively limits the amount of information you can encode into the target-uri. Your limit in practice may be lower still if you have to support clients or intermediaries that don't follow this recommendation.
If the server needs the information, and you cannot encode it into the URI, then you are basically stuck with encoding it into the message-body; which in turn means that GET -- however otherwise suitable the semantics might be for your setting -- is out of the picture:
A payload within a GET request message has no defined semantics
So that's that - you are stuck with POST, and you lose safe semantics, idempotent semantics, and caching.
A possible alternative to consider is creating a resource which the client can later GET to retrieve the current representation of the matches. That doesn't make things any better for the first adhoc query, but it does give you nicer semantics for repeat queries.
You might, for example, copy the message-body into an document store, and encode the key to the document store (for example, a hash of the document) into the URI to be used by the client in subsequent GETs.
For cases where the boilerplate of the JSON document is large, and the encoding of the variation small, you might consider a family of resources that encode the variations into the URI. In effect, the variation is extracted from the URI and applied to the server's copy of the template, and then the fully reconstituted JSON document is used to achieve... whatever.
You should be using a POST anyways. With Get you can only "upload" data via URL parameters or HTTP Headers. Both are unsuitable for structured data like yours. Do use POST even though no "change" happens on the server.

Elasticsearch truly RESTful?

I am designing an API that will need to accept a large amount of data in order to return resources. I thought about using a POST request instead of a GET so I can pass a body with the request. That has been largely frowned upon in the REST community:
Switching to a POST request simply because there's too much data to fit in a GET request makes little sense
https://stackoverflow.com/a/812935/7489702
Another:
Switching to POST discards a number of very useful features though. POST is defined as a non-safe, non-idempotent method. This means that if a POST request fails, an intermediate (such as a proxy) cannot just assume they can make the same request again. https://evertpot.com/dropbox-post-api/
Another: HTTP GET with request body
But contrary to this, Elasticsearch uses POST methods to get around the issue of queries being too long to put in a url.
Both HTTP GET and HTTP POST can be used to execute search with body. Since not all clients support GET with body, POST is allowed as well.https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-body.html
So, is Elasticsearch not truly restful? Or, does the difference between POST and GET not matter as much in modern browsers?
ElasticSearch intent is not to be RESTful but to provide a (pragmatic) Web-API to clients in order to index documents and offer services like fulltext search or aggregations to help the client in its needs.
Not everything that is exposed via HTTP is automatically RESTful. I claim that most of the so called RESTful services aren't as RESTful as they think they are. In order to be RESTful a service has to adhere to a couple of constraints which Fielding, the inventor of REST, precisied further in a blog post.
Basically RESTful services should adhere to and not violate the underlying protocol and put a strong focus on resources and their presentation via media-types. Altough REST is used via HTTP most of the time, it is not restricted to this protocol.
Clients on the other hand should not have initial knowledge or assumptions on the available resources or their returned state ("typed" resource) in an API but learn them on the fly via issued requests and analyzed responses. This gives the server the opportunity to move arround or rename resources easily without breaking a client implementation.
HATEOAS (Hypertext as the engion of aplication state) enriches a resource state with links a client can use to trigger further requests in order to update its knowlege base or perform some state changes. Here a client should determine the semantics of an URI by the given relation name rather than parse the URI as the relation name should not change if the server moves arround a resource for whatever reason.
The client furthermore should use the relation name to determine what content type a resource may have. A relation name like news could force the client to request the resource as application/atom+xml representation while a contact relation might lead to a representation request of media-type text/vcard, vcard+json or vcard+xml.
If you look at an ElasticSearch sample I took from dzone you will see that ES does not support HATEOAS at all:
Request
GET /bookdb_index/book/_search?q=guide
Response:
"hits": [
{
"_index": "bookdb_index",
"_type": "book",
"_id": "1",
"_score": 0.28168046,
"_source": {
"title": "Elasticsearch: The Definitive Guide",
"authors": [
"clinton gormley",
"zachary tong"
],
"summary": "A distibuted real-time search and analytics engine",
"publish_date": "2015-02-07",
"num_reviews": 20,
"publisher": "manning"
}
},
{
"_index": "bookdb_index",
"_type": "book",
"_id": "4",
"_score": 0.24144039,
"_source": {
"title": "Solr in Action",
"authors": [
"trey grainger",
"timothy potter"
],
"summary": "Comprehensive guide to implementing a scalable search engine using Apache Solr",
"publish_date": "2014-04-05",
"num_reviews": 23,
"publisher": "manning"
}
}
]
The problem here is, that the response contains ElasticSearch related stuff that obviously is some arbitrary metadata for the returned results. While this could be handled via special media-types that teaches a client what each fields semantics are the actual payload kept in the _source element is still generic. Here you'd need further custom media-type extensions for each possible type.
If ES changes the response format in future clients which assume that _type will determine the type of a resource and _source will define the current state of some object of that type may break and hence stop working. Instead a client should ask the server to return a resource in a format it understands. If the client does not know any of the requested representation formats it will notify the client accordingly. If it knows at least one it will transform the state of the requested resource to a representation the client understands.
Long story short, ElasticSearch is by no means RESTful and it also does not try to be. Instead your "RESTful" service should use it and use the results to generate a response in accordance with the requested representation by the client.
So, is Elasticsearch not truly restful? Or, does the difference between POST and GET not matter as much in modern browsers?
I think ES is not truly restful, because it's query is more complex than normal Web Application.
REST proponents tend to favor URLs, such as
http://myserver.com/catalog/item/1729
but the REST architecture does not require these “pretty URLs”. A GET request with a parameter
http://myserver.com/catalog?item=1729 (Elasticsearch do this)
It is difference POST and GET in modern developer.
GET requests should be idempotent. That is, issuing a request twice should be no different from issuing it once. That’s what makes the requests cacheable. An “add to cart” request is not idempotent—issuing it twice adds two copies of the item to the cart. A POST request is clearly appropriate in this context. Thus, even a RESTful web application needs its share of POST requests.
reference What exactly is RESTful programming?

REST Best practice for sync log data in reverse order

Consider a backend system that stores logs of this form:
{"id": 541, "timestamp": 123, "status": "ok"}
{"id": 681, "timestamp": 124, "status": "waiting"}
...
Assuming that there are MANY logs, a client (e.g. an Android app) wants to sync the log data stored at a server to the client's device for presentation. Since the most recent logs are of more interest to a user, the GET request should be paged and start with the most recent logs and walk its way towards the oldest ones.
What is a proper design for this situation? What about the following design?
Let the server response in reverse order, add parameters lastReceivedIdand size to the request and add a field more=true/false in the response that indicates whether there are more old logs available before the oldest log send in the current request. On the first request set lastRecivedId=-1 indicating that the server should answer with the most recent logs.
Ship 'em all, let the server sort them out. The endpoint simply doesn't care what order they show up in, the server will handle that detail for presentation.
On the client, the client can choose to send the latest logs first, but that's simply coincidence. There's no requirement one way or the other.
There's also no need to send them in any particular order. If the client has a thousand log entries (in chronological order), it can send back batches of 100 starting at 900-1000, then 800-899, etc. The server will figure it out in the end.

HTTP Response Status Code for bad data found on the DB

I expect the JSON response returned by one of my Web APIs to contain a minimum set of fields annotated as mandatory by the business.
Which HTTP status code fit better in case some bad data that does not respect the contract is found on the db?
At the moment we're using 500 but we probably need to improve it (also because Varnish put in front of our service translates that 500 into a 503 Service Unavailable).
Example:
{
"id": "123",
"message": "500 - Exception during request processing. Cause: subtitles is a required field of class Movie and cannot be empty",
"_links": {
"self": {
"href": "/products/movies/123"
}
}
}
Thanks
An 500 Internal Server Error seems to be enough here to be honest as the issue is essentially from the server side and it doesn't reflect any wrong doing from the client side.
400 means BAD request but there is nothing bad with the request so don't use it.
500 means server error as in something unexpected happened - the code imploded, the galaxy crashed. Bad data is not a cause of any of that.
You have a case of bad data so you have a number of approaches at your disposal.
The first question is what do you intend to do with the bad data?
You could delete it in which case you would return a 204 which means the request is fine but there is no data to send back. I would not return a 404 for no data because 404 means endpoint doesn't exist, not data does not exist.
You could move it out of the main db into a temp data while someone fixes it.
Either way you would not return the bad data and the client doesn't care why that data is bad. All the client is concerned with is there any data or not.
Why should the client care that your data is missing a field you deem important?
You could take another approach and return the data you have.
Bottom line : decide how you're dealing with the bad data and don't put any kind of responsibility on the client.

data store with a "get or block" operation?

I'm looking for a data store that has a "get or block" operation. This operation would return the value associated with a key/query if that value exists or block until that value is created.
It's like a pub/sub message queue but with a memory to handle the case when the subscriber connects after the publisher has published the result.
This operation allows unrelated processes to rendezvous with each other, and it seems that it would be a very useful architectural building block to have - especially in a web environment - i.e. a web request comes in which kicks off a backend server process to do some work and the web client can get the results via a future AJAX call.
Here is an blog post I found on how to accomplish this sort of operation with mongodb:
http://blog.mongodb.org/post/29495793738/pub-sub-with-mongodb
What other solutions are in use today? Can I do the same thing with redis or rabbitmq? I've looked at the docs for both, but it's unclear exactly how it would work. Should I roll my own server with 0MQ? Is there something out there specifically tailored for this problem?
Your are correct both Redis[1] and rabbitmq[2] have pub/sub capabilities.
[1] http://redis.io/topics/pubsub
[2] http://www.rabbitmq.com/tutorials/tutorial-three-python.html