Keep all fields option creating duplicate records in http client origin in StreamSets - streamsets

I have an http client origin which gives a json response. Pipeline uses pagination (by page number). When I am enabling ‘Keep all fields’ option in http client it is creates the duplicates of first record in every page. Say I have 10 records in the file, it writes first record 10 times in the same, 1st record in the second page 10 times and so on. basically it repeats first record for the entire page. Any way to fix this issue? We need to ‘Keep all fields’ enabled to get page properties for processing within the job.

Related

explanation on HTTPS Records Query

Can anyone please explain how first set, previous set, next set, and the last set of records can be used to query HTTP rest message data. what exactly does this do?
I got some information in ServiceNow website, where i am not able to understand.
Can we use this instead of sysparm_limit/sysparm_offset technique to fetch the records?
Yes, it's there for pagination on your side.
To get the first 10 records you can set sysparm_limit=10 and sysparm_offset=0. To get the next 10 records you should set sysparm_limit=10 and sysparm_offset=10.

MeteorJS - How Send two seperate queries of the same collection from server to client

I am trying to send two separate sets of data from the same collection from server to client. Data is being inserted in the collection on a set interval of 30 seconds. One set of data sent to the client must return all documents over the course of the current day on an hourly basis, while the other set of data simply sends the most recent entry in the collection. I have a graph that needs to display hourly data, as well as fields that need to display the most recent record every 30 seconds, however, I cannot seem to decouple these two data sets. The query for the most recent entry seems to always overwrite the query for the hourly data when attempting to access the data on the client. So my question summed up is: How does one send two separate sets of data of the same collection from server to client, and then access these two separate sets independently on the client?
The answer is simple, you cannot!
The server is always answering to client with a result set that the client asked for. So, if the client needs two separate (different) result sets, then the client must fire up two different queries. Queries that request hourly data OR last (newest) entry.
Use added, changed, removed to modify the results from the two queries so that they are "transformed" into different fields. https://docs.meteor.com/api/pubsub.html#Subscription-added
However, this is probably not your issue. You are almost certainly using the same string as the name argument of your Meteor.publish call, or you are accidentally Meteor.subscribe-ing to the same Meteor.publish twice.
Make two separate Meteor.publish names, one for the most recent and one for the hourly data. Subscribe to each of them separately. The commenter is incorrect.

How to design a REST API to fetch a large (ephemeral) data stream?

Imagine a request that starts a long running process whose output is a large set of records.
We could start the process with a POST request:
POST /api/v1/long-computation
The output consists of a large sequence of numbered records, that must be sent to the client. Since the output is large, the server does not store everything, and so maintains a window of records with a upper limit on the size of the window. Let's say that it stores upto 1000 records (and pauses computation whenever this many records are available). When the client fetches records, the server may subsequently delete those records and so continue with generating more records (as more slots in the 1000-length window are free).
Let's say we fetch records with:
GET /api/v1/long-computation?ack=213
We can take this to mean that the server should return records starting from index 214. When the server receives this request, it can assume that the (well-behaved) client is acknowledging that records up to number 213 are received by the client and so it deletes them, and then returns records starting from number 214 to whatever is available at that time.
Next if the client requests:
GET /api/v1/long-computation?ack=214
the server would delete record 214 and return records starting from 215.
This seems like a reasonable design until it is noticed that GET requests need to be safe and idempotent (see section 9.1 in the HTTP RFC).
Questions:
Is there a better way to design this API?
Is it OK to keep it as GET even though it appears to violate the standard?
Would it be reasonable to make it a POST request such as:
POST /api/v1/long-computation/truncate-and-fetch?ack=213
One question I always feel like that needs to be asked is, are you sure that REST is the right approach for this problem? I'm a big fan and proponent REST, but try to only apply to to situations where it's applicable.
That being said, I don't think there's anything necessarily wrong with expiring resources after they have been used, but I think it's bad design to re-use the same url over and over again.
Instead, when I call the first set of results (maybe with):
GET /api/v1/long-computation
I'd expect that resource to give me a next link with the next set of results.
Although that particular url design does sort of tell me there's only 1 long-computation on the entire system going on at the same time. If this is not the case, I would also expect a bit more uniqueness in the url design.
The best solution here is to buy a bigger hard drive. I'm assuming you've pushed back and that's not in the cards.
I would consider your operation to be "unsafe" as defined by RFC 7231, so I would suggest not using GET. I would also strongly advise you to not delete records from the server without the client explicitly requesting it. One of the principles REST is built around is that the web is unreliable. Under your design, what happens if a response doesn't make it to the client for whatever reason? If they make another request, any records from the lost response will be destroyed.
I'm going to second #Evert's suggestion that you absolutely must keep this design, you instead pick a technology that's build around reliable delivery of information, such as a messaging queue. If you're going to stick with REST, you need to allow clients to tell you when it's safe to delete records.
For instance, is it possible to page records? You could do something like:
POST /long-running-operations?recordsPerPage=10
202 Accepted
Location: "/long-running-operations/12"
{
"status": "building next page",
"retry-after-seconds": 120
}
GET /long-running-operations/12
200 OK
{
"status": "next page available",
"current-page": "/pages/123"
}
-- or --
GET /long-running-operations/12
200 OK
{
"status": "building next page",
"retry-after-seconds": 120
}
-- or --
GET /long-running-operations/12
200 OK
{
"status": "complete"
}
GET /pages/123
{
// a page of records
}
DELETE /pages/123
// remove this page so new records can be made
You'll need to cap out page size at the number of records you support. If the client request is smaller than that limit, you can background more records while they process the first page.
That's just spitballing, but maybe you can start there. No promises on quality - this is totally off the top of my head. This approach is a little chatty, but it saves you from returning a 404 if the new page isn't ready yet.

Paging in inbox threads/comments does not work?

I am tryning to list all messages for a thread in the inbox. I notice that I get the 25 last messages by default by doing something like this:
https://graph.facebook.com/<threadID>/comments?access_token=<token>
I get data for the 25 last messages in the thread, in this case message 4 to 28. The first message has a created_time" of "2011-01-21", the last (newest) has a
"created_time" of "2013-09-24".
The data returned for the "comments" connection has paging, the "next" and "previous" links are present and looks like this:
"previous"
https://graph.facebook.com/<threadID>/comments?access_token=<token>&limit=25&since=1380049638&__paging_token=<threadID>_28"
"next"
https://graph.facebook.com/<threadID>/comments?access_token=<token>&limit=25&until=1295625728&__paging_token=<threadID>_4
However, both return empty data sets!
How can I get this to work?
Another obeservation: when experimenting with "until", I noticed that when setting "until=2013-02-23" or earlier the response is also an empty data set!
I have also noticed another thing: the default limit seems to be 25 messages, however even when setting limit to a high number (like "limit=100) you only get around 28-30 messages per request. So it seems that for the thread/comments connections there are two problems: 1) "limit=" does not work as expected 2) "until=" does not work as expected: going back before a certain date/time returns an empty data set (this is why the paging does not work I guess).
Any ideas on how to get around this?
If you have a problem with next URL for the pagination, try using the offset along with the limit parameters in the URI.
For example, instead of making an API call to <threadID>/comments, make a call to /comments?limit=100&offset=0. This will start the list of the messages from an offset of 0 and will display a list of 100 messages on each page. The next URL will work just fine in this case. You can however increase the limit of the messages per page.
Also, there are some issues with the paging. Have a look at this post to learn how it works actually.

How to implement robust pagination with a RESTful API when the resultset can change?

I'm implementing a RESTful API which exposes Orders as a resource and supports pagination through the resultset:
GET /orders?start=1&end=30
where the orders to paginate are sorted by ordered_at timestamp, descending. This is basically approach #1 from the SO question Pagination in a REST web application.
If the user requests the second page of orders (GET /orders?start=31&end=60), the server simply re-queries the orders table, sorts by ordered_at DESC again and returns the records in positions 31 to 60.
The problem I have is: what happens if the resultset changes (e.g. a new order is added) while the user is viewing the records? In the case of a new order being added, the user would see the old order #30 in first position on the second page of results (because the same order is now #31). Worse, in the case of a deletion, the user sees the old order #32 in first position on the second page (#31) and wouldn't see the old order #31 (now #30) at all.
I can't see a solution to this without somehow making the RESTful server stateful (urg) or building some pagination intelligence into each client... What are some established techniques for dealing with this?
For completeness: my back-end is implemented in Scala/Spray/Squeryl/Postgres; I'm building two front-end clients, one in backbone.js and the other in Python Django.
The way I'd do it, is to make the indices from old to new. So they never change. And then when querying without any start parameter, return the newest page. Also the response should contain an index indicating what elements are contained, so you can calculate the indices you need to request for the next older page. While this is not exactly what you want, it seems like the easiest and cleanest solution to me.
Initial request: GET /orders?count=30 returns:
{
"start"=1039;
"count"=30;
...//data
}
From this the consumer calculates that he wants to request:
Next requests: GET /orders?start=1009&count=30 which then returns:
{
"start"=1009;
"count"=30;
...//data
}
Instead of raw indices you could also return a link to the next page:
{
"next"="/orders?start=1009&count=30";
}
This approach breaks if items get inserted or deleted in the middle. In that case you should use some auto incrementing persistent value instead of an index.
The sad truth is that all the sites I see have pagination "broken" in that sense, so there must not be an easy way to achieve that.
A quick workaround could be reversing the ordering, so the position of the items is absolute and unchanging with new additions. From your front page you can give the latest indices to ensure consistent navigation from up there.
Pros: same url gives the same results
Cons: there's no evident way to get the latest elements... Maybe you could use negative indices and redirect the result page to the absolute indices.
With a RESTFUL API, Application state should be in the client. Here the application state should some sort of time stamp or version number telling when you started looking at the data. On the server side, you will need some form of audit trail, which is properly server data, as it does not depend on whether there have been clients and what they have done. At the very least, it should know when the data last changed. No contradiction with REST here.
You could add a version parameter to your get. When the client first requires a page, it normally does not send a version. The server replies contains one. For instance, if there are links in the reply to next/other pages, those links contains &version=... The client should send the version when requiring another page.
When the server recieves some request with a version, it should at least know whether the data have changed since the client started looking and, dependending of what sort of audit trail you have, how they have changed. If they have not, it answer normally, transmitting the same version number. If they have, it may at least tell the client. And depending how much it knows on how the data have changed, it may taylor the reply accordingly.
Just as an example, suppose you get a request with start, end, version, and that you know that since version was up to date, 3 rows coming before start have been deleted. You might send a redirect with start-3, end-3, new version.
WebSockets can do this. You can use something like pusher.com to catch realtime changes to your database and pass the changes to the client. You can then bind different pusher events to work with models and collections.
Just Going to throw it out there. Please feel free to tell me if it's completely wrong and why so.
This approach is trying to use a left_off variable to sort through without using offsets.
Consider you need to make your result Ordered by timestamp order_at DESC.
So when I ask for first result set
it's
SELECT * FROM Orders ORDER BY order_at DESC LIMIT 25;
right?
This is the case when you ask for the first page (in terms of URL probably the request that doesn't have any
yoursomething.com/orders?limit=25&left_off=$timestamp
Then When receiving your data set. just grab the timestamp of last viewed item. 2015-12-21 13:00:49
Now to Request next 25 items go to: yoursomething.com/orders?limit=25&left_off=2015-12-21 13:00:49 (to lastly viewed timestamp)
In Sql you would just make the same query and say where timestamp is equal or less than $left_off
SELECT * FROM (SELECT * FROM Orders ORDER BY order_at DESC) as a
WHERE a.order_at < '2015-12-21 13:00:49' LIMIT 25;
You should get a next 25 items from the last seen item.
For those who sees this answer. Please comment if this approach is relevant or even possible in the first place. Thank you.