Fql Limit Returns Less Results Than Expected - facebook

I am trying to get data from facebook by fql query.
One of the things I want to do is to get a litmited records each call. I am trying to do this by the 'LIMIT [start], [end]' command, it suppose to return to me the records between those numbers. Instead of getting [end]-[start] records which is the total number of records that are suppose to be returned. I get random number of records, I have checked and I can be sure that I am not trying to get more records that there is.
LIMIT example:
http://graph.facebook.com/fql?q=SELECT actor_id, message,description FROM stream WHERE source_id =5878435678 Limit 2,10
This suppose to return 7 records (the count starts from 0) and I get only 3 records.
The funny thig is when I wrote instead of 10 50 I got 26 records.
Can someone help me to find a way how to get the exact ammount of records I asked for.
Tanks ahead!!

This blog post by Facebook engineers explains this phenomenon.
http://developers.facebook.com/blog/post/478/
Here's the part that addresses your question...
You might notice that the number of results returned is not always
equal to the “limit” specified. This is expected behavior. Query
parameters are applied on our end before checking to see if the
results returned are visible to the viewer. Because of this, it is
possible that you might get fewer results than expected.
The below is the best part of the blog entry...
This also means when you are manually constructing your own queries,
you should be aware that with some tables and connections if you are
specifying an “offset” parameter, the Nth result you are pointing to
may not get returned in your results (like the 3rd result in step 2 of
the image above). This can make paging difficult and confusing.
Lol, you're killing me Facebook!!! Why not make it straight forward and consistent, rather than "difficult and confusing."?!?!?

Related

Do I have to loop through each 'page' of orders to get all orders in one WooComerce REST Api query?

I've built a KNIME workflow that helps me analyse (sales) data from numerous channels. In the past I used to export all orders manually and use an XSLX or CSV reader but I want to do it via WooCommerce's REST API to reduce manual labor.
I would like to be able to receive all orders up until now from a single query. So far, I only get as many as the # I fill in for &per_page=X. But if I fill in like 1000, it gives an error. This + my common sense give me the feeling I'm thinking the wrong way!
If it is not possible, is looping through all pages the second best thing?
I've managed to connect to the api via basic auth. The following query returns orders, but only 10:
I've tried increasing the number per_page but I do not think this is the right way to get all orders in one table.
https://XXXX.nl/wp-json/wc/v3/orders?consumer_key=XXXX&consumer_secret=XXXX
My current mindset would like to be able to receive all orders up until now from a single query. But it personally also feels like that this is not the common way to do it. Is looping through all pages the second best thing?
Thanks in advance for your responses. I am more of a data analist than a data engineer or scientist and I hope your answers will help me towards my goal of being more of a scientist :)
It's possible by passing params "per_page" with the request
per_page integer Maximum number of items to be returned in result set. Default is 10.
Try -1 as the value
https://woocommerce.github.io/woocommerce-rest-api-docs/?php#list-all-orders

how to avoid redundent in Yahoo answer api

I have a question about Yahoo answer api. I plan to use (questionSearch, getByCategory, getQuestion, getByUser). For example I used getByCategory to query. Each time I call the function, I can query max 50 questions. However, there are a lot of same questions which have been queried in previous time. So How can I remove this redundent ?
The API doesn't track what it has returned to you previously as its stateless.
This leaves you with two options that I can think of.
1) After you get your data back filter out what you already have. This requires you checking what is displayed and then not displaying duplicated items.
2) Store all ID's you have showing in a list, then adjust your YQL Query so that it provides that list of ID's as ones not to turn. Like:
select * from answers.getbycategory where category_id=2115500137 and type="resolved" and id not in ('20140216060544AA0tCLE', '20140215125452AAcNRTq', '20140215124804AAC1cQl');
The downside of this, is that it could effect performance since your YQL queries will start to take longer and longer to return.

Find First and First Difference in Progress 4GL

I'm not clear about below queries and curious to know what is the different between them even though both retrieves same results. (Database used sports2000).
FOR EACH Customer WHERE State = "NH",
FIRST Order OF Customer:
DISPLAY Customer.Cust-Num NAME Order-Num Order-Date.
END.
FOR EACH Customer WHERE State = "NH":
FIND FIRST Order OF Customer NO-ERROR.
IF AVAILABLE Order THEN
DISPLAY Customer.Cust-Num NAME Order-Num Order-Date.
END.
Please explain me
Regards
Suga
As AquaAlex says your first snippet is a join (the "," part of the syntax makes it a join) and has all of the pros and cons he mentions. There is, however, a significant additional "con" -- the join is being made with FIRST and FOR ... FIRST should never be used.
FOR LAST - Query, giving wrong result
It will eventually bite you in the butt.
FIND FIRST is not much better.
The fundamental problem with both statements is that they imply that there is an order which your desired record is the FIRST instance of. But no part of the statement specifies that order. So in the event that there is more than one record that satisfies the query you have no idea which record you will actually get. That might be ok if the only reason that you are doing this is to probe to see if there is one or more records and you have no intention of actually using the record buffer. But if that is the case then CAN-FIND() would be a better statement to be using.
There is a myth that FIND FIRST is supposedly faster. If you believe this, or know someone who does, I urge you to test it. It is not true. It is true that in the case where FIND returns a large set of records adding FIRST is faster -- but that is not apples to apples. That is throwing away the bushel after randomly grabbing an apple. And if you code like that your apple now has magical properties which will lead to impossible to cure bugs.
OF is also problematic. OF implies a WHERE clause based on the compiler guessing that fields with the same name in both tables and which are part of a unique index can be used to join the tables. That may seem reasonable, and perhaps it is, but it obscures the code and makes the maintenance programmer's job much more difficult. It makes a good demo but should never be used in real life.
Your first statement is a join statement, which means less network traffic. And you will only receive records where both the customer and order record exist so do not need to do any further checks. (MORE EFFICIENT)
The second statement will retrieve each customer and then for each customer found it will do a find on order. Because there may not be an order you need to do an additional statement (If Available) as well. This is a less efficient way to retrieve the records and will result in much more unwanted network traffic and more statements being executed.

How to fetch the continuous list with PostgreSQL in web

I am making an API over HTTP that fetches many rows from PostgreSQL with pagination. In ordinary cases, I usually implement such pagination through naive OFFET/LIMIT clause. However, there are some special requirements in this case:
A lot of rows there are so that I believe users cannot reach the end (imagine Twitter timeline).
Pages does not have to be randomly accessible but only sequentially.
API would return a URL which contains a cursor token that directs to the page of continuous chunks.
Cursor tokens have not to exist permanently but for some time.
Its ordering has frequent fluctuating (like Reddit rankings), however continuous cursors should keep their consistent ordering.
How can I achieve the mission? I am ready to change my whole database schema for it!
Assuming it's only the ordering of the results that fluctuates and not the data in the rows, Fredrik's answer makes sense. However, I'd suggest the following additions:
store the id list in a postgresql table using the array type rather than in memory. Doing it in memory, unless you carefully use something like redis with auto expiry and memory limits, is setting yourself up for a DOS memory consumption attack. I imagine it would look something like this:
create table foo_paging_cursor (
cursor_token ..., -- probably a uuid is best or timestamp (see below)
result_ids integer[], -- or text[] if you have non-integer ids
expiry_time TIMESTAMP
);
You need to decide if the cursor_token and result_ids can be shared between users to reduce your storage needs and the time needed to run the initial query per user. If they can be shared, chose a cache window, say 1 or 5 minute(s), and then upon a new request create the cache_token for that time period and then check to see if the results ids have already been calculated for that token. If not, add a new row for that token. You should probably add a lock around the check/insert code to handle concurrent requests for a new token.
Have a scheduled background job that purges old tokens/results and make sure your client code can handle any errors related to expired/invalid tokens.
Don't even consider using real db cursors for this.
Keeping the result ids in Redis lists is another way to handle this (see the LRANGE command), but be careful with expiry and memory usage if you go down that path. Your Redis key would be the cursor_token and the ids would be the members of the list.
I know absolutely nothing about PostgreSQL, but I'm a pretty decent SQL Server developer, so I'd like to take a shot at this anyway :)
How many rows/pages do you expect a user would maximally browse through per session? For instance, if you expect a user to page through a maximum of 10 pages for each session [each page containing 50 rows], you could make take that max, and setup the webservice so that when the user requests the first page, you cache 10*50 rows (or just the Id:s for the rows, depends on how much memory/simultaneous users you got).
This would certainly help speed up your webservice, in more ways than one. And it's quite easy to implement to. So:
When a user requests data from page #1. Run a query (complete with order by, join checks, etc), store all the id:s into an array (but a maximum of 500 ids). Return datarows that corresponds to id:s in the array at positions 0-9.
When the user requests page #2-10. Return datarows that corresponds to id:s in the array at posisions (page-1)*50 - (page)*50-1.
You could also bump up the numbers, an array of 500 int:s would only occupy 2K of memory, but it also depends on how fast you want your initial query/response.
I've used a similar technique on a live website, and when the user continued past page 10, I just switched to queries. I guess another solution would be to continue to expand/fill the array. (Running the query again, but excluding already included id:s).
Anyway, hope this helps!

How to get total number of potential results in Lucene

I'm using lucene on a site of mine and I want to show the total result count from a query, for example:
Showing results x to y of z
But I can't find any method which will return me the total number of potential results. I can only seem to find methods which you have to specify the number of results you want, and since I only want 10 per page it seems logical to pass in 10 as the number of results.
Or am I doing this wrong, should I be passing in say 1000 and then just taking the 10 in the range that I require?
BTW, since I know you personally I should point out for others I already knew you were referring to Lucene.net and not Lucene :) although the API would be the same
In versions prior to 2.9.x you could call IndexSearcher.Search(Query query, Filter filter) which returned a Hits object, one of which properties [methods, technically, due to the Java port] was Length()
This is now marked Obsolete since it will be removed in 3.0, the only results of Search return TopDocs or TopFieldDocs objects.
Your alternatives are
a) Use IndexServer.Search(Query query, int count) which will return a TopDocs object, so TopDocs.TotalHits will show you the total possible hits but at the expense of actually creating <count> results
b) A faster way is to implement your own Collector object (inherit from Lucene.Net.Search.Collector) and call IndexSearcher.Search(Query query, Collector collector). The search method will call Collect(int docId) on your collector on every match, so if internally you keep track of that you have a way of garnering all the results.
It should be noted Lucene is not a total-resultset query environment and is designed to stream the most relevant results to you (the developer) as fast as possible. Any method which gives you a "total results" count is just a wrapper enumerating over all the matches (as with the Collector method).
The trick is to keep this enumeration as fast as possible. The most expensive part is deserialisation of Documents from the index, populating each field etc. At least with the newer API design, requiring you to write your own Collector, the principle is made clear by telling the developer to avoid deserialising each result from the index since only matching document Ids and a score are provided by default.
The top docs collector does this for you, for example
TopDocs topDocs = searcher.search(qry, 10);
int totalHits = topDocs.totalHits ;
The above query will count all hits, but return only 10.