OSRM: -different distance/duration for same (sub)route in /route/v1/driving - osrm

I'm retrieving durations/distances for millions of short routes using OSRM. In order to do so efficiently, I batch the rides in the form 'http://localhost:5000/route/v1/driving/longitude_departure1,latitude_departure1;longitude_arrival1,latitude_arrival1;longitude_departure2,latitude_departure2;longitude_arrival2,latitude_arrival2; etc. etc.?overview=false'
I batch the requests to speed things up and then I seperate the Json results, every odd response leg is a result I need. We found out that this will go wrong if more than 500 points are queried (so for me 250 rides) so I cap the query to 250 rides. This was supposed to be okay, untill I discovered that calling an individual ride does not result in the same duration/distance as the batch request.
The examples I used (using localhost:5000):
single ride:
http://router.project-osrm.org/route/v1/driving/5.437881482356651,52.129306402910395;5.43154421722558,52.13201994238261?overview=false
3 batched rides:
http://router.project-osrm.org/route/v1/driving/5.43084097657677,52.14466625721029;5.4286627903353,52.1431390952977;5.437881482356651,52.129306402910395;5.43154421722558,52.13201994238261;5.43154421722558,52.13201994238261;5.4208318605057295,52.1339190601066?overview=false
The results in my local OSRM engine:
Single ride
Distance: 1583.5
Duration: 228.7
Batched ride
Distance: 1651.6
Duration: 268.3
The weird thing is, online the result is the same, but different from my offline results. the latter is probably because of settings (which I did not alter at all) but I do not understand why the batch and single call are different.
Online results
Distance: 1137.1
Duration:350.7
Is there a valid explanation for this behaviour or is this a bug? I can imagine OSRM does not evaluate all options in the batched call to optimize the speed of the request.
I'm not sure which versions I'm running, but I've installed my routing enginge on the 14th of May this year. and I've downloaded Dutch map material from geofrabrik

If someone has the same question, I've found the answer. It turned out it is not a bug, but a setting in de lua profile:
continue_straight_at_waypoint = true,
This line makes sure a car will not make a u-turn in 2 subsequent rides, in a single ride, OSRM always assumes the car is facing in the optimal direction to start, in a batched ride, this is not necessarily the case.
My fix is thus to alter the paramter to false.

Related

Firebase analytics - Unity - time spent on a level

is there any possibility to get exact time spent on a certain level in a game via firebase analytics? Thank you so much 🙏
I tried to use logEvents.
The best way to do so would be measuring the time on the level within your codebase, then have a very dedicated event for level completion, in which you would pass the time spent on the level.
Let's get to details. I will use Kotlin as an example, but it should be obvious what I'm doing here and you can see more language examples here.
firebaseAnalytics.setUserProperty("user_id", userId)
firebaseAnalytics.logEvent("level_completed") {
param("name", levelName)
param("difficulty", difficulty)
param("subscription_status", subscriptionStatus)
param("minutes", minutesSpentOnLevel)
param("score", score)
}
Now see how I have a bunch of parameters with the event? These parameters are important since they will allow you to conduct a more thorough and robust analysis later on, answer more questions. Like, Hey, what is the most difficult level? Do people still have troubles on it when the game difficulty is lower? How many times has this level been rage-quit or lost (for that you'd likely need a level_started event). What about our paid players, are they having similar troubles on this level as well? How many people have ragequit the game on this level and never played again? That would likely be easier answer with sql at this point, taking the latest value of the level name for the level_started, grouped by the user_id. Or, you could also have levelName as a UserProperty as well as the EventProperty, then it would be somewhat trivial to answer in the default analytics interface.
Note that you're limited in the number of event parameters you can send per event. The total number of unique parameter names is limited too. As well as the number of unique event names you're allowed to have. In our case, the event name would be level_completed. See the limits here.
Because of those limitations, it's important to name your event properties in somewhat generic way so that you would be able to efficiently reuse them elsewhere. For this reason, I named minutes and not something like minutes_spent_on_the_level. You could then reuse this property to send the minutes the player spent actively playing, minutes the player spent idling, minutes the player spent on any info page, minutes they spent choosing their upgrades, etc. Same idea about having name property rather than level_name. Could as well be id.
You need to carefully and thoughtfully stuff your event with event properties. I normally have a wrapper around the firebase sdk, in which I would enrich events with dimensions that I always want to be there, like the user_id or subscription_status to not have to add them manually every time I send an event. I also usually have some more adequate logging there Firebase Analytics default logging is completely awful. I also have some sanitizing there, lowercasing all values unless I'm passing something case-sensitive like base64 values, making sure I don't have double spaces (so replacing \s+ with " " (space)), maybe also adding the user's local timestamp as another parameter. The latter is very helpful to indicate time-cheating users, especially if your game is an idler.
Good. We're halfway there :) Bear with me.
Now You need to go to firebase and register your eps (event parameters) into cds (custom dimensions and metrics). If you don't register your eps, they won't be counted towards the global cd limit count (it's about 50 custom dimensions and 50 custom metrics). You register the cds in the Custom Definitions section of FB.
Now you need to know whether this is a dimension or a metric, as well as the scope of your dimension. It's much easier than it sounds. The rule of thumb is: if you want to be able to run mathematical aggregation functions on your dimension, then it's a metric. Otherwise - it's a dimension. So:
firebaseAnalytics.setUserProperty("user_id", userId) <-- dimension
param("name", levelName) <-- dimension
param("difficulty", difficulty) <-- dimension (or can be a metric, depends)
param("subscription_status", subscriptionStatus) <-- dimension (can be a metric too, but even less likely)
param("minutes", minutesSpentOnLevel) <-- metric
param("score", score) <-- metric
Now another important thing to understand is the scope. Because Firebase and GA4 are still, essentially just in Beta being actively worked on, you only have user or hit scope for the dimensions and only hit for the metrics. The scope basically just indicates how the value persists. In my example, we only need the user_id as a user-scoped cd. Because user_id is the user-level dimension, it is set separately form the logEvent function. Although I suspect you can do it there too. Haven't tried tho.
Now, we're almost there.
Finally, you don't want to use Firebase to look at your data. It's horrible at data presentation. It's good at debugging though. Cuz that's what it was intended for initially. Because of how horrible it is, it's always advised to link it to GA4. Now GA4 will allow you to look at the Firebase values much more efficiently. Note that you will likely need to re-register your custom dimensions from Firebase in GA4. Because GA4 is capable of getting multiple data streams, of which firebase would be just one data source. But GA4's CDs limits are very close to Firebase's. Ok, let's be frank. GA4's data model is almost exactly copied from that of Firebase's. But GA4 has a much better analytics capabilities.
Good, you've moved to GA4. Now, GA4 is a very raw not-officially-beta product as well as Firebase Analytics. Because of that, it's advised to first change your data retention to 12 months and only use the explorer for analysis, pretty much ignoring the pre-generated reports. They are just not very reliable at this point.
Finally, you may find it easier to just use SQL to get your analysis done. For that, you can easily copy your data from GA4 to a sandbox instance of BQ. It's very easy to do.This is the best, most reliable known method of using GA4 at this moment. I mean, advanced analysts do the export into BQ, then ETL the data from BQ into a proper storage like Snowflake or even s3, or Aurora, or whatever you prefer and then on top of that, use a proper BI tool like Looker, PowerBI, Tableau, etc. A lot of people just stay in BQ though, it's fine. Lots of BI tools have BQ connectors, it's just BQ gets expensive quickly if you do a lot of analysis.
Whew, I hope you'll enjoy analyzing your game's data. Data-driven decisions rock in games. Well... They rock everywhere, to be honest.

Do I have to loop through each 'page' of orders to get all orders in one WooComerce REST Api query?

I've built a KNIME workflow that helps me analyse (sales) data from numerous channels. In the past I used to export all orders manually and use an XSLX or CSV reader but I want to do it via WooCommerce's REST API to reduce manual labor.
I would like to be able to receive all orders up until now from a single query. So far, I only get as many as the # I fill in for &per_page=X. But if I fill in like 1000, it gives an error. This + my common sense give me the feeling I'm thinking the wrong way!
If it is not possible, is looping through all pages the second best thing?
I've managed to connect to the api via basic auth. The following query returns orders, but only 10:
I've tried increasing the number per_page but I do not think this is the right way to get all orders in one table.
https://XXXX.nl/wp-json/wc/v3/orders?consumer_key=XXXX&consumer_secret=XXXX
My current mindset would like to be able to receive all orders up until now from a single query. But it personally also feels like that this is not the common way to do it. Is looping through all pages the second best thing?
Thanks in advance for your responses. I am more of a data analist than a data engineer or scientist and I hope your answers will help me towards my goal of being more of a scientist :)
It's possible by passing params "per_page" with the request
per_page integer Maximum number of items to be returned in result set. Default is 10.
Try -1 as the value
https://woocommerce.github.io/woocommerce-rest-api-docs/?php#list-all-orders

Why Google Analytics API v3 is triggering ALWAYS sampling at 50%?

I have build a very simple crawler for Google Analytics (v3) and it used to work well until this week that I started to get sampled data in all queries.
I used to overcome sampling by simply reducing the date range of the queries, but now I get 50% of all sessions (aprox.), even for sample spaces of less than 100 sessions.
It seems like that something is triggering sampling, but I cannot realize what can be. Anyone has suffered similar issues?
EDITED
We are also suffering sampling when querying the "Users Overview" standard report from GA web interface (along with others), even when there are only 883 sessions and we are asking for a single day.
A sample query is below, where we are querying several metrics over 3 dimensions, with a sample size of 883 sessions and a sampling or around 50% (query URL is cropped, but parameters are listed on "query" key).
It seems that the reason could be related with querying ga:users metric with several dimensions, including ga:appId.
I have tried different combinations and only ga:users is returning sampled data when queried with more dimensions than ga:date.
In summary, if I query any other metric from the example with the same 3 dimensions it returns full space data.
Two weeks ago this was not happening, so I suppose that Google has changed the way ga:users is computed recently.
Moreover, as a side-effect I realized that querying users on batches is somehow misleading if you plan to compute the total number of users, because you cannot simply sum them. That is, ga:users is similar to ga:1dayUsers when queried with ga:date, and then you cannot aggregate data. Also weird is the fact that you cannot use ga:appId with ga:1dayUsers, but you can with ga:users.
We have also detected another problem after discarding ga:users in crawler. The issue is related with segment parameter, that it is also triggering sampling when used in combination with the remaining metrics and dimensions.
We collect data from several apps in the same view (not recommendable, but it is there for legacy reasons). Therefore we use a segment defined on-the-fly like "sessions::condition::ga:appId=#com.xxx.yyy.zzz".
The fact is that when we filter that way we suffer sampling, but if we use a common filter like "ga:appId=com.xxx.yyy.zzz" we do not get sampled results.
Probably the question is why we use the segment-based filter instead of standard filter, and the reason is because we need it for some specific metrics like ga:7dayUsers and related, which cannot be combined with ga:appId as dimension and so you cannot either use ga:appId in filters. Confusingly, for those metrics, when we use the segment-based filter we do not get sampled results.
Now it seems that all our API calls are returning real data.
Not sure yet however, why a default report in web interface like "Users Overview" is returning sampled data for a single day with less than 1000 sessions.
Hope this information could help someone else if having similar issues with sampling.

How to fetch the continuous list with PostgreSQL in web

I am making an API over HTTP that fetches many rows from PostgreSQL with pagination. In ordinary cases, I usually implement such pagination through naive OFFET/LIMIT clause. However, there are some special requirements in this case:
A lot of rows there are so that I believe users cannot reach the end (imagine Twitter timeline).
Pages does not have to be randomly accessible but only sequentially.
API would return a URL which contains a cursor token that directs to the page of continuous chunks.
Cursor tokens have not to exist permanently but for some time.
Its ordering has frequent fluctuating (like Reddit rankings), however continuous cursors should keep their consistent ordering.
How can I achieve the mission? I am ready to change my whole database schema for it!
Assuming it's only the ordering of the results that fluctuates and not the data in the rows, Fredrik's answer makes sense. However, I'd suggest the following additions:
store the id list in a postgresql table using the array type rather than in memory. Doing it in memory, unless you carefully use something like redis with auto expiry and memory limits, is setting yourself up for a DOS memory consumption attack. I imagine it would look something like this:
create table foo_paging_cursor (
cursor_token ..., -- probably a uuid is best or timestamp (see below)
result_ids integer[], -- or text[] if you have non-integer ids
expiry_time TIMESTAMP
);
You need to decide if the cursor_token and result_ids can be shared between users to reduce your storage needs and the time needed to run the initial query per user. If they can be shared, chose a cache window, say 1 or 5 minute(s), and then upon a new request create the cache_token for that time period and then check to see if the results ids have already been calculated for that token. If not, add a new row for that token. You should probably add a lock around the check/insert code to handle concurrent requests for a new token.
Have a scheduled background job that purges old tokens/results and make sure your client code can handle any errors related to expired/invalid tokens.
Don't even consider using real db cursors for this.
Keeping the result ids in Redis lists is another way to handle this (see the LRANGE command), but be careful with expiry and memory usage if you go down that path. Your Redis key would be the cursor_token and the ids would be the members of the list.
I know absolutely nothing about PostgreSQL, but I'm a pretty decent SQL Server developer, so I'd like to take a shot at this anyway :)
How many rows/pages do you expect a user would maximally browse through per session? For instance, if you expect a user to page through a maximum of 10 pages for each session [each page containing 50 rows], you could make take that max, and setup the webservice so that when the user requests the first page, you cache 10*50 rows (or just the Id:s for the rows, depends on how much memory/simultaneous users you got).
This would certainly help speed up your webservice, in more ways than one. And it's quite easy to implement to. So:
When a user requests data from page #1. Run a query (complete with order by, join checks, etc), store all the id:s into an array (but a maximum of 500 ids). Return datarows that corresponds to id:s in the array at positions 0-9.
When the user requests page #2-10. Return datarows that corresponds to id:s in the array at posisions (page-1)*50 - (page)*50-1.
You could also bump up the numbers, an array of 500 int:s would only occupy 2K of memory, but it also depends on how fast you want your initial query/response.
I've used a similar technique on a live website, and when the user continued past page 10, I just switched to queries. I guess another solution would be to continue to expand/fill the array. (Running the query again, but excluding already included id:s).
Anyway, hope this helps!

What ist a RESTful-resource in the context of large data sets, i.E. weather data?

So I am working on a webservice to access our weather forecast data (10000 locations, 40 parameters each, hourly values for the next 14 days = about 130 million values).
So I read all about RESTful services and its ideology.
So I understand that an URL is adressing a ressource.
But what is a ressource in my case?
The common use case is that you want to get the data for a couple of parameters over a timespan at one or more location. So clearly giving every value its own URL is not pratical and would result in hundreds of requests. I have the feeling that my specific problem doesn't excactly fit into the RESTful pattern.
Update: To clarify: There are two usage patterns of the service. 1. Raw data; rows and rows of data for several locations and parameters.
Interpreted data; the raw data calculated into symbols (Suns & clouds, for example) and other parameters.
There is not one 'forecast'. Different clients have different needs for data.
The reason I think this doesn't fit into the REST-pattern is, that while I can actually have a 'forecast' ressource, I still have to submit a lot of request parameters. So a simple GET-request on a ressource doesn't work, I end up POSTing data all over the place.
So I am working on a webservice to access our weather forecast data (10000 locations, 40 parameters each, hourly values for the next 14 days = about 130 million values). ... But what is a ressource in my case?
That depends on the details of your problem domain. Simply having a large amount of data is not a good reason to avoid REST. There are smart ways and dumb ways to model and expose that data.
As you rightly see, your main goal at this point should be to understand what exactly a resource is. Knowing only enough about weather forecasting to follow the Weather Channel, I won't be much help here. It's for domain experts like yourself to make that call.
If you were to explain in a little more detail the major domain concepts you're working with, it might make it a little easier to give specific advice.
For example, one resource might be Forecast. When weatherpeople talk about Forecasts, what words keep coming up? When you think about breaking a forecast down into smaller elements, what words do you use to describe the pieces?
Do this process recursively, and you'll probably be able to make a list of important terms. Don't forget that these terms can describe things or actions. Think about what these terms really mean, what data you can use to model them, how they can be aggregated.
At this point you'll have the makings of something you can start building a RESTful system around - but not before.
Don't forget that a RESTful system is not a data dump wrapped in HTTP - it's a hypertext-driven system.
Also don't forget that media types are the point of contact between your server and its clients. A media type is only limited by your imagination and can model datasets of any size if you're clever about it. It can contain XML, JSON, YAML, binary elements such as a Bloom Filter, or whatever works for the problem.
Firstly, there is no once-and-for-all right answer.
Each valid url is something that makes sense to query, think of them as equivalents to providing query forms for people looking for your data - that might help you narrow down the scenarios.
It is a matter of personal taste and possibly the toolkit you use, as to what goes into the basic url path and what parameters are encoded. The debate is a bit like the XML debate over putting values in elements vs attributes. It is not always a rational or logically decided issue nor will everybody be kind in their comments on your decisions.
If you are using a backend like Rails, that implies certain conventions. Even if you're not using Rails, it makes sense to work in the same way unless you have a strong reason to change. That way, people writing clients to talk to Rails-based services will find yours easier to understand and it saves you on documentation time ;-)
Maybe you can use forecast as the ressource and go deeper to fine grained services with xlink.
Would it be possible to do something like this,Since you have so many parameters so i was thinking if somehow you can relate it to a mix of id / parameter combo to decrease the url size
/WeatherForeCastService//day/hour
www.weatherornot.com/today/days/x // (where x is number of days)
www.weatherornot.com/today/9am/hours/h // (where h is number of hours)