The proper way to represent ranges in semantic URLs - rest

Working on a little side project, I have now the opportunity to design my very own API. Event if it is not a business endeavor, it's the occasion for me to learn more about REST, Resources, Collections and URIs.
My service, records data points organized in time-series and will soon provide an API to easily query ranges of data points from specific series. Data points are immutable and as such should be very good candidates for caching. Time-series can be updated only during a limited time window, after which they are archived and readable only (making them also "cachable").
I have been looking into the APIs of some companies that provide the same kind of services, and I found the following two patterns:
Define the series in the path and the range in the query:
/series/:id?from=2017-01-26&to=2017-01-27
This is pretty much what most services out there are using. I understand it as
the series being the resources/collections that are then sliced to a specific range. This seems to be very easy to use from a consumer point of view, but from a data point of view, the dates in the query are part of some kind of organization or hierarchy and should in this case be part of the path.
Define the series and coordinates in the path:
/series/:x/:y/:z
I didn't find examples of this for time-series, but it is the kind of structure used for tile based map services. Which, to me, means that each combination of x, y and z is a different collection, that might, in some cases contain the same resources or not. It also maps directly to some hierarchy, /series/:x contains all the series with a specific value of x and any value of y and z.
I really like the idea of the method 2. and I started with something like:
/series/:id (all data points from a specific series)
/series/:id/:year (all the data points from a specific series and year)
/series/:id/:year/:month
/series/:id/:year/:month/:day
...
Which works pretty well for querying predefined ranges such as "all the data points from 2016" or "all the data points from January 2016". Issues arise when trying to query arbitrary ranges like "all the data points from January 2016 to March 2016".
My first trial was to simply add the start and end to the path:
/series/:id/:year (all the data points from a specific year)
/series/:id/:fromyear/:toyear (all the data points between fromyear and toyear)
But:
It becomes very long, very quick. Example: /series/:id/:fromyear/:frommonth/:fromday/:toyear/:tomonth/:today and potentially very cumbersome depending of the chosen structure /series/:id/:fromyear/:toyear/:frommonth/:tomonth/:fromday/:today
It doesn't make any sense from a hierarchy or structure point of view. In /series/1/2014/2016, 2016 is not a subset of 2014 and this collection is actually going to return data points from 2014, 2015 and 2016.
It is tricky to handle on the server side. Is /series/1/2016/01/02 supposed to return all the data points for the January the 2nd or for the whole January to February range ?
After noticing the way that Github references specific lines or ranges of lines in their fragment, I played with the idea of defining ranges as being different collections, such as:
/series/:id/:year/:month (same than before)
/series/:id/:year/:frommonth-:tomonth (to get a specific range)
/series/:id/:year/-:tomonth (to get everything from the beginning of the year to tomonth)
/series/:id/:year/:frommonth- (to get everything from frommonth to the end of the year)
Now, the questions:
Does my solution break any REST or Semantic URL rules/notions/ideas?
Does it improve caching in anyway compared to using ranges in the query?
Does it hurt usability for consumers?
Is it unnatural or going against unwritten rules of some Frontend frameworks?

Related

Problems with having 2 dim dates in ssas/power bi?

I have a basic model with a fact table, and 2 dimensions (one of them is Date dimension).
Now, a new column with a date has been added to the fact table… Therefore I have created a second ‘Dim Date’ and connected to it:
I have the next doubts:
Can I have any problem in my .pbix or cube if I use 2 dim dates?
Shall I mark this new ‘dim date’ also as ‘Mark as date table’? can I have 2 tables marked as date table?
This new 'Dim Date' shall be used only as a filter in the pbix, I dont plan on using any time intelligence on it...
It depends:
The analysis services tabular engine that power bi runs on supports multiple connections between tables. I would generally recommend using this with the USERELATIONSHIP() function and then your measures will give context to the report.
However, I have found there are situations where using USERELATIONSHIP() in many measures can introduce unnecessary complexity in your model. You can end up with far too many measures and it can get confusing when you use two measures that are using two different relationships in the same visual.
In short: There is not anything inherently wrong with duplicating a dimension but for data storage optimization and model cleanliness I would be sure USERELATIONSHIP() with multiple relationships between fact and dimension will NOT work before duplicating the dimension.

Why Google Analytics API v3 is triggering ALWAYS sampling at 50%?

I have build a very simple crawler for Google Analytics (v3) and it used to work well until this week that I started to get sampled data in all queries.
I used to overcome sampling by simply reducing the date range of the queries, but now I get 50% of all sessions (aprox.), even for sample spaces of less than 100 sessions.
It seems like that something is triggering sampling, but I cannot realize what can be. Anyone has suffered similar issues?
EDITED
We are also suffering sampling when querying the "Users Overview" standard report from GA web interface (along with others), even when there are only 883 sessions and we are asking for a single day.
A sample query is below, where we are querying several metrics over 3 dimensions, with a sample size of 883 sessions and a sampling or around 50% (query URL is cropped, but parameters are listed on "query" key).
It seems that the reason could be related with querying ga:users metric with several dimensions, including ga:appId.
I have tried different combinations and only ga:users is returning sampled data when queried with more dimensions than ga:date.
In summary, if I query any other metric from the example with the same 3 dimensions it returns full space data.
Two weeks ago this was not happening, so I suppose that Google has changed the way ga:users is computed recently.
Moreover, as a side-effect I realized that querying users on batches is somehow misleading if you plan to compute the total number of users, because you cannot simply sum them. That is, ga:users is similar to ga:1dayUsers when queried with ga:date, and then you cannot aggregate data. Also weird is the fact that you cannot use ga:appId with ga:1dayUsers, but you can with ga:users.
We have also detected another problem after discarding ga:users in crawler. The issue is related with segment parameter, that it is also triggering sampling when used in combination with the remaining metrics and dimensions.
We collect data from several apps in the same view (not recommendable, but it is there for legacy reasons). Therefore we use a segment defined on-the-fly like "sessions::condition::ga:appId=#com.xxx.yyy.zzz".
The fact is that when we filter that way we suffer sampling, but if we use a common filter like "ga:appId=com.xxx.yyy.zzz" we do not get sampled results.
Probably the question is why we use the segment-based filter instead of standard filter, and the reason is because we need it for some specific metrics like ga:7dayUsers and related, which cannot be combined with ga:appId as dimension and so you cannot either use ga:appId in filters. Confusingly, for those metrics, when we use the segment-based filter we do not get sampled results.
Now it seems that all our API calls are returning real data.
Not sure yet however, why a default report in web interface like "Users Overview" is returning sampled data for a single day with less than 1000 sessions.
Hope this information could help someone else if having similar issues with sampling.

Postgresql Modeling Ranges

I'm looking to model a group of users that provider various services that take various times and hoping to build on the relatively new ranges datetypes that Postgresql supports to make things a lot cleaner.
Bookings:
user_id|integer
time|tsrange
service_id|integer
Services:
user_id|integer
time_required|integer #in hours
Users:
id|integer
Services vary between users, some might be identical but takes one user 2 hours and another just 1 hour.
Searching for bookings that occur within, or not within a given time period are easy. I'm having trouble figuring out how best I would get all the users who have time available on a given day to perform one or more of their services.
I think I need to find the inverse of their booked ranges, bound by 6am/8pmn on a given day and then see if the largest range within that set will fit their smallest service, but figuring out how express that in SQL is eluding me.
Is it possible to do inverses of ranges, or would their be a better way to approach this?
The question isn't entirely clear, but if I get it right you're look for the lead() or lag() window function to compute available time slots. If so, this should put you on the right track:
select bookings.*, lead("time") over (order by "time") as next_booking_time
from bookings
http://www.postgresql.org/docs/current/static/tutorial-window.html

What ist a RESTful-resource in the context of large data sets, i.E. weather data?

So I am working on a webservice to access our weather forecast data (10000 locations, 40 parameters each, hourly values for the next 14 days = about 130 million values).
So I read all about RESTful services and its ideology.
So I understand that an URL is adressing a ressource.
But what is a ressource in my case?
The common use case is that you want to get the data for a couple of parameters over a timespan at one or more location. So clearly giving every value its own URL is not pratical and would result in hundreds of requests. I have the feeling that my specific problem doesn't excactly fit into the RESTful pattern.
Update: To clarify: There are two usage patterns of the service. 1. Raw data; rows and rows of data for several locations and parameters.
Interpreted data; the raw data calculated into symbols (Suns & clouds, for example) and other parameters.
There is not one 'forecast'. Different clients have different needs for data.
The reason I think this doesn't fit into the REST-pattern is, that while I can actually have a 'forecast' ressource, I still have to submit a lot of request parameters. So a simple GET-request on a ressource doesn't work, I end up POSTing data all over the place.
So I am working on a webservice to access our weather forecast data (10000 locations, 40 parameters each, hourly values for the next 14 days = about 130 million values). ... But what is a ressource in my case?
That depends on the details of your problem domain. Simply having a large amount of data is not a good reason to avoid REST. There are smart ways and dumb ways to model and expose that data.
As you rightly see, your main goal at this point should be to understand what exactly a resource is. Knowing only enough about weather forecasting to follow the Weather Channel, I won't be much help here. It's for domain experts like yourself to make that call.
If you were to explain in a little more detail the major domain concepts you're working with, it might make it a little easier to give specific advice.
For example, one resource might be Forecast. When weatherpeople talk about Forecasts, what words keep coming up? When you think about breaking a forecast down into smaller elements, what words do you use to describe the pieces?
Do this process recursively, and you'll probably be able to make a list of important terms. Don't forget that these terms can describe things or actions. Think about what these terms really mean, what data you can use to model them, how they can be aggregated.
At this point you'll have the makings of something you can start building a RESTful system around - but not before.
Don't forget that a RESTful system is not a data dump wrapped in HTTP - it's a hypertext-driven system.
Also don't forget that media types are the point of contact between your server and its clients. A media type is only limited by your imagination and can model datasets of any size if you're clever about it. It can contain XML, JSON, YAML, binary elements such as a Bloom Filter, or whatever works for the problem.
Firstly, there is no once-and-for-all right answer.
Each valid url is something that makes sense to query, think of them as equivalents to providing query forms for people looking for your data - that might help you narrow down the scenarios.
It is a matter of personal taste and possibly the toolkit you use, as to what goes into the basic url path and what parameters are encoded. The debate is a bit like the XML debate over putting values in elements vs attributes. It is not always a rational or logically decided issue nor will everybody be kind in their comments on your decisions.
If you are using a backend like Rails, that implies certain conventions. Even if you're not using Rails, it makes sense to work in the same way unless you have a strong reason to change. That way, people writing clients to talk to Rails-based services will find yours easier to understand and it saves you on documentation time ;-)
Maybe you can use forecast as the ressource and go deeper to fine grained services with xlink.
Would it be possible to do something like this,Since you have so many parameters so i was thinking if somehow you can relate it to a mix of id / parameter combo to decrease the url size
/WeatherForeCastService//day/hour
www.weatherornot.com/today/days/x // (where x is number of days)
www.weatherornot.com/today/9am/hours/h // (where h is number of hours)

What's a good data structure for periodic or recurring dates?

Is there a published data structure for storing periodic or recurring dates? Something that can handle:
The pump need recycling every five days.
Payday is every second Friday.
Thanksgiving Day is the second Monday in October (US: the fourth Thursday in November).
Valentine's Day is every February 14th.
Solstice is (usually) every June 21st and December 21st.
Easter is the Sunday after the first full moon on or after the day of the vernal equinox (okay, this one's a bit of a stretch).
I reckon cron's internal data structure can handle #1, #4, #5 (two rules), and maybe #2, but I haven't had a look at it. MS Outlook and other calendars seem to be able to handle the first five, but I don't have that source code lying around.
Use a iCalendar implementation library, like these ones: ruby, java, php, python, .net and java, and then add support for calculating special dates.
With all these variations in the way you specify the recurrence, I would shy away from one single data structure implementation to accommodate all 5 scenarios.
Instead, I would (and have for a previous project) build simple structures that address each type of recurrence. You could wrap them all up so that it feels like a single data structure, but under the hood they could do whatever they like. By implementing an interface, I was able to treat each type of recurrence similarly so it felt like a one-size-fits-all data structure. I could ask any instance for all the dates of recurrence within a certain time frame, and that did the trick.
I'd also want to know more about how these dates need to be used before settling on a specific implementation.
If you want to hands-on create a data structure, I'd recommend a hash table (where the holidays or event are keys with the new date occurrence as a value), if there are multiplicities of each occurrence you could hash the value that finds a section in a linked list, which then has a list of all the occurrences (this would make finding as well as insertion run in O(1)).