Caching Strategy for location requests - rest

I am building REST APIs that return data (lets say events ) in particular area. The REST URL is a simple GET
/api/v1/events?lat=<lat>&lng=<lng>&radius=<radius>.
with parameters lat, lng and radius (10 miles by default), the latitude and longitude are what the device or browser APIs return. Now needless to say that the lat and lng change continuously as the user moves and also two users can be same vicinity with different lat / lng. What is the best way to cache such kind of requests on the server so that I don't have to dip into business logic everytime. The URL is not going to unique since lat/lng change.
Thanks

I'm assuming you have some sort of "grid", and when a user requests a specific coordinate, you return the grid tile(s) around the location. So you have an infinite URL space (coordinates) that is mapped to a finite number of tiles. One solution is to redirect every request to the "canonical", cache friendly URL for that tile, e.g.
GET /api/v1/events?lat=123&lng=456
=>
302 Found
Location: /api/v1/events?tile=abc
Or, if you want to retain the lat/long info in the URL, you could use the location of the center of the tile.

I think the best approach is for you to store the results in a cache with the center coordinates as a key, and later query the points within the circle for the new request.
I'm not aware of any cache engines that would allow you to perform spatial queries, so I think you'll have to use a database that allows easy querying and indexing of spatial data. You may use that database for caching your results, or at least store a key to that result in a cache engine somewhere else, and later you can query them with spatial coordinates, asking for all points with a threshold distance to your new request.
There's PostGis for PostgreSQL, which should be quite straightforward since it has full support for latitude/longitude distance computations. Once you have it setup with proper indexes, it should be as easy as:
SELECT * FROM your_cache_table
WHERE ST_Distance_Sphere(the_geom, ST_MakePoint(new_lon, new_lat)) <= 160.934
MySQL has some support for the OpenGis extensions, however it doesn't have support for latitude/longitude distance computations. Maybe you'll need to do some calculations by yourself, maybe the simple cartesian distance works for you. Check the documentation here, and this answer should also help.
I also believe even MySQL 5.6 still has support for spatial indexes only in MyISAM tables, but that shouldn't be an issue since you're using them only for cache.
Managing the cache may be a little more complicated than usual. If you need expiration, you should probably store only keys in the database and set an expire parameter on the cache server. When you hit a database point for which there's no longer a valid key, you clean it from the database. You'll probably need a way to invalidate cache when the primary data changes, removing from both the database and the cache server.

I have been data modelling a hobby application that also needs to deal with geolocation data and have had to try to solve similar problems. The solution will of course depend on the constraints that you have and the actual use cases that are crucial to your application's purpose. I will assume that you have design flexibility to change all aspects of your application. i.e. the technology stack.
note:
... so that I don't have to dip into business logic everytime. The URL is not
going to unique since lat/lng change.
The above is an ambiguous statement, since not dipping into business logic can mean many things. I will assume it means you don't want to make any database queries to retrieve new data that might be similar to the data you already have in the cache. Also assuming that you only want to cache on the server, the following approaches come to mind:
Cache the data between the database and the application.
Database ---> Cache --> App ---> User
In this approach, your application processes all the rest api calls and then decide whether the results in the cache can be used or another database access is required.
Cache the data between the application and the user.
Database --> App ---> Cache ---> User
This approach is a little tricky considering that the url is always changing. So you might need a 'smart caching mechanism' that will process the incoming url and then decide whether the cached data is relevant. The smart caching can be done in a number of ways depending on how your application is implemented. Something like mongodb could be used as a json cache and then each incoming request can be preprocessed to see if the data can just be returned from this cache or redirect to the main application. So the structure might look like this:
Database --> App ---> (Some logic(e.g NodeJS app) + Mongodb) ---> User
conclusion:
Without knowing the architecture of your solution, the critical use cases and the full design constraints, one cannot really suggest a complete solution to this problem. You might have to rethink certain features you are trying to provide and make tough compromises to get things working. Hopefully, the suggestions provided by different people here will be helpful.

Related

Firebase analytics - Unity - time spent on a level

is there any possibility to get exact time spent on a certain level in a game via firebase analytics? Thank you so much 🙏
I tried to use logEvents.
The best way to do so would be measuring the time on the level within your codebase, then have a very dedicated event for level completion, in which you would pass the time spent on the level.
Let's get to details. I will use Kotlin as an example, but it should be obvious what I'm doing here and you can see more language examples here.
firebaseAnalytics.setUserProperty("user_id", userId)
firebaseAnalytics.logEvent("level_completed") {
param("name", levelName)
param("difficulty", difficulty)
param("subscription_status", subscriptionStatus)
param("minutes", minutesSpentOnLevel)
param("score", score)
}
Now see how I have a bunch of parameters with the event? These parameters are important since they will allow you to conduct a more thorough and robust analysis later on, answer more questions. Like, Hey, what is the most difficult level? Do people still have troubles on it when the game difficulty is lower? How many times has this level been rage-quit or lost (for that you'd likely need a level_started event). What about our paid players, are they having similar troubles on this level as well? How many people have ragequit the game on this level and never played again? That would likely be easier answer with sql at this point, taking the latest value of the level name for the level_started, grouped by the user_id. Or, you could also have levelName as a UserProperty as well as the EventProperty, then it would be somewhat trivial to answer in the default analytics interface.
Note that you're limited in the number of event parameters you can send per event. The total number of unique parameter names is limited too. As well as the number of unique event names you're allowed to have. In our case, the event name would be level_completed. See the limits here.
Because of those limitations, it's important to name your event properties in somewhat generic way so that you would be able to efficiently reuse them elsewhere. For this reason, I named minutes and not something like minutes_spent_on_the_level. You could then reuse this property to send the minutes the player spent actively playing, minutes the player spent idling, minutes the player spent on any info page, minutes they spent choosing their upgrades, etc. Same idea about having name property rather than level_name. Could as well be id.
You need to carefully and thoughtfully stuff your event with event properties. I normally have a wrapper around the firebase sdk, in which I would enrich events with dimensions that I always want to be there, like the user_id or subscription_status to not have to add them manually every time I send an event. I also usually have some more adequate logging there Firebase Analytics default logging is completely awful. I also have some sanitizing there, lowercasing all values unless I'm passing something case-sensitive like base64 values, making sure I don't have double spaces (so replacing \s+ with " " (space)), maybe also adding the user's local timestamp as another parameter. The latter is very helpful to indicate time-cheating users, especially if your game is an idler.
Good. We're halfway there :) Bear with me.
Now You need to go to firebase and register your eps (event parameters) into cds (custom dimensions and metrics). If you don't register your eps, they won't be counted towards the global cd limit count (it's about 50 custom dimensions and 50 custom metrics). You register the cds in the Custom Definitions section of FB.
Now you need to know whether this is a dimension or a metric, as well as the scope of your dimension. It's much easier than it sounds. The rule of thumb is: if you want to be able to run mathematical aggregation functions on your dimension, then it's a metric. Otherwise - it's a dimension. So:
firebaseAnalytics.setUserProperty("user_id", userId) <-- dimension
param("name", levelName) <-- dimension
param("difficulty", difficulty) <-- dimension (or can be a metric, depends)
param("subscription_status", subscriptionStatus) <-- dimension (can be a metric too, but even less likely)
param("minutes", minutesSpentOnLevel) <-- metric
param("score", score) <-- metric
Now another important thing to understand is the scope. Because Firebase and GA4 are still, essentially just in Beta being actively worked on, you only have user or hit scope for the dimensions and only hit for the metrics. The scope basically just indicates how the value persists. In my example, we only need the user_id as a user-scoped cd. Because user_id is the user-level dimension, it is set separately form the logEvent function. Although I suspect you can do it there too. Haven't tried tho.
Now, we're almost there.
Finally, you don't want to use Firebase to look at your data. It's horrible at data presentation. It's good at debugging though. Cuz that's what it was intended for initially. Because of how horrible it is, it's always advised to link it to GA4. Now GA4 will allow you to look at the Firebase values much more efficiently. Note that you will likely need to re-register your custom dimensions from Firebase in GA4. Because GA4 is capable of getting multiple data streams, of which firebase would be just one data source. But GA4's CDs limits are very close to Firebase's. Ok, let's be frank. GA4's data model is almost exactly copied from that of Firebase's. But GA4 has a much better analytics capabilities.
Good, you've moved to GA4. Now, GA4 is a very raw not-officially-beta product as well as Firebase Analytics. Because of that, it's advised to first change your data retention to 12 months and only use the explorer for analysis, pretty much ignoring the pre-generated reports. They are just not very reliable at this point.
Finally, you may find it easier to just use SQL to get your analysis done. For that, you can easily copy your data from GA4 to a sandbox instance of BQ. It's very easy to do.This is the best, most reliable known method of using GA4 at this moment. I mean, advanced analysts do the export into BQ, then ETL the data from BQ into a proper storage like Snowflake or even s3, or Aurora, or whatever you prefer and then on top of that, use a proper BI tool like Looker, PowerBI, Tableau, etc. A lot of people just stay in BQ though, it's fine. Lots of BI tools have BQ connectors, it's just BQ gets expensive quickly if you do a lot of analysis.
Whew, I hope you'll enjoy analyzing your game's data. Data-driven decisions rock in games. Well... They rock everywhere, to be honest.

Where should data be transformed for the database?

Where should data be transformed for the database? I believe this is called data normalization/sanitization.
In my database I have user created shops- like an Etsy. Let's say a user inputs a price for an item in their shop as "1,000.00". But my database stores the prices as an integer/pennies- "100000". Where should "1,000.00" be converted to "100000"?
These are the two ways I thought of.
In the frontend: The input data is converted from "1,000.00" to "100000" in the frontend before the HTTP request. In this case, the backend would validate that the price it is an integer.
In the backend: "1,000.00" is sent to the backend as is, then the backend validates that it is a price format, then the backend converts the price to an integer "100000" before being stored in the database.
It seems either would work but is one way better than the other or is there a standard way? I would think the second way is best to reduce code duplication since there is likely to be multiple frontends - mobile, web, etc- and one backend. But the first way also seems cleaner- just send what the api needs.
I'm working with a MERN application if that makes any difference, but I would think this would be language agnostic.
I would go with a mix between the two (which is a best practice, AFAIK).
The frontend has to do some kind of validation anyway, because it you don't want to wait for the backend to get the validation response. I would add the actual conversion here as well.
The validation code will be adapted to each frontend I guess, because each one has a different framework / language that it uses, so I don't necessarily see the code duplication here (the logic yes, but not the actual code). As a last resort, you can create a common validation library.
The backend should validate again the converted value sent by the frontends, just as a double check, and then store it in the database. You never know if the backend will be integrated with other components in the future and input sanitization is always a best practice.
Short version
Both ways would work and I think there is no standard way. I prefer formatting in the frontend.
Long version
What I do in such cases is to look at my expected business requirements and make a list of pros and cons. Here are just a few thoughts about this.
In case, you decide for doing every transformation of the price (formatting, normalization, sanitization) in the frontend, your backend will stay smaller and you will have less endpoints or endpoints with less options. Depending on the frontend, you can choose the perfect fit for you end user. The amount of code which is delivered will stay smaller, because the application can be cached and makes all the formatting stuff.
If you implement everything in the backend, you have full control about which format is delivered to your users. Especially when dealing with a lot of different languages, it could be helpful to get the correct display value directly from the server.
Furthermore, it can be helpful to take a look at some different APIs of well-known providers and how these handle prices.
The Paypal API uses an amount object to transfer prices as decimals together with a currency code.
"amount": {
"value": "9.87",
"currency": "USD"
}
It's up to you how to handle it in the frontend. Here is a link with an example request from the docs:
https://developer.paypal.com/docs/api/payments.payouts-batch/v1#payouts_post
Stripe uses a slightly different model.
{
unit_amount: 1600,
currency: 'usd',
}
It has integer values in the base unit of the currency as the amount and a currency code to describe prices. Here are two examples to make it more clear:
https://stripe.com/docs/api/prices/create?lang=node
https://stripe.com/docs/checkout/integration-builder
In both cases, the normalization and sanitization has to be done before making requests. The response will also need formatting before showing it to the user. Of course, most of these requests are done by backend code. But if you look at the prebuilt checkout pages from Stripe or Paypal, these are also using normalized and sanitized values for their frontend integrations: https://developer.paypal.com/docs/business/checkout/configure-payments/single-page-app
Conclusion/My opinion
I would always prefer keeping the backend as simple as possible for security reasons. Less code (aka endpoints) means a smaller attack surface. More configuration possibilities means a lot more effort to make the application secure. Furthermore, you could write another (micro)service which overtakes some transformation tasks, if you have a business requirement to deliver everything perfectly formatted from the backend. Example use cases may be if you have a preference for backend code over frontend code (think about your/your team's skills), if you want to deploy a lot of different frontends and want to make sure that they all use a single source of truth for their display values or maybe if you need to fulfill regulatory requirements to know exactly what is delivered to your user.
Thank you for your question and I hope I have given you some guidance for your own decision. In the end, you will always wish you had chosen a different approach anyway... ;)
For sanitisation, it has to be on the back end for security reasons. Sanitisation is concerned with ensuring only certain field's values from your web form are even entertained. It's like a guest list at an exclusive club. It's concerned not just with the field (e.g. username) but also the value (e.g. bobbycat, or ); DROP TABLE users;). So it's about ensuring security of your database.
Normalisation, on the other hand, is concerned with transforming the data before storing them in the database. It's the example you brought up: 1,000 to 1000 because you are storing it as integers without decimals in the database.
Where does this code belong? I think there's no clear winner because it depends on your use case.
If it's a simple matter like making sure the value is an integer and not a string, you should offload that to the web form (I.e. the front end), since forms already have a "type" attribute to enforce these.
But imagine a more complicated scenario. Let's say you're building an app that allows users to construct a Facebook ads campaign (your app being the third party developer app, like Smartly.io). That means there will be something like 30 form fields that must be filled out before the user hits "create campaign". And the value in some form fields affect the validity of other parts of the form.
In such a situation, it might make sense to put at least some of the validation in the back end because there is a series of operations your back end needs to run (like create the Post, then create the Ad) in order to ensure validity. It wouldn't make sense for you to code those validations and normalisations in the front end.
So in short, it's a balance you'll need to strike. Offload the easy stuff to the front end, leveraging web APIs and form validations. Leave the more complex normalisation steps to the back end.
On a related note, there's a broader concept of ETL (extract, transform, load) that you'd use if you were trying to consume data from another service and then transforming it to fit the way you store info in your own database. In that case, it's usually a good idea to keep it as a repository on its own - using something like Apache Airflow to manage and visualise the cron jobs.

REST design principles: Referencing related objects vs Nesting objects

My team and I we are refactoring a REST-API and I have come to a question.
For terms of brevity, let us assume that we have an SQL database with 4 tables: Teachers, Students, Courses and Classrooms.
Right now all the relations between the items are represented in the REST-API through referencing the URL of the related item. For example for a course we could have the following
{ "id":"Course1", "teacher": "http://server.com/teacher1", ... }
In addition, if ask a list of courses thought a call GET call to /courses, I get a list of references as shown below:
{
... //pagination details
"items": [
{"href": "http://server1.com/course1"},
{"href": "http://server1.com/course2"}...
]
}
All this is nice and clean but if I want a list of all the courses titles with the teachers' names and I have 2000 courses and 500 teachers I have to do the following:
Approximately 2500 queries just to read the data.
Implement the join between the teachers and courses
Optimize with caching etc, so that I will do it as fast as possible.
My problem is that this method creates a lot of network traffic with thousands of REST-API calls and that I have to re-implement the natural join that the database would do way more efficiently.
Colleagues say that this is approach is the standard way of implementing a REST-API but then a relatively simple query becomes a big hassle.
My question therefore is:
1. Is it wrong if we we nest the teacher information in the courses.
2. Should the listing of items e.g. GET /courses return a list of references or a list of items?
Edit: After some research I would say the model I have in mind corresponds mainly to the one shown in jsonapi.org. Is this a good approach?
My problem is that this method creates a lot of network traffic with thousands of REST-API calls and that I have to re-implement the natural join that the database would do way more efficiently. Colleagues say that this is approach is the standard way of implementing a REST-API but then a relatively simple query becomes a big hassle.
Your colleagues have lost the plot.
Here's your heuristic - how would you support this use case on a web site?
You would probably do it by defining a new web page, that produces the report you need. You'd run the query, you the result set to generate a bunch of HTML, and ta-da! The client has the information that they need in a standardized representation.
A REST-API is the same thing, with more emphasis on machine readability. Create a new document, with a schema so that your clients can understand the semantics of the document you return to them, tell the clients how to find the target uri for the document, and voila.
Creating new resources to handle new use cases is the normal approach to REST.
Yes, I totally think you should design something similar to jsonapi.org. As a rule of thumb, I would say "prefer a solution that requires less network calls". It's especially true if amount of network calls will be less by order of magnitude.
Of course it doesn't eliminate the need to limit the request/response size if it becomes unreasonable.
Real life solutions must have a proper balance. Clean API is nice as long as it works.
So in your case I would so something like:
GET /courses?include=teachers
Or
GET /courses?includeTeacher=true
Or
GET /courses?includeTeacher=brief|full
In the last one the response can have only the teacher's id for brief and full teacher details for full.
My problem is that this method creates a lot of network traffic with thousands of REST-API calls and that I have to re-implement the natural join that the database would do way more efficiently. Colleagues say that this is approach is the standard way of implementing a REST-API but then a relatively simple query becomes a big hassle.
Have you actually measured the overhead generated by each request? If not, how do you know that the overhead will be too intense? From an object-oriented programmers perspective it may sound bad to perform each call on their own, your design, however, lacks one important asset which helped the Web to grew to its current size: caching.
Caching can occur on multiple levels. You can do it on the API level or the client might do something or an intermediary server might do it. Fielding even mad it a constraint of REST! So, if you want to comply to the REST architecture philosophy you should also support caching of responses. Caching helps to reduce the number of requests having to be calculated or even processed by a single server. With the help of stateless communication you might even introduce a multitude of servers that all perform calculations for billions of requests that act as one cohesive system to the client. An intermediary cache may further help to reduce the number of requests that actually reach the server significantly.
A URI as a whole (including any path, matrix or query parameters) is actually a key for a cache. Upon receiving a GET request, i.e., an application checks whether its current cache already contains a stored response for that URI and returns the stored response on behalf of the server directly to the client if the stored data is "fresh enough". If the stored data already exceeded the freshness threshold it will throw away the stored data and route the request to the next hop in line (might be the actual server, might be a further intermediary).
Spotting resources that are ideal for caching might not be easy at times, though the majority of data doesn't change that quickly to completely neglect caching at all. Thus, it should be, at least, of general interest to introduce caching, especially the more traffic your API produces.
While certain media-types such as HAL JSON, jsonapi, ... allow you to embed content gathered from related resources into the response, embedding content has some potential drawbacks such as:
Utilization of the cache might be low due to mixing data that changes quickly with data that is more static
Server might calculate data the client wont need
One server calculates the whole response
If related resources are only linked to instead of directly embedded, a client for sure has to fire off a further request to obtain that data, though it actually is more likely to get (partly) served by a cache which, as mentioned a couple times now throughout the post, reduces the workload on the server. Besides that, a positive side effect could be that you gain more insights into what the clients are actually interested in (if an intermediary cache is run by you i.e.).
Is it wrong if we we nest the teacher information in the courses.
It is not wrong, but it might not be ideal as explained above
Should the listing of items e.g. GET /courses return a list of references or a list of items?
It depends. There is no right or wrong.
As REST is just a generalization of the interaction model used in the Web, basically the same concepts apply to REST as well. Depending on the size of the "item" it might be beneficial to return a short summary of the items content and add a link to the item. Similar things are done in the Web as well. For a list of students enrolled in a course this might be the name and its matriculation number and the link further details of that student could be asked for accompanied by a link-relation name that give the actual link some semantical context which a client can use to decide whether invoking such URI makes sense or not.
Such link-relation names are either standardized by IANA, common approaches such as Dublin Core or schema.org or custom extensions as defined in RFC 8288 (Web Linking). For the above mentioned list of students enrolled in a course you could i.e. make use of the about relation name to hint a client that further information on the current item can be found by following the link. If you want to enable pagination the usage of first, next, prev and last can and probably should be used as well and so forth.
This is actually what HATEOAS is all about. Linking data together and giving them meaningful relation names to span a kind of semantic net between resources. By simply embedding things into a response such semantic graphs might be harder to build and maintain.
In the end it basically boils down to implementation choice whether you want to embed or reference resources. I hope, I could shed some light on the usefulness of caching and the benefits it could yield, especially on large-scale systems, as well as on the benefit of providing link-relation names for URIs, that enhance the semantical context of relations used within your API.

How to fetch the continuous list with PostgreSQL in web

I am making an API over HTTP that fetches many rows from PostgreSQL with pagination. In ordinary cases, I usually implement such pagination through naive OFFET/LIMIT clause. However, there are some special requirements in this case:
A lot of rows there are so that I believe users cannot reach the end (imagine Twitter timeline).
Pages does not have to be randomly accessible but only sequentially.
API would return a URL which contains a cursor token that directs to the page of continuous chunks.
Cursor tokens have not to exist permanently but for some time.
Its ordering has frequent fluctuating (like Reddit rankings), however continuous cursors should keep their consistent ordering.
How can I achieve the mission? I am ready to change my whole database schema for it!
Assuming it's only the ordering of the results that fluctuates and not the data in the rows, Fredrik's answer makes sense. However, I'd suggest the following additions:
store the id list in a postgresql table using the array type rather than in memory. Doing it in memory, unless you carefully use something like redis with auto expiry and memory limits, is setting yourself up for a DOS memory consumption attack. I imagine it would look something like this:
create table foo_paging_cursor (
cursor_token ..., -- probably a uuid is best or timestamp (see below)
result_ids integer[], -- or text[] if you have non-integer ids
expiry_time TIMESTAMP
);
You need to decide if the cursor_token and result_ids can be shared between users to reduce your storage needs and the time needed to run the initial query per user. If they can be shared, chose a cache window, say 1 or 5 minute(s), and then upon a new request create the cache_token for that time period and then check to see if the results ids have already been calculated for that token. If not, add a new row for that token. You should probably add a lock around the check/insert code to handle concurrent requests for a new token.
Have a scheduled background job that purges old tokens/results and make sure your client code can handle any errors related to expired/invalid tokens.
Don't even consider using real db cursors for this.
Keeping the result ids in Redis lists is another way to handle this (see the LRANGE command), but be careful with expiry and memory usage if you go down that path. Your Redis key would be the cursor_token and the ids would be the members of the list.
I know absolutely nothing about PostgreSQL, but I'm a pretty decent SQL Server developer, so I'd like to take a shot at this anyway :)
How many rows/pages do you expect a user would maximally browse through per session? For instance, if you expect a user to page through a maximum of 10 pages for each session [each page containing 50 rows], you could make take that max, and setup the webservice so that when the user requests the first page, you cache 10*50 rows (or just the Id:s for the rows, depends on how much memory/simultaneous users you got).
This would certainly help speed up your webservice, in more ways than one. And it's quite easy to implement to. So:
When a user requests data from page #1. Run a query (complete with order by, join checks, etc), store all the id:s into an array (but a maximum of 500 ids). Return datarows that corresponds to id:s in the array at positions 0-9.
When the user requests page #2-10. Return datarows that corresponds to id:s in the array at posisions (page-1)*50 - (page)*50-1.
You could also bump up the numbers, an array of 500 int:s would only occupy 2K of memory, but it also depends on how fast you want your initial query/response.
I've used a similar technique on a live website, and when the user continued past page 10, I just switched to queries. I guess another solution would be to continue to expand/fill the array. (Running the query again, but excluding already included id:s).
Anyway, hope this helps!

What ist a RESTful-resource in the context of large data sets, i.E. weather data?

So I am working on a webservice to access our weather forecast data (10000 locations, 40 parameters each, hourly values for the next 14 days = about 130 million values).
So I read all about RESTful services and its ideology.
So I understand that an URL is adressing a ressource.
But what is a ressource in my case?
The common use case is that you want to get the data for a couple of parameters over a timespan at one or more location. So clearly giving every value its own URL is not pratical and would result in hundreds of requests. I have the feeling that my specific problem doesn't excactly fit into the RESTful pattern.
Update: To clarify: There are two usage patterns of the service. 1. Raw data; rows and rows of data for several locations and parameters.
Interpreted data; the raw data calculated into symbols (Suns & clouds, for example) and other parameters.
There is not one 'forecast'. Different clients have different needs for data.
The reason I think this doesn't fit into the REST-pattern is, that while I can actually have a 'forecast' ressource, I still have to submit a lot of request parameters. So a simple GET-request on a ressource doesn't work, I end up POSTing data all over the place.
So I am working on a webservice to access our weather forecast data (10000 locations, 40 parameters each, hourly values for the next 14 days = about 130 million values). ... But what is a ressource in my case?
That depends on the details of your problem domain. Simply having a large amount of data is not a good reason to avoid REST. There are smart ways and dumb ways to model and expose that data.
As you rightly see, your main goal at this point should be to understand what exactly a resource is. Knowing only enough about weather forecasting to follow the Weather Channel, I won't be much help here. It's for domain experts like yourself to make that call.
If you were to explain in a little more detail the major domain concepts you're working with, it might make it a little easier to give specific advice.
For example, one resource might be Forecast. When weatherpeople talk about Forecasts, what words keep coming up? When you think about breaking a forecast down into smaller elements, what words do you use to describe the pieces?
Do this process recursively, and you'll probably be able to make a list of important terms. Don't forget that these terms can describe things or actions. Think about what these terms really mean, what data you can use to model them, how they can be aggregated.
At this point you'll have the makings of something you can start building a RESTful system around - but not before.
Don't forget that a RESTful system is not a data dump wrapped in HTTP - it's a hypertext-driven system.
Also don't forget that media types are the point of contact between your server and its clients. A media type is only limited by your imagination and can model datasets of any size if you're clever about it. It can contain XML, JSON, YAML, binary elements such as a Bloom Filter, or whatever works for the problem.
Firstly, there is no once-and-for-all right answer.
Each valid url is something that makes sense to query, think of them as equivalents to providing query forms for people looking for your data - that might help you narrow down the scenarios.
It is a matter of personal taste and possibly the toolkit you use, as to what goes into the basic url path and what parameters are encoded. The debate is a bit like the XML debate over putting values in elements vs attributes. It is not always a rational or logically decided issue nor will everybody be kind in their comments on your decisions.
If you are using a backend like Rails, that implies certain conventions. Even if you're not using Rails, it makes sense to work in the same way unless you have a strong reason to change. That way, people writing clients to talk to Rails-based services will find yours easier to understand and it saves you on documentation time ;-)
Maybe you can use forecast as the ressource and go deeper to fine grained services with xlink.
Would it be possible to do something like this,Since you have so many parameters so i was thinking if somehow you can relate it to a mix of id / parameter combo to decrease the url size
/WeatherForeCastService//day/hour
www.weatherornot.com/today/days/x // (where x is number of days)
www.weatherornot.com/today/9am/hours/h // (where h is number of hours)