machine learning forecasting / asking for input parameter - prediction

I've been working on a weekly web traffic forecast based on time series. The primary business driver is marketing investment (spend)
I have historical spending data but not future spending data. In this case, how can future web traffic be predicted?
I'd like to have an input request switch where I could change the cost input to see future traffic. (If you spend this much, you will receive these many visits; if you do more, you will get more, and vice versa.)

Related

Firebase analytics - Unity - time spent on a level

is there any possibility to get exact time spent on a certain level in a game via firebase analytics? Thank you so much 🙏
I tried to use logEvents.
The best way to do so would be measuring the time on the level within your codebase, then have a very dedicated event for level completion, in which you would pass the time spent on the level.
Let's get to details. I will use Kotlin as an example, but it should be obvious what I'm doing here and you can see more language examples here.
firebaseAnalytics.setUserProperty("user_id", userId)
firebaseAnalytics.logEvent("level_completed") {
param("name", levelName)
param("difficulty", difficulty)
param("subscription_status", subscriptionStatus)
param("minutes", minutesSpentOnLevel)
param("score", score)
}
Now see how I have a bunch of parameters with the event? These parameters are important since they will allow you to conduct a more thorough and robust analysis later on, answer more questions. Like, Hey, what is the most difficult level? Do people still have troubles on it when the game difficulty is lower? How many times has this level been rage-quit or lost (for that you'd likely need a level_started event). What about our paid players, are they having similar troubles on this level as well? How many people have ragequit the game on this level and never played again? That would likely be easier answer with sql at this point, taking the latest value of the level name for the level_started, grouped by the user_id. Or, you could also have levelName as a UserProperty as well as the EventProperty, then it would be somewhat trivial to answer in the default analytics interface.
Note that you're limited in the number of event parameters you can send per event. The total number of unique parameter names is limited too. As well as the number of unique event names you're allowed to have. In our case, the event name would be level_completed. See the limits here.
Because of those limitations, it's important to name your event properties in somewhat generic way so that you would be able to efficiently reuse them elsewhere. For this reason, I named minutes and not something like minutes_spent_on_the_level. You could then reuse this property to send the minutes the player spent actively playing, minutes the player spent idling, minutes the player spent on any info page, minutes they spent choosing their upgrades, etc. Same idea about having name property rather than level_name. Could as well be id.
You need to carefully and thoughtfully stuff your event with event properties. I normally have a wrapper around the firebase sdk, in which I would enrich events with dimensions that I always want to be there, like the user_id or subscription_status to not have to add them manually every time I send an event. I also usually have some more adequate logging there Firebase Analytics default logging is completely awful. I also have some sanitizing there, lowercasing all values unless I'm passing something case-sensitive like base64 values, making sure I don't have double spaces (so replacing \s+ with " " (space)), maybe also adding the user's local timestamp as another parameter. The latter is very helpful to indicate time-cheating users, especially if your game is an idler.
Good. We're halfway there :) Bear with me.
Now You need to go to firebase and register your eps (event parameters) into cds (custom dimensions and metrics). If you don't register your eps, they won't be counted towards the global cd limit count (it's about 50 custom dimensions and 50 custom metrics). You register the cds in the Custom Definitions section of FB.
Now you need to know whether this is a dimension or a metric, as well as the scope of your dimension. It's much easier than it sounds. The rule of thumb is: if you want to be able to run mathematical aggregation functions on your dimension, then it's a metric. Otherwise - it's a dimension. So:
firebaseAnalytics.setUserProperty("user_id", userId) <-- dimension
param("name", levelName) <-- dimension
param("difficulty", difficulty) <-- dimension (or can be a metric, depends)
param("subscription_status", subscriptionStatus) <-- dimension (can be a metric too, but even less likely)
param("minutes", minutesSpentOnLevel) <-- metric
param("score", score) <-- metric
Now another important thing to understand is the scope. Because Firebase and GA4 are still, essentially just in Beta being actively worked on, you only have user or hit scope for the dimensions and only hit for the metrics. The scope basically just indicates how the value persists. In my example, we only need the user_id as a user-scoped cd. Because user_id is the user-level dimension, it is set separately form the logEvent function. Although I suspect you can do it there too. Haven't tried tho.
Now, we're almost there.
Finally, you don't want to use Firebase to look at your data. It's horrible at data presentation. It's good at debugging though. Cuz that's what it was intended for initially. Because of how horrible it is, it's always advised to link it to GA4. Now GA4 will allow you to look at the Firebase values much more efficiently. Note that you will likely need to re-register your custom dimensions from Firebase in GA4. Because GA4 is capable of getting multiple data streams, of which firebase would be just one data source. But GA4's CDs limits are very close to Firebase's. Ok, let's be frank. GA4's data model is almost exactly copied from that of Firebase's. But GA4 has a much better analytics capabilities.
Good, you've moved to GA4. Now, GA4 is a very raw not-officially-beta product as well as Firebase Analytics. Because of that, it's advised to first change your data retention to 12 months and only use the explorer for analysis, pretty much ignoring the pre-generated reports. They are just not very reliable at this point.
Finally, you may find it easier to just use SQL to get your analysis done. For that, you can easily copy your data from GA4 to a sandbox instance of BQ. It's very easy to do.This is the best, most reliable known method of using GA4 at this moment. I mean, advanced analysts do the export into BQ, then ETL the data from BQ into a proper storage like Snowflake or even s3, or Aurora, or whatever you prefer and then on top of that, use a proper BI tool like Looker, PowerBI, Tableau, etc. A lot of people just stay in BQ though, it's fine. Lots of BI tools have BQ connectors, it's just BQ gets expensive quickly if you do a lot of analysis.
Whew, I hope you'll enjoy analyzing your game's data. Data-driven decisions rock in games. Well... They rock everywhere, to be honest.

Where should data be transformed for the database?

Where should data be transformed for the database? I believe this is called data normalization/sanitization.
In my database I have user created shops- like an Etsy. Let's say a user inputs a price for an item in their shop as "1,000.00". But my database stores the prices as an integer/pennies- "100000". Where should "1,000.00" be converted to "100000"?
These are the two ways I thought of.
In the frontend: The input data is converted from "1,000.00" to "100000" in the frontend before the HTTP request. In this case, the backend would validate that the price it is an integer.
In the backend: "1,000.00" is sent to the backend as is, then the backend validates that it is a price format, then the backend converts the price to an integer "100000" before being stored in the database.
It seems either would work but is one way better than the other or is there a standard way? I would think the second way is best to reduce code duplication since there is likely to be multiple frontends - mobile, web, etc- and one backend. But the first way also seems cleaner- just send what the api needs.
I'm working with a MERN application if that makes any difference, but I would think this would be language agnostic.
I would go with a mix between the two (which is a best practice, AFAIK).
The frontend has to do some kind of validation anyway, because it you don't want to wait for the backend to get the validation response. I would add the actual conversion here as well.
The validation code will be adapted to each frontend I guess, because each one has a different framework / language that it uses, so I don't necessarily see the code duplication here (the logic yes, but not the actual code). As a last resort, you can create a common validation library.
The backend should validate again the converted value sent by the frontends, just as a double check, and then store it in the database. You never know if the backend will be integrated with other components in the future and input sanitization is always a best practice.
Short version
Both ways would work and I think there is no standard way. I prefer formatting in the frontend.
Long version
What I do in such cases is to look at my expected business requirements and make a list of pros and cons. Here are just a few thoughts about this.
In case, you decide for doing every transformation of the price (formatting, normalization, sanitization) in the frontend, your backend will stay smaller and you will have less endpoints or endpoints with less options. Depending on the frontend, you can choose the perfect fit for you end user. The amount of code which is delivered will stay smaller, because the application can be cached and makes all the formatting stuff.
If you implement everything in the backend, you have full control about which format is delivered to your users. Especially when dealing with a lot of different languages, it could be helpful to get the correct display value directly from the server.
Furthermore, it can be helpful to take a look at some different APIs of well-known providers and how these handle prices.
The Paypal API uses an amount object to transfer prices as decimals together with a currency code.
"amount": {
"value": "9.87",
"currency": "USD"
}
It's up to you how to handle it in the frontend. Here is a link with an example request from the docs:
https://developer.paypal.com/docs/api/payments.payouts-batch/v1#payouts_post
Stripe uses a slightly different model.
{
unit_amount: 1600,
currency: 'usd',
}
It has integer values in the base unit of the currency as the amount and a currency code to describe prices. Here are two examples to make it more clear:
https://stripe.com/docs/api/prices/create?lang=node
https://stripe.com/docs/checkout/integration-builder
In both cases, the normalization and sanitization has to be done before making requests. The response will also need formatting before showing it to the user. Of course, most of these requests are done by backend code. But if you look at the prebuilt checkout pages from Stripe or Paypal, these are also using normalized and sanitized values for their frontend integrations: https://developer.paypal.com/docs/business/checkout/configure-payments/single-page-app
Conclusion/My opinion
I would always prefer keeping the backend as simple as possible for security reasons. Less code (aka endpoints) means a smaller attack surface. More configuration possibilities means a lot more effort to make the application secure. Furthermore, you could write another (micro)service which overtakes some transformation tasks, if you have a business requirement to deliver everything perfectly formatted from the backend. Example use cases may be if you have a preference for backend code over frontend code (think about your/your team's skills), if you want to deploy a lot of different frontends and want to make sure that they all use a single source of truth for their display values or maybe if you need to fulfill regulatory requirements to know exactly what is delivered to your user.
Thank you for your question and I hope I have given you some guidance for your own decision. In the end, you will always wish you had chosen a different approach anyway... ;)
For sanitisation, it has to be on the back end for security reasons. Sanitisation is concerned with ensuring only certain field's values from your web form are even entertained. It's like a guest list at an exclusive club. It's concerned not just with the field (e.g. username) but also the value (e.g. bobbycat, or ); DROP TABLE users;). So it's about ensuring security of your database.
Normalisation, on the other hand, is concerned with transforming the data before storing them in the database. It's the example you brought up: 1,000 to 1000 because you are storing it as integers without decimals in the database.
Where does this code belong? I think there's no clear winner because it depends on your use case.
If it's a simple matter like making sure the value is an integer and not a string, you should offload that to the web form (I.e. the front end), since forms already have a "type" attribute to enforce these.
But imagine a more complicated scenario. Let's say you're building an app that allows users to construct a Facebook ads campaign (your app being the third party developer app, like Smartly.io). That means there will be something like 30 form fields that must be filled out before the user hits "create campaign". And the value in some form fields affect the validity of other parts of the form.
In such a situation, it might make sense to put at least some of the validation in the back end because there is a series of operations your back end needs to run (like create the Post, then create the Ad) in order to ensure validity. It wouldn't make sense for you to code those validations and normalisations in the front end.
So in short, it's a balance you'll need to strike. Offload the easy stuff to the front end, leveraging web APIs and form validations. Leave the more complex normalisation steps to the back end.
On a related note, there's a broader concept of ETL (extract, transform, load) that you'd use if you were trying to consume data from another service and then transforming it to fit the way you store info in your own database. In that case, it's usually a good idea to keep it as a repository on its own - using something like Apache Airflow to manage and visualise the cron jobs.

With Bluemix Retrieve&Rank, How do we implement a system to continuously learn?

With reference to the Web page below, using the Retrieve & Rank service of IBM Bluemix, we are creating a bot that can respond to inquiries.
Question:
After learning the ranker once, based on the user's response to the inquiry, how can we construct a mechanism to continuously learn and improve response accuracy?
Assumption:
Because there was no API of R&R service to continuously learn from the inquiry response result of the user, tuning the GroundTruth file,
I suppose that it is necessary to periodically perform such a process as training the ranker again.
Tuning contents of assumed GT file:
If there is a new question, add a set of questions and answers
Increase or decrease the relevance score of responses if there is something that could not be answered well by existing question
(If bot answered incorrectly, lower the score, if there is a useful answer, raise the score)
In order to continuously learn, you will want to do the following:
capture new examples i.e. each user input and corresponding result
review those examples and create new ranker examples, adjust relevance scores, etc
add those new examples to the ranker
retrain the ranker using your new and existing examples
NOTE: Be sure to validate that new updates to the ranker data improves the overall system performance. k-fold validation is a great way to measure this.
All in all, learning is a continuous process that should be repeated indefinitely or until system performance is deemed suffcient.

Why is my identifier collision rate increasing?

I'm using a hash of IP + User Agent as a unique identifier for every user that visits a website. This is a simple scheme with a pretty clear pitfall: identifier collisions. Multiple individuals browse the internet with the same IP + user agent combination. Unique users identified by the same hash will be recognized as a single user. I want to know how frequently this identifier error will be made.
To calculate the frequency, I've created a two-step funnel that should theoretically convert at zero percent: publish.click > signup.complete. (Users have to signup before they publish.) Running this funnel for 1 day gives me a conversion rate of 0.37%. That figure is, I figured, my unique identifier collision probability for that funnel. Looking at the raw data (a table about 10,000 rows long), I confirmed this hypothesis. 37 signups were completed by new users identified by the same hash as old users who completed publish.click during the funnel period (1 day). (I know this because hashes matched up across the funnel, while UIDs, which are assigned at signup, did not.)
I thought I had it all figured out...
But then I ran the funnel for 1 week, and the conversion rate increased to 0.78%. For 5 months, the conversion rate jumped to 1.71%.
What could be at play here? Why is my conversion (collision) rate increasing with widening experiment period?
I think it may have something to do with the fact that unique users typically only fire signup.complete once, while they may fire publish.click multiple times over the course of a period. I'm struggling however to put this hypothesis into words.
Any help would be appreciated.
Possible explanations starting with the simplest:
The collision rate is relatively stable, but your initial measurement isn't significant because of the low volume of positives that you got. 37 isn't very many. In this case, you've got two decent data points.
The collision rate isn't very stable and changes over time as usage changes (at work, at home, using mobile, etc.). The fact that you got three data points that show an upward trend is just a coincidence. This wouldn't surprise me, as funnel conversion rates change significantly over time, especially on a weekly basis. Also bots that we haven't caught.
If you really get multiple publishes, and sign-ups are absolutely a one-time thing, then your collision rate would increase as users who only signed up and didn't publish eventually publish. That won't increase their funnel conversion, but it will provide an extra publish for somebody else to convert on. Essentially, every additional publish raises the probability that I, as a new user, am going to get confused with a previous publish event.
Note from OP. Hypothesis 3 turned out to be the correct hypothesis.

SOA Service Boundary and production support

Our development group is working towards building up with service catalog.
Right now, we have two groups, one to sale a product, another to service that product.
We have one particular service that calculates if the price of the product is profitable. When a sale occurs, the sale can be overridden by a manager. This sale must also be represented in another system to track various sales and the numbers must match. The parameters of profitability also change, and are different from month to month, but a sale may be based on the previous set of parameters.
Right now the sale profitability service only calculates the profit, it also provides a RESTful URI.
One group of developers has suggested that the profitability service also support these "manager overrides" and support a date parameter to calculate based on a previous date. Of course the sales group of developers disagree. If the service won't support this, the servicing developers will have to do an ETL between two systems for each product(s), instead of just the profitability service. Right now since we do not have a set of services to handle this, production support gets the request and then has to update the 1+ systems associated for that given product.
So, if a service works for a narrow slice, but an exception based business process breaks it, does that mean the boundaries of the service are incorrect and need to account for the change in business process?
Two, does adding a date parameter extend the boundary of the service too much, or should it be excepted that if the service already has the parameters, it would also have a history of parameters as well? At this moment, we don't not have a service that only stores the parameters as no one has required a need for it.
If there is any clarification needed before an answer can be given, please let me know.
I think the key here is: How much pain would be incurred by both teams if and ETL was introduced between to the two?
Not that I think you're over-analysing this, but if I may, you probably have an opinion that adding a date parameter into the service contract is bad, but also dislike the idea of the ETL.
Well, strategy aside, I find these days my overriding instinct is to focus less on the technical ins and outs and more on the purely practical.
At the end of the day, ETL is simple, reliable, and relatively pain free to implement, however it comes with side effects. The main one is that you'll be coupling changes to your service's db schema with an outside party, which will limit options to evolve your service in the future.
On the other hand allowing consumer demand to dictate service evolution is easy and low-friction, but also a rocky road as you may become beholden to that consumer at the expense of others.
Another possibility is to allow the extra service parameters to be delivered to the consumer via a message, rather then across the same service. This would allow you to keep your service boundary intact and for the consumer to hold the necessary parameters local to themselves.
Apologies if this does not address your question directly.