I have a dataset that contains date variables, quantitative and qualitative predictor variables, and a binary dependent variable. The goal of my analysis is find the percent of success in CORRECT and to learn more about the relationship between CORRECT with the independent variables.
There are people we can call trackers that live throughout the US. Each one has the job of keeping track of the addresses for the participants of our program in their location. The problem is that some of these trackers are not regularly updating the address for the group of participants that they are responsible for. Some of the addresses in their database can be outdated or incorrect some other way. I’m looking to explore more into these correct/incorrect addresses and their relationship between the other variables. Below are some of the variables that I have in my dataset:
CORRECT: a binary variable to indicate whether or not the RECORDER entered the correct address
RECORDER_ADDRESS: the address that the RECORDER put into their database for the participant
ACTUAL_ADDRESS: the address where the participant is actually located
ZIP_CODE: the zip code for the participant
PARTICIPANT_ID: The unique ID for the participant
CREATED_DATE: date when the initial address for the participant was recorded
MODIFIED_DATE: dates when any variable was modified
PARTICIPANT_START_DATE: the start date for the participant on the job
PARTICIPANT_END_DATE: the date when this participants duty will end
RECORDER: name of the recorder that is in charge of tracking this entry
TRAINING: the type of training that the participant received
I’ve figured out the accuracy of the RECORDERs. I found they were correct about 56% of the time. Now I’m trying to look more for into these incorrect and correct addresses. I’ve tried logistic regression to predict CORRECT but none of the predictor variables were significant. I made a stacked bar-plot using the CORRECT variable and STATE along with CORRECT and RECORDER. Now I’m resorting to using the 4 date variables along with the ZIP_CODE, RECORDER_ADDRESS, and ACTUAL_ADDRESS to learn about the success and failure of the RECORDERS. Are there some visualization ideas or analysis that can use the date variables and/or address variables to gain insight about the correct/incorrect recordings?
An idea that can be used is to create another variable that’s the difference in time between CREATED_DATE and MODIFIED_DATE. Another difference for PARTICIPANT_START_DATE and MODIFIED_DATE.
Related
I'm using Graphite + Grafana to monitor (by sampling) queue lengths in a test system at work. The data that gets uploaded to Graphite is grouped into different series/metrics by properties of the payloads in the queue. These properties can be somewhat arbitrary, at least to the point where they are not all known at the time when the data collection script is run.
For example, a property could be the project that the payload belongs to and this could be uploaded as a separate series/metric so that we can monitor the queues broken down by the different projects.
This has the consequence that Graphite sends a lot of null values for certain metrics if the queues in the test system did not contain any payloads with properties that would group it into that specific series/metric.
For example, if a certain project did not have any payloads in queue at the time when the data collection was ran.
In Grafana this is not so nice as the line graphs don't show up as connected lines and gauges will show either null or the last non-null value.
For line graphs I can just chose to connect null values in Grafana but for gauges thats not possible.
I know about the keepLastValue function in Graphite but that includes a limit for how long to keep the value which I like very much as I would like to keep the last value until the next time data collection is ran. Data collection is run periodically at known intervals.
The problem with keepLastValue is it expects a number of points as this limit. I would rather give it a time period instead. In Grafana the relationship between time and data points is very dynamic so its not easy to hard-code a good limit for keepLastValue.
Thus, my question is: Is there a way to tell Graphite to keep the last value for a given time instead of a given number of points?
In my source block I want to be the amount of agents based on two different factors namely the amount of beds and visitors per bed. The visitors per bed is just a variable (e.g. visitors=3) and the amount of beds is loaded from the database table which is an excel file (see first image). Now I want to code this in the code block as shown in the example in image 2, but I do not know the correct code and do not know if it is even possible.
Simplest solution is just to do the pre-calcs in the input file and have in the dbase.
The more complex solution is to set the Source arrivals as:
Now, you read your dbase code at the start of the model using SQL (i.e. the query constructor). Make the necessary computations and create a Dynamic Event for each arrival when you want it to happen, relative to the model start. Each dynamic event then calls the source.inject(1) method.
Better still is to not use Source at all but a simple Enter block. The dynamic event creates the agent with all relevant properties from your dbase and pushes it into the Enter block using enter.take(myNewAgent)
But as I said: this is not trivial
I am trying to use parameter variation in AnyLogic. My inputs are 3 parameters, each varying 5 times. My output is water demand. What I need from parameter variation is the way in which demand changes according to the different combinations of the three parameters. I imagine something like: there are 10,950 rows (one for each day), the first column is time (in days), the second column are the values for the first combination, the second column is the second combination, and so on and so forth. What would be the best way to track this metadata to then be able to export it to excel? I have added a "dataset" to my main to track demand through each simulation, but I am not sure what to add to the parameter variation experiment interface to track the output across the different iterations. It would also be helpful to have a way to know which combination of inputs produced a given output (for example, have the combination be the name for each column). I see that there are Java Actions, but I haven't been able to figure out the code to do what I need. I appreciate any help with this matter.
The easiest approach is just to track this in output database tables which are then exported to Excel at the end of your run. As long as these tables include outputs from multiple runs (and are, for example, only cleared at the start of the experiment not the run), your Parameter Variation experiment will end up with an Excel file having outcomes from all the runs. (You will probably need to turn off parallel execution in the PV experiment so you don't run into issues trying to write to the same Excel file in parallel.)
So, for example, you might have tables:
run_details with columns id, parm1, parm2 and parm3 (with proper column names given your actual parameters and some unique ID generated for each run)
output_demand with columns run_id, sim_time_hrs and demand_value (if, say, you're storing some demand value each hour of simulated time) where run_id cross-references the run's ID in run_details
(There is extra complexity in how you could allocate a unique run ID and how and when you write to/clear those tables, but I'm just presenting the core design. You can also get round the need-serial-execution point by programmatically controlling when you export to Excel, rather than using the built-in "Export tables at the end of model execution" capability, but that's also more complicated.)
Good day
I'm a new user trying to find my with Anylogic.
Any help with the following question will be appreciated.
Is it possible to start a model with initial values/quantities given to certain blocks/sections in a model? In other words not have the model start from 0 but from the values given.
You can run a "warmup" period manually and save that as a model snapshot. In future runs, you can start off from that snapshot by loading it. See the help on model snapshots
This is the general problem of model initialisation (e.g., if you're modelling a manufacturing facility, you may want the run to start with the facility at the state it would be at on 9am next Monday morning). There is no generic answer: what initialisation you need is 100% model-dependent (as is how easy/hard this is).
In particular, process models make this difficult because entities (agents) are expected to have flowed through the process up to the point they 'start' in. You can use things like extra initialisation-only Source/Enter blocks to 'inject' agents into the appropriate process points, but in most models it's not this easy: you will have all kinds of model state that needs to be made consistent with this (e.g., the agents flowing through the process might have attributes which have changed based on what's happened to them so far, so this would have to be made consistent).
That's why warm-up periods (letting the model run 'from empty' for a period until its state is qualitatively what you want as your starting point) is a common approach. Model snapshots can help you here (see Ben's answer) but they're not the only way of doing it. (You can also just 'reset' all your metrics/output gathering at the point when you determine the warm-up period has ended --- i.e., you are effectively establishing a new 'time zero' --- but, again, exactly what you need to do is 100% model dependent.)
I have a problem I need to solve using Node. The question I have is surrounding the best logical way to solve it. Any advise is appreciated.
Summary
You will build a tool that imports train schedules from an external data source and is stored in an internal database.
Eternal Data Source
There is a service that provides a list of trains:
[{"id":1,"name":"A EXPRESS"},{"id":2,"name":"B EXPRESS"},{"id":3,"name":"C EXPRESS"},{"id":4,"name":"D EXPRESS"},{"id":5,"name":"E EXPRESS"}]
And a list of train stations:
Example for train "A EXPRESS":
[{"arrival":"2019-04-30T11:48:00.000Z","departure":"2019-05-01T05:42:00.000Z","service":"Loop 5","station":{"id":"ST1","name":"Waterloo"}},{"arrival":"2019-05-13T18:00:00.000Z","departure":"2019-05-14T05:00:00.000Z","service":"Loop 5","station":{"id":"ST2","name":"Paddington"}},{"arrival":"2019-05-15T04:00:00.000Z","departure":"2019-05-15T11:00:00.000Z","service":"Loop 5","station":{"id":"ST3","name":"Heathrow"}},{"arrival":"2019-05-16T20:00:00.000Z","departure":"2019-05-17T10:00:00.000Z","service":"Loop 5","station":{"id":"ST4","name":"Wimbledon"}},{"arrival":"2019-05-18T15:00:00.000Z","departure":"2019-05-19T21:00:00.000Z","service":"Loop 5","station":{"id":"ST5","name":"Reading"}},{"arrival":"2019-05-21T04:00:00.000Z","departure":"2019-05-21T21:00:00.000Z","service":"Loop 5","station":{"id":"ST6","name":"Algate"}},{"arrival":"2019-05-31T03:00:00.000Z","departure":"2019-05-31T15:00:00.000Z","service":"Loop 5","station":{"id":"ST1","name":"Waterloo"}}]
Note: this train stops at station "ST1" ("Waterloo") twice.
To do
For each imported station call, we want to maintain the latest information as well as the history of the station call:
-What station is the train calling?
-What are the latest arrival and departure dates?
-When was the station call first imported?
-When was the station call last updated?
-How did the station call change as time went? (evolution of arrival and departure dates with time) This kind of information is useful for us to understand how often trains are delayed, when do schedule changes happen and if there are patterns to these changes.
How it works
-The external data source is a simulation of train schedules forecasts
-The data covers a time range from January 1st 2019 to May 31st 2019
-This 5 months time window is compressed and simulated over a 24 hours cycle
-This 24 hours cycle restarts every day at 00:00 UTC
-The data source provides endpoints to request train schedules
-A train list endpoint provides a dynamic list of trains that you can import (see Data above)
-A train schedules endpoint provides a dynamic list of stations calls for a specific train (see Data above)
-A train schedule consists of a list of station calls with a varying amount of past station calls and future station calls
-This external data source does not provide a unique identifier for each station call
-This means that merging station calls is not straightforward. This is the crux of my question: reconciling external station calls with the existing ones in the database.
-Station calls arrival and departure dates routinely change, sometimes by multiple days. Sometimes they swap, get deleted or new ones appear
-Station calls can sometimes be deleted (the train will not stop in that specific station
-New station calls can sometimes be created (the train will make an unscheduled stop)
-The train schedules endpoint changes the data returned every 15 minutes.
Specific requirements
You need to capture 24 hours of all train schedules starting at 00:00:00 UTC on one day and ending at 23:59:59 UTC on the same day.
Question
As you can see from the task above, there is a need to reconcile the new imported data with existing data as the new data changes.
There is no ID that can be used to match station visits, but these visits need to be updated when the external data changes.
We do have a train ID and a station ID as well as the date of visits.
What is the logic I can apply to keep the database data accurate and up to date?
Thank you
Answer?
My initial thoughts are to do the following, however I am not sure if this is the best solution. If you have a better solution, or see a problem with mine, please let me know.
The list of station calls retrieved from the external service will always be saved to the internal database under its retrieval timestamp. This will always be displayed as the current information in the UI (latest database entry).
Each 15 min a call is made to the external service and a new latest entry is added to the internal database, and reflected on the UI.
The status of the latest entry needs to reflect the difference between the the latest entry and the previous entry. E.g. delayed by X min, cancelled, etc. This is the tricky part, because the current station visit needs to be matched with a previous station visit.
My thinking is for a specific train is to just find the matching station ID.
If there is no previous matching station ID, then the status is "new".
if there is previous station ID, and no matching current station ID, the status is "cancelled".
if there is one matching station ID, then there times are compared, and the status is updated to "early" or "delayed".
If there is more than one matching station ID, the station previous and current station IDs with the closest timestamps are matched, and their status updates accordingly to "early" or "delayed".
Is my logic correct?