I'm generating my agents in anylogic based on a database table that I've created. In this DB I have some characteristics of my agent. This agent is supposed to be my "scheduling agent", since my focus is on rescheduling, it is important that my production orders are saved as agents in a queue. My problem is that when generating the agents, firstly I can't tell the system to generate all of them at once (so like "import" the line of my DB and transform each line into an agent with characteristics).
I tried doing it by adding 1s difference between every production order, but, when the last date is reached, my simulation gives an error and stops working. Could someone help me achieve my task? Do you think there would be a better solution?
I am not sure 100% what you are trying to do, but I have a similar problem I think that I have solved this way.
I have a database of batches that I want to load all at once.
enter image description here
This is going to load the batches one at a time with 0 interarrival time. This means that batches will flow continuously. Also important is the Limited number of arrivals option, which will stop the loading when the end of the database is reached.
Also, after the source, I added a queue with Maximum capacity set to infinite.
Hope that helps
Related
We've been experimenting with tweaking the autovacuum thresholds on some of our larger tables, because otherwise they never run, but also build up 10s of thousands of dead tuples. Using a query I found somewhere on SO, looking at the pg_stat_user_tables table, I'm able to see the last run time and the number of runs for the autovacuum, but I can't seem to find a history of the events. We're trying to keep track of how often they are running to get some idea of where a good threshold is, so that sort of info would be useful. Is there another table available for this?
There is no table with the history (unless you have created one of course or deployed some monitoring system that did it for you, but I don't know of such
a one). You can set log_autovacuum_min_duration to zero, then you will have a record going forward in your log files.
So this is a bit of a weird question as it isn't related to how to use the tool but more about why to use it.
I'm deploying a model and thinking of using Apache-beam to run the feature processing tasks using its python API. Documentation is pretty big and complex but I went through most of it, even built a small working pipeline, and it is still not clear this would be the right tool for me.
An example of what I need is the following:
Input data structure:
ID | Timestamp | category
output needed:
category | category count for last 30 minutes (feature example)
This process needs to run every 5 minutes and update the counts.
===> What I fail to understand is if apache can run this pipeline every 5 minutes, read whichever new input data was generated and update the counts of the previous time it ran. And if so, can someone point me in the right direction?
Thank you!
When you run a Beam pipeline manually, it's expected to be started only once. Then it could be either a Bounded (Batch) or Unbounded (Streaming) pipeline. In the first case, it will be stopped after the all your bounded amount of data has been processed, in the second case it will run continuously and expect new data arrival (until it will be stopped manually).
Usually, the type of pipeline depends on data source that you have (Beam IO connectors). For example, if you read from files, then, by default, it's assumed to be a bounded source (limited number of files), but it could be unbounded source as well if you expect to have more new files to arrive and want to read them in the same pipeline.
Also, you can run your batch pipeline periodically with automated tools, like Apache Airflow (or just unix crontab). So, it all depends on your needs and type or data source. I could probably give more specific advice if you could share more details of your data pipeline - type of your data source and environment, an example of input and output results, how often your input data can be updated and so on.
During my cloud dataprep adventures I have come across yet another very annoying bug.
The problem occurs when creating complex flow structures which need to be connected through reference datasets. If a certain limit is crossed in performing a number of unions or a joins with these sets, dataflow is unable to start a job.
I have had a lot of contact with support and they are working on the issue:
"Our Systems Engineer Team was able to determine the root cause resulting into the failed job. They mentioned that the job is too large. That means that the recipe (combined from all datasets) is too big, and Dataflow rejects it. Our engineering team is still investigating approaches to address this.
A workaround is to split the job into two smaller jobs. The first run the flow for the data enrichment, and then use the output as input in the other flow. While it is not ideal, this would be a working solution for the time being."
I ran into the same problem and have a fairly educated guess as to the answer. Keep in mind that DataPrep simply takes all your GUI based inputs and translates it into Apache Beam code. When you pass in a reference data set, it probably writes some AB code that turns the reference data set into a side-input (https://beam.apache.org/documentation/programming-guide/). DataFlow will perform a Parellel Do (ParDo) function where it takes each element from a PCollection, stuffs it into a worker node, and then applies the side-input data for transformation.
So I am pretty sure if the reference sets get too big (which can happen with Joins), the underlying code will take an element from dataset A, pass it to a function with side-input B...but if side-input B is very big, it won't be able to fit into the worker memory. Take a look at the Stackdriver logs for your job to investigate if this is the case. If you see 'GC (Allocation Failure)' in your logs this is a sign of not enough memory.
You can try doing this: suppose you have two CSV files to read in and process, file A is 4 GB and file B is also 4 GB. If you kick off a job to perform some type of Join, it will very quickly outgrow the worker memory and puke. If you CAN, see if you can pre-process in a way where one of the files is in the MB range and just grow the other file.
If your data structures don't lend themselves to that option, you could do what the Sys Engs suggested, split one file up into many small chunks and then feed it to the recipe iteratively against the other larger file.
Another option to test is specifying the compute type for the workers. You can iteratively grow the compute type larger and larger to see if it finally pushes through.
The other option is to code it all up yourself in Apache Beam, test locally, then port to Google Cloud DataFlow.
Hopefully these guys fix the problem soon, they don't make it easy to ask them questions, that's for sure.
I want to check an item in my database every x minutes for y minutes after its creation.
I've come up with two ways to do this and I'm not sure which will result in better efficiency/speed.
The first is to store a Date field in the model and do something like
Model.find({time_created > current_time - y})
inside of a cron job every x minutes.
The second is the keep a times_to_check field that keeps track of how many more times, based on x and y, the object should be checked.
Model.find({times_to_check> 0})
My thought on why these two might be comparable is because the first comparison of Dates would take longer, but the second one requires a write to the database after the object had been checked.
So either way you are going have to check the database continuously to see if it is time to query your collection. In your "second solution" you do not have a way to run your background process as you are only referencing how you are determining your collection delta.
Stick with running you unix Cron job but make sure it is fault tolerant an have controls ensuring it is actually running when you application is up. Below is a pretty good answer for how to handle that.
How do I write a bash script to restart a process if it dies?
Based on that i would ask how does your application react if your Cron job has not run for x number of minutes, hours or days. How will your application recover if this does happen?
I have to start moving transnational data into a reporting database, but would like to move towards a more warehouse/data mart design, eventually leveraging Sql Server Analytics.
The thing that is being measured is the time between points of a workflow on a piece of work. How would you model that when the things that can happen, do not have a specific order. Also some work wont have all the actions, or might have the same action multiple times.
It makes me want to put the data into a typical relational design with one table the key or piece of work and a table that has all the actions and times. Is that wrong? The business is going to try to use tableau for report writing and I know it can do all kinds of sources, but again, I would like to move away from transaction into warehousing.
The work is the dimension and the actions and times are the facts?
Is there any other good online resources for modeling questions?
Thanks
It may seem like splitting hairs, but you don't want to measure the time between points in a workflow, you need to measure time within a point of a workflow. If you change your perspective, it can become much easier to model.
Your OLTP system will likely capture the timestamp of when the event occurred. When you convert that to OLAP, you should turn that into a start & stop time for each event. While you're at it, calculate the duration, in seconds or minutes, and the occurrence number for the event. If the task was sent to "Design" three times, you should have three design events, numbered 1,2,3.
If you're want to know how much time a task spent in design, the cube will sum the duration of all three design events to present a total time. You can also do some calculated measures to determine first time in and last time out.
Having the start & stop times of the task allow you to , for example, find all of tasks that finished design in January.
If you're looking for an average above the event grain, for example what is the average time in design across all tasks, you'll need to do a new calculated measure using total time in design/# tasks (not events).
Assuming you have more granular states, it is a good idea to define parent states for use in executive reporting. In my company, the operational teams have workflows with 60+ states, but management wanted them rolled up into five summary states. The rollup hierarchy should be part of your workflow states dimension.
Hope that helps.