Cube/Mongo: Custom metric resolutions (step) - mongodb

According the documentation square/cube supports 5 metric resolutions (step), the lowest is 10 seconds. I understand this is required in order to allow pyramidal reducers. Will cube work correctly (though less efficiently) with any arbitrary step value or are there other problems? If it is just an efficiency issue, how bad would it be - even with built in step values it takes time for the cache to fill for all options.

I faced a similar situation when creating horizon charts of stock data. Some stocks are not traded at all moments during the day.
In this situation, I "backfilled" in the intermediate values and created a uniform distribution. Essentially, I took the latest data and added it to a newer time stamp until new data was available.
For example, if I had the following prices for minute-by-minute data:
11:15 AM -> 112.0
11:18 AM -> 115.0
my program created the following "imaginary" intervals.
11:15 AM -> 112.0
11:16 AM -> 112.0
11:17 AM -> 112.0
11:18 AM -> 115.0
My program used a JSON data source, so manipulating these values was really easy. I have never used cube/mongo and so I don't know how easy it will be to do the same there.
Does this answer your question?

Related

Measuring system time of specific agent in anylogic

I've got 3 different product types of agent, which each go it's individual path within the fabric. How can i measure the average time the product type spends in the system?
My logic looks like this , and i wanted to implement the measurement in the first service, like this:, it will be completed in the last service like this :
Now I get some really high numbers, which are absolutely wrong. The process itself works fine, if you run the measurement with the code "//agent.enteredSystemP1 = time()", you will get a mean of 24 minutes, per product. But how can i get the mean per product type?
Just use the same if-elseif-else distinction in the 2nd service block as well.
Currently, any agent leaving the system adds time to any systemTimeDistribution

Utilization of a Resource

Is there any way to calculate the utilization of a given resource in a specific time frame? I have a machine that works h24, but during daytime hours its utilization is higher than during nighttime hours.
In the final statistics, using the function "machine.utilization()" I get a low result, which is influenced by the night hours. How can I split the two statistics?
Utilization is calculated as (work time)/(available time excluding maintenance). Which means that the measure described in your question can be achieved using 2 ways:
Make the machine 'unavailable' during the night, this way that time will be excluded in calculations
ResourcePool object has 2 properties on resource seize and on resource release which can be used to record specific instances of work time, sum it up and divide only by a period of (8hr * (num of days)) instead of total time from model start
For a little more detail and link to AnyLogic help please see the answer to another question here.
Update:
In ResourcePool's On seize and On release, AnyLogic provides a parameter called unit, which is the specific resource unit agent being processed. So getting actual use time per unit requires following:
2 collections of type LinkedHashMap that maps Agent -> Double. One collection to store start times (let's call it col_start and one to store total use times, let's call it col_worked)
On seize should contain this code: col_start.put(unit, time())
On release should contain:
double updated = col_worked.getOrDefault(unit, 0.0) + (time() - col_start.get(unit));
col_worked.put(unit, updated);
This way at any given point during model execution, col_worked will contain a mapping of resource unit Agent to the total sum of time it was utilised expressed as a double value in model time units.

Two questions of TradeStation charts

I am evaluating which one to choose, TWS from IB or TS. TWS has a demo account but TS does not. I have two questions about TS stock chart.
premarket data in charts start from 4am, or later? TWS starts from 4am. I know TS only allows trades after 8am. I am just wondering if premarket data in chart also starts late.
when premarket data is displayed along with regular trading hour data, HOURLY candle is aligned with 9am or 9:30am? TWS has hourly candle aligned with 9am not market opening time. I honestly don't like it. I am just wondering if TS does the same thing.
If anyone can answer me these two questions, I would be really appreciated!
Thanks,
Jay
You can set when you want TS to show you data from, in terms of sessions; if it's there, they'll find it. So yes, I can for example set my TS chart to show 'premarket' AAPL data from 0300 (or really any other time). I can't trade it before the exchange opens though, as that would be an OTC/ off-exchange trade which you have to be an institution to do.
Candles are aligned with 0930, but that doesn't matter; it can be changed in settings, also though when backtesting, you should use LIBB (Look Intra Bar).
Hope this helps! I strongly recommend you take a close look at MultiCharts too, using IQFeed data.
Also remember, all these systems are buggy as all hell. Managing workflow with them is as much about learning and overcoming their eccentricities, as it is doing the work.

Anylogic Error: Can't Convert from Population to Resource Pool

I am a new Anylogic user and I am trying to set up a simple simulation like so:
I have one ResourcePool named People.
People uses a Schedule named People_Schedule that is sourced from a DB table that has a date column and then the number of resources (employees) avalailable that specific interval.
2020-07-09 10:30 32
2020-07-09 11:00 35
2020-07-09 11:30 40
I have a par of Sources named G_Task and M_TASK.
Both of these Sources use Schedules that are sourced from a DB that has one column with a date column, a work type column, and then a number of work items (tasks) expected for that specific interval.
2020-07-09 10:30 M 100
2020-07-09 10:30 G 50
2020-07-09 11:00 M 125
I have a specific Service for each of my Sources, but my issue is that when selecting the ResourcePool setting in Properties, there is no dropdown for my ResourcePool named People. I can type in People which then allows both items to be highlighted at the same time on the grid like maybe they are linked, but when I hover on the error I am getting on the Resource Pool setting, I get this message:
Type mismatch: cannot convert from Main._People_Population to ResourcePool
What am I missing? I have tried searching through the Anylogic help and through Google, but I am finding nothing.
As a side question, are there any Anylogic users that have gotten past the learning curve and are comfortable with the level of troubleshooting out there on the web (or lack thereof)? I am used to being able to search SQL/Excel/Python questions and finding tons of resources almost immediately.
Thanks,
Ray

Massive time increase when segmenting column and writing parquet file

I work with clinical data, so I apologize that I can't display any output, as it is HIPAA regulated, but I'll do my best to fill in any gaps.
I am a recent graduate in data science, and I never really spent much time working with any spark system, but I am now in my new role. We are working on collecting output from a function that I will call udf_function, which takes a clinical note (report) from a physician and returns output that the function defines from the python function call_function. Here is the code that I use to complete this task
def call_function(report):
//python code that generates a list of a,b,c, which I
join together to return a string of the combined list items
a= ",".join(a)
b= ",".join(b)
c= ",".join(c)
return [a,b,c]
udf_function= udf(lambda y: call_function(y), ArrayType(StringType()))
mid_frame = df.select('report',
udf_function('report').alias('udf_output')
)
This returns an array of length 3, which contains strings about the information returned from the function. On a selection of 25,000 records, I was able to complete the run on a 30 node cluster on GCP (20 workers, 10 preemptive) in just a little over 3 hours the other day.
I changed my code a bit to parse out the three objects from the array, as the three objects contains different types of information that we want to further analyze, which I'll call a,b,c (again, sorry if this is vague; I'm trying to keep the actual data as surface level as possible). The previous 3 hour run didn't write out any files, as I was was testing how long the system would take.
output = mid_frame.select('report',
mid_frame['udf_output'].getItem(0).alias('a'),
mid_frame['udf_output'].getItem(1).alias('b'),
mid_frame['udf_output'].getItem(2).alias('c')
)
output_frame.show()
output_frame.write.parquet(data_bucket)
This task of parsing the output and writing the files took an additional 48 hours. I think I could stomach this time lost if I was dealing with HUGE files, but the output is 4 parquet files which come out to 24.06 MB total. Looking at the job logs, the writing process itself took just about 20 hours.
Obviously I have introduced some extreme inefficiency, but I'm since I'm new to this system and style of work, I'm not sure where I have gone awry.
Thank you to all that can offer some advice or guidance on this!
EDIT
Here is an example of what report might be and what the return would be from the function
This is a sentence I wrote myself, and thus, is not pulled from any real record
report = 'The patient showed up to the hospital, presenting with a heart attack and diabetes'
\\ code
return ['heart attack, diabetes','myocardial infarction, diabetes mellatus', 'X88989,B898232']
where the first item is any actual string in the sentence that is tagged by the code, the second item is the professional medical equivalent, and the third item is simply a code which helps us find diagnosis hierarchy between other codes
If you only have 4 parquet file outputs, that says your partition is too small, try repartition before you write out. For example:
output_frame= output_frame.repartition(500)
output_frame.write.parquet(data_bucket)