Massive time increase when segmenting column and writing parquet file - pyspark

I work with clinical data, so I apologize that I can't display any output, as it is HIPAA regulated, but I'll do my best to fill in any gaps.
I am a recent graduate in data science, and I never really spent much time working with any spark system, but I am now in my new role. We are working on collecting output from a function that I will call udf_function, which takes a clinical note (report) from a physician and returns output that the function defines from the python function call_function. Here is the code that I use to complete this task
def call_function(report):
//python code that generates a list of a,b,c, which I
join together to return a string of the combined list items
a= ",".join(a)
b= ",".join(b)
c= ",".join(c)
return [a,b,c]
udf_function= udf(lambda y: call_function(y), ArrayType(StringType()))
mid_frame = df.select('report',
udf_function('report').alias('udf_output')
)
This returns an array of length 3, which contains strings about the information returned from the function. On a selection of 25,000 records, I was able to complete the run on a 30 node cluster on GCP (20 workers, 10 preemptive) in just a little over 3 hours the other day.
I changed my code a bit to parse out the three objects from the array, as the three objects contains different types of information that we want to further analyze, which I'll call a,b,c (again, sorry if this is vague; I'm trying to keep the actual data as surface level as possible). The previous 3 hour run didn't write out any files, as I was was testing how long the system would take.
output = mid_frame.select('report',
mid_frame['udf_output'].getItem(0).alias('a'),
mid_frame['udf_output'].getItem(1).alias('b'),
mid_frame['udf_output'].getItem(2).alias('c')
)
output_frame.show()
output_frame.write.parquet(data_bucket)
This task of parsing the output and writing the files took an additional 48 hours. I think I could stomach this time lost if I was dealing with HUGE files, but the output is 4 parquet files which come out to 24.06 MB total. Looking at the job logs, the writing process itself took just about 20 hours.
Obviously I have introduced some extreme inefficiency, but I'm since I'm new to this system and style of work, I'm not sure where I have gone awry.
Thank you to all that can offer some advice or guidance on this!
EDIT
Here is an example of what report might be and what the return would be from the function
This is a sentence I wrote myself, and thus, is not pulled from any real record
report = 'The patient showed up to the hospital, presenting with a heart attack and diabetes'
\\ code
return ['heart attack, diabetes','myocardial infarction, diabetes mellatus', 'X88989,B898232']
where the first item is any actual string in the sentence that is tagged by the code, the second item is the professional medical equivalent, and the third item is simply a code which helps us find diagnosis hierarchy between other codes

If you only have 4 parquet file outputs, that says your partition is too small, try repartition before you write out. For example:
output_frame= output_frame.repartition(500)
output_frame.write.parquet(data_bucket)

Related

Parameter Variation: Tracking the Metadata

I am trying to use parameter variation in AnyLogic. My inputs are 3 parameters, each varying 5 times. My output is water demand. What I need from parameter variation is the way in which demand changes according to the different combinations of the three parameters. I imagine something like: there are 10,950 rows (one for each day), the first column is time (in days), the second column are the values for the first combination, the second column is the second combination, and so on and so forth. What would be the best way to track this metadata to then be able to export it to excel? I have added a "dataset" to my main to track demand through each simulation, but I am not sure what to add to the parameter variation experiment interface to track the output across the different iterations. It would also be helpful to have a way to know which combination of inputs produced a given output (for example, have the combination be the name for each column). I see that there are Java Actions, but I haven't been able to figure out the code to do what I need. I appreciate any help with this matter.
The easiest approach is just to track this in output database tables which are then exported to Excel at the end of your run. As long as these tables include outputs from multiple runs (and are, for example, only cleared at the start of the experiment not the run), your Parameter Variation experiment will end up with an Excel file having outcomes from all the runs. (You will probably need to turn off parallel execution in the PV experiment so you don't run into issues trying to write to the same Excel file in parallel.)
So, for example, you might have tables:
run_details with columns id, parm1, parm2 and parm3 (with proper column names given your actual parameters and some unique ID generated for each run)
output_demand with columns run_id, sim_time_hrs and demand_value (if, say, you're storing some demand value each hour of simulated time) where run_id cross-references the run's ID in run_details
(There is extra complexity in how you could allocate a unique run ID and how and when you write to/clear those tables, but I'm just presenting the core design. You can also get round the need-serial-execution point by programmatically controlling when you export to Excel, rather than using the built-in "Export tables at the end of model execution" capability, but that's also more complicated.)

data processing of Dymola's result during simulation

I am working on a complex Modelica model that contains a large set of data, and I need the simulation to keep going until I terminate the simulation process, maybe even for days, so the .mat file could get very large, I got trouble with how to do data processing. So I'd like to ask if there are any methods that allow me to
output the data I need after a fixed time step during simulation, but not using the .mat file after simulation. I am considering using Modelica.Utilities.Stream.Print` function to print the data I need into a CSV file, but I have to write a huge amount of code that prints every variable I need, so I think there should be a better solution.
delete the .mat file during a fixed time step, so the .mat file stored on my PC wouldn't get too large, and don't affect the normal simulation of Dymola.
Long time ago I wrote a small C-program that runs the executable of Dymola with two threads. One of them is responsible for terminating the whole simulation after exceeding an input time limit. I used the executable of this C-program within the standard given mfiles from Dymola. I think with some hacking capabilities, one would be able to conduct the mentioned requirements.
Have a look at https://github.com/Mathemodica/dymmat however I need to warn that the associated mfiles were for particular type of models and the software is not maintained since long time. However, the idea of the C-program would be reproducible.
I didn't fully test this, so please think of this more like "source of inspiration" than a full answer:
In Section "4.3.6 Saving periodic snapshots during simulation" of the Dymola 2021 Release Notes you'll find a description to do the following:
The simulator can be instructed to print the simulation result file “dsfinal.txt” snapshots during simulation.
This can be done periodically using the Simulation Setup options "Complete result snapshots", but I think for your case it could be more useful to trigger it from the model using the function Dymola.Simulation.TriggerResultSnapshot(). A simple example is given as well:
when x > 0 then
Dymola.Simulation.TriggerResultSnapshot();
end when;
Also one property of this function could help, as it by default creates multiple files without overwriting them:
By default, a time stamp is added to the snapshot file name, e.g.: “dsfinal_0.1.txt”.
The format of the created dsfinal_[TIMESTAMP].txt is a bit overwhelming at first, as it contains all information for initializing the model, but there should be everything you need...
So some effort is shifted to the post processing, as you will likely need to read multiple files, but I think this is an acceptable trade-off.

How to capture the last record in a file

I have a requirement to split a sequential file into 3 parts, Header, Data, Trailer. I have the header and Data worked out.
Is there a way, in a Transformer, to determine if you have the last record in a sequential file? I tried using LastRow() but that gives me the last row for each node. I need to leave parallelize on.
Thanks in advance for any help.
You have no a priori knowledge about which node the trailer row will come through on. There is therefore no solution in a Transformer stage if you want to retain parallel execution.
One way to do it is to have a reject link on the Sequential File stage. This will capture any row that does not match the defined metadata. Set up the stage with the metadata for your Data rows, then the Header and Trailer will be captured onto the reject link. It should be pretty obvious from their data which is which, and you can process them further and perhaps even rejoin them to your Data rows.
You could also capture the last row separately (e.g. via head -1 filename) and compare that against every row processed to determine if it's the last. Computationally heavy for very little gain.

Writing different experiment output run to different cells in a sheet (Excel file)

I am simultaneously running a model with different input values and it is producing different output on each run. I am trying to create a code that will get anylogic to wright each experiment output run in a different cell in excel sheet i.e. throughput Vs. Time. I am using dataset. Wondering If there is any script or hint can help in solving the issue?
Currently I am using the following commands. They keep overwriting the output using the same cells.
Out_excelFile1.setCellValue("Sink1 Out",2,2,2);
Out_excelFile1.writeDataSet(Sink1_D,2,3,2);
Best if you actually use the build-in database for outputs and only write to Excel at the end of all runs, tbh.
But in your case, you need to change the row number by your replication/iteration number. Use getCurrentIteration() or getCurrentReplication() in your "after simulation run" or "after replication" or "after iteration" experiment code sections to get this right.
Then, it would look something like Out_excelFile1.setCellValue("Sink1 Out",2,getCurrentIteration(),2);
(Details depend on your actual implementation, check the help for further info on replications, iterations and those functions)

Discrepancy in Date Created attribute between Powershell script output and Windows Explorer

I wrote a simple powershell script that recursively walks a file tree and returns the paths of each node along with the time of its creation in tab-separated form, so that I can write it out to a text file and use it to do statistical analysis:
echo "PATH CREATEDATE"
get-childitem -recurse | foreach-object {
$filepath = $_.FullName
$datecreated = $_.CreationTime
echo "$filepath $datecreated"
}
Once I had done this, however, I noticed that the CreationDate times that get produced by the script are exactly one hour ahead of what Windows Explorer says when I look at the same attribute of the same files. Based on inspecting the rest of my dataset (which recorded surrounding events in a different format), it's clear that the results I get from explorer are the only ones that fit the overall narrative, which leads me to believe that there's something wrong with the Powershell script that makes it write out the incorrect time. Does anyone have a sense for why that might be?
Problem background:
I'm trying to correct for a problem in the design of some XML log files, which logged when the users started and stopped using an application when it was actually supposed to log how long it took the users to get through different stages of the workflow. I found a possible way to overcome this problem, by pulling date information from some backup files that the users sent along with the XML logs. The backups are generated by our end-user application at the exact moment when a user transitions between stages in the workflow, so I'm trying to bring information from those files' timestamps together with the contents of the original XML log to figure out what I wanted to know about the workflow steps.
Summary of points that have come out in comment discussion:
The files are located on the same machine as the script I'm running (not a network store)
Correcting for daylight savings and time zones has improved the data quality, but not for the specific issue posed in the original question.
I never found the ultimate technical reason for the discrepancy between the timestamps from powershell vs. explorer, but I was able to correct for it by just subtracting an hour off all the timestamps I got from the powershell script. After doing that, however, there was still a large amount of disagreement between the time stamps I got from out of my XML log files and the ones I pulled from the filesystem using the powershell script. Reasoning that the end-users probably stayed in the same time zone when they were generating the files, I wrote a little algorithm to estimate the time zone of each user by evaluating the median amount of time between steps 1 and 2 in the workflow and steps 2 and 3. If there was a problem with the user's time zone, one of those two timespans would be negative (since the time of the step 2 event was estimated and the times of the steps 1 and 3 events were known from the XML logs.) I then rounded the positive value down to the nearest hour and applied that number of hours as an offset to that user's step 2 times. Overall, this took the amount of bad data in my dataset from 20% down to 0.01%, so I'm happy with the results.
In case anyone needs it, here's the code I used to make the hour offset in the timestamps (not powershell code, this was in a C# script that handled another part of data processing):
DateTime step2time = DateTime.Parse(LastModifyDate);
TimeSpan shenanigansCorrection = new TimeSpan(step2time.Hour-1,step2time.Minute,step2time.Second);
step2time= step2time.Date + shenanigansCorrection;
The reason for redefining the step2time variable is that DateTimes aren't mutable in .NET.