Im using Get-ChildItem to read files in a folder then I get the lastwritetime for each file and sort them.
Do I have to close the files after getting the lastwritetime?
No. This is just a list of info for the files. No streams or locks are taking place
As has already been mentioned, the answer is no.
The reason being that when you use Get-ChildItem against the file system it doesn't actually open any files - it interrogates an underlying API which then returns metadata about the files in the file system - so there's no file handle to be "closed".
From the comments on your question, I sense some confusion as to "why don't I need to manage system resource allocation in PowerShell?"
PowerShell runs on .NET, and the .NET runtime is garbage-collected. At some (undefined) point in time after a block of memory is no longer referenced by any pointers, the garbage collector will take care of freeing it, and you don't need to worry about managing this process yourself.
Of course there are situations in which resource allocation is external to the runtime, and has to be managed, but the usual pattern in .NET is to implement the IDisposable interface when defining classes that depend on unmanaged resources. An example is a StreamReader (with which you could read a text file). In C# you can use the using directive to automatically dispose of such an object once execution leaves the scope in which it's required:
using(StreamReader reader = File.OpenText("C:\path\to\file.txt"))
{
// use reader in here
}
// at this point, reader.Dispose() has been called automatically
In PowerShell, there is no such semantic construct. What I usually do when allocating many disposable objects is wrap them in a try/finally block:
try {
$FileReader = [System.IO.File]::OpenText("C:\path\to\file.txt")
# user $FileReader here
}
finally {
if($FileReader -ne $null){
$FileReader.Dispose()
}
}
Of course all of this is hidden away from you when invoking Get-Content for example - the developer of the underlying function in the file system provider has already taken care of disposing the object by the time the pipeline stops running. It's really only needed when you want to write your own cmdlets and interact with more "primitive" types directly
I hope this sheds some light on your confusion
Related
I'm working on a C++ program that uses boost::python to provide a python wrapper/API for the user. The program tracks and limits its own memory usage by opening /proc/self/statm using a file descriptor. Every timestep it seeks to the beginning of that file and reads the vmsize from it.
proc_self_statm_fd = open( "/proc/self/statm", O_RDONLY );
However, this causes a problem when calling fork(). In particular, when a user writes a python script that does something like this:
proc = multiprocessing.Process(name="bkg_process",target=bkg_process,daemon=True)
The problem is that the forked process gets the file descriptor pointing to /proc/self/statm from the parent process, not its own, and this reports the wrong memory usage. Even worse, if the parent process exits, the child process will fail when trying to read from the file descriptor.
What's the correct solution for this? It needs to be handled at the C++ level because we don't have control over the user's python scripts. Is there a way to have the class auto detect that a fork has happened and grab a new file descriptor? In the worst case I can have it re-open the file for every update. I'm worried that would add runtime overhead though.
You could store the PID in the class, and check it against the value of getpid() on each call, and then reopen the file if the PID has changed. getpid() is typically much cheaper than open - on some systems it doesn't even need a context switch (it just fetches the PID from a magic location in the process's own memory).
That said, you may also want to actually measure the cost of reopening the file each time - it may not actually be significant.
I would like to come up with a mechanism by which I can share 'data' between different Powershell processes. This would be in order to implement a kind of job system, whereby a function can be run in one Powershell process, complete and then someone communicate its status to a function run from another (distinct) Powershell process...
I guess what I'd ideally like psjob results to be shareable between sessions, but this does not seem to be possible.
I can think of a few dirty ways of achieving this (like O/S environment variables), but am I missing an semi-elegant way?
For example:
Function giveMeNumber
{
$return_vlaue = Get-Random -Minimum -100 -Maximum 100
Return $return_vlaue
}
What are some ways i could get this function to store it's return somewhere and then grab it from another Powershell session (without using a database).
Cheers.
The QA mentioned by Keith refers to using MSMQ, a message queueing feature optionally available on desktop, mobile & server OS's from Microsoft.
It doesn't run by default on desktop OS's so you would have to ensure that the appropriate service was started. Seems like serious overkill to me unless you wanted something pretty beefy.
Of course, the most common choice for this type of task would be a simple shared file.
Alternatively, you could create a TCP listener in each of the jobs that you want to have accept external info. Not done this myself in PowerShell though I know it is possible. Node.JS would be a more familiar environment or Python. Seems like overkill if a shared file would do the job!
Another way would be to use the registry. Though you might consider that cheating since it is actually a database (of a very broken and simplistic sort).
I'm actually not sure that environment variables would work since I know that they can be picky about the parent environment scope (for example setting an env variable in a cmd doesn't make it available outside of the cmd scope by default.
UPDATE: Doh, missed a few! Some of them very obvious. Microsoft have a list:
Clipboard
COM
Data Copy
DDE
File Mapping
Mailslots
Pipes
RPC
Windows Sockets
Pipes was the one I was trying to remember. Windows sockets would be similar to a TCP listener.
I have following questions:
How is global code executed and global variables initialized in perl?
If I write use package_name; in multiple packages, does the global code execute each time?
Are global variables defined this way thread safe?
Perl makes a complete copy of all code and variables for each thread. Communication between threads is via specially marked shared variables (which in fact are not shared - there is still a copy in each thread, but all the copies get updated). This is a significantly different threading model than many other languages have, so the thread-safety concerns are different - mostly centering around what happens when objects are copied to make a new thread and those objects have some form of resource to something outside the program (e.g. a database connection).
Your question about use isn't really related to threads, as far as I can tell? use does several things; one is loading the specified module and running any top-level code in it; this happens only once per module, not once per use statement.
I created a PowerShell script which loops over a large number of XML Schema (.xsd) files, and for each creates a .NET XmlSchemaSet object, calls Add() and Compile() to add a schema to it, and prints out all validation errors.
This script works correctly, but there is a memory leak somewhere, causing it to consume gigabytes of memory if run on 100s of files.
What I essentially do in a loop is the following:
$schemaSet = new-object -typename System.Xml.Schema.XmlSchemaSet
register-objectevent $schemaSet ValidationEventHandler -Action {
...write-host the event details...
}
$reader = [System.Xml.XmlReader]::Create($schemaFileName)
[void] $schemaSet.Add($null_for_dotnet_string, $reader)
$reader.Close()
$schemaSet.Compile()
(A full script to reproduce this problem can be found in this gist: https://gist.github.com/3002649. Just run it, and watch the memory usage increase in Task Manager or Process Explorer.)
Inspired by some blog posts, I tried adding
remove-variable reader, schemaSet
I also tried picking up the $schema from Add() and doing
[void] $schemaSet.RemoveRecursive($schema)
These seem to have some effect, but still there is a leak. I'm presuming that older instances of XmlSchemaSet are still using memory without being garbage collected.
The question: How do I properly teach the garbage collector that it can reclaim all memory used in the code above? Or more generally: how can I achieve my goal with a bounded amount of memory?
Microsoft has confirmed that this is a bug in PowerShell 2.0, and they state that this has been resolved in PowerShell 3.0.
The problem is that an event handler registered using Register-ObjectEvent is not garbage collected. In reponse to a support call, Microsoft said that
"we’re dealing with a bug in PowerShell v.2. The issue is caused
actually by the fact that the .NET object instances are no longer
released due to the event handlers not being released themselves. The
issue is no longer reproducible with PowerShell v.3".
The best solution, as far as I can see, is to interface between PowerShell and .NET at a different level: do the validation completely in C# code (embedded in the PowerShell script), and just pass back a list of ValidationEventArgs objects. See the fixed reproduction script at https://gist.github.com/3697081: that script is functionally correct and leaks no memory.
(Thanks to Microsoft Support for helping me find this solution.)
Initially Microsoft offered another workaround, which is to use $xyzzy = Register-ObjectEvent -SourceIdentifier XYZZY, and then at the end do the following:
Unregister-Event XYZZY
Remove-Job $xyzzy -Force
However, this workaround is functionally incorrect. Any events that are still 'in flight' are lost at the time these two additional statements are executed. In my case, that means that I miss validation errors, so the output of my script is incomplete.
After the remove-variable you can try to force GC collection :
[GC]::Collect()
My first real-world Python project is to write a simple framework (or re-use/adapt an existing one) which can wrap small python scripts (which are used to gather custom data for a monitoring tool) with a "container" to handle boilerplate tasks like:
fetching a script's configuration from a file (and keeping that info up to date if the file changes and handle decryption of sensitive config data)
running multiple instances of the same script in different threads instead of spinning up a new process for each one
expose an API for caching expensive data and storing persistent state from one script invocation to the next
Today, script authors must handle the issues above, which usually means that most script authors don't handle them correctly, causing bugs and performance problems. In addition to avoiding bugs, we want a solution which lowers the bar to create and maintain scripts, especially given that many script authors may not be trained programmers.
Below are examples of the API I've been thinking of, and which I'm looking to get your feedback about.
A scripter would need to build a single method which takes (as input) the configuration that the script needs to do its job, and either returns a python object or calls a method to stream back data in chunks. Optionally, a scripter could supply methods to handle startup and/or shutdown tasks.
HTTP-fetching script example (in pseudocode, omitting the actual data-fetching details to focus on the container's API):
def run (config, context, cache) :
results = http_library_call (config.url, config.http_method, config.username, config.password, ...)
return { html : results.html, status_code : results.status, headers : results.response_headers }
def init(config, context, cache) :
config.max_threads = 20 # up to 20 URLs at one time (per process)
config.max_processes = 3 # launch up to 3 concurrent processes
config.keepalive = 1200 # keep process alive for 10 mins without another call
config.process_recycle.requests = 1000 # restart the process every 1000 requests (to avoid leaks)
config.kill_timeout = 600 # kill the process if any call lasts longer than 10 minutes
Database-data fetching script example might look like this (in pseudocode):
def run (config, context, cache) :
expensive = context.cache["something_expensive"]
for record in db_library_call (expensive, context.checkpoint, config.connection_string) :
context.log (record, "logDate") # log all properties, optionally specify name of timestamp property
last_date = record["logDate"]
context.checkpoint = last_date # persistent checkpoint, used next time through
def init(config, context, cache) :
cache["something_expensive"] = get_expensive_thing()
def shutdown(config, context, cache) :
expensive = cache["something_expensive"]
expensive.release_me()
Is this API appropriately "pythonic", or are there things I should do to make this more natural to the Python scripter? (I'm more familiar with building C++/C#/Java APIs so I suspect I'm missing useful Python idioms.)
Specific questions:
is it natural to pass a "config" object into a method and ask the callee to set various configuration options? Or is there another preferred way to do this?
when a callee needs to stream data back to its caller, is a method like context.log() (see above) appropriate, or should I be using yield instead? (yeild seems natural, but I worry it'd be over the head of most scripters)
My approach requires scripts to define functions with predefined names (e.g. "run", "init", "shutdown"). Is this a good way to do it? If not, what other mechanism would be more natural?
I'm passing the same config, context, cache parameters into every method. Would it be better to use a single "context" parameter instead? Would it be better to use global variables instead?
Finally, are there existing libraries you'd recommend to make this kind of simple "script-running container" easier to write?
Have a look at SQL Alchemy for dealing with database stuff in python. Also to make script writing easier for dealing with concurrency look into Stackless Python.