Garbage collection in Perl - perl

Unlike Java, Perl uses reference count for garbage collection. I have tried searching some previous questions which speak about C++ RAII and smart pointers and Java GC but have not understood how Perl deals with the circular referencing problem.
Can anyone explain how Perl's garbage collector deals with circular references? Is there any way to reclaim circular referenced memory which are no longer used by the program or does Perl just ignores this problem altogether?

According to my copy of Programming Perl 3rd ed., on exit Perl 5 does an "expensive mark and sweep" to reclaim circular references. You'll want to avoid circular references as much as possible because otherwise they won't be reclaimed until the program exits.
Perl 5 does offer weak references through the Scalar::Utils module.
Perl 6 will move to a pluggable garbage collected scheme (well, the underlying VM will have multiple garbage collection options and the behavior of those options can have an effect on Perl). That is, you'll be able to choose between various garbage collectors, or implement your own. Want a copying collector? Sure. Want a coloring collector? You got it. Mark/sweep, compacting, etc? Why not?

The quick answer is that Perl 5 does not handle circular references automatically. Unless you take explicit measures in your code, any of your data structures which include circular references will not be reclaimed until the thread that created them dies. This is considered to be an acceptable tradeoff in that it avoids the need for runtime garbage collection that would slow down execution.
If your code creates data structures with circular references (i.e. a tree whose nodes contain references back to the root), you will want to use the Scalar::Util module to "weaken" the references that point back toward the root node. These weak references will not add to the reference count of whatever they point to, so the entire data structure will be automatically deallocated when the last external reference vanishes.
Example:
use Scalar::Util qw(weaken);
...
my $new_node = { content => $content, root => $root_node };
weaken $new_node->{root};
push #{$root_node->{children}}, $new_node;
If you use code like this whenever you add new nodes to your data structure, then the only references to the root that are actually counted are those from outside of the structure. This is exactly what you want. Then the root, and recursively all of its children, will be reclaimed as soon as the last external reference to it vanishes.

Have a look at Proxy Objects.

Perl applies a mark-and-sweep alternate GC in some occasions (when a thread dies, I think) in order to reclaim circular references. Note that the "every value is a string" Perl stanza makes it difficult to create true circular references; this is feasible, but "normal" Perl code does not, which is why reference counting works well with Perl.

Related

Querying Python runtime for all objects in existence

I'm working on a C++ Python wrapper the attempts to encapsulate the awkwardness of reference counting, retaining, releasing.
It has a set of unit tests.
However I want to ensure that after each test, everything has been cleared away properly. i.e. every object created during that test has had its reference count taken down to 0, and has consequently been removed.
Is there any way of querying the Python runtime for this information?
If I could just get the number of objects being stored, that would do. I could then sure it doesn't change between tests.
EDIT: I believe it is possible to compile Python with a special flag producing a binary that has functions for monitoring reference counting. But this is as much as I know. Maybe more...
That depends on which implementation you use. I'm assuming you're using cpython. Since you're fiddling with the reference counting mechanism, I will further assume that using the garbage collector to find the remaining objects won't be sufficiently reliable for your purpose. (Elsewise, see here.)
The build flag you were thinking about is this one:
It is best to define these options in the EXTRA_CFLAGS make variable:
make EXTRA_CFLAGS="-DPy_REF_DEBUG".
Py_REF_DEBUG introduced in 1.4
named REF_DEBUG before 1.4
Turn on aggregate reference counting. This arranges that extern
_Py_RefTotal hold a count of all references, the sum of ob_refcnt across
all objects. [..]
Special gimmicks:
sys.gettotalrefcount()
Return current total of all refcounts.
(Source: Python git, SpecialBuilds.txt, Debugging builds from the C API reference.)
If you need a list of all pointers to live objects, use Py_TRACE_REFS, directly below that one in the SpecialBuilds file.

Should I prefer hashes or hashrefs in Perl?

I'm still learning perl.
For me it feels more "natural" to references to hashes rather than directly access them, because it is easier to pass references to a sub, (one variable can be passed instead of a list). Generally I prefer this approach to that where one directly accesses %hashes".
The question is, where (in what situations) is better to use plain %hashes, so
$hash{key} = $value;
instead of
$href->{key} = $value
Is here any speed, or any other "thing" what prefers to use %hashes and not $hashrefs? Or it is matter only of pure personal taste and TIMTOWTDI? Some examples, when is better to use %hash?
I think this kind of question is very legitimate: Programming languages such as Perl or C++ have come a long way and accumulated a lot of historical baggage, but people typically learn them from ahistorical, synchronous exposés. Hence they keep wondering why TIMTOWDI and WTF all these choices and what is better and what should be preferred?
So, before version 5, Perl didn't have references. It only had value types. References are an add-on to Perl 4, enabling lots more stuff to be written. Value types had to be retained, of course, to keep backward compatibility; and also, for simplicity's sake, because frequently you don't need the indirection that references are.
To answer your question:
Don't waste time thinking about the speed of Perl hash lists. They're fast. They're memory access. Accessing a database or the filesystem or the net, that is where your program will typically spend time.
In theory, a dereference operation should take a tiny bit of time, so tiny it shouldn't matter.
If you're curious, then benchmark. Don't draw too many conclusions from differences you might see. Things could look different on another release.
So there is no speed reason to favour references over value types or vice versa.
Is there any other reason? I'd say it's a question of style and taste. Personally, I prefer the syntax without -> accessors.
If you can use a plain hashes, to describe your data, you use a plain hash. However, when your data structure gets a bit more complex, you will need to use references.
Imagine a program where I'm storing information about inventory items, and how many I have in stock. A simple hash works quite well:
$item{XP232} = 324;
$item{BV348} = 145;
$item{ZZ310} = 485;
If all you're doing is creating quick programs that can read a file and store simple information for a report, there's no need to use references at all.
However, when things get more complex, you need references. For example, my program isn't just tracking my stock, I'm tracking all aspects of my inventory. Inventory items also have names, the company that creates them, etc. In this case, I'll want to have my hashes not pointing to a single data point (the number of items I have in stock), but a reference to a hash:
$item{XP232}->{DESCRIPTION} = "Blue Widget";
$item{XP232}->{IN_STOCK} = 324;
$item{XP232}->{MANUFACTURER} = "The Great American Widget Company";
$item{BV348}->{DESCRIPTION} = "A Small Purple Whatzit";
$item{BV348}->{IN_STOCK} = 145;
$item{BV348}->{MANUFACTURER} = "Acme Whatzit Company";
You can do all sorts of wacky things to do something like this (like have separate hashes for each field or put all fields in a single value separated by colons), but it's simply easier to use references to store these more complex structures.
For me the main reason to use $hashrefs to %hashes is the ability to give them meaningful names (a related idea would be name the references to an anonymous hash) which can help you separate data structures from program logic and make things easier to read and maintain.
If you end up with multiple levels of references (refs to refs?!) you start to loose this clean and readable advantage, though. As well, for short programs or modules, or at earlier stages of development where you are retesting things as you go, directly accessing the %hash can make things easier for simple debugging (print statements and the like) and avoiding accidental "action at a distance" issues so you can focus on "iterating" through your design, using references where appropriate.
In general though I think this is a great question because TIMTOWDI and TIMTOCWDI where C = "correct". Thanks for asking it and thanks for the answers.

recursive reference in Perl

$a=\$a;
The book I'm reading says in this case $a will NEVER be free,my question is why perl interpreter doesn't fix it at compile time?When it finds it's pointing at itself,don't increase refcount.
Why perl doesn't do it?
Some garbage collectors have cycle detection; Perl, for performance and historical reasons, does not. If you want a reference that doesn't affect the reference count, you can use Scalar::Util::weaken to obtain a weak reference, which removes the need for cycle detection in most situations where you would need to rely on it. There would need to be cycle-detection built into the interpreter to automatically detect whether \$a should be a weak or strong reference, so you just have to do it explicitly.

MATLAB weak references to handle class objects

While thinking about the possibility of a handle class based ORM in MATLAB, the issue of caching instances came up. I could not immediately think of a way to make weak references or a weak map, though I'm guessing that something could be contrived with event listeners. Any ideas?
More Info
In MATLAB, a handle class (as opposed to a value class) has reference semantics. An example included with MATLAB is the containers.Map class. If you instantiate one and pass it to a function, any modifications the function makes to the object will be visible via the original reference. That is, it works like a Java or Python object reference.
Like Java and Python, MATLAB keeps track in one way or another of how many things are referencing each object of a handle class. When there aren't any more, MATLAB knows it is safe to delete the object.
A weak reference is one that refers to the object but does not count as a reference for purposes of garbage collection. So if the only remaining references to the object are weak, then it can be thrown away. Generally an event or callback can be supplied to the weak reference - when the object is thrown away, the weak references to it will be notified, allowing cleanup code to run.
For instance, a weak value map is like a normal map, except that the values (as opposed to the keys) are implemented as weak references. The weak map class can arrange a callback or event on each of these weak references so that when the referenced object is deleted, the key/value entry in the map is removed, keeping the map nice and tidy.
These special reference types are really a language-level feature, something you need the VM and GC to do. Trying to implement it in user code will likely end in tears, especially if you lean on undocumented behavior. (Sorry to be a stick in the mud.)
There's a couple ways you could do something similar. These are just ideas, not endorsements; I haven't actually done them.
Perhaps instead of caching Matlab object instances per se, you could cache expensive computational results using a real Java weak ref map in the JVM embedded inside Matlab. If you can convert your Matlab values to and from Java relatively quickly, this could be a win. If it's relatively flat numeric data, primitives like double[] or double[][] convert quickly using Matlab's implicit conversion.
Or you could make a regular LRU object cache in the Matlab level (maybe using a containers.Map keyed by hashcodes) that explicitly removes the objects inside it when new ones are added. Either use it directly, or add an onCleanup() behavior to your objects that has them automatically add a copy of themselves to a global "recently deleted objects" LRU cache of fixed size, keyed by an externally meaningful id, and mark the instances in the cache so your onCleanup() method doesn't try to re-add them when they're deleted due to expiration from the cache. Then you could have a factory method or other lookup method "resurrect" instances from the cache instead of constructing brand new ones the expensive way. This sounds like a lot of work, though, and really not idiomatic Matlab.
This is not an answer to your question but just my 2 cents.
Weak reference is a feature of garbage collector. In Java and .NET garbage collector is being called when the pressure on memory is high and is therefore indeterministic.
This MATLAB Digest post says that MATLAB is not using a (indeterministic) garbage collector. In MATLAB references are being deleted from memory (deterministically) on each stack pop i.e. on leaving each function.
Thus I do not think that weak references belongs to the MATLAB reference handling concept. But MATLAB has always had tons of undocumented features so I can not exclude that it is buried somewhere.
In this SO post I asked about MATLAB garbage collector implementation and got no real answer. One MathWorks stuff member instead of answering my question has accused me of trying to construct a Python vs. MATLAB argument. Another MathWorks stuff member wrote something looking reasonable but in substance a clever deception - purposeful distraction from the problem I asked about. And the best answer has been:
if you ask this question then MATLAB
is not the right language for you!

How can I represent sets in Perl?

I would like to represent a set in Perl. What I usually do is using a hash with some dummy value, e.g.:
my %hash=();
$hash{"element1"}=1;
$hash{"element5"}=1;
Then use if (defined $hash{$element_name}) to decide whether an element is in the set.
Is this a common practice? Any suggestions on improving this?
Also, should I use defined or exists?
Thank you
Yes, building hash sets that way is a common idiom. Note that:
my #keys = qw/a b c d/;
my %hash;
#hash{#keys} = ();
is preferable to using 1 as the value because undef takes up significantly less space. This also forces you to uses exists (which is the right choice anyway).
Use one of the many Set modules on CPAN. Judging from your example, Set::Light or Set::Scalar seem appropriate.
I can defend this advice with the usual arguments pro CPAN (disregarding possible synergy effects).
How can we know that look-up is all that is needed, both now and in the future? Experience teaches that even the simplest programs expand and sprawl. Using a module would anticipate that.
An API is much nicer for maintenance, or people who need to read and understand the code in general, than an ad-hoc implementation as it allows to think about partial problems at different levels of abstraction.
Related to that, if it turns out that the overhead is undesirable, it is easy to go from a module to a simple by removing indirections or paring data structures and source code. But on the other hand, if one would need more features, it is moderately more difficult to achieve the other way around.
CPAN modules are already tested and to some extent thoroughly debugged, perhaps also the API underwent improvement steps over the time, whereas with ad-hoc, programmers usually implement the first design that comes to mind.
Rarely it turns out that picking a module at the beginning is the wrong choice.
That's how I've always done it. I would tend to use exists rather than defined but they should both work in this context.