The purpose of a hash is to take a unique blob and create a short unique identifier for it. Essentially we want to avoid collision
hash('test1') => 4r3oi34
hash('test2') => xkfo302
What would the opposite be - namely trying to achieve collision? Is there a name for mapping things in that way? Is it still considered a hash?
mapthingy('test1') => 12342
mapthingy('test2') => 12342
P.S. I tried asking the question at softwareengineering exchange but apparently I'm banned from asking questions.
Related
What I want is roughly equivalent to
df.where(<condition>).count() != 0
But I'm pretty sure it's not quite smart enough to stop once it finds any such violation. I would expect some sort of aggregator to be able to do this, but I haven't found one? I could do it with a max and some sort of conversion, but again I don't think it would necessarily know to quit (not being specific to bool, I'm not sure if understands no value is larger than true).
More specifically, I want to check if a column contains only a single element. Right now my best idea is to do this is by grabbing the first value and comparing everything.
I would try this option, it should be much faster:
df.where(<condition>).head(1).isEmpty
You can also try to define your conditions on a row together with scala's exists (which stops at the first occurence of true):
df.mapPartitions(rows => if(rows.exists(row => <condition>)) Iterator(1) else Iterator.empty).isEmpty
At the end you should benchmark the alternatives
More specifically can a patricia trie:
http://search.cpan.org/~plonka/Net-Patricia-1.014/Patricia.pm
Contain duplicate IP addresses (duplicate key IP with different values)
and if so how does it handles returning the values from that key?
ie:
123.456.789.0-> {
'value' => 'hai'
}
123.456.789.0->{
'value' => 'hey there'
}
patricia-trie->match_string(123.456.789.0)
returns ???
Edit:
Yes, I know I can implement my own that supports this behavior. I am asking how this specific trie implementation handles it. The documentation is extremely limited and manual testing appears to show overwrites, but I was hoping someone has a definitive answer.
A trie (or Patricia tree) is generally a map between keys and values. To make a map into a multimap, use a linked list or array as the value.
Cross-posted from perlmonks:
I have to clean up some gross, ancient code at $work,
and before I try to make a new module I'd love to use an existing one if anyone knows of something appropriate.
At runtime I am parsing a file to determine what processing I need to do on a set of data.
If I were to write a module I would try to do it more generically (non-DBI-specific), but my exact use case is this:
I read a SQL file to determine the query to run against the database.
I parse comments at the top and determine that
column A needs to have a s/// applied,
column B needs to be transformed to look like a date of given format,
column C gets a sort of tr///.
Additionally things can be chained so that column D might s///, then say if it isn't 1 or 2, set it to 3.
So when fetching from the db the program applies the various (possibly stacked) transformations before returning the data.
Currently the code is a disgustingly large and difficult series of if clauses
processing hideously difficult to read or maintain arrays of instructions.
So what I'm imagining is perhaps an object that will parse those lines
(and additionally expose a functional interface),
stack up the list of processors to apply,
then be able to execute it on a passed piece of data.
Optionally there could be a name/category option,
so that one object could be used dynamically to stack processors only for the given name/category/column.
A traditionally contrived example:
$obj = $module->new();
$obj->parse("-- greeting:gsub: /hi/hello"); # don't say "hi"
$obj->parse("-- numbers:gsub: /\D//"); # digits only
$obj->parse("-- numbers:exchange: 1,2,3 one,two,three"); # then spell out the numbers
$obj->parse("-- when:date: %Y-%m-%d 08:00:00"); # format like a date, force to 8am
$obj->stack(action => 'gsub', name => 'when', format => '/1995/1996/'); # my company does not recognize the year 1995.
$cleaned = $obj->apply({greeting => "good morning", numbers => "t2", when => "2010116"});
Each processor (gsub, date, exchange) would be a separate subroutine.
Plugins could be defined to add more by name.
$obj->define("chew", \&CookieMonster::chew);
$obj->parse("column:chew: 3x"); # chew the column 3 times
So the obvious first question is, does anybody know of a module out there that I could use?
About the only thing I was able to find so far is [mod://Hash::Transform],
but since I would be determining which processing to do dynamically at runtime
I would always end up using the "complex" option and I'd still have to build the parser/stacker.
Is anybody aware of any similar modules or even a mildly related module that I might want to utilize/wrap?
If there's nothing generic out there for public consumption (surely mine is not the only one in the darkpan),
does anybody have any advice for things to keep in mind or interface suggestions or even other possible uses
besides munging the return of data from DBI, Text::CSV, etc?
If I end up writing a new module, does anybody have namespace suggestions?
I think something under Data:: is probably appropriate...
the word "pluggable" keeps coming to mind because my use case reminds me of PAM,
but I really don't have any good ideas...
Data::Processor::Pluggable ?
Data::Munging::Configurable ?
I::Chew::Data ?
First I'd try to place as much of the formatting as possible in the SQL queries if possible.
Things like date format etc. definitely should be handled in SQL.
Out top of my head a module I know and which could be used for your purpose is Data::FormValidator. Although is is mainly aimed at validating CGI parameters, it has the functionality you need: you can defined filters and constraints and chain them in various ways. Doesn't mean there no other modules for you purpose, I just don't know.
Or you can do something what you already hinted at. You could define some sort of command classes and chain them on the various data inputs. I'd do something along these lines:
package MyDataProcessor;
use Moose;
has 'Transformations' => (
traits => ['Array'],
is => 'rw',
isa => 'ArrayRef[MyTransformer]',
handles => {
add_transformer => 'push',
}
);
has 'input' => (is => 'rw', isa => 'Str');
sub apply_transforms { }
package MyRegexTransformer;
use Moose;
extends 'MyTransformer';
has 'Regex' => (is => 'rw', isa => 'Str');
has 'Replacement' => (is => 'rw', isa => 'Str');
sub transform { }
# some other transformers
#
# somewhere else
#
#
my $processor = MyDataProcessor->new(input => 'Hello transform me');
my $tr = MyRegexTransformer->new(Regex => 'Hello', Replacement => 'Hi');
$processor->add_transformer($tr);
#...
$processor->apply_transforms;
I'm not aware of any data transform CPAN modules, so I've had to roll my own for work. It was significantly more complicated than this, but operated under a similar principle; it was basically a poor man's implementation of Informatica-style ETL sans the fancy GUI... the configuration was Perl hashes (Perl instead of XML since it allowed me to implement certain complex rules as subroutine references).
As far as namespace, i'd go for Data::Transform::*
Thanks to everyone for their thoughts.
The short version:
After trying to adapt a few existing modules I ended up abstracting my own: Sub::Chain.
It needs some work, but is doing what I need so far.
The long version:
(an excerpt from the POD)
=head1 RATIONALE
This module started out as Data::Transform::Named,
a named wrapper (like Sub::Chain::Named) around
Data::Transform (and specifically Data::Transform::Map).
As the module was nearly finished I realized I was using very little
of Data::Transform (and its documentation suggested that
I probably wouldn't want to use the only part that I I using).
I also found that the output was not always what I expected.
I decided that it seemed reasonable according to the likely purpose
of Data::Transform, and this module simply needed to be different.
So I attempted to think more abstractly
and realized that the essence of the module was not tied to
data transformation, but merely the succession of simple subroutine calls.
I then found and considered Sub::Pipeline
but needed to be able to use the same
named subroutine with different arguments in a single chain,
so it seemed easier to me to stick with the code I had written
and just rename it and abstract it a bit further.
I also looked into Rule::Engine which was beginning development
at the time I was searching.
However, like Data::Transform, it seemed more complex than what I needed.
When I saw that Rule::Engine was using [the very excellent] Moose
I decided to pass since I was doing work on a number of very old machines
with old distros and old perls and constrained resources.
Again, it just seemed to be much more than what I was looking for.
=cut
As for the "parse" method in my original idea/example,
I haven't found that to be necessary, and am currently using syntax like
$chain->append($sub, \#arguments, \%options)
I would like to be able to store objects in a hash structure so I can work with the name of the object as a variable.
Could someone help me make a
sub new{
...
}
routine that creates a new object as member of a hash? I am not exactly sure how to go about doing this or how to refer to and/or use the object when it is stored like this. I just want to be able to use and refer to the name of the object for other subroutines.
See my comment in How can I get name of an object in Perl? for why I want to do this.
Thank you
Objects don't really have names. Why are you trying to give them names? One of the fundamental points of references is that you don't need to know a name, or even what class it is, to work with it.
There's probably a much better way to achieve your task.
However, since objects are just references, and references are just scalars, the object can be a hash value:
my %hash = (
some_name => Class->new( ... ),
other_name => Class->new( ... ).
);
You might want to check out a book such as Intermediate Perl to learn how references and objects work.
Don't quite understand what you are trying to do. Perhaps you can provide some concrete examples?
You can store objects into hashes just like any other variable in perl.
my %hash = ( );
$hash{'foo'} = new Foo(...);
$hash{'bar'} = new Bar(...);
Assuming you know the object stored at 'foo' is a Foo object and at 'bar' is a Bar object, then you can retrieve the objects from the hash and use it.
$hash{'foo'}->foo_method();
$hash{'bar'}->bar_method();
You may want to programmatically determine this behavior at run time. That's assuming that you are sticking with this naming scheme.
Can someone suggest a good module in perl which can be used to store collection of objects?
Or is ARRAY a good enough substitute for most of the needs?
Update:
I am looking for a collections class because I want to be able to do an operation like compute collection level property from each element.
Since I need to perform many such operations, I might as well write a class which can be extended by individual objects. This class will obviously work with arrays (or may be hashes).
There are collection modules for more complex structures, but it is common style in Perl to use Arrays for arrays, stacks and lists. Perl has built in functions for using the array as a stack or list : push/pop, shift/unshift, splice (inserting or removing in the middle) and the foreach form for iteration.
Perl also has a map, called a hashmap which is the equivalent to a Dictionary in Python - allowing you to have an association between a single key and a single value.
Perl developers often compose these two data-structures to build what they need - need multiple values? Store array-references in the value part of the hashtable (Map). Trees can be built in a similar manner - if you need unique keys, use multiple-levels of hashmaps, or if you don't use nested array references.
These two primitive collection types in Perl don't have an Object Oriented api, but they still are collections.
If you look on CPAN you'll likely find modules that provide other Object Oriented data structures, it really depends on your need. Is there a particular data structure you need besides a List, Stack or Map? You might get a more precise answer (eg a specific module) if you're asking about a particular data structure.
Forgot to mention, if you're looking for small code examples across a variety of languages, PLEAC (Programming Language Examples Alike Cookbook) is a decent resource.
I would second Michael Carman's comment: please do not use the term "Hashmap" or "map" when you mean a hash or associative array. Especially when Perl has a map function; that just confuses things.
Having said that, Kyle Burton's response is fundamentally sound: either a hash or an array, or a complex structure composed of a mixture of the two, is usually enough. Perl groks OO, but doesn't enforce it; chances are that a loosely-defined data structure may be good enough for what you need.
Failing that, please define more exactly what you mean by "compute collection level property from each element". And bear in mind that Perl has keywords like map and grep that let you do functional programming things like e.g.
my $record = get_complex_structure();
# $record = {
# 'widgets' => {
# name => 'ACME Widgets',
# skus => [ 'WIDG01', 'WIDG02', 'WIDG03' ],
# sales => {
# WIDG01 => { num => 25, value => 105.24 },
# WIDG02 => { num => 10, value => 80.02 },
# WIDG03 => { num => 8, value => 205.80 },
# },
# },
# ### and so on for 'grommets', 'nuts', 'bolts' etc.
# }
my #standouts =
map { $_->[0] }
sort {
$b->[2] <=> $a->[2]
|| $b->[1] <=> $a->[1]
|| $record->{$a->[0]}->{name} cmp $record->{$b->[0]}->{name}
}
map {
my ($num, $value);
for my $sku (#{$record->{$_}{skus}}) {
$num += $record->{$_}{sales}{$sku}{num};
$value += $record->{$_}{sales}{$sku}{value};
}
[ $_, $num, $value ];
}
keys %$record;
Reading from back to front, this particular Schwarztian transform does three things:
3) It takes a key to $record, goes through the SKUs defined in this arbitrary structure, and works out the aggregate number and total value of transactions. It returns an anonymous array containing the key, the number of transactions and the total value.
2) The next block takes in a number of arrayrefs and sorts them a) first of all by comparing the total value, numerically, in descending orders; b) if the values are equal, by comparing the number of transactions, numerically in descending order; and c) if that fails, by sorting asciibetically on the name associated with this order.
1) Finally, we take the key to $record from the sorted data structure, and return that.
It may well be that you don't need to set up a separate class to do what you want.
I would normally use an #array or a %hash.
What features are you looking for that aren't provided by those?
Base your decision on how you need to access the objects. If pushing them onto an array, indexing into, popping/shifting them off works, then use an array. Otherwise hash them by some key or organize them into a tree of objects that meets your needs. A hash of objects is a very simple, powerful, and highly-optimized way of doing things in Perl.
Since Perl arrays can easily be appended to, resized, sorted, etc., they are good enough for most "collection" needs. In cases where you need something more advanced, a hash will generally do. I wouldn't recommend that you go looking for a collection module until you actually need it.
Either an array or a hash can store a collection of objects. A class might be better if you want to work with the class in certain ways but you'd have to tell us what those ways are before we could make any good recommendations.
i would stick with an ARRAY or a HASH.
#names = ('Paul','Michael','Jessica','Megan');
and
my %petsounds = ("cat" => "meow",
"dog" => "woof",
"snake" => "hiss");
source
It depends a lot; there's Sparse Matrix modules, some forms of persistence, a new style of OO etc
Most people just man perldata, perllol, perldsc to answer their specific issue with a data structure.