Version 1.3 of ais/ai-00293.txt

Unformatted version of ais/ai-00293.txt version 1.3
Other versions for file ais/ai-00293.txt

!standard 13.14 (00)          02-04-23 AI95-00293/00
!class amendment 02-04-23
!status No Action (9-0-1) 03-10-03
!status received 02-04-23
!priority Medium
!difficulty Medium
!subject Built-in hash function
!summary
!problem
!proposal
!wording
!example
!discussion
!ACATS Test
ACATS test(s) need to be created.
!appendix

From: Ted Baker
Sent: Friday, April 19, 2002  7:03 AM

A really useful package to have widely available would be an implementation of
a generic dynamic associative mapping, from some domain of key values to some
range of objects. I have in mind something like SNOBOL tables, but able to live
within Ada's type structure. It would need to be reasonably fast for sparse key
domains, and support reasonably fast updates. In other words, it would amount
to some form of generic dynamic hash tables.

I can't count the number of times I could have saved a lot of time
and effort if I could just have instantiated such a structure.

It can be used to implement a set (i.e., the subset set of the
domain for which the map is currently defined), or a function.

With sets and functions, you can program anything.

****************************************************************

From: Robert Dewar
Sent: Friday, April 19, 2002  8:51 AM

Really the language feature that would be valuable here would be someting
like

   x'Hash (val)

which would yield an Integer hash value from the given value. This should
apply to all types.

****************************************************************

From: Tucker Taft
Sent: Friday, April 19, 2002  9:38 AM

I would think this would want to be related to the "=" operator,
implying that it would be analogous to stream attributes, defined
for all types, but only predefined for non-limited types.  And of course
it should be user-specifiable, like streams.

****************************************************************

From: Robert Dewar
Sent: Friday, April 19, 2002  9:47 AM

No, I think to be useful it needs to be all types. If the restricted stream
model is acceptable, you can just hash the stream of the item.

****************************************************************

From: Tucker Taft
Sent: Friday, April 19, 2002  12:45 PM

I suppose for "inherently" limited types, <lim_type>'hash(x) could be
roughly equivalent to System.Address'hash(x'Address) since each object of
an inherently limited type is essentially unique.

For types that are not inherently limited, the basic requirement
for the predefined 'hash function would be that if x=y then
<non_lim_type>'hash(x) = <non_lim_type>'hash(y).

****************************************************************

From: Robert Dewar
Sent: Friday, April 19, 2002  1:41 PM

I would think that the criterion should be that the hashes are equal if
the underlying representations are equal, i.e. that the hash should simply
go to the underlying type always.

****************************************************************

From: Florian Weimer
Sent: Friday, April 19, 2002  12:51 PM

The address of an object can change during its lifetime, can't it?

****************************************************************

From: Tucker Taft
Sent: Friday, April 19, 2002  2:58 PM

That's why I said "roughly equivalent" ;-).  The point is that each object
of an inherently limited type is unique, so they can each receive
a different hash value.  I suppose a bigger question is whether the hash
value of a given inherently-limited object must remain the same throughout
it's life.  That would be nice, I *think*.

The fundamental problem is that non-limited types are "value oriented"
whereas limited types are "identity" or "object" oriented.

****************************************************************

From: Robert Dewar
Sent: Friday, April 19, 2002  3:08 PM

Absolutely, the hash must not change, otherwise it is useless

****************************************************************

From: Tucker Taft
Sent: Friday, April 19, 2002  3:31 PM

That makes it pretty different from the hash value of an object of
a non-limited type, since clearly the hash value of such an object
will likely change if the value of the object changes.

So perhaps what we are saying for an inherently-limited type is
that value = identity (at least for the predefined 'hash function).

I suppose this is what makes me feel that 'hash for an inherently limited
type is pretty different from the meaning of 'hash for a nonlimited type...

****************************************************************

From: Robert Dewar
Sent: Friday, April 19, 2002  4:23 PM

No I did not mean that, the hash should depend on the value, as for any
other type. Mappings of the kind we are discussing here are really value
oriented it seems to me.

****************************************************************

From: Robert Dewar
Sent: Friday, April 19, 2002  4:41 PM

>>Absolutely, the hash must not change, otherwise it is useless

Incidentally, I realize this caused confusion, I mean that it cannot change
just because the address changes, it can only change if the value changes.

****************************************************************

From: Michael Erdmann
Sent: Friday, April 19, 2002  4:22 PM

Now i get the idea, you want to assign to each instance some kind of
reference value (not a address of access)  which can be used as
key for the object?

It seems i dont understand what your intention are!

****************************************************************

From: Simon Wright
Sent: Monday, April 22, 2002  3:02 AM

> I suppose this is what makes me feel that 'hash for an inherently limited
> type is pretty different from the meaning of 'hash for a nonlimited type...

Sometimes a type is only limited because of a language restriction
rather than the application semantics (eg, the type has an access
discriminant). I feel that 'Hash ought to depend on application
semantics?

****************************************************************

From: Alexandre E. Kopilovitch
Sent: Friday, April 19, 2002  5:42 PM

It seems that the difference for an application is whether the hash values for
the limited types may be used as a persistent (or otherwise shared) data or
their usage is restricted to a particular run of a program.

****************************************************************

From: Michael Erdmann
Sent: Friday, April 19, 2002 12:23 PM

Do i understand this correctly, for example i could
do something like this:

x : array( .....)  of  some_type;
z : some_type;

z  :=  x'Hash(val);

****************************************************************

From: Robert Dewar
Sent: Friday, April 19, 2002  12:29 PM

No, that's wrong, because as with any attribute function of this type
x is the type, so you need to say

Also the result is always an integer. The proper example is

   type a is array (....) of some_type;
   x : a;
   z : integer

   z := a'hash(x);

****************************************************************

From: Michael Erdmann
Sent: Friday, April 19, 2002  1:06 PM

What this could proably would do is

             H(x) =  ord(x) mod (lenght of array)

where ord the ordinal number of the key would by in this case  ord(x) = x ?

The compiler would have to derive the correct ord and length from the code ?

****************************************************************

From: Robert Dewar
Sent: Friday, April 19, 2002  1:43 PM

I don't know what "ordinary number of the key" might mean

An array is typically hashed by combining the hashes of its components

****************************************************************

From: Michael Erdmann
Sent: Friday, April 19, 2002  3:33 PM

>I don't know what "ordinary number of the key" might mean

Just a function mapping  the keys 1:1  onto a number.

****************************************************************

From: Randy Brukardt
Sent: Friday, April 19, 2002  4:45 PM

Would someone care to show an example of how this "hash" value would be used
in an actual program? That would help decide what the semantics ought to be.
(I don't quite understand what the intent of this is. It seems to be an
attempt at a one-size-fits-all hash function, but that seems odd, as
generally the hash function needs to be tuned both to the type of data it is
hashing, and also the hash table algorithm.) How would this be implemented
(say, in GNAT)?

****************************************************************

From: Robert Dewar
Sent: Friday, April 19, 2002  7:31 PM

Think of a language like SNOBOL-4 or SETL with general associative arrays
or mappings. YOu need conveniently implemented hashing functions for this
purpose. You would allo them to be overridden by a user, but it would indeed
by a huge convenience to have a "pretty good" hash function for free.

In GNAT, we would provide hash functoins for all the primitive data, and
then some obvious aggregation rules.

****************************************************************

From: Tucker Taft
Sent: Friday, April 19, 2002  5:48 PM

If you create a generic hashtable, then you need to "hash" values of the
key type.

The usual Ada approach is to have the instantiator pass in a "hash"
function.  This standard attribute would provide a default hash function.
The properties of the hash function would generally be that it returns
a full size integer, distributed as widely and pseudo-randomly as
possible over the full integer range.  Hence to use the value you
would generally need to "mod" it by the size of the hash table.

> ... That would help decide what the semantics ought to be.
> (I don't quite understand what the intent of this is. It seems to be an
> attempt at a one-size-fits-all hash function, but that seems odd, as
> generally the hash function needs to be tuned both to the type of data it is
> hashing, and also the hash table algorithm.)

Generally if the hash value is widely distributed across the range,
then that is adequate for almost all purposes.

> ... How would this be implemented (say, in GNAT)?

Scalars and access values are pretty straightforward.
Arrays and records would be the xor or equivalent of all
or some of their components.  Or something approximating that...

****************************************************************

From: Alexandre E. Kopilovitch
Sent: Friday, April 19, 2002  6:58 PM

Do you mean that a string, being an array of characters, should be processed
in such a way? And generally, any permutation of an array will never change
its hash value?

****************************************************************

From: Robert Dewar
Sent: Friday, April 19, 2002  7:57 PM

That's fairly reasonable in practice (experiecne coming from the SETL
implementation).

Remember that *all* hash functions are "defective" in the sense that some
different values have the same hash. The fact that it is easy to see some
such cases says nothing about the efficiency of the hash. Remember that
this is not a case where we need cryptographically sound hashing :-)

****************************************************************

From: Alexandre E. Kopilovitch
Sent: Saturday, April 20, 2002  8:29 AM

I have no intention to contest the Spitbol implementor's authority in the
practical matters of string hashing, but here we are talking about the future
standard specification, to which all Ada compilers will be required to comply
(to be validated).

> Remember that *all* hash functions are "defective" in the sense that some
> different values have the same hash. The fact that it is easy to see some
> such cases says nothing about the efficiency of the hash. Remember that
> this is not a case where we need cryptographically sound hashing :-)

Well, all that may be true, but why on earth should we severily _restrict_
the quality of the standard hash function?
  I think that there exists a solution that provides sufficient freedom of
choice: suppose we have the standard hash functions for all scalar types, and
also a standard hash function for Integer arrays. Then we may treat an
aggregate (either array or record) with the following way:

1) if a source aggregate isn't an Integer array then _reduce_ the aggregate
   to an Integer array using the hash functions for its components;
2) apply ("secondary") hash function to that Integer array.

  Note, that with this specification you still may use XOR-accumulator for the
Integer array hash function, if you wish. But you are no longer _required_ to
do so, and you are permitted to provide something better.

****************************************************************

From: Robert Dewar
Sent: Saturday, April 20, 2002  9:23 AM

<<I have no intention to contest the Spitbol implementor's authority in the
practical matters of string hashing, but here we are talking about the future
standard specification, to which all Ada compilers will be required to comply
(to be validated).>>

Spitbol has nothing to do with strings here, all data types are hashable
in SNOBOL-4 that's why it is a very good model for what we are talking about.

<<  Note, that with this specification you still may use XOR-accumulator for the
Integer array hash function, if you wish. But you are no longer _required_ to
do so, and you are permitted to provide something better.>>

The choice of the hash function should clearly be implementation defined.
No one even hinted that "XOR-accumulator" would be "required", it is just
one simple approach that is perfectly adequate in practice.

It's harder than you think to know what is a good hash algorithm in practice.
Of course if you use something like MD-5 you are pretty much assured of
reasonable performance but that's massive overkill in this case. The only
purpose of a good hash algorithm is to improve time performance of lookup,
so you quickly reach diminishing returns if you spend too much time
computing the hash.

Anyway, the language feature would simply define the attribute and say
NOTHING AT ALL about how it should be implemented. Once you start discussing
how it should be implemented, you fall into a useless discussion, and you
will end up with nothing.

Since this is such a trivial issue, it is not worth any more of my time.
So this is my last message on the subject.

****************************************************************

From: Robert Dewar
Sent: Friday, April 19, 2002  7:50 PM

Tuck said

<<The usual Ada approach is to have the instantiator pass in a "hash"
function.  This standard attribute would provide a default hash function.
The properties of the hash function would generally be that it returns
a full size integer, distributed as widely and pseudo-randomly as
possible over the full integer range.  Hence to use the value you
would generally need to "mod" it by the size of the hash table.
>>

Exactly, we are on the same wavelength :-)

****************************************************************

From: Randy Brukardt
Sent: Friday, April 19, 2002  7:10 PM

Tuck said (replying to me):

> If you create a generic hashtable, then you need to "hash"
> values of the key type.
>
> The usual Ada approach is to have the instantiator pass in a "hash"
> function.  This standard attribute would provide a default hash function.
> The properties of the hash function would generally be that it returns
> a full size integer, distributed as widely and pseudo-randomly as
> possible over the full integer range.  Hence to use the value you
> would generally need to "mod" it by the size of the hash table.

Thanks. It helps a lot when evaluating a proposal to know what problem it is
intending to solve...

> > ... That would help decide what the semantics ought to be.
> > (I don't quite understand what the intent of this is. It seems to be an
> > attempt at a one-size-fits-all hash function, but that seems odd, as
> > generally the hash function needs to be tuned both to the type of data
it is
> > hashing, and also the hash table algorithm.)
>
> Generally if the hash value is widely distributed across the range,
> then that is adequate for almost all purposes.

But that seems like quite a trick in the general case.

> > ... How would this be implemented (say, in GNAT)?
>
> Scalars and access values are pretty straightforward.
> Arrays and records would be the xor or equivalent of all
> or some of their components.  Or something approximating that...

"Pretty straightforward" maybe, but that would provide an exceedingly bad
hash function in many cases. If the hash value for a discrete type has
simply the value itself, it would work fine for a single object, but it
would compose horribly. Consider an array of Booleans: the hash values are
all 0 and 1. If you xor them together, you get 0 or 1. That doesn't seem to
meet the requirement of "widely distributed".

Indeed, any subtype which was mostly empty space would cause trouble. No
matter what values were chosen for the hash of Boolean, there could only be
two such values. Composition of those values is unlikely to provide a
"widely distributed" function.

Since Robert seems to have abdicated his usual role as the "amendment
proposal skeptic" on this one, let me take it up. :-)

A fundamental property of a useful hash function is the following: given
that X and Y are objects of type T, X = Y implies that T'Hash(X) =
T'Hash(Y). This has a long list of implications:

It has to be possible to specify a hash function. If "=" is overridden by a
user function, it has to be possible to specify T'Hash to keep the invariant
true.

Ada.Finalization.Controlled'Hash returns a constant. ("=" always returns
True for this type).

For a tagged type, where the predefined "=" composes, T'Hash must compose.
(If the parent has a user-defined hash function, the extension must use this
for the parent part.)

If "=" does a component-by-component comparison in order to avoid padding
and implementation defined components, then T'Hash must do a
component-by-component hash. As I showed above, a straightforward
implementation of that can provide an exceedingly bad hash function. You can
improve the function somewhat by tossing in a rotate, but doing that well
requires knowing how many bits each component contributes to the hash. But
that is unknowable for dynamic subtypes.

These requirements look a lot like the stream attributes, and there is
nothing simple about using or implementing the stream attributes. And I am
very skeptical that a decent ("good enough") hash function can be created
with the tiny bit of knowledge that the compiler has.

The extension to inherently limited types makes these even more fun. One
presumes that the "limited" components (the task TCB, the protected lock and
queue) would not participate in the hash (otherwise, the hash value would
depend on what entries were called, the priority of a task, etc.), but that
other components would. This would have to be nailed down enough that the
result could be depended on, but that sounds like a can of worms.

Does any modern programming language have anything like this? Does it work?
:-)

****************************************************************

From: Robert Dewar
Sent: Friday, April 19, 2002  7:57 PM

>>But that seems like quite a trick in the general case.

This is *very* well known technology.

>>Does any modern programming language have anything like this? Does it work?

yes of course, I gave two examples, SETL and SNOBOL4

****************************************************************

From: Randy Brukardt
Sent: Friday, April 19, 2002  8:38 PM

> >>But that seems like quite a trick in the general case.
>
> This is *very* well known technology.

It can't be "very well known" (which I would apply only to things known by
almost all programmers), 'cause I don't know it.

But I think (based on your other messages), that you are willing to have a
much worse hash function than I would expect would be usable. That is not
hard, I know exactly how to create a lousy hash function. :-) But I don't
think a "pretty good" one is possible *in general* for Ada's set of data
types, given the other requirements of the Ada model. Both Alexandre and I
gave examples of cases where it wouldn't work well at all, and would not
come close to "pretty good" in my opinion.

> >>Does any modern programming language have anything like this? Does it
work?
>
> yes of course, I gave two examples, SETL and SNOBOL4

Well these (hardly modern) programming languages have associative arrays. I
don't doubt the need for these or that they can be implemented, especially
in a language with a fairly limited set of types. I was referring
specifically to the 'Hash proposal: does any modern programming language
have a generalized hash function like this? How do maps components get a
hash function in Java or C++ (compiled languages with rich type systems)?

I'm just trying to justify the (considerable) baggage of this proposal. I
can see the value in making a map component easier to create, but that
already can be done (less conveniently) -- and I wonder how useful the
result actually will be.

I'm pretty much expecting that Ada 0y is going to get a "scope reduction"
similar to Ada 9x, and a proposal like this that will take a page of RM text
to fill one, very specific need is likely to push out something else
important.

****************************************************************

From: Robert Dewar
Sent: Friday, April 19, 2002  9:52 PM

Well one person's important is another persons frill. Personally I find
most proposals for complexifying the typing system to be frills, but
others seem to regard them as important.

The hash function is reasonably easy to implement, perhaps a day or two's
work for GNAT, not more.

****************************************************************

From: Randy Brukardt
Sent: Friday, April 19, 2002  10:14 PM

Amazing. Certainly the actual function is relatively easy, but handling
redefinition and composition is roughly like handling a stream attribute in
Janus/Ada. And that is not easy. I would guess that it would take a week, might
be faster if some of the stream attribute code could be reused. It would
probably take the better part of a day just to add the needed additional symbol
table components (because every aggregate would need to be adjusted; we use
aggregates to insure that we don't fail to handle a component, but of course
that means we have to handle added components in each place even when it is
irrelevant).

****************************************************************

From: Robert Dewar
Sent: Friday, April 19, 2002  10:25 PM

It takes about 15 minutes to just add the apparatus for a new attribute (we
have done it quite often :-)

THe reason that the actual implementation would be relatively easy is that
it is quite similar to the stream functions. For example, the requirement
that x'class'hash work as expected would be just like the stream functions.

But I am not at all pressing furiously for this feature it was just a
passing thought in response to people saying they wanted general mappings.
Note that the unit GNAT.Spitbol has:

   -------------------
   -- Table Support --
   -------------------

   --  So far, we only provide support for tables whose indexing data values
   --  are strings (or unbounded strings). The values stored may be of any
   --  type, as supplied by the generic formal parameter.

   generic

      type Value_Type is private;
      --  Any non-limited type can be used as the value type in the table

      Null_Value : Value_Type;
      --  Value used to represent a value that is not present in the table.

      with function Img (A : Value_Type) return String;
      --  Used to provide image of value in Dump procedure

      with function "=" (A, B : Value_Type) return Boolean is <>;
      --  This allows a user-defined equality function to override the
      --  predefined equality function.

   package Table is

      ------------------------
      -- Table Declarations --
      ------------------------

      type Table (N : Unsigned_32) is private;
      --  This is the table type itself. A table is a mapping from string
      --  values to values of Value_Type. The discriminant is an estimate of
      --  the number of values in the table. If the estimate is much too
      --  high, some space is wasted, if the estimate is too low, access to
      --  table elements is slowed down. The type Table has copy semantics,
      --  not reference semantics. This means that if a table is copied
      --  using simple assignment, then the two copies refer to entirely
      --  separate tables.

      -----------------------------
      -- Table Access Operations --
      -----------------------------

      function Get (T : Table; Name : VString)   return Value_Type;
      function Get (T : Table; Name : Character) return Value_Type;
      pragma Inline (Get);
      function Get (T : Table; Name : String)    return Value_Type;

      --  If an entry with the given name exists in the table, then the
      --  corresponding Value_Type value is returned. Otherwise Null_Value
      --  is returned.

      function Present (T : Table; Name : VString)   return Boolean;
      function Present (T : Table; Name : Character) return Boolean;
      pragma Inline (Present);
      function Present (T : Table; Name : String)    return Boolean;
      --  Determines if an entry with the given name is present in the table.
      --  A returned value of True means that it is in the table, otherwise
      --  False indicates that it is not in the table.

      procedure Delete (T : in out Table; Name : VString);
      procedure Delete (T : in out Table; Name : Character);
      pragma Inline (Delete);
      procedure Delete (T : in out Table; Name : String);
      --  Deletes the table element with the given name from the table. If
      --  no element in the table has this name, then the call has no effect.

      procedure Set (T : in out Table; Name  : VString;   Value : Value_Type);
      procedure Set (T : in out Table; Name  : Character; Value : Value_Type);
      pragma Inline (Set);
      procedure Set (T : in out Table; Name  : String;    Value : Value_Type);
      --  Sets the value of the element with the given name to the given
      --  value. If Value is equal to Null_Value, the effect is to remove
      --  the entry from the table. If no element with the given name is
      --  currently in the table, then a new element with the given value
      --  is created.

      ----------------------------
      -- Allocation and Copying --
      ----------------------------

      --  Table is a controlled type, so that all storage associated with
      --  tables is properly reclaimed when a Table value is abandoned.
      --  Tables have value semantics rather than reference semantics as
      --  in Spitbol, i.e. when you assign a copy you end up with two
      --  distinct copies of the table, as though COPY had been used in
      --  Spitbol. It seems clearly more appropriate in Ada to require
      --  the use of explicit pointers for reference semantics.

      procedure Clear (T : in out Table);
      --  Clears all the elements of the given table, freeing associated
      --  storage. On return T is an empty table with no elements.

      procedure Copy (From : in Table; To : in out Table);
      --  First all the elements of table To are cleared (as described for
      --  the Clear procedure above), then all the elements of table From
      --  are copied into To. In the case where the tables From and To have
      --  the same declared size (i.e. the same discriminant), the call to
      --  Copy has the same effect as the assignment of From to To. The
      --  difference is that, unlike the assignment statement, which will
      --  cause a Constraint_Error if the source and target are of different
      --  sizes, Copy works fine with different sized tables.

      ----------------
      -- Conversion --
      ----------------

      type Table_Entry is record
         Name  : VString;
         Value : Value_Type;
      end record;

      type Table_Array is array (Positive range <>) of Table_Entry;

      function Convert_To_Array (T : Table) return Table_Array;
      --  Returns a Table_Array value with a low bound of 1, and a length
      --  corresponding to the number of elements in the table. The elements
      --  of the array give the elements of the table in unsorted order.

      ---------------
      -- Debugging --
      ---------------

      procedure Dump (T : Table; Str : String := "Table");
      --  Dump contents of given table to the standard output file. The
      --  string value Str is used as the name of the table in the dump.

      procedure Dump (T : Table_Array; Str : String := "Table_Array");
      --  Dump contents of given table array to the current output file. The
      --  string value Str is used as the name of the table array in the dump.


****************************************************************

From: Florian Weimer
Sent: Friday, April 19, 2002  10:25 PM

"Randy Brukardt" <Randy@RRSoftware.Com> writes:

> Well these (hardly modern) programming languages have associative arrays. I
> don't doubt the need for these or that they can be implemented, especially
> in a language with a fairly limited set of types. I was referring
> specifically to the 'Hash proposal: does any modern programming language
> have a generalized hash function like this? How do maps components get a
> hash function in Java or C++ (compiled languages with rich type systems)?

C++ hasn't got hash support in the container library, the mappings are
based on balanced trees (or something like that).

Java has got hash function support, in its Object baseclass:

http://java.sun.com/docs/books/tutorial/java/javaOO/objectclass.html

****************************************************************

From: Bernard Maudry
Sent: Friday, April 19, 2002  6:20 AM

> How would this be implemented (say, in GNAT)?

We have such hashing function used to speed up the comparison of values in
polymorphic containers. The algorithm is the following:
- we designed a 'write only' stream type, which takes the byte flow of its
write procedure and accumulate it in a modular member through a prime
multiplicand (to provide dispersion and influence of each part of the value)
- to get the hash value of a variable, we just write it into this stream and
return the modular member.

This algorithm is able to handle any type and appears to work very well.

If such an algorithm is to be used for the future 'hash' attribute, the prime
multiplicand should be tunable, for example by a 'hash_multiplicand' attribute.

****************************************************************

From: Ted Baker
Sent: Monday, April 22, 2002  6:35 AM

| What this could proably would do is
|              H(x) =  ord(x) mod (lenght of array)
| where ord the ordinal number of the key would by in this case  ord(x) = x ?
| The compiler would have to derive the correct ord and length from the code ?

If you are talking about implementations, I don't see a need to be specific,
but would certainly avoid requiring the use of "mod" in any case.
It would be faster and at least as good to use word-wise exclusive-or
to collapse long structures.

****************************************************************

From: Ted Baker
Sent: Monday, April 22, 2002  7:04 AM

Regarding the limited/address versus non-limited value discussion,
for the uses I had in mind I would rarely expect to need to hash a
limited objects, and if I did I would not mind hashing its address
explicitly.  Moreover, I believe it would be conceptually simpler
for everyone if the semantics and implementation of T'hash would
be uniform across all data types.

By the way, there are other critical properties of a good hash
function, beyond the obvious requirement that equal values hash to
the same hash value.  I would expect it to preserve or increase
the "appearance of randomness" (which I put in quotes and
purposely leave undefined, because of my inexpert status on hash
functions).  For example, I would also expect that the number of
domain values that hash to each range value should be as nearly as
possible equal, i.e., if X is a random variable for the given
domain with uniform distribution, I would expect that P(H(X)=Y) is
very close to some constant for every element Y of the range.

--Ted

PS:

Originally, I was more interested in just having the convenience
of an array-like abstraction -- which I'll call a table -- that
could be used to store sparse mappings of some key domain (e.g.,
unconstrained strings) to values of some other domain (e.g.,
pointers to record objects).

I see that implementing a generic table type in portable way would
require such an attribute, but I would have been happy if the
*interface* to the table type were portable, and each
implementation did whatever bit-hacking works for a given compiler
and target architecture.

I find it interesting that this discussion has focussed
exclusively on the portable computation of the hash value.

****************************************************************

From: Ted Baker
Sent: Monday, April 22, 2002  7:28 AM

| Does any modern programming language have anything like this? Does it work?
| :-)

My original inspiration was Snobol, but for more modern language
you can look at Perl.  The following is from the first Perl tutorial
Google found on the web:

| "Associative arrays, also frequently called hashes, are the third
| major data type in Perl after scalars and arrays. Hashes are named
| as such because they work very similarly to a common data
| structure that programmers use in other languages--hash
| tables. However, hashes in Perl are actually a direct language
| supported data type.

| Accessing a hash works very similar to accessing arrays. However,
| hashes are not subscripted by numbers. They can be subscripted by
| an arbitrary scalar value. You simply use the {} to subscript the
| value instead of [] as you did with arrays. Here is an example:

| use strict;
| my %table;
| $table{'schmoe'} = 'joe';
| $table{7.5}  = 2.6;

| In this example, our hash, called, %table, has two entries. The
| key 'schmoe' is associated with the value 'joe', and the key 7.5
| is associated with the value 2.6.

| Just like with array elements, hash elements can be used anywhere
| a scalar variable is permitted. Thus, given a @hash{%table} built
| with the code above, we can do the following:

| print "$table{'schmoe'}\n";    # outputs "joe\n"
| --$table{7.5};                 # $table{7.5} now contains 1.6

| Another interesting fact is that all hash variables can be
| evaluated in the list context. When done, this gives a list whose
| odd elements are the keys of the hash, and whose even elements are
| the corresponding values. Thus, assuming we have the same %table
| from above, we can execute:

| my @tableListed = %table;  # @tableListed is qw/schmoe joe 7.5 1.6/

Another one of the examples I originally had in mind the C++
Standard Template Library associative container "map", but that is
less interesting to me because it is based on a user-specified
comparison operator (and implemented as an ordered tree rather
than a hash function).

****************************************************************

From: Ted Baker
Sent: Monday, April 22, 2002  7:45 AM

| Well these (hardly modern) programming languages have associative arrays. I
| don't doubt the need for these or that they can be implemented, especially
| in a language with a fairly limited set of types. I was referring
| specifically to the 'Hash proposal: does any modern programming language
| have a generalized hash function like this? How do maps components get a
| hash function in Java or C++ (compiled languages with rich type systems)?

C++ seems to be lagging behind a little bit on this.  The C++ STL
associative map container template relies on an ordering
comparision operator, and so does not fit a hashing
implementation, though there is a proposal (see
http://www.sgi.com/tech/stl/hash_set.html) for an extension.  The
proposed extension relies on a binary equality-test operatar as
parameter.

On the other hand, Java (like Perl) seems to have built-in what I
was hoping for.  See below.

--Ted

http://java.sun.com/products/jdk/1.1/docs/api/java.lang.Object.html#hashCode()

| hashCode

| public native int hashCode()

|     Returns a hash code value for the object. This method is supported for the benefit of hashtables such as those provided by java.util.Hashtable.

|     The general contract of hashCode is:

|         * Whenever it is invoked on the same object more than once
| during an execution of a Java application, the hashCode method
| must consistently return the same integer. This integer need not
| remain consistent from one execution of an application to another
| execution of the same application.
|         * If two objects are equal according to the equals method,
| then calling the hashCode method on each of the two objects must
| produce the same integer result.

|     Returns:
|     a hash code value for this object. See Also:
|     equals, Hashtable

http://java.sun.com/products/jdk/1.1/docs/api/java.util.Hashtable.html#_top_

| public class Hashtable
| extends Dictionary
| implements Cloneable, Serializable

| This class implements a hashtable, which maps keys to
| values. Any non-null object can be used as a key or as a value.

| To successfully store and retrieve objects from a hashtable, the
| objects used as keys must implement the hashCode method and the
| equals method.

| An instance of Hashtable has two parameters that affect its
| efficiency: its capacity and its load factor. The load factor
| should be between 0.0 and 1.0. When the number of entries in the
| hashtable exceeds the product of the load factor and the current
| capacity, the capacity is increased by calling the rehash
| method. Larger load factors use memory more efficiently, at the
| expense of larger expected time per lookup.

| If many entries are to be made into a Hashtable, creating it
| with a sufficiently large capacity may allow the entries to be
| inserted more efficiently than letting it perform automatic
| rehashing as needed to grow the table.

| This example creates a hashtable of numbers. It uses the names of
| the numbers as keys:

| Hashtable numbers = new Hashtable();
|      numbers.put("one", new Integer(1));
|      numbers.put("two", new Integer(2));
|      numbers.put("three", new Integer(3));
|

| To retrieve a number, use the following code:

| Integer n = (Integer)numbers.get("two");
|      if (n != null) {
|          System.out.println("two = " + n);
|      }
|

| See Also:
| equals, hashCode, rehash


****************************************************************

From: Pascal Leroy
Sent: Monday, April 22, 2002  8:47 AM

> The choice of the hash function should clearly be implementation defined.
> No one even hinted that "XOR-accumulator" would be "required", it is just
> one simple approach that is perfectly adequate in practice.

I think the situation here is analogous to that of the random number
generator.  The generator is defined by the language, and is required to
behave reasonably well (see RM95 G.2.5).  Other than that, it is entirely
implementation-defined, and if an application has particular needs (eg
cryptography), it can just use its own generator.

Similarly, it would be generally useful to have a hashing capability, but
the details should be left to the implementation.  The RM could possibly
impose a minimum level of quality and some documentation requirements.

****************************************************************

From: Robert Dewar
Sent: Monday, April 22, 2002  6:17 PM

<Similarly, it would be generally useful to have a hashing capability, but
the details should be left to the implementation.  The RM could possibly
impose a minimum level of quality and some documentation requirements.>>

I disagree in part. Any attempt to impose "quality" will be misguided,
especially since quality in this regard involves an appropriate balance
between pseudo-randomness and speed.

****************************************************************

From: Alexandre E. Kopilovitch
Sent: Monday, April 22, 2002  6:55 PM

Not at all. A notion of quality may say, for example, that the hash function
should be generally useful for all types (perhaps, with some restrictions),
including, for instance, Boolean arrays (Randy's example). Such a requirement
does not involve a balance between pseudo-randomness and speed.

****************************************************************

From: Pascal Leroy
Sent: Tuesday, April 23, 2002  1:44 AM

> I disagree in part. Any attempt to impose "quality" will be misguided,
> especially since quality in this regard involves an appropriate balance
> between pseudo-randomness and speed.

It depends very much on how quality requirements are phrased.  Take for
example the requirements for random numbers in G.2.5: they are perfectly
reasonable and do not cause any significant speed problem, but they ensure
that the random number generators are useful for most applications.

A hash function that would always return 0 would be fast but useless.  On
the other hand, we don't want to phrase requirements in such a way that only
SHA-1 or MD-5 are acceptable hash.  So there has to be a balance somewhere.
The kind of requirements that I had in mind would be things like:

- For a scalar type with a sufficiently small number of values, the hash
function must return distinct values for all values of the type.

- For random uniformly distributed input, the output of the hash function
must be uniformly distributed.

(Lots of hand waving in the above two rules.)  As far as I can tell,
requirements like that would not constrain the implementation in ways that
would cause speed problems, but they would guarantee that the hash is useful
in practice.  A hash function that would not be uniformly distributed would
be a nuisance, because it could silently affect the complexity of algorithms
that depend on it without the user realizing it.

****************************************************************

From: Robert Dewar
Sent: Tuesday, April 23, 2002  6:09 AM

> - For random uniformly distributed input, the output of the hash function
> must be uniformly distributed.

This is very ill-conceived, and is far too over-constraining. No usable
hash function will meet this criterion. In fact MD-5 does not meet this
criterion. If you take a set of numbers that all hash to the same value
using MD-5, the input will meet all possible tests for randomness.

Indeed I don't think your condition is well formed at all.

This discussion certainly convinces me that the process of discussing
standardized packages is never going to succeed in people reaching
agreement. I think you are doomed to a repeat of the absurd nonsense
that happened with the directory operations package :-)

****************************************************************

From: Chistoph Grein
Sent: Monday, April 22, 2002  12:15 AM

> Tuck said
>
> <<The usual Ada approach is to have the instantiator pass in a "hash"
> function.  This standard attribute would provide a default hash function.
> The properties of the hash function would generally be that it returns
> a full size integer, distributed as widely and pseudo-randomly as
> possible over the full integer range.  Hence to use the value you
> would generally need to "mod" it by the size of the hash table.
> >>
>
> Exactly, we are on the same wavelength :-)

For arrays, you could take the indexes into account when hashing. Of course a
given array value's hash must not depend on its current range, so it should be
slided to a standard index range first.

****************************************************************

From: Robert Dewar
Sent: Monday, April 22, 2002  6:32 PM

No of course about it here, this needs thinking about.

Also, please change the subject, again we should not be discussing all
enhancements under one umbrella subject

****************************************************************

From: Randy Brukardt
Sent: Monday, April 22, 2002  6:45 PM

> Also, please change the subject, again we should not be discussing all
> enhancements under one umbrella subject

Yes, please do. Your editor has 71 messages in his Ada comment in box
labeled "Enhancements of the Predefined Language environment". Determining
how to file these is going to be fun...

****************************************************************

From: Robert Dewar
Sent: Monday, April 22, 2002  6:53 PM

Circular file for the lot of them?

****************************************************************

From: Robert Dewar
Sent: Monday, April 22, 2002  7:47 PM

Oops, meant to send this joke only to Randy. But seriously, unless we get
these discussions organized, they will go nowhere.

****************************************************************

From: Nick Roberts
Sent: Wednesday, April 24, 2002  7:31 PM

I'm sorry to come to the hashing debate late. I blinked and nearly missed it!

The first thing I must point out is that it is very likely to be extremely
useful for a 'dope' number to be included in the hash function. This means that
the function profile would be something like:

   function T'Hash (X: in T; N: in Natural) return Natural;

where N is the dope number. This number allows the implementation of a
hash-indexed array to try dope number 0 first, and if that clashes, to 'try
again' with number 1, and so on. [Therefore a good characteristic is to
approach #(x,n) /= #(x,m) forall n /= m.]

Second, it might be worth adding a third 'Scale' parameter to the function.
This may be best expressed in the form of a binary power, and would suggest the
range of the function. However, there are various pros and cons to having this
parameter.

Third, I suggest that if T'Hash were defined (and predefined) only for definite
non-limited types, this would be sufficient. For all practical purposes, any
limited or indefinite type could be effectively hashed by declaring an access
type which referred to it, and using the hash of the access type.

Finally, I like the idea, and I think it would be worth adding in the next
revision, if practicable.

****************************************************************

From: Robert Dewar
Sent: Thursday, April 25, 2002  9:46 PM

> where N is the dope number. This number allows the implementation of a
> hash-indexed array to try dope number 0 first, and if that clashes, to 'try
> again' with number 1, and so on. [Therefore a good characteristic is to
> approach #(x,n) /= #(x,m) forall n /= m.]

No point in this, no one uses linear hashing in situations like this except
text books :-)

The only sensible data structures for maps are balanced trees based on
sorting, or chained hash implementations.

****************************************************************


Questions? Ask the ACAA Technical Agent