!standard 13.14 (00) 02-04-23 AI95-00293/00 !class amendment 02-04-23 !status received 02-04-23 !priority Medium !difficulty Medium !subject Built-in hash function !summary !problem !proposal !wording !example !discussion !ACATS Test ACATS test(s) need to be created. !appendix From: Ted Baker Sent: Friday, April 19, 2002 7:03 AM A really useful package to have widely available would be an implementation of a generic dynamic associative mapping, from some domain of key values to some range of objects. I have in mind something like SNOBOL tables, but able to live within Ada's type structure. It would need to be reasonably fast for sparse key domains, and support reasonably fast updates. In other words, it would amount to some form of generic dynamic hash tables. I can't count the number of times I could have saved a lot of time and effort if I could just have instantiated such a structure. It can be used to implement a set (i.e., the subset set of the domain for which the map is currently defined), or a function. With sets and functions, you can program anything. **************************************************************** From: Robert Dewar Sent: Friday, April 19, 2002 8:51 AM Really the language feature that would be valuable here would be someting like x'Hash (val) which would yield an Integer hash value from the given value. This should apply to all types. **************************************************************** From: Tucker Taft Sent: Friday, April 19, 2002 9:38 AM I would think this would want to be related to the "=" operator, implying that it would be analogous to stream attributes, defined for all types, but only predefined for non-limited types. And of course it should be user-specifiable, like streams. **************************************************************** From: Robert Dewar Sent: Friday, April 19, 2002 9:47 AM No, I think to be useful it needs to be all types. If the restricted stream model is acceptable, you can just hash the stream of the item. **************************************************************** From: Tucker Taft Sent: Friday, April 19, 2002 12:45 PM I suppose for "inherently" limited types, 'hash(x) could be roughly equivalent to System.Address'hash(x'Address) since each object of an inherently limited type is essentially unique. For types that are not inherently limited, the basic requirement for the predefined 'hash function would be that if x=y then 'hash(x) = 'hash(y). **************************************************************** From: Robert Dewar Sent: Friday, April 19, 2002 1:41 PM I would think that the criterion should be that the hashes are equal if the underlying representations are equal, i.e. that the hash should simply go to the underlying type always. **************************************************************** From: Florian Weimer Sent: Friday, April 19, 2002 12:51 PM The address of an object can change during its lifetime, can't it? **************************************************************** From: Tucker Taft Sent: Friday, April 19, 2002 2:58 PM That's why I said "roughly equivalent" ;-). The point is that each object of an inherently limited type is unique, so they can each receive a different hash value. I suppose a bigger question is whether the hash value of a given inherently-limited object must remain the same throughout it's life. That would be nice, I *think*. The fundamental problem is that non-limited types are "value oriented" whereas limited types are "identity" or "object" oriented. **************************************************************** From: Robert Dewar Sent: Friday, April 19, 2002 3:08 PM Absolutely, the hash must not change, otherwise it is useless **************************************************************** From: Tucker Taft Sent: Friday, April 19, 2002 3:31 PM That makes it pretty different from the hash value of an object of a non-limited type, since clearly the hash value of such an object will likely change if the value of the object changes. So perhaps what we are saying for an inherently-limited type is that value = identity (at least for the predefined 'hash function). I suppose this is what makes me feel that 'hash for an inherently limited type is pretty different from the meaning of 'hash for a nonlimited type... **************************************************************** From: Robert Dewar Sent: Friday, April 19, 2002 4:23 PM No I did not mean that, the hash should depend on the value, as for any other type. Mappings of the kind we are discussing here are really value oriented it seems to me. **************************************************************** From: Robert Dewar Sent: Friday, April 19, 2002 4:41 PM >>Absolutely, the hash must not change, otherwise it is useless Incidentally, I realize this caused confusion, I mean that it cannot change just because the address changes, it can only change if the value changes. **************************************************************** From: Michael Erdmann Sent: Friday, April 19, 2002 4:22 PM Now i get the idea, you want to assign to each instance some kind of reference value (not a address of access) which can be used as key for the object? It seems i dont understand what your intention are! **************************************************************** From: Simon Wright Sent: Monday, April 22, 2002 3:02 AM > I suppose this is what makes me feel that 'hash for an inherently limited > type is pretty different from the meaning of 'hash for a nonlimited type... Sometimes a type is only limited because of a language restriction rather than the application semantics (eg, the type has an access discriminant). I feel that 'Hash ought to depend on application semantics? **************************************************************** From: Alexandre E. Kopilovitch Sent: Friday, April 19, 2002 5:42 PM It seems that the difference for an application is whether the hash values for the limited types may be used as a persistent (or otherwise shared) data or their usage is restricted to a particular run of a program. **************************************************************** From: Michael Erdmann Sent: Friday, April 19, 2002 12:23 PM Do i understand this correctly, for example i could do something like this: x : array( .....) of some_type; z : some_type; z := x'Hash(val); **************************************************************** From: Robert Dewar Sent: Friday, April 19, 2002 12:29 PM No, that's wrong, because as with any attribute function of this type x is the type, so you need to say Also the result is always an integer. The proper example is type a is array (....) of some_type; x : a; z : integer z := a'hash(x); **************************************************************** From: Michael Erdmann Sent: Friday, April 19, 2002 1:06 PM What this could proably would do is H(x) = ord(x) mod (lenght of array) where ord the ordinal number of the key would by in this case ord(x) = x ? The compiler would have to derive the correct ord and length from the code ? **************************************************************** From: Robert Dewar Sent: Friday, April 19, 2002 1:43 PM I don't know what "ordinary number of the key" might mean An array is typically hashed by combining the hashes of its components **************************************************************** From: Michael Erdmann Sent: Friday, April 19, 2002 3:33 PM >I don't know what "ordinary number of the key" might mean Just a function mapping the keys 1:1 onto a number. **************************************************************** From: Randy Brukardt Sent: Friday, April 19, 2002 4:45 PM Would someone care to show an example of how this "hash" value would be used in an actual program? That would help decide what the semantics ought to be. (I don't quite understand what the intent of this is. It seems to be an attempt at a one-size-fits-all hash function, but that seems odd, as generally the hash function needs to be tuned both to the type of data it is hashing, and also the hash table algorithm.) How would this be implemented (say, in GNAT)? **************************************************************** From: Robert Dewar Sent: Friday, April 19, 2002 7:31 PM Think of a language like SNOBOL-4 or SETL with general associative arrays or mappings. YOu need conveniently implemented hashing functions for this purpose. You would allo them to be overridden by a user, but it would indeed by a huge convenience to have a "pretty good" hash function for free. In GNAT, we would provide hash functoins for all the primitive data, and then some obvious aggregation rules. **************************************************************** From: Tucker Taft Sent: Friday, April 19, 2002 5:48 PM If you create a generic hashtable, then you need to "hash" values of the key type. The usual Ada approach is to have the instantiator pass in a "hash" function. This standard attribute would provide a default hash function. The properties of the hash function would generally be that it returns a full size integer, distributed as widely and pseudo-randomly as possible over the full integer range. Hence to use the value you would generally need to "mod" it by the size of the hash table. > ... That would help decide what the semantics ought to be. > (I don't quite understand what the intent of this is. It seems to be an > attempt at a one-size-fits-all hash function, but that seems odd, as > generally the hash function needs to be tuned both to the type of data it is > hashing, and also the hash table algorithm.) Generally if the hash value is widely distributed across the range, then that is adequate for almost all purposes. > ... How would this be implemented (say, in GNAT)? Scalars and access values are pretty straightforward. Arrays and records would be the xor or equivalent of all or some of their components. Or something approximating that... **************************************************************** From: Alexandre E. Kopilovitch Sent: Friday, April 19, 2002 6:58 PM Do you mean that a string, being an array of characters, should be processed in such a way? And generally, any permutation of an array will never change its hash value? **************************************************************** From: Robert Dewar Sent: Friday, April 19, 2002 7:57 PM That's fairly reasonable in practice (experiecne coming from the SETL implementation). Remember that *all* hash functions are "defective" in the sense that some different values have the same hash. The fact that it is easy to see some such cases says nothing about the efficiency of the hash. Remember that this is not a case where we need cryptographically sound hashing :-) **************************************************************** From: Alexandre E. Kopilovitch Sent: Saturday, April 20, 2002 8:29 AM I have no intention to contest the Spitbol implementor's authority in the practical matters of string hashing, but here we are talking about the future standard specification, to which all Ada compilers will be required to comply (to be validated). > Remember that *all* hash functions are "defective" in the sense that some > different values have the same hash. The fact that it is easy to see some > such cases says nothing about the efficiency of the hash. Remember that > this is not a case where we need cryptographically sound hashing :-) Well, all that may be true, but why on earth should we severily _restrict_ the quality of the standard hash function? I think that there exists a solution that provides sufficient freedom of choice: suppose we have the standard hash functions for all scalar types, and also a standard hash function for Integer arrays. Then we may treat an aggregate (either array or record) with the following way: 1) if a source aggregate isn't an Integer array then _reduce_ the aggregate to an Integer array using the hash functions for its components; 2) apply ("secondary") hash function to that Integer array. Note, that with this specification you still may use XOR-accumulator for the Integer array hash function, if you wish. But you are no longer _required_ to do so, and you are permitted to provide something better. **************************************************************** From: Robert Dewar Sent: Saturday, April 20, 2002 9:23 AM <> Spitbol has nothing to do with strings here, all data types are hashable in SNOBOL-4 that's why it is a very good model for what we are talking about. << Note, that with this specification you still may use XOR-accumulator for the Integer array hash function, if you wish. But you are no longer _required_ to do so, and you are permitted to provide something better.>> The choice of the hash function should clearly be implementation defined. No one even hinted that "XOR-accumulator" would be "required", it is just one simple approach that is perfectly adequate in practice. It's harder than you think to know what is a good hash algorithm in practice. Of course if you use something like MD-5 you are pretty much assured of reasonable performance but that's massive overkill in this case. The only purpose of a good hash algorithm is to improve time performance of lookup, so you quickly reach diminishing returns if you spend too much time computing the hash. Anyway, the language feature would simply define the attribute and say NOTHING AT ALL about how it should be implemented. Once you start discussing how it should be implemented, you fall into a useless discussion, and you will end up with nothing. Since this is such a trivial issue, it is not worth any more of my time. So this is my last message on the subject. **************************************************************** From: Robert Dewar Sent: Friday, April 19, 2002 7:50 PM Tuck said <> Exactly, we are on the same wavelength :-) **************************************************************** From: Randy Brukardt Sent: Friday, April 19, 2002 7:10 PM Tuck said (replying to me): > If you create a generic hashtable, then you need to "hash" > values of the key type. > > The usual Ada approach is to have the instantiator pass in a "hash" > function. This standard attribute would provide a default hash function. > The properties of the hash function would generally be that it returns > a full size integer, distributed as widely and pseudo-randomly as > possible over the full integer range. Hence to use the value you > would generally need to "mod" it by the size of the hash table. Thanks. It helps a lot when evaluating a proposal to know what problem it is intending to solve... > > ... That would help decide what the semantics ought to be. > > (I don't quite understand what the intent of this is. It seems to be an > > attempt at a one-size-fits-all hash function, but that seems odd, as > > generally the hash function needs to be tuned both to the type of data it is > > hashing, and also the hash table algorithm.) > > Generally if the hash value is widely distributed across the range, > then that is adequate for almost all purposes. But that seems like quite a trick in the general case. > > ... How would this be implemented (say, in GNAT)? > > Scalars and access values are pretty straightforward. > Arrays and records would be the xor or equivalent of all > or some of their components. Or something approximating that... "Pretty straightforward" maybe, but that would provide an exceedingly bad hash function in many cases. If the hash value for a discrete type has simply the value itself, it would work fine for a single object, but it would compose horribly. Consider an array of Booleans: the hash values are all 0 and 1. If you xor them together, you get 0 or 1. That doesn't seem to meet the requirement of "widely distributed". Indeed, any subtype which was mostly empty space would cause trouble. No matter what values were chosen for the hash of Boolean, there could only be two such values. Composition of those values is unlikely to provide a "widely distributed" function. Since Robert seems to have abdicated his usual role as the "amendment proposal skeptic" on this one, let me take it up. :-) A fundamental property of a useful hash function is the following: given that X and Y are objects of type T, X = Y implies that T'Hash(X) = T'Hash(Y). This has a long list of implications: It has to be possible to specify a hash function. If "=" is overridden by a user function, it has to be possible to specify T'Hash to keep the invariant true. Ada.Finalization.Controlled'Hash returns a constant. ("=" always returns True for this type). For a tagged type, where the predefined "=" composes, T'Hash must compose. (If the parent has a user-defined hash function, the extension must use this for the parent part.) If "=" does a component-by-component comparison in order to avoid padding and implementation defined components, then T'Hash must do a component-by-component hash. As I showed above, a straightforward implementation of that can provide an exceedingly bad hash function. You can improve the function somewhat by tossing in a rotate, but doing that well requires knowing how many bits each component contributes to the hash. But that is unknowable for dynamic subtypes. These requirements look a lot like the stream attributes, and there is nothing simple about using or implementing the stream attributes. And I am very skeptical that a decent ("good enough") hash function can be created with the tiny bit of knowledge that the compiler has. The extension to inherently limited types makes these even more fun. One presumes that the "limited" components (the task TCB, the protected lock and queue) would not participate in the hash (otherwise, the hash value would depend on what entries were called, the priority of a task, etc.), but that other components would. This would have to be nailed down enough that the result could be depended on, but that sounds like a can of worms. Does any modern programming language have anything like this? Does it work? :-) **************************************************************** From: Robert Dewar Sent: Friday, April 19, 2002 7:57 PM >>But that seems like quite a trick in the general case. This is *very* well known technology. >>Does any modern programming language have anything like this? Does it work? yes of course, I gave two examples, SETL and SNOBOL4 **************************************************************** From: Randy Brukardt Sent: Friday, April 19, 2002 8:38 PM > >>But that seems like quite a trick in the general case. > > This is *very* well known technology. It can't be "very well known" (which I would apply only to things known by almost all programmers), 'cause I don't know it. But I think (based on your other messages), that you are willing to have a much worse hash function than I would expect would be usable. That is not hard, I know exactly how to create a lousy hash function. :-) But I don't think a "pretty good" one is possible *in general* for Ada's set of data types, given the other requirements of the Ada model. Both Alexandre and I gave examples of cases where it wouldn't work well at all, and would not come close to "pretty good" in my opinion. > >>Does any modern programming language have anything like this? Does it work? > > yes of course, I gave two examples, SETL and SNOBOL4 Well these (hardly modern) programming languages have associative arrays. I don't doubt the need for these or that they can be implemented, especially in a language with a fairly limited set of types. I was referring specifically to the 'Hash proposal: does any modern programming language have a generalized hash function like this? How do maps components get a hash function in Java or C++ (compiled languages with rich type systems)? I'm just trying to justify the (considerable) baggage of this proposal. I can see the value in making a map component easier to create, but that already can be done (less conveniently) -- and I wonder how useful the result actually will be. I'm pretty much expecting that Ada 0y is going to get a "scope reduction" similar to Ada 9x, and a proposal like this that will take a page of RM text to fill one, very specific need is likely to push out something else important. **************************************************************** From: Robert Dewar Sent: Friday, April 19, 2002 9:52 PM Well one person's important is another persons frill. Personally I find most proposals for complexifying the typing system to be frills, but others seem to regard them as important. The hash function is reasonably easy to implement, perhaps a day or two's work for GNAT, not more. **************************************************************** From: Randy Brukardt Sent: Friday, April 19, 2002 10:14 PM Amazing. Certainly the actual function is relatively easy, but handling redefinition and composition is roughly like handling a stream attribute in Janus/Ada. And that is not easy. I would guess that it would take a week, might be faster if some of the stream attribute code could be reused. It would probably take the better part of a day just to add the needed additional symbol table components (because every aggregate would need to be adjusted; we use aggregates to insure that we don't fail to handle a component, but of course that means we have to handle added components in each place even when it is irrelevant). **************************************************************** From: Robert Dewar Sent: Friday, April 19, 2002 10:25 PM It takes about 15 minutes to just add the apparatus for a new attribute (we have done it quite often :-) THe reason that the actual implementation would be relatively easy is that it is quite similar to the stream functions. For example, the requirement that x'class'hash work as expected would be just like the stream functions. But I am not at all pressing furiously for this feature it was just a passing thought in response to people saying they wanted general mappings. Note that the unit GNAT.Spitbol has: ------------------- -- Table Support -- ------------------- -- So far, we only provide support for tables whose indexing data values -- are strings (or unbounded strings). The values stored may be of any -- type, as supplied by the generic formal parameter. generic type Value_Type is private; -- Any non-limited type can be used as the value type in the table Null_Value : Value_Type; -- Value used to represent a value that is not present in the table. with function Img (A : Value_Type) return String; -- Used to provide image of value in Dump procedure with function "=" (A, B : Value_Type) return Boolean is <>; -- This allows a user-defined equality function to override the -- predefined equality function. package Table is ------------------------ -- Table Declarations -- ------------------------ type Table (N : Unsigned_32) is private; -- This is the table type itself. A table is a mapping from string -- values to values of Value_Type. The discriminant is an estimate of -- the number of values in the table. If the estimate is much too -- high, some space is wasted, if the estimate is too low, access to -- table elements is slowed down. The type Table has copy semantics, -- not reference semantics. This means that if a table is copied -- using simple assignment, then the two copies refer to entirely -- separate tables. ----------------------------- -- Table Access Operations -- ----------------------------- function Get (T : Table; Name : VString) return Value_Type; function Get (T : Table; Name : Character) return Value_Type; pragma Inline (Get); function Get (T : Table; Name : String) return Value_Type; -- If an entry with the given name exists in the table, then the -- corresponding Value_Type value is returned. Otherwise Null_Value -- is returned. function Present (T : Table; Name : VString) return Boolean; function Present (T : Table; Name : Character) return Boolean; pragma Inline (Present); function Present (T : Table; Name : String) return Boolean; -- Determines if an entry with the given name is present in the table. -- A returned value of True means that it is in the table, otherwise -- False indicates that it is not in the table. procedure Delete (T : in out Table; Name : VString); procedure Delete (T : in out Table; Name : Character); pragma Inline (Delete); procedure Delete (T : in out Table; Name : String); -- Deletes the table element with the given name from the table. If -- no element in the table has this name, then the call has no effect. procedure Set (T : in out Table; Name : VString; Value : Value_Type); procedure Set (T : in out Table; Name : Character; Value : Value_Type); pragma Inline (Set); procedure Set (T : in out Table; Name : String; Value : Value_Type); -- Sets the value of the element with the given name to the given -- value. If Value is equal to Null_Value, the effect is to remove -- the entry from the table. If no element with the given name is -- currently in the table, then a new element with the given value -- is created. ---------------------------- -- Allocation and Copying -- ---------------------------- -- Table is a controlled type, so that all storage associated with -- tables is properly reclaimed when a Table value is abandoned. -- Tables have value semantics rather than reference semantics as -- in Spitbol, i.e. when you assign a copy you end up with two -- distinct copies of the table, as though COPY had been used in -- Spitbol. It seems clearly more appropriate in Ada to require -- the use of explicit pointers for reference semantics. procedure Clear (T : in out Table); -- Clears all the elements of the given table, freeing associated -- storage. On return T is an empty table with no elements. procedure Copy (From : in Table; To : in out Table); -- First all the elements of table To are cleared (as described for -- the Clear procedure above), then all the elements of table From -- are copied into To. In the case where the tables From and To have -- the same declared size (i.e. the same discriminant), the call to -- Copy has the same effect as the assignment of From to To. The -- difference is that, unlike the assignment statement, which will -- cause a Constraint_Error if the source and target are of different -- sizes, Copy works fine with different sized tables. ---------------- -- Conversion -- ---------------- type Table_Entry is record Name : VString; Value : Value_Type; end record; type Table_Array is array (Positive range <>) of Table_Entry; function Convert_To_Array (T : Table) return Table_Array; -- Returns a Table_Array value with a low bound of 1, and a length -- corresponding to the number of elements in the table. The elements -- of the array give the elements of the table in unsorted order. --------------- -- Debugging -- --------------- procedure Dump (T : Table; Str : String := "Table"); -- Dump contents of given table to the standard output file. The -- string value Str is used as the name of the table in the dump. procedure Dump (T : Table_Array; Str : String := "Table_Array"); -- Dump contents of given table array to the current output file. The -- string value Str is used as the name of the table array in the dump. **************************************************************** From: Florian Weimer Sent: Friday, April 19, 2002 10:25 PM "Randy Brukardt" writes: > Well these (hardly modern) programming languages have associative arrays. I > don't doubt the need for these or that they can be implemented, especially > in a language with a fairly limited set of types. I was referring > specifically to the 'Hash proposal: does any modern programming language > have a generalized hash function like this? How do maps components get a > hash function in Java or C++ (compiled languages with rich type systems)? C++ hasn't got hash support in the container library, the mappings are based on balanced trees (or something like that). Java has got hash function support, in its Object baseclass: http://java.sun.com/docs/books/tutorial/java/javaOO/objectclass.html **************************************************************** From: Bernard Maudry Sent: Friday, April 19, 2002 6:20 AM > How would this be implemented (say, in GNAT)? We have such hashing function used to speed up the comparison of values in polymorphic containers. The algorithm is the following: - we designed a 'write only' stream type, which takes the byte flow of its write procedure and accumulate it in a modular member through a prime multiplicand (to provide dispersion and influence of each part of the value) - to get the hash value of a variable, we just write it into this stream and return the modular member. This algorithm is able to handle any type and appears to work very well. If such an algorithm is to be used for the future 'hash' attribute, the prime multiplicand should be tunable, for example by a 'hash_multiplicand' attribute. **************************************************************** From: Ted Baker Sent: Monday, April 22, 2002 6:35 AM | What this could proably would do is | H(x) = ord(x) mod (lenght of array) | where ord the ordinal number of the key would by in this case ord(x) = x ? | The compiler would have to derive the correct ord and length from the code ? If you are talking about implementations, I don't see a need to be specific, but would certainly avoid requiring the use of "mod" in any case. It would be faster and at least as good to use word-wise exclusive-or to collapse long structures. **************************************************************** From: Ted Baker Sent: Monday, April 22, 2002 7:04 AM Regarding the limited/address versus non-limited value discussion, for the uses I had in mind I would rarely expect to need to hash a limited objects, and if I did I would not mind hashing its address explicitly. Moreover, I believe it would be conceptually simpler for everyone if the semantics and implementation of T'hash would be uniform across all data types. By the way, there are other critical properties of a good hash function, beyond the obvious requirement that equal values hash to the same hash value. I would expect it to preserve or increase the "appearance of randomness" (which I put in quotes and purposely leave undefined, because of my inexpert status on hash functions). For example, I would also expect that the number of domain values that hash to each range value should be as nearly as possible equal, i.e., if X is a random variable for the given domain with uniform distribution, I would expect that P(H(X)=Y) is very close to some constant for every element Y of the range. --Ted PS: Originally, I was more interested in just having the convenience of an array-like abstraction -- which I'll call a table -- that could be used to store sparse mappings of some key domain (e.g., unconstrained strings) to values of some other domain (e.g., pointers to record objects). I see that implementing a generic table type in portable way would require such an attribute, but I would have been happy if the *interface* to the table type were portable, and each implementation did whatever bit-hacking works for a given compiler and target architecture. I find it interesting that this discussion has focussed exclusively on the portable computation of the hash value. **************************************************************** From: Ted Baker Sent: Monday, April 22, 2002 7:28 AM | Does any modern programming language have anything like this? Does it work? | :-) My original inspiration was Snobol, but for more modern language you can look at Perl. The following is from the first Perl tutorial Google found on the web: | "Associative arrays, also frequently called hashes, are the third | major data type in Perl after scalars and arrays. Hashes are named | as such because they work very similarly to a common data | structure that programmers use in other languages--hash | tables. However, hashes in Perl are actually a direct language | supported data type. | Accessing a hash works very similar to accessing arrays. However, | hashes are not subscripted by numbers. They can be subscripted by | an arbitrary scalar value. You simply use the {} to subscript the | value instead of [] as you did with arrays. Here is an example: | use strict; | my %table; | $table{'schmoe'} = 'joe'; | $table{7.5} = 2.6; | In this example, our hash, called, %table, has two entries. The | key 'schmoe' is associated with the value 'joe', and the key 7.5 | is associated with the value 2.6. | Just like with array elements, hash elements can be used anywhere | a scalar variable is permitted. Thus, given a @hash{%table} built | with the code above, we can do the following: | print "$table{'schmoe'}\n"; # outputs "joe\n" | --$table{7.5}; # $table{7.5} now contains 1.6 | Another interesting fact is that all hash variables can be | evaluated in the list context. When done, this gives a list whose | odd elements are the keys of the hash, and whose even elements are | the corresponding values. Thus, assuming we have the same %table | from above, we can execute: | my @tableListed = %table; # @tableListed is qw/schmoe joe 7.5 1.6/ Another one of the examples I originally had in mind the C++ Standard Template Library associative container "map", but that is less interesting to me because it is based on a user-specified comparison operator (and implemented as an ordered tree rather than a hash function). **************************************************************** From: Ted Baker Sent: Monday, April 22, 2002 7:45 AM | Well these (hardly modern) programming languages have associative arrays. I | don't doubt the need for these or that they can be implemented, especially | in a language with a fairly limited set of types. I was referring | specifically to the 'Hash proposal: does any modern programming language | have a generalized hash function like this? How do maps components get a | hash function in Java or C++ (compiled languages with rich type systems)? C++ seems to be lagging behind a little bit on this. The C++ STL associative map container template relies on an ordering comparision operator, and so does not fit a hashing implementation, though there is a proposal (see http://www.sgi.com/tech/stl/hash_set.html) for an extension. The proposed extension relies on a binary equality-test operatar as parameter. On the other hand, Java (like Perl) seems to have built-in what I was hoping for. See below. --Ted http://java.sun.com/products/jdk/1.1/docs/api/java.lang.Object.html#hashCode() | hashCode | public native int hashCode() | Returns a hash code value for the object. This method is supported for the benefit of hashtables such as those provided by java.util.Hashtable. | The general contract of hashCode is: | * Whenever it is invoked on the same object more than once | during an execution of a Java application, the hashCode method | must consistently return the same integer. This integer need not | remain consistent from one execution of an application to another | execution of the same application. | * If two objects are equal according to the equals method, | then calling the hashCode method on each of the two objects must | produce the same integer result. | Returns: | a hash code value for this object. See Also: | equals, Hashtable http://java.sun.com/products/jdk/1.1/docs/api/java.util.Hashtable.html#_top_ | public class Hashtable | extends Dictionary | implements Cloneable, Serializable | This class implements a hashtable, which maps keys to | values. Any non-null object can be used as a key or as a value. | To successfully store and retrieve objects from a hashtable, the | objects used as keys must implement the hashCode method and the | equals method. | An instance of Hashtable has two parameters that affect its | efficiency: its capacity and its load factor. The load factor | should be between 0.0 and 1.0. When the number of entries in the | hashtable exceeds the product of the load factor and the current | capacity, the capacity is increased by calling the rehash | method. Larger load factors use memory more efficiently, at the | expense of larger expected time per lookup. | If many entries are to be made into a Hashtable, creating it | with a sufficiently large capacity may allow the entries to be | inserted more efficiently than letting it perform automatic | rehashing as needed to grow the table. | This example creates a hashtable of numbers. It uses the names of | the numbers as keys: | Hashtable numbers = new Hashtable(); | numbers.put("one", new Integer(1)); | numbers.put("two", new Integer(2)); | numbers.put("three", new Integer(3)); | | To retrieve a number, use the following code: | Integer n = (Integer)numbers.get("two"); | if (n != null) { | System.out.println("two = " + n); | } | | See Also: | equals, hashCode, rehash **************************************************************** From: Pascal Leroy Sent: Monday, April 22, 2002 8:47 AM > The choice of the hash function should clearly be implementation defined. > No one even hinted that "XOR-accumulator" would be "required", it is just > one simple approach that is perfectly adequate in practice. I think the situation here is analogous to that of the random number generator. The generator is defined by the language, and is required to behave reasonably well (see RM95 G.2.5). Other than that, it is entirely implementation-defined, and if an application has particular needs (eg cryptography), it can just use its own generator. Similarly, it would be generally useful to have a hashing capability, but the details should be left to the implementation. The RM could possibly impose a minimum level of quality and some documentation requirements. **************************************************************** From: Robert Dewar Sent: Monday, April 22, 2002 6:17 PM > I disagree in part. Any attempt to impose "quality" will be misguided, especially since quality in this regard involves an appropriate balance between pseudo-randomness and speed. **************************************************************** From: Alexandre E. Kopilovitch Sent: Monday, April 22, 2002 6:55 PM Not at all. A notion of quality may say, for example, that the hash function should be generally useful for all types (perhaps, with some restrictions), including, for instance, Boolean arrays (Randy's example). Such a requirement does not involve a balance between pseudo-randomness and speed. **************************************************************** From: Pascal Leroy Sent: Tuesday, April 23, 2002 1:44 AM > I disagree in part. Any attempt to impose "quality" will be misguided, > especially since quality in this regard involves an appropriate balance > between pseudo-randomness and speed. It depends very much on how quality requirements are phrased. Take for example the requirements for random numbers in G.2.5: they are perfectly reasonable and do not cause any significant speed problem, but they ensure that the random number generators are useful for most applications. A hash function that would always return 0 would be fast but useless. On the other hand, we don't want to phrase requirements in such a way that only SHA-1 or MD-5 are acceptable hash. So there has to be a balance somewhere. The kind of requirements that I had in mind would be things like: - For a scalar type with a sufficiently small number of values, the hash function must return distinct values for all values of the type. - For random uniformly distributed input, the output of the hash function must be uniformly distributed. (Lots of hand waving in the above two rules.) As far as I can tell, requirements like that would not constrain the implementation in ways that would cause speed problems, but they would guarantee that the hash is useful in practice. A hash function that would not be uniformly distributed would be a nuisance, because it could silently affect the complexity of algorithms that depend on it without the user realizing it. **************************************************************** From: Robert Dewar Sent: Tuesday, April 23, 2002 6:09 AM > - For random uniformly distributed input, the output of the hash function > must be uniformly distributed. This is very ill-conceived, and is far too over-constraining. No usable hash function will meet this criterion. In fact MD-5 does not meet this criterion. If you take a set of numbers that all hash to the same value using MD-5, the input will meet all possible tests for randomness. Indeed I don't think your condition is well formed at all. This discussion certainly convinces me that the process of discussing standardized packages is never going to succeed in people reaching agreement. I think you are doomed to a repeat of the absurd nonsense that happened with the directory operations package :-) **************************************************************** From: Chistoph Grein Sent: Monday, April 22, 2002 12:15 AM > Tuck said > > < function. This standard attribute would provide a default hash function. > The properties of the hash function would generally be that it returns > a full size integer, distributed as widely and pseudo-randomly as > possible over the full integer range. Hence to use the value you > would generally need to "mod" it by the size of the hash table. > >> > > Exactly, we are on the same wavelength :-) For arrays, you could take the indexes into account when hashing. Of course a given array value's hash must not depend on its current range, so it should be slided to a standard index range first. **************************************************************** From: Robert Dewar Sent: Monday, April 22, 2002 6:32 PM No of course about it here, this needs thinking about. Also, please change the subject, again we should not be discussing all enhancements under one umbrella subject **************************************************************** From: Randy Brukardt Sent: Monday, April 22, 2002 6:45 PM > Also, please change the subject, again we should not be discussing all > enhancements under one umbrella subject Yes, please do. Your editor has 71 messages in his Ada comment in box labeled "Enhancements of the Predefined Language environment". Determining how to file these is going to be fun... **************************************************************** From: Robert Dewar Sent: Monday, April 22, 2002 6:53 PM Circular file for the lot of them? **************************************************************** From: Robert Dewar Sent: Monday, April 22, 2002 7:47 PM Oops, meant to send this joke only to Randy. But seriously, unless we get these discussions organized, they will go nowhere. **************************************************************** From: Nick Roberts Sent: Wednesday, April 24, 2002 7:31 PM I'm sorry to come to the hashing debate late. I blinked and nearly missed it! The first thing I must point out is that it is very likely to be extremely useful for a 'dope' number to be included in the hash function. This means that the function profile would be something like: function T'Hash (X: in T; N: in Natural) return Natural; where N is the dope number. This number allows the implementation of a hash-indexed array to try dope number 0 first, and if that clashes, to 'try again' with number 1, and so on. [Therefore a good characteristic is to approach #(x,n) /= #(x,m) forall n /= m.] Second, it might be worth adding a third 'Scale' parameter to the function. This may be best expressed in the form of a binary power, and would suggest the range of the function. However, there are various pros and cons to having this parameter. Third, I suggest that if T'Hash were defined (and predefined) only for definite non-limited types, this would be sufficient. For all practical purposes, any limited or indefinite type could be effectively hashed by declaring an access type which referred to it, and using the hash of the access type. Finally, I like the idea, and I think it would be worth adding in the next revision, if practicable. **************************************************************** From: Robert Dewar Sent: Thursday, April 25, 2002 9:46 PM > where N is the dope number. This number allows the implementation of a > hash-indexed array to try dope number 0 first, and if that clashes, to 'try > again' with number 1, and so on. [Therefore a good characteristic is to > approach #(x,n) /= #(x,m) forall n /= m.] No point in this, no one uses linear hashing in situations like this except text books :-) The only sensible data structures for maps are balanced trees based on sorting, or chained hash implementations. ****************************************************************