CVS difference for ai12s/ai12-0242-1.txt

Differences between 1.5 and version 1.6
Log of other versions for file ai12s/ai12-0242-1.txt

--- ai12s/ai12-0242-1.txt	2018/02/27 07:04:06	1.5
+++ ai12s/ai12-0242-1.txt	2018/03/01 07:24:04	1.6
@@ -7307,3 +7307,594 @@
 
 ****************************************************************
 
+From: Randy Brukardt
+Sent: Wednesday, February 14, 2018  8:56 PM
+
+Tucker and I have been having a lengthy private back-and-forth on whether it
+is necessary to explicitly have a separate "parallel_reduce" attribute.
+(We've also hashed out some other issues, mostly covered elsewhere.) I think 
+we understand each others position (not that either of us have convinced the 
+other of much). Tucker clearly is much more optimistic about what parallelism
+can be automatically inserted for "normal" Ada code.
+
+
+Here is a quick summary of my position for the record. I believe that it is
+useful/necessary to be able to include "parallel" for the following reasons:
+
+(1) As-if optimization have to have a strong probability of making the code
+faster. It is very difficult to prove that for parallelism, especially if the
+overhead is relatively high (as it is on most normal OSes).
+
+(2) (1) means that compiler writers have little incentive to provide any 
+parallelism for a construct like this. Including a keyword/separate attribute
+prods them to support it.
+
+(3) Marketing: we are trying to expand Ada's marketplace with easy-to-use and
+safe parallel constructs. It helps to show even the causal reader that this is
+happening by having the word "parallel" somewhere. (I realize that not everyone
+subscribes to this one much, recalling Ichbiah's shoe. :-)
+
+(4) Insurance: if the user wants to force parallel behavior, they will get an
+error if that behavior cannot be achieved.
+
+(5) Level of parallelism part 1: At least on conventional targets,
+coarse-grained parallelism is better than fine-grained parallelism, simply
+because there is much less overhead. Brad had noted this effect, saying that
+you want to put the parallelism at the highest-level possible (that is, the
+highest level where parallelism is possible). If the programmer has explicitly
+introduced parallelism at a high-level, the compiler introducing it
+automatically at a lower level just adds overhead.
+
+(6) Level of parallelism part 2: If the user has determined that the proper
+place to introduce parallelism is a Reduce attribute, they want to add
+parallelism there (and only there). That requires some sort of marking.
+
+----
+
+Some explanation of some of these:
+
+(1) was explained in great detail in a previous ARG message. Basically, it is
+necessary to ensure that the overhead of adding parallelism is not bigger than
+the possible gain from parallel execution. This requires a relatively
+expensive combiner function (or expensive iterator target expression in the
+iterator/aggregate form), or a large number of iterations. However, most
+complex functions are separately compiled, so one can't look into the body to
+find out the complexity. And most iterations are dynamic in some way, so the
+compiler doesn't really know how many of those are there, either. It is much
+more likely that the compiler could be sure that parallelization does NOT pay
+(the combiner could be a predefined operator or a simple expression function).
+
+Tucker suggests that a compiler that would do such automatic parallelization
+would only do so under control of a compiler switch. In such a case, you could
+apply parallelization always unless you know it doesn't work -- secure in
+knowing that the user can turn off that silly switch when it turns their
+program from slow into a slug. :-)
+
+(2), (3), and (4) are self-explanatory.
+
+(5) comes from the notion that many problems are naturally parallel.
+
+Let me start with a real-world example. I have a program that tries to find
+"optimal" coefficients that matches observed results to estimated results.
+Essentially, there are three low-level estimated values that are combined into
+a score, and I would like the estimated score to be as close as possible to
+the observed scores for some tested possibilities. (The ultimate idea being
+to use these estimates to find new possibilities to test.)
+
+It does this by testing various coefficients using a "rolling ball"
+algorithm. A single set of coefficients can be considered a trial; the program
+uses those to make a full set of estimates corresponding to the observed data
+and then calculates statistical information on it (mean/variance/correlation)
+-- the idea is to produce a score such that the differences of all three of
+these parameters from that of the observed data is minimized.
+
+The program starts with a survey pass (since there is no good idea of what
+"good" coefficients might look like). The survey pass runs about 1.5 million
+trials on the 7 coefficients. I keep the top 100 scores (and the associated
+coefficients) in an array, the rest are just discarded. This takes about 2
+hours per pass using one core on my current computer.
+
+The statistics gathering code could be written as a series of reduction
+expressions (this is the code example I previously gave Brad). For the sake
+of discussion, let's assume they were written that way (even though that would
+make the code run quite a bit slower because it would have a lot of additional
+memory usage).
+
+Since each trial is logically and programmatically separate from any other,
+parallelism would be best introduced at the point of the trials loop. If that
+was replaced by a parallel loop (and presumably, the results managed by a
+protected object), one could easily use all of the processing elements
+available to the program. (I am not expecting to have a machine with 2 million
+cores on my desktop anytime soon. ;-)
+
+Note that the processing needed for each trial is almost exactly the same;
+there is almost no conditional code in trials (and most of what is there is
+part of the process of saving the result, providing progress indications, and
+for debugging). So each chunk would take about the same amount of time to
+execute; no load-balancing is needed for this problem.
+
+Therefore, any parallelism introduced for the reduction expressions would
+simply add overhead -- all of the processing cores are already busy nearly
+100% of the time. As an as-if optimization, parallel execution is a complete
+failure in this case -- there is no chance of it providing any benefit at all.
+Since the statistics could easily be in a separate unit from the parallel
+loop, in general the compiler cannot know whether or not explicit
+arallelism is in use.
+
+Essentially, if the parallelism is introduced manually as part of modeling the
+problem, any further parallelization has little value at absolute best, and in
+most cases just will make the program run slower.
+
+Both (1) and (5) can be mitigated somewhat by various techniques:
+controlling this behavior with compilation switches, or perhaps better with
+Ada-specific bind-time program generation. (At bind-time, the bodies of all
+Ada functions will be available, and the presence or absence of explicit
+tasking/parallelism will be easily determinable.)
+
+Parallelism is relatively natural in many problems. Some examples from my
+experience: the web server and e-mail filter both use pools of worker tasks 
+that are assigned jobs as they come in (from the public internet). Each job is
+independent of any other, just depending on some global settings. One could
+have used a parallel loop in Ada 2020 to get a similar effect (although that
+complicate logging and debugging, as well as setting changes). In the
+Janus/Ada compiler, the main optimizer works on chunks of intermediate code
+never larger than a subprogram body. It's easy to imagine optimizing several
+chunks at once, as each chunk is independent of any other.
+
+Auto-generation of parallelism seems better used on algorithms where the
+parallelism is not obvious. Tucker claims (probably correctly) that a
+compiler/runtime pair knows a lot more about the target, load balancing, and
+the like than any programmer could. So if the problem itself isn't naturally
+parallel, trying to manually do parts in parallel probably won't work out very
+well. OTOH, a lot of Ada code (and code in general) is naturally sequential,
+adding dependencies even where they could very well not have existed. It's
+hard for me to imagine a compiler smart enough to be able to untangle needed
+sequential behavior from that which is accidentally introduced by the natural
+organization of the programming language. That necessarily will reduce the
+possible implicit parallelism.
+
+Finally, point (6). If a problem is naturally parallel, one will want to write
+parallel code for that high-level parallelism -- and only that code.
+If that code turns out to be best expressed as a Reduce attribute, they
+certainly will not be happy to have to turn it into a parallel loop instead 
+in order to require the generation of parallel behavior.
+
+Tucker does point out that the introduction of "parallel" anywhere really 
+doesn't require any parallel execution; a compiler could run everything on a 
+single processor (and might be forced to do that by OS-level allocations).
+But it does require the safety checks that everything is allowed to be run in
+parallel (4); the user will not be writing code that can only be executed
+sequentially when they intended parallel execution.
+
+---
+
+While I find none of these points compelling by themselves, the combination
+makes it pretty clear to me that we do want a separate parallel version of 
+the Reduce attribute. But we don't want the regular Reduce to have any rules
+that would prevent automatic parallelization in the future (presuming that it
+could have been a Parallel_Reduce).
+
+****************************************************************
+
+From: Jeff Cousins
+Sent: Saturday, February 17, 2018  2:46 PM
+
+A subjective opinion, but I find it hard to believe that a compiler vendor
+would provide a difficult optimisation if it were optional, I think we need
+both ‘Reduce and ‘Parallel_Reduce.
+
+I visited the IET (Institute of Engineering & Technology) yesterday and looked
+in its library for books on parallelisation.  I’m not sure if I reached any
+conclusion other than C++ is a horrible language, but I found TBB quite
+interesting.
+
+****************************************************************
+
+From: Bob Duff
+Sent: Saturday, February 17, 2018  5:17 PM
+
+> A subjective opinion, but I find it hard to believe that a compiler 
+> vendor would provide a difficult optimisation if it were optional, I 
+> think we need both ‘Reduce and ‘Parallel_Reduce.
+
+But optimizations are always optional.  What's to stop an implementer from
+making ‘Reduce and ‘Parallel_Reduce generate exactly the same code?  Money,
+that's what.  ARG has no say in the matter.
+
+****************************************************************
+
+From: Randy Brukardt
+Sent: Saturday, February 17, 2018  7:52 PM
+
+The problem I see is that parallel execution is hardly an optimization in many
+circumstances. The performance of it is just different (might be better, might
+be worse). I'm not in the business of making my customers code worse, so the
+only way (money or not) that I could even consider such an code generation is
+if the customer somehow asks for it.
+
+But whether or not it is a good idea is going to vary a lot for each usage, so
+one needs to be able to ask for each usage. (A compiler switch - that is,
+something that is per unit -- is likely to be too coarse.) Perhaps one could
+say something like, maybe, 'Reduce and 'Parallel_Reduce. ;-)
+
+****************************************************************
+
+From: Tucker Taft
+Sent: Saturday, February 17, 2018  8:22 PM
+
+Any "optimization" is always subject to specific conditions. Putting a given
+variable in a register is an example, in that if the variable is too long
+lived, it can cause too much register pressure, ultimately making surrounding
+code less efficient.  I have given up on this battle, but I predict that in
+the long run this distinction between 'Reduce and 'Parallel_Reduce will be 
+seen in the same light as C's old "register" annotation on local variables.
+It represents a claim that there are no bad usages (e.g. 'Address on a
+"register," data dependencies for 'Parallel_Reduce), but other than that, the
+compiler will ignore the distinction since it can see the bad usages itself
+(presuming Global annotations on the "combiner" in the latter case).
+
+****************************************************************
+
+From: Randy Brukardt
+Sent: Saturday, February 17, 2018  8:38 PM
+
+Of course, one can say the same thing about any construct. Therefore, we don't
+need parallel blocks or loops either. Indeed, we don't need to do anything at
+all to enable parallelism, since any sufficiently intelligent compiler can
+always introduce it just where needed. So Ada 2012 is good enough and we can
+all do something else. :-)
+
+Certainly if "parallel_reduce" ends up like "register", you certainly would
+have to say the same about parallel loops, and probably parallel blocks and
+even tasks. I'm not too worried; maybe Ada 2051 will put these guys into Annex
+J but I think it will be a long time before compilers are that good.
+(I think it is much more likely we're not using compilers at all by then. :-)
+
+****************************************************************
+
+From: Tucker Taft
+Sent: Saturday, February 17, 2018  11:57 PM
+
+I was drawing the distinction between individual expressions, where the
+compiler does not generally expect direction for how to do register
+allocation, instruction scheduling, etc. vs. multi-statement constructs like
+loops where some programmer control can be useful.  For this particular case,
+the more appropriate place to put guidance would be on the "combiner"
+operation rather than on each use of it, vaguely analogous to putting a pragma
+Inline on a subprogram rather than on each call.  
+
+Many combiners are going to be builtins like Integer "+" about which the
+compiler presumably knows what it needs to know.  For user-defined combiners
+(e.g. some kind of set "union" operation), some measure of typical execution
+cost might be helpful, as might some measure of the expense of the Next
+operation of an iterator type.  
+
+Perhaps some kind of subprogram "Cost" aspect could be defined, ranging from,
+say, 1 = cheap to 10 = very expensive (probably not a linear scale!  Perhaps
+something like the Richter scale ;-).  Then a more global specification could
+be used to indicate at what cost parallelization becomes interesting, with
+lower numbers meaning more aggressive parallelization, and bigger numbers
+meaning minimal parallelization.  This global number could be interpreted as
+the relative cost of spawning a tasklet.  The Cost numbers wouldn't need to 
+have any absolute meaning -- they are merely a way to rank subprograms, and a
+way to indicate at what level the expense of executing the subprogram dwarfs 
+the overhead of spawning a tasklet.  Each increment in Cost might represent a
+factor of, say, 10 in relative cost, so 1 = 1 instruction, 2 = 10 instructions,
+10 = 1Giga instructions.  If spawning a tasklet takes, say 100 instructions
+(Cost=3), then you might want the compiler to consider parallelizing when the
+set of subprograms involved have a Cost of >= 4.
+
+This sort of "Cost" aspect could also be informed by profiling of the
+execution of the program, and conceivably a tool could provide the cost aspects
+to the compiler via a separate file of some format, which the programmer could
+tweak as appropriate.
+
+The main point of all of this is that in my view we don't want programmers
+deciding at the individual reduction expression whether to parallelize or not.
+Based on my experience, there will be a lot of these scattered about, and you 
+really don't want to get bogged down in deciding about 'Reduce vs.
+'Parallel_Reduce each time you use one of these, nor going back and editing
+individual expressions to do your tuning.
+
+****************************************************************
+
+From: Brad Moore
+Sent: Saturday, February 17, 2018  1:50 PM
+
+> I was drawing the distinction between individual expressions, where 
+> the compiler does not generally expect direction for how to do 
+> register allocation, instruction scheduling, etc. vs. multi-statement 
+> constructs like loops where some programmer control can be useful.  
+> For this particular case, the more appropriate place to put guidance 
+> would be on the "combiner" operation rather than on each use of it, 
+> vaguely analogous to putting a pragma Inline on a subprogram rather than
+> on each call.
+
+An interesting idea. The Cost aspect does appeal to me, but one problem I see
+with that is if the cost value is hardcoded as an annotation to a subprogram,
+then I see that it should be updated anytime a programmer makes a change to
+application in that area of code, or to anything that area of code depends on.
+Having to modify the sources every time one does a compile does not seem
+reasonable to me, so I would expect that the cost would have to be determined
+by the compiler, and the cost value would not appear as an aspect or
+annotation in the source code. Even with such a compiler generated cost value,
+I think there are a lot of concerns that Randy raises where a compiler would
+have a difficult task, such as analysing the overall cost on all dependent
+units which might be separately compiled. Also if cores are already busy due
+to other tasking or parallelism being applied higher up in the code, it might
+be difficult for the implementation to know if the parallelism is worthwhile
+at a particular place in the  code.
+
+One of my concerns is that at some time in the future, there may exist
+exceptionally smart compilers that can apply sophisticated analysis to
+determine whether parallelism should be applied or not, but I'd be concerned
+that we'd be expecting to raise the bar so that all compiler vendors would
+need to have this sophisticated analysis capability. If there is a way for the
+programmer to explicitly state the desire for parallelism in a simple
+straightforward manner, then perhaps compiler implementations that are not as
+smart can still play the parallelism game.
+
+I think typically programmers would start use the 'Reduce attribute, just as
+they would write loops without the parallel keyword. If performance problems
+are discovered, the optimization phase of the development cycle would look for
+places to see where things can be sped up. If a programmer identifies a
+'Reduce is a likely candidate for parallelization, he can try changing it to a
+'Parallel_Reduce and see if that makes a beneficial difference or not.
+
+If, down the road it all Ada compilers are smart enough to determine whether
+'Reduce usages should be parallel or not, then it doesn't seem to be that big
+a deal at that time to move 'Parallel_Reduce into annex J, it if truly isn't
+needed anymore.
+
+Having Parallel_Reduce now however, means that existing compilers can claim
+better support for parallelism now, rather than having to wait for predictions
+of future compiler technology to come true. From a marketing perspective, it
+may be important to be able to make these claims now. If we have to wait for
+some unknown period of time into the future, the rest of the world interested
+in parallelism may have abandoned Ada for other options by then.
+
+If one did decide they wanted to eliminate all uses of 'Parallel_Reduce from
+the sources, a simple copy and replace operation could be applied, just as
+deleting occurrences of the "register" keyword in C could be applied by such
+an operation.
+
+I'd be much more worried about doing this in C, however since the C
+preprocessor might cause strange interactions with such an automated change to
+sources, but Ada doesn't have that problem.
+
+I think a main point to consider is that parallel reduction is really quite a
+different algorithm than sequential reduction. Yes it is an optimization, but 
+it is also a change of algorithm, which may be worth indicating the specific
+places in the source code where this alternate algorithm is desired. The
+default usage I think would be 'Reduce, where the programmer wants the
+compiler to decide as best it can which algorithm should be applied. But in
+specific places where the programmer wants the parallelism algorithm, he
+should be able to see that by looking at the call site in the code, in my
+opinion.
+
+****************************************************************
+
+From: Erhard Ploedereder
+Sent: Sunday, February 18, 2018  4:22 PM
+
+> The main point of all of this is that in my view we don't want 
+> programmers deciding at the individual reduction expression whether to 
+> parallelize or not.  Based on my experience, there will be a lot of 
+> these scattered about, and you really don't want to get bogged down in 
+> deciding about 'Reduce vs. 'Parallel_Reduce each time you use one of 
+> these, nor going back and editing individual expressions to do your 
+> tuning.
+
+For >50 years, we as a community have tried to create compilers that
+parallelize sequential code based on "as if"-rules. We have not succeeded.
+"As if" is simply too limiting. What makes anybody believe that we as Ada 
+folks can succeed where others have failed?
+
+The reasonably successful models all based on "damn it, parallelize and
+ignore the consequences vis-a-vis sequential code"-semantics. Of course,
+if for some reason you know that the parallel code is slower than the
+sequential version, then you can optimize by generating sequential code.
+(Seriously.)
+
+So, in short, I disagree. There is a need for a 'Parallel_Reduce, less to
+indicate that you wish parallelization, but rather to absolve from a
+requirement of "as if"-semantics of parallelized code.
+
+****************************************************************
+
+From: Edmond Schonberg
+Sent: Sunday, February 18, 2018  5:35 PM
+
+Indeed, I don’t think Ada can bring anything that will suddenly make 
+parallelization easy and effective. For a user concerned with performance,
+tuning machinery is indispensable:  that means profiling tools and
+annotations, which will invariably have to be target-dependent. The two
+best-known langage-independent (kind of) models of distribution and parallel
+computation in use today, OpenMP and OpenACC, both choose to use a pragma-like
+syntax to annotate a program that uses the standard syntax of a sequential 
+language (Fortran, C, C++). This makes the inescapably iterative process of
+tuning a program easier, because only the annotations need to be modified.
+Those annotations typically carry target-specific information (number of
+threads, chunk size, etc) . Concerning reduction, this suggests that a single
+attribute is sufficient, and that a suitably placed pragma can be used to
+indicate whether and how it should be parallelized. The compiler can warn on
+the applicability of such a pragma the way it can warn on older optimization
+pragmas.
+
+From an implementation point of view there will be a big advantage in being 
+close to OpenMP and/or OpenACC given that several compiler frameworks (GCC, 
+LLVM) support these annotations. 
+
+As an aside, something that the discussion on parallelism has omitted 
+completely so far is memory placement, and the literature on parallel 
+programming makes it clear that without a way of specifying where things go
+(which processor, which GPU, and when to transfer data from one to the other)
+there is no hope of getting the full performance that the hardware could
+supply.  Here also the pragmas proposed by Open… might be a good model to
+emulate.
+
+****************************************************************
+
+From: Tucker Taft
+Sent: Sunday, February 18, 2018  6:41 PM
+
+> For >50 years, we as a community have tried to create compilers that 
+> parallelize sequential code based on "as if"-rules. We have not 
+> succeeded. "As if" is simply too limiting. What makes anybody believe 
+> that we as Ada folks can succeed where others have failed?
+
+I think a reduction expression might be a special case, since the entire 
+computation is fundamentally side-effect free.
+
+> The reasonably successful models all based on "damn it, parallelize 
+> and ignore the consequences vis-a-vis sequential code"-semantics. Of 
+> course, if for some reason you know that the parallel code is slower 
+> than the sequential version, then you can optimize by generating sequential code.
+> (Seriously.)
+> 
+> So, in short, I disagree. There is a need for a 'Parallel_Reduce, less 
+> to indicate that you wish parallelization, but rather to absolve from 
+> a requirement of "as if"-semantics of parallelized code.
+
+But note (I believe) that our proposal is that 'Parallel_Reduce is illegal 
+(not erroneous) if it is not data-race free, and depends on the Global
+annotations.  So I am not sure we are doing what you suggest.
+
+****************************************************************
+
+From: Randy Brukardt
+Sent: Monday, February 19, 2018  9:07 PM
+
+... 
+> > For >50 years, we as a community have tried to create compilers that 
+> > parallelize sequential code based on "as if"-rules. We have not 
+> > succeeded. "As if" is simply too limiting. What makes anybody 
+> > believe that we as Ada folks can succeed where others have failed?
+> 
+> I think a reduction expression might be a special case, since the 
+> entire computation is fundamentally side-effect free.
+
+Two problems with that:
+(1) Ada 2020 is going to provide the tools to the compiler so that it can
+ analyze any code for parallelizability. I don't see any reason that a
+compiler that is doing such as-if code would want to limit that to just
+'Reduce. It could apply parallelization to any code that met the conditions,
+including loops and probably even sequential statements. So IF this is
+practical (I'm skeptical), it has nothing really to do with 'Reduce, but
+rather all Ada constructs (meaning that parallel loops, parallel blocks, and
+even tasks are all redundant in the same way.
+
+(2) 'Reduce is likely to be rare in most programmer's code, as the conditions
+for making it useful aren't going to be easily visible to most programmers.
+With the exception of a few problems that are "obviously" reduces, most of the
+examples that Tucker provided are verging on "tricky" (using a reduce to
+create a debugging string, for example).
+
+> > The reasonably successful models all based on "damn it, parallelize 
+> > and ignore the consequences vis-a-vis sequential code"-semantics. Of 
+> > course, if for some reason you know that the parallel code is slower 
+> > than the sequential version, then you can optimize by generating 
+> > sequential code. (Seriously.)
+> > 
+> > So, in short, I disagree. There is a need for a 'Parallel_Reduce, 
+> > less to indicate that you wish parallelization, but rather to 
+> > absolve from a requirement of "as if"-semantics of parallelized code.
+> 
+> But note (I believe) that our proposal is that 'Parallel_Reduce is 
+> illegal (not erroneous) if it is not data-race free, and depends on 
+> the Global annotations.  So I am not sure we are doing what you 
+> suggest.
+
+Any Ada parallelism will require that. Otherwise, the code has to be executed 
+purely sequentially (Ada has always said that). One could imagine a switch to 
+allow everything to be executed in parallel, but that by definition couldn't
+follow the canonical Ada semantics.
+
+In my view, the only parallelism that is likely to be successful for Ada is
+that applied at a fairly high-level (much like one would do with Ada tasks 
+today). For that to work, the programmer has to identify where parallelism 
+can best be applied, with the compiler helping to show when there are problems
+with that application. Such places should be just a handful in any given
+program (ideally, just one).
+
+Programmers sticking "parallel" all over hoping for better performance are
+going to be rather disappointed, regardless of what rules we adopt. (Robert
+always used to say that "most optimizations are disappointing", and that
+surely is going to be the case here.) Real performance improvement is all
+about finding the "hottest" code and then replacing that with a better 
+algorithm. Parallel execution has a place in that replacement, but it is not
+any sort of panacea.
+
+****************************************************************
+
+From: Erhard Ploedereder
+Sent: Tuesday, February 20, 2018  12:42 PM
+
+>> For >50 years, we as a community have tried to create compilers that 
+>> parallelize sequential code based on "as if"-rules. We have not 
+>> succeeded. "As if" is simply too limiting. What makes anybody believe 
+>> that we as Ada folks can succeed where others have failed?
+
+> I think a reduction expression might be a special case, since the 
+> entire computation is fundamentally side-effect free.
+
+Then I misunderstood the purpose of 'Parallel_Reduce.
+I thought that the parallelization also applied to the production of the
+values to be reduced, e.g., applying the filters and evaluations of the 
+container's content values.
+ 
+>> The reasonably successful models all based on "damn it, parallelize 
+>> and ignore the consequences vis-a-vis sequential code"-semantics.
+>> Of course, if for some reason you know that the parallel code is 
+>> slower than the sequential version, then you can optimize by 
+>> generating sequential code. (Seriously.)
+>> 
+>> So, in short, I disagree. There is a need for a 'Parallel_Reduce, 
+>> less to indicate that you wish parallelization, but rather to absolve 
+>> from a requirement of "as if"-semantics of parallelized code.
+
+> But note (I believe) that our proposal is that 'Parallel_Reduce is 
+> illegal (not erroneous) if it is not data-race free, and depends on 
+> the Global annotations.  So I am not sure we are doing what you 
+> suggest.
+
+Yes, I am afraid of that. So, any program that dares to use 'Parallel_Reduce
+will be illegal until
+ - compilers are up to snuff to show data-race-freeness
+ - sufficient and minutely exact Global specs are provided and
+   subsequently maintained to allow said compilers to prove
+   absence of races (at which point compilers will KNOW that
+   parallelization is possible, so that after this check there
+   is no difference between 'Reduce and 'Parallel_Reduce....
+   -- No, there is: the 'Parallel-version is the "bloody nuisance
+   checks"-version).
+
+Prepare for another heart of darkness in defining this well, with a net
+benefit of zilch.
+
+****************************************************************
+
+From: Randy Brukardt
+Sent: Tuesday, February 20, 2018  3:23 PM
+
+"zilch"? I can agree, but only because that's the usual benefit of low-level
+optimizations. For the vast majority of programs, they're irrelevant, and the
+same is certainly true of fine-grained parallelism. As an as-if optimization,
+it is impossible even in the rare cases where it might help (because there is
+a substantial possibility that the code would be slower - the compiler knows 
+little about the function bodies and little about the number of iterations).
+
+But we *have* to do this for marketing reasons. And it can help when it is 
+fairly coarse-grained. But *no one* can find data races by hand. (I know I
+can't -- the web server goes catatonic periodically and I have been unable to
+find any reason why.) Making it easy to have parallel execution without any
+sort of checking is a guarantee of unreliable code.
+
+I thought that the intent was that the parallelism checking would be a 
+"suppressible error", such that one could disable it with a pragma. (That of 
+course assumes that we define the concept.) So if you really think you can get 
+it right on your own, feel free to do so. I just hope no one does that in any 
+auto or aircraft software...
+
+****************************************************************
+

Questions? Ask the ACAA Technical Agent