Version 1.3 of ai05s/ai05-0094-1.txt

Unformatted version of ai05s/ai05-0094-1.txt version 1.3
Other versions for file ai05s/ai05-0094-1.txt

!standard D.15(15/2)          08-10-18 AI05-0094-1/03
!class binding interpretation 08-05-16
!status ARG Approved 8-0-0 08-06-21
!status work item 08-05-16
!status received 06-03-28
!priority Medium
!difficulty Medium
!qualifier Error
!subject Timing_Events should not require deadlock
!summary
D.15(15/2) should only require that the handler be executed as soon as possible.
!question
There seems to be a nasty bug in the rules for Ada.Real_Time.Timing_Events.
D.15(15/2) says:
15/2 {AI95-00297-01} If a procedure Set_Handler is called with zero or negative In_Time or with At_Time indicating a time in the past then the handler is executed immediately by the task executing the call of Set_Handler. The timing event Event is cleared.
"The handler is executed" means that Handler.all is called in the normal way for a protected procedure, locking the protected object. The problem occurs if the task in question is already inside that same protected object. Deadlock or other bad behavior is required by the above paragraph. The "task executing the call of Set_Handler" is exactly the _wrong_ task to be calling the handler.
The scenario is:
T : constant Time_Span := Zilliseconds (10); -- Some small amount of time
protected body ... is procedure The_Handler (Event : in out Timing_Event) is begin Set_Handler (Event, At_Time => Clock + T, Handler => The_Handler'access); ... end The_Handler;
In this example, we get a flaky deadlock. Perhaps it works most of the time, but if some unrelated process steals a little time, such that Clock + T has passed by the time Set_Handler does its thing, it deadlocks.
How should this be fixed?
!wording
Modify D.15(15/2) as follows:
If a procedure Set_Handler is called with zero or negative In_Time or with At_Time indicating a time in the past then the handler is executed {as soon as possible after the completion of}[immediately by the task executing] the call of Set_Handler. [The timing event Event is cleared.]
AARM Ramification: The handler will still be executed. Under no circumstances is a scheduled call of a handler lost.
AARM discussion: We say "as soon as possible" so that we do not deadlock if we are executing the handler when Set_Handler is called. In that case, the current invocation of the handler must complete before the new handler can start executing.
!discussion
Avoiding the loss of events is important to eliminate the need to program around race conditions. Otherwise, if something takes longer than expected, it might set an event with a time in the past, and then it would be necessary to be able to continue working even if some events are lost. That would complicate programs for no good reason.
We delete "The timing event is cleared." from D.15(15/2), as the execution of the handler already clears the event (D.15(13/2)). And we don't want the event cleared before we start executing the handler (otherwise we again would be at risk of losing events).
!corrigendum D.15(15/2)
Replace the paragraph:
If a procedure Set_Handler is called with zero or negative In_Time or with At_Time indicating a time in the past then the handler is executed immediately by the task executing the call of Set_Handler. The timing event Event is cleared.
by:
If a procedure Set_Handler is called with zero or negative In_Time or with At_Time indicating a time in the past then the handler is executed as soon as possible after the completion of the call of Set_Handler.
!ACATS Test
The ACATS tests for this feature should be adjusted to follow this semantics change; no further tests should be needed.
!appendix

From: Robert A. Duff
Sent: Friday, March 28, 2008  8:02 PM

There seems to be a nasty bug in the rules for Ada.Real_Time.Timing_Events.

D.15(15/2) says:

 15/2  {AI95-00297-01} If a procedure Set_Handler is called with zero or
 negative In_Time or with At_Time indicating a time in the past then the
 handler is executed immediately by the task executing the call of Set_Handler.
 The timing event Event is cleared.

I presume "the handler is executed" means that Handler.all is called in the
normal way for a protected procedure, locking the protected object.
The problem occurs if the task in question is already inside that same
protected object. Deadlock or other bad behavior is required by the above
paragraph. The "task executing the call of Set_Handler" is exactly
the _wrong_ task to be calling the handler.

A NOTE indicates that such a scenario makes sense:

 26/2  45  {AI95-00297-01} Since a call of Set_Handler is not a potentially
       blocking operation, it can be called from within a handler.

...and in particular it can be called within the _same_ handler.

AI95-00297-01 has a couple of examples that call Set_Handler from within
the protected object of the handler.

The scenario is:

    T : constant Time_Span := Zilliseconds (10); -- some small amount of time

    protected body ... is
        procedure Handler (Event : in out Timing_Event) is
        begin
            Set_Handler (Event, At_Time => Clock + T, Handler => Handler'Access);
            ...
        end Handler;

We get a flaky deadlock. Perhaps it works most of the time, but if some
unrelated process steals a little time, such that Clock + T has passed by
the time Set_Handler does its thing, it deadlocks.

I think the solution is to delete the D.15(15/2), and rely on:

 13/2  {AI95-00297-01} As soon as possible after the time set for the event,
       the handler is executed, passing the event as parameter. ...

in all cases.

By the way, this issue comes from real customer code.  It took me most of the
day to debug it. The customer reported that it ran fine for hours on an
unloaded system, but on a heavily loaded system, it would sometimes hang.

Yikes!

****************************************************************

From: Robert A. Duff
Sent: Sunday, March 30, 2008  2:26 PM

Robert A. Duff writes:

> There seems to be a nasty bug in the rules for Ada.Real_Time.Timing_Events.

I had a discussion with Ed about this, and he asked me to forward it here.
Here are the relevant excerpts:


From: Bob Duff <duff@adacore.com>

Edmond Schonberg wrote:

> Can't we recognize in that case that this is an internal call?  We 
> must be able to query the identity of the current protected object and 
> compare it with the target. We would have to generate code for this, 
> one branch calling the body of the unprotected operation, and the 
> other making the standard external call. ????

I thought about that, but it seems inappropriate.  First, this would be
the only place where Ada does "nested locking" (i.e. lock-if-not-already-locked).
Second, it would require searching a set of PO's currently locked (we could be
inside more than one).  And we might be in some procedure called from a PO --
we don't statically know whether we're inside a PO.  It all seems way too heavy
for a language feature described like this:

  1/2   {AI95-00297-01} This clause describes a language-defined package to
  allow user-defined protected procedures to be executed at a specified time
  WITHOUT the need for a TASK OR A DELAY statement.
  ...
                              Implementation Advice
  25/2  {AI95-00297-01} The protected handler procedure should be executed
  DIRECTLY by the real-time clock INTERRUPT mechanism.

(emphasis added).

I mean, if you want "heavy", something like "loop ... delay ...; ... end loop"
seems more appropriate.

Anyway, what's the point of the requirement to do it "immediately" and "by the
same task"?  I say, erase that requirement, since we already have a requirement
"as soon as possible".

From schonberg@adacore.com  Sun Mar 30 11:22:49 2008

On Mar 30, 2008, at 11:11 AM, Bob Duff wrote:

> Edmond Schonberg wrote:
...
>                               Implementation Advice
>   25/2  {AI95-00297-01} The protected handler procedure should be 
> executed
>   DIRECTLY by the real-time clock INTERRUPT mechanism.

But this would indicate that there is no locking involved, precisely:  
just go and do it., this is urgent, no?
...

> P.S. Did you see my message to ARG about it?  I'm not sure it got 
> through...

I saw it, and it seemed reasonable at the time, but rereading your message
I had the impression that if we can determine that it is an internal call
there is no additional locking. If it is an external call it's potentially
blocking and we certainly don't have to do anything special given that it's wrong.


From duff@adacore.com  Sun Mar 30 12:25:35 2008

Edmond Schonberg wrote:

> But this would indicate that there is no locking involved, precisely:  
> just go and do it., this is urgent, no?

No, not if by "interrupt" we mean the model described in the SP Annex
(attaching protected procedures to interrupts and so forth).  That
analogy seems apt.

If an interrupt occurs while the interrupt handler is running, then it does
not immediately cause the handler to start running again reentrantly.
Instead, either the interrupt is lost, or the handler is triggered when
the current invocation of the handler finishes.  It's just like a normal
protected object, except this level of "locking" happens in hardware (and the
hardware is allowed to lose interrupts).

Seems like we want the same semantics for the timing event handlers.

By the way, the "bug" (or feature?) that I "fixed" is on our Linux version
(and the same is used on most non-embedded systems.  It makes no attempt to
properly implement the intended real-time semantics, and is far from "directly"
attached to interrupts.  I suppose the MaRTE version may try to do it "right".

> I saw it, ...

OK, good.

>...and it seemed reasonable at the time, but rereading your  message  I 
>had the impression that if we can determine that it is an  internal 
>call there is no additional locking. If it is an external  call it's 
>potentially blocking and we certainly don't have to do  anything 
>special given that it's wrong.

But Ada always distinguishes internal vs. external statically.
I object to making that distinction at run time in one corner or the language,
while the rest of the language is unchanged.

Formally, the call to the handler (the one I deleted) is always an external
call, since it is indirect -- Handler.all(Event) -- and indirect calls clearly
need to be considered external, in general.

P.S. Perhaps we should have this discussion on the arg list?

From schonberg@adacore.com  Sun Mar 30 13:09:54 2008

On Mar 30, 2008, at 12:25 PM, Bob Duff wrote:

> Edmond Schonberg wrote:
>
>> But this would indicate that there is no locking involved, precisely:
>> just go and do it., this is urgent, no?
>
> No, not if by "interrupt" we mean the model described in the SP Annex 
> (attaching protected procedures to interrupts and so forth).  That 
> analogy seems apt.
>
> If an interrupt occurs while the interrupt handler is running, then it 
> does not
> immediately cause the handler to start running again reentrantly.   
> Instead,
> either the interrupt is lost, or the handler is triggered when the 
> current invocation of the handler finishes.  It's just like a normal 
> protected object, except this level of "locking" happens in hardware 
> (and the hardware is allowed to lose interrupts).
>
> Seems like we want the same semantics for the timing event handlers.

Not so sure. The description is that the completion of the Attach action
is the invocation of the handler. It's not two separate actions (if I
understand the description properly, which I guess is the issue!).

>> ...and it seemed reasonable at the time, but rereading your message  
>> I had the impression that if we can determine that it is an internal 
>> call there is no additional locking. If it is an external call it's 
>> potentially blocking and we certainly don't have to do anything 
>> special given that it's wrong.
>
> But Ada always distinguishes internal vs. external statically.
> I object to making that distinction at run time in one corner or the 
> language, while the rest of the language is unchanged.

Agreed, unless it's simple to do. And annex D is a corner of the
language with lots of special characteristics.

...
> P.S. Perhaps we should have this discussion on the arg list?

Yes, B&W's opinion would be very useful here. Can you forward?

****************************************************************

From: Robert A. Duff
Sent: Sunday, March 30, 2008  2:36 PM

If we have:

    protected body ... is
        procedure Handler (Event : in out Timing_Event) is
        begin
            Do_This;
            Set_Handler (Event, At_Time => TT, Handler => Handler'Access);
            Do_That;
        end Handler;

If time TT is during Do_This, or during Do_That, or during Set_Handler,
I claim we want the same behavior in all these cases: as soon as Handler
is done, Handler should be executed again.  But certainly not in the middle of
Handler -- that would defeat the purpose of "protected" objects.

Furthermore, the task performing the next call to Handler need not be the
same task that calls it this time.  I don't understand why the RM talks
about which task, in a particular case, but leaves the task (if any!)
unspecified in other cases.

Another possibility, given that the Impl Advice suggests that the handler
should be "directly" attached to the timer interrupt mechanism (whatever
that means), is that we should allow interrupts to be lost.  (Allow,
not require.) Because that's how normal interrupts work in the SP annex.

****************************************************************

From: Pascal Leroy
Sent: Monday, March 31, 2008  2:06 AM

>I think the solution is to delete the D.15(15/2), and rely on:
>
> 13/2  {AI95-00297-01} As soon as possible after the time set for the event,
> the handler is executed, passing the event as parameter. ...
>
>in all cases.

I am uncomfortable with this.  I think it's a good idea to precisely
specify what happens when the given time is in the past.  After all, one
possible definition would be that nothing happens in this case (the event
is not set, the handler is not called).  My preference would be to keep
D.15(15/2), but to replace "immediately by the task executing the call of
Set_Handler" with "as soon as possible".  This would remove the
overspecification, and the deadlock.

****************************************************************

From: Tucker Taft
Sent: Monday, March 31, 2008  1:00 PM

I basically agree with Pascal, though I think you also need to address
the last sentence of D.15(15/2):
    ... The timing event Event is cleared.

Do we believe that on return from Set_Handler the timing event is cleared?
I would say not.  My model of how this would be handled is that a "pseudo-interrupt"
would be triggered, whose handling would potentially be deferred if the caller
is already "inside" an interrupt, and once the caller returns to the
non-interrupt level, the timing event's handler would be called, and that is
when the timing event would be cleared.

****************************************************************

From: Robert A. Duff
Sent: Monday, March 31, 2008  1:37 PM

What Pascal says, with Tuck's suggested modification, seems fine to me.

Question: Should it be guaranteed that no timing events are lost?
Or is it like the interrupts in section C.3, where if an interrupt is
"generated" during the handler, it is impl-def whether the new one is lost?

****************************************************************

From: Tucker Taft
Sent: Monday, March 31, 2008  2:04 PM

That's a tough question.  I'll leave it to Alan, Andy, and friends to answer that one.

****************************************************************

From: Alan Burns
Sent: Tuesday, April 1, 2008  3:11 AM

I'm just catching up with emails after being away for a few days.
I'll read through these and get back - but perhaps not for a few day

****************************************************************

From: Robert I. Eachus
Sent: Saturday, April 5, 2008  7:22 PM

>What Pascal says, with Tuck's suggested modification, seems fine to me.
>
>Question: Should it be guaranteed that no timing events are lost?
>Or is it like the interrupts in section C.3, where if an interrupt is 
>"generated" during the handler, it is impl-def whether the new one is 
>lost?

You have real customer code, so check it out. My feeling is that
losing the event would be just as bad, from the user's point of view,
as a deadlock.  Think of it this way, how should the user write the code
so that there is no deadlock, and no scheduled events are lost?  We want
that to be the easiest case.  If the user wants to skip past events, it
is not difficult to wrte code to do so.  But we don't want to define
timing races where the effect of a call to Set_Interrupt is unpredictable.

****************************************************************

From: Alan Burns
Sent: Wednesday, April 9, 2008  6:29 AM

The Timing Events issue has been discussed over the last few days by the IRTAW group.

It concludes

1) yes there does seem to be a problem with the current definition!
(which I think I wrote -sorry)

2) timing events should not be lost - ie even if the time specified is in
the past the handler should still be executed - if this were not the case
then it would be difficult/impossible to write code that was not subject
to race conditions.

3) the solution to just say  that 'the execute of the handler should occur
at the earliest opportunity' seems the right approach (ie just keep with
the current overall requirement of 'immediately'.)

We had some discussion about returning an exception or a flag to indicate
that the 'time' specified was in the past, but these ideas seemed to
produce more problems than they solved!

So, to conclude, agreement with what was proposed by ARG

PS I am now out of the office for a couple of week, so sorry if anyone
needs clarification on this.

****************************************************************

From: Alan Burns
Sent: Tuesday, April 22, 2008  4:13 AM

Just to add a (final) point, the two implementations MaRTE and ORK, both
do what we now think is the right thing to do:

> In MaRTE we are in the same situation than in ORK (and for the same 
> reason: I didn't realize about that point in the RM).
>
> We set the hardware timer to expire "now" but, since interrupts are 
> disabled while inside the PO (Interrupt_Priority ceiling), interrupt 
> is not served until the protected action finishes.
>
> Consequently, and just by chance, we are already using the "now+delta" 
> approach 

****************************************************************

From: Robert A. Duff
Sent: Tuesday, April 22, 2008  9:40 AM

> Just to add a (final) point, the two implementations MaRTE and ORK, 
> both do what we now think is the right thing to do:

Oh, good.  Thanks for letting us know.

I fixed the problem in our normal run-time system, but I wasn't sure
about MaRTE.  I have an item on my to-do list somewhere to worry about
whether MaRTE is doing the right thing...

****************************************************************

Questions? Ask the ACAA Technical Agent