ChimeraTK-ApplicationCore  04.01.00
Technical specification: Exception handling for device runtime errors V1.0

This version is identical to V1.0RC2WIP.

NOTICE FOR FUTURE RELEASES: AVOID CHANGING THE NUMBERING! The tests refer to the sections, incl. links and unlinked references from tests or other parts of the specification. These break, or even worse become wrong, when they are not changed consistenty!

A. Introduction

  • 1. Exceptions are handled by ApplicationCore in a way that the application developer does not need to care much about it.
  • 2. ChimeraTK::runtime_error exceptions are caught by the framework and are reported to the DeviceModule.
  • 3. The DeviceModule handles this exception and periodically tries to reopen the device.
  • 4. Communication with the faulty device is skipped, frozen or delayed until the device is functional again (see A.9).
  • 5. In case of several devices only the faulty device is affected.
  • 6. Faulty devices do not prevent the application from starting, only the parts of the application that depend on the fault device are waiting for the device to come up.
  • 7. Input variables of ApplicationModules which cannot be read due to a faulty device will set and propagate the DataValidity::faulty flag (see also the Technical specification: data validity propagation).
  • 8. When the device becomes functional, it will be (re)initialised by using application-defined initialisation handlers and also recover the last known values of its process variables.

A.9 Special terminology used in this document

  • 9.1 A read operation might be skipped. It means there will be no new data because the operation will not take place at all. Instead, the function called returns immediately and data is marked as DataValidity::faulty. Note: This term is also used if there is no new data because a running operation is interrupted by an exception.
  • 9.2 A read operation might be frozen. This means, the function called will not return until the fault state is resolved and the operation is executed. Freezing only happens on operations with a pre-existing fault state (*).
  • 9.3 A write operation might be delayed. This means, the operation will not be executed immediately and the calling thread continues. The operation will be asynchronosuly executed when the fault state is resolved. Note that the VersionNumber specified in the write operation will be retained and also used for the delayed write operation.
  • 9.4 Whenever a write operation or a call to write() is mentioned, destructive writes via writeDestructively() are included. The destructive write optimisation makes no difference for the exception handling.

(*) Comments

  • 9.2 If the device was ok and an exception occurs during the operation, it will be skipped.

B. Behavioural description

  • 1. All ChimeraTK::runtime_error exceptions thrown by device register accessors are handled by the framework and are never exposed to user code in ApplicationModules.
    • 1.1 ChimeraTK::logic_error exceptions are left unhandled and will terminate the application. These errors may only occur in the (re-)initialisation phase (up to the point where all devices are opened and initialised) and point to a severe configuration error which is not recoverable. (*)
    • 1.2 Exception handling and DataValidity flag propagation is implemented such that it is transparent to a module whether it is directly connected to a device, or whether a fanout or another application module is in between. This is the central requirement from which most other requirements are derived.
      • 1.2.1 The only exception to this rule can occur if the application buffer is changed by the user code between two reads to a TransferElement, see 2.2.6.
    • 1.3 boost::numeric::bad_numeric_cast exceptions are treated like ChimeraTK::logic_error. They originate from picking the wrong data type in the program code or the configuration and are also not recoverable by re-opening the device.
    • 1.4 The only other exception allowed by the DeviceAccess::TransferElement specificaton is boost::thread_interrupted. It must not be caught by the exception handling decorator because it is used to cleanly shut down the application.

Runtime error handling

  • 2. When a ChimeraTK::runtime_error has been received by the framework (thrown by a device register accessor):
    • 2.1 The exception status is published as a process variable together with an error message. [T]
      • 2.1.1 The variable Devices/<alias>/status contains a boolean flag whether the device is in an error state.
      • 2.1.2 The variable Devices/<alias>/message contains an error message, if the device is in an error state, or an empty string otherwise.
    • 2.2 Read operations will propagate the DataValidity::faulty flag to the owning module / fan out:
      • 2.2.1 The normal module algorithm code will be continued, to allow this flag to propagate to the outputs in the same way as if it had been received through the process variable itself (cf. 1.2). [no test, just an intro]
      • 2.2.2 The DataValidity::faulty flag resulting from the fault state is propagated once, even if the variable had the a DataValidity::faulty flag already set previously for another reason. [T, T]
      • 2.2.3 Read operations without AccessMode::wait_for_new_data are skipped until the device is fully recovered again (cf. 3.1). The first skipped read operation will have a new VersionNumber. [T, T]
      • 2.2.4 Read operations with AccessMode::wait_for_new_data will be skipped once for each accessor to propagate the DataValidity::faulty flag (which counts as new data, i.e. readNonBlocking()/readLatest() will return true (= hasNewData), and a new VersionNumber is obtained) [T, T, T, T, T]. Subsequently:
        • 2.2.4.1 non-blocking read operations (readNonBlocking() and readLatest()) are skipped and return false (= no new data), until the device is recovered [T, T], and
        • 2.2.4.2 blocking read operations (read()) will be frozen until the device is recovered. [T]
        • 2.2.4.3 After the device is fully recovered (cf. 3.1), the current value is (synchronously) read from the device. This is the first value received by the accessor after an exception. [T]
      • 2.2.5 The VersionNumbers returned in case of an exception are the same for the same exception, even across variables and modules. It will be generated in the moment the exception is reported. (*) [T]
      • 2.2.6 The data buffer is not updated. This guarantees that the data buffer stays on the last known value if the user code has not modified it since the last read. [T]
        • 2.2.6.1 This is different to a working device or an implementation without exception handling, where a returning read() has overwritten the data content of the buffer. If an application requires the last read value in the data buffer, it must not change it in the user code. This is the only exception to the golden rule 1.2. [No test, as out of scope]
    • 2.3 Write operations will be delayed until the device is fully recovered again (cf. 3.1).
      • 2.3.1 In case of a fault state (new or persisting), the actual write operation will take place asynchronously when the device is recovering. [tested by 3.1.2]
      • 2.3.2 The same mechanism as used for 3.1.2 is used here, hence the order of write operations is guaranteed across accessors, but only the latest written value of each accessor prevails. (*) [tested by 3.1.2]
      • 2.3.3 The return value of write() indicates whether data was lost in the transfer. If the write has to be delayed due to an exception, the return value will be true (= data lost) if a previously delayed and not-yet written value is discarded in the process, false (= no data lost) otherwise. [T]
      • 2.3.4 When the delayed value is finally written to the device during the recovery procedure, the return value of the write() is ignored. (*) [not testable]
      • 2.3.5 It is guaranteed that the write takes place before the device is considered fully recovered again and other transfers are allowed (cf. 3.1). [T]
      • 2.3.6 Write operations to registers of the type ChimeraTK::Void are not delayed. (*) [tested by 3.1.2]
    • 2.4 In case of exceptions, there is no guaranteed realtime behaviour, not even for "non-blocking" transfers. (*) [not testable]
    • 2.5 TransferElement::isReadable(), TransferElement::isWriteable() and TransferElement::isReadonly() return with values as if reading and writing would be allowed. (*) [T]

Recovery

  • 3. The framework tries to resolve an exception state by periodically re-opening the faulty device.
    • 3.1 After successfully re-opening the device, a recovery procedure is executed before allowing any read/write operations from the ApplicationModules and FanOuts again. This recovery procedure involves:
      • 3.1.1 the execution of so-called initialisation handlers (see 3.2) [T], and
      • 3.1.2 restoring all registers that have been written since the start of the application with their latest values. The register values are restored in the same order they were written. Registers of the type ChimeraTK::Void are not written. (*) [T]
      • 3.1.3 The asynchronous read transfers of the device are (re-)activated by calling Device::activateAsyncReads(). [T]
      • 3.1.4 Finally, Devices/<alias>/deviceBecameFunctional is written to inform any module subscribing to this variable about the finished recovery. (*) [T]
    • 3.2 Any number of initialisation handlers can be added to the DeviceModule in the user code. Initialisation handlers are callback functions which will be executed when a device is opened for the first time and after a device recovers from an exception, before any application-initiated transfers are executed (including delayed write transfers). See DeviceModule::addInitialisationHandler(). [T]

Startup

  • 4. The behaviour at application start (at which all devices are still closed at first) is similar to the case of a later received exception. The only differences are mentioned in 4.2.
    • 4.1 Even if some devices are initially in a persisting error state, the part of the application which does not interact with the faulty devices starts and works normally. [T]
    • 4.2 Initial values are correctly propagated after a device is opened. See the Technical specification: propagation of initial values. Especially, all read operations (even readNonBlocking/readLatest or without AccessMode::wait_for_new_data) will be frozen until an initial value has been successfully read. (*) [test in other spec]

Forced Recovery

  • 5. Any ApplicationModule can explicitly report a problem with the device by calling DeviceModule::reportException(). This allows the reinitialisation of a device e.g. after a reboot of the device which didn't result in an exception (e.g. because it was too quick to be noticed, or rebooting the device takes place without interrupting the communication).

(*) Comments

  • 1.1 In future, maybe logic_errors are also handled, so configuration errors can nicely be presented to the control system. This may be important especially since logic_errors may depend also on the configuration of external components (devices). If e.g. a device is changed (e.g. device is another control system application which has been modified), logic_errors may be thrown in the recovery phase, despite the device had been successfully initialsed previously.
  • 2.2.5 Without changing the VersionNumber, the faulty-marked data might get correlated with good data (e.g. a trigger number which is also used as a trigger to read data from the device), resulting in marking the originally good data as faulty, just because an exception has been received after the good data was processed. Using a VersionNumber generated when reporting the exception ensures that the VersionNumber is older than any data read from the device after recovery. There might still be a race condition if a trigger is delayed for some reason for the entire time of detecting and reporting an exception and recovering the device, in which case the trigger number is older than the exception, but the data is still newer and shouldn't really be correlated with the trigger any more. Since ApplicationModules will always use the newest VersionNumber of its inputs, in this case the VersionNumber from the exception will still be used, which is not ideal but should merely prevent the correlation of the data with other data.
  • 2.3.2 / 3.1.4 If timing is important for write operations (e.g. must not write a sequence of registers too fast), or if multiple values need to be written to the same register in sequence, the application cannot fully rely on the framework's recovery procedure. The framework hence provides the process variable Devices/<alias>/deviceBecameFunctional for each device, which will be written each time the recovery procedure is completed (cf. 3.1.4). ApplicationModules which implement such timed sequence need to receive this variable and restart the entire sequence after the recovery.
  • 2.3.4 The TransferElement specification B.7.2 guarantees that only old data may be lost in a write transfer, hence the latest data is guaranteed to be written to the device during recovery.
  • 2.3.6 Void-typed registers trigger actions and do not carry data. Hence no value can be restored, but instead an action would be triggered which is usually unwanted at the time of recovery (e.g. board reset). If the action is explicitly wanted during recovery, it can be triggered in the recovery handler instead.
  • 2.4 Even read without wait_for_new_data and write operations are not truely non-blocking, since they are still synchronous. The "non-blocking" guarantee only means that the operation does not block until new data has arrived, and that it is not frozen until the device is recovered. For the duration of the recovery procedure and of course for timeout periods these operations may still block. readNonBlocking() and readLatest() with wait_for_new_data could in theory be truely lock-free and wait-free, but the synchronisation mechanism in case of exceptions are not implemented as such. In case of exceptions, the application usually anway does not behave normally any more. If needed, this limitation could be lifted with a more complicated implementation in the future.
  • 2.5 These functions can throw runtime errors if the behaviour has to be determined from the running device. In this case readability and writeability can change on the device (cf. TransferElement specification C.5.3). Suppressing the exception and allowing the operation does not pose the risk of getting a ChimeraTK::logic_error in the preXxx() phase of the operation because all transfer elements are tested for this during device recovery (cf. C.3.3.3).
  • 3.1.2 For some applications, the order of writes may be important, e.g. if firmware expects this. Please note that the VersionNumber is insufficient as a sorting criteria, since many writes may have been done with the same VersionNumber (in an ApplicationModule, the VersionNumber used for the writes is determined by the largest VersionNumber of the inputs).
  • 4.2 DataValidity::faulty is initially set by default, so there is no need to propagate this flag initially. To prevent race conditions and undefined behaviour (especially in automated tests), it even needs to be made sure that the flag is not propagated unnecessarily. The behaviour of non-blocking reads presents a slight asymmetry between the initial device opening and a later recovery. This will in particular be visible when restarting a server while a device is offline. If a module only uses readLatest()/readNonBlocking() (= read() for poll-type inputs) for the offline device, the module was still running before the server restart using the last known values for the dysfunctional registers (and flagging all outputs as faulty). After the restart, the module has to wait for the initial value and hence will not run until the device becomes functional again. To make this behaviour symmetric, one would need to persist the values of device inputs. Since this only affects a corner case in which likely no usable output is produced anyway, this slight inconsistency is considered acceptable.

C. Implementation

A so-called ExceptionHandlingDecorator is placed around all device register accessors (used in ApplicationModules and FanOuts). It is responsible for catching the exceptions and implementing most of the behaviour described in B.2, and its implementation is described in C.2. It has to work closely with the DeviceModule and there is a complex synchronisation and locking scheme, which is described in C.1. The sequence executed in the DeviceModule is described in C.3.

C.1 Internal interface between ExceptionHandlingDecorator and DeviceModule

Note: This section defines the internal interface on a low level. Helper functions, like getters and setters, are intenionally not mentioned here, since those are (in this context) unimportant details which can be chosen at will to structure the code conveniently. The entire interface between the ExceptionHandlingDecorator and the DeviceModule should be protected and the two classes should be friends, to prevent interference with the interface from other entities. Only DeviceModule::reportException() is public, see B.5.

  • 1.1 The boolean flag DeviceModule::deviceHasError
    • 1.1.1 is used by the ExceptionHandlingDecorator to detect prevailing error conditions, to know when transfers have to be skipped or delayed (cf. 2.4).
    • 1.1.2 The access is protected by the DeviceModule::errorMutex:
      • shared lock allows to read
      • unique lock allows to read and write
  • 1.2 The atomic DeviceModule::synchronousTransferCounter (*)
    • 1.2.1 tracks the number of on-going synchronous transfers, and
    • 1.2.2 is used by the DeviceModule to wait until they are all terminated (3.3.15).
  • 1.3 The elements of the DeviceModule::recoveryHelpers list
    • 1.3.1 are used to delay write operations and to restore the last-written values during recovery.
    • 1.3.2 are protected by the DeviceModule::recoveryMutex:
      • shared lock allows to update the application buffer of RecoveryHelper::accessor and to update the other members of the RecoveryHelper structure (*)
      • unique lock allows to call RecoveryHelper::accessor.write() and to read/write the other members of the RecoveryHelper structure
  • 1.4 The cppext::future_queue DeviceModule::errorQueue
  • 1.5 DeviceModule::listOfReadRegisters resp. DeviceModule::listOfWriteRegisters
    • 1.5.1 are used to check that all used registers are existing and have the right direction after (re-)opening the device.
    • 1.5.2 No lock for accessing is required, since the lists are filled in the constructors of the ExceptionHandlingDecorator and in the following only used by the DeviceModule thread.
  • 1.6 The following mutexes govern critical sections (besides variable access listed above):
    • 1.6.1 DeviceModule::errorMutex protects (*)
      • the (positive) decision to start a transfer followed by incrementing the DeviceModule::synchronousTransferCounter in 2.4.3 to 2.4.5, against
      • setting DeviceModule::deviceHasError flag in 2.7.1.
    • 1.6.2 DeviceModule::recoveryMutex protects (*)
      • writing the DeviceModule::recoveryHelpers to the device and clearing the DeviceModule::deviceHasError flag in 3.3.6 to 3.3.7, against
      • updating the DeviceModule::recoveryHelpers in 2.2 and deciding whether to skip the write operation in 2.4.
    • 1.6.3 DeviceModule::initialValueMutex protects (*)
      • the start of a read operation of an initial value in 2.3, against
      • the setup phase of a device until it has been opened and recovered for the very first time in 3.1 to 3.3.10.
  • 1.7 The DeviceModule::exceptionVersionNumber
    • 1.7.1 is generated by DeviceModule:reportException(), and
    • 1.7.2 is used by the ExceptionHandlingDecorator as VersionNumber for the propagation of the DataValidity::faulty flag after an exception.
    • 1.7.3 The access is protected by the DeviceModule::errorMutex:
      • shared lock allows to read
      • unique lock allows to read and write

(*) Comments

  • 1.2 Reason for not using an (exclusive) lock: Incrementing and decrementing the counter is done in the ExceptionHandlingDecorator for each operation, even if there is no exception or error state. Concurrent operations must not exclude each other, to allow lockfree operation in the no-exception case (if the backend supports it) and to avoid priority inversion, if different application threads have different priorities.
  • 1.3.2 A shared lock (in contrast to an exclusive lock) is used for the same reasons as in 1.2. It might be confusing that in this case the shared lock is used for writing, while the exclusive lock is used for reading. The reason is that here each 'producer thread' is holding it's own buffer, so the producers don't interfere with each other. A single, separate reader thread however must access all buffers at once, and must lock out the producers with the exclusive lock (in contrast to 1.2, where the mutex prodects a shared resource from concurrent writes).
  • 1.6.1 This prevents a race condition in 3.3.15. If a (synchronous) transfer might be started after DeviceModule::deviceHasError has been set, the barrier for new transfers in 3.3.15 would not be effective and the transfer might be even executed only after the device has been re-openend (3.3.1) but before the recovery is complete.
  • 1.6.2 This prevents data loss due to a race condition. If the ExceptionHandlingDecorator would update the corresponding DeviceModule::recoveryHelpers list entry only after it has been written to the device by the DeviceModule thread in 3.3.6, but the ExceptionHandlingDecorator would decide not to execute the write operation (2.4) because the DeviceModule thread has not yet cleared the error flag in 3.3.7, the data would not be written to the device at all.
  • 1.6.3 This implements freezing reads until the initial value can be read, cf. B.4.2.

C.2 ExceptionHandlingDecorator

Structure

  • 2.1 A second, undecorated copy of each writeable device register accessor (*), the so-called recovery accessor, is stored in the DeviceModule::recoveryHelpers. These recoveryHelpers are used to set the initial values of registers when the device is opened for the first time and to recover the last written values during the recovery procedure.

Behaviour

  • 2.2 In doPreWrite() the RecoveryHelper is updated while holding a shared lock on DeviceModule::recoveryMutex:
  • 2.3 In doPreRead() it is checked if the transfer element has seen an initial value by checking whether the current version number is still {nullptr} (cf. B.4.2)
    • 2.3.1 This is done as the first thing unconditionally for all read types, as no read must return with the "value after constuction". (For further details, see the intial value propagation specfication)
    • 2.3.2 If there has not been an initial value yet, the read is frozen by acquiring a shared lock on the DeviceModule::initialValueMutex. (*)
    • 2.3.3 As soon as the lock has been acquired it can be released immediately. The device should now be functional and an initial value can be read. (*)
    • 2.3.4 A check whether to freeze for a recovery of asynchronous transfers as rescribed in B.2.2.4 is not done in doPreRead(). The backend takes care of this and the operation automatically freezes when waiting for data from the decorated transfer element, and resumes once the backend starts sending data again. There is nothing extra to do for the ExceptionHandlingDecorator in this case.
    • 2.3.5 The lock on the DeviceModule::errorMutex must not be held in this step to prevent dead-lock with the DeviceModule::initialValueMutex. (*)
  • 2.4 In doPreRead()/doPreWrite(), it is decided whether to execute the target's transfer.
    • 2.4.1 This is only applicable to read operations without AccessMode::wait_for_new_data, and to write operations (*).
    • 2.4.2 This part requires a shared lock on the DeviceModule::errorMutex.
    • 2.4.3 Transfers are only executed if DeviceModule::deviceHasError == false (cf. B.2.3 and B.2.2.3).
    • 2.4.4 If a transfer is not executed, none of the pre/transfer/post functions must be delegated to the target accessor.
    • 2.4.5 If the transfer is executed, the DeviceModule::synchronousTransferCounter must be incremented.
    • 2.4.6 To prevent the execution of the transfer, a ChimeraTK::runtime_error is thrown before calling _target::preXxx() (*).
  • 2.5 deleted
  • 2.6 In doPostRead()/doPostWrite():
    • 2.6.1 Delegate to postRead() / postWrite() (see 2.7), if there was no exception raised by the ExceptionHandling decorator itself (see 2.4.6.1).
    • 2.6.2 In doPostWrite() the RecoveryHelper::wasWritten flag is set (while holding a shared lock on DeviceModule::recoveryMutex) if the write was successful (no exception thrown; data lost flag does not matter here). (*)
    • 2.6.3 If the DeviceModule::synchronousTransferCounter was incremented in 2.4.5, decrement it. (*)
    • 2.6.4 In doPostRead(), _dataValidity and _versionNumber are set to
      • DataValidity::faulty and DeviceModule::exceptionVersionNumber, respectively, if an exception was thrown in 2.4.6 to prevent the transfer, or caught from the delegated postXxx() (see 2.7)
      • the target's data validity and version number, respectively, in all other cases
  • 2.7 In doPostRead()/doPostWrite(), any ChimeraTK::runtime_error exception thrown by the delegated postRead()/postWrite() is caught (*). The following actions are executed in case of a ChimeraTK::runtime_error:
  • 2.8 The constructor of the decorator
    • 2.8.1 receives the VariableNetworkNode for the device variable, to enable it to create additional, undecorated copies of the register accessor,
    • 2.8.2 puts the name of the register (from the VariableNetworkNode) to DeviceModule::listOfReadRegisters resp. DeviceModule::listOfWriteRegisters depending on the direction the accessor is used, and
    • 2.8.3 creates the recovery accessor and initialises the RecoveryHelper object.
    • 2.8.4 Note: The alias name of the device can be obtained from the VariableNetworkNode, which allows to obtain the corresponding DeviceModule via Application::deviceModuleList (change the list into a map).
    • 2.8.5 The code instantiating the decorator (Application::createDeviceVariable()) makes sure that the ExceptionHandlingDecorator is "inside" the MetaDataPropagatingRegisterDecorator, so in case of an exception the dataValidity flag is properly propagated to the owning module/fan out (cf. 2.6.4).
  • 2.9 When a ChimeraTK::runtime_error is caught in isReadable(), isWriteable() or isReadOnly(), the DeviceModule is informed via DeviceModule::reportException().

(*) Comments

  • 2.1 Possible future change: Output accessors can have the option not to have a RecoveryHelper. This is needed for instance for "trigger registers" which start an operation on the hardware. Also void registers don't have a RecoveryHelper (once the void data type is supported by ChimeraTK).
  • 2.1.1 The written flag cannot be replaced by comparing RecoveryHelper::accessor.getCurrentVersion() and RecoveryHelper::versionNumber, because normal writes (without exceptions) would not update the version number of the RecoveryHelper::accessor. The written flag could also be made atomic to avoid acquiring the shared lock in postWrite(), but since the shared lock will never block (if acquired before counting down the DeviceModule::synchronousTransferCounter) there is probably no benifit in using an atomic here.
  • 2.1.2 The ordering guarantee cannot work across DeviceModules anyway. Different devices may go offline and recover at different times. Even in case of two DeviceModules which actually refer to the same hardware device there is no synchronisation mechanism which ensures the recovering procedure is done in a defined order.
  • 2.2.1 Updating the recoveryHelper first ensures that no data is lost, even if the write operation attempt is concurrent with a recovery. See 1.6.2.
  • 2.2.4 Extending the duration of the lock until the decision whether to skip the transfer will prevent unncessary duplicate writes, which otherwise could occur if the DeviceModule went through the whole critical section 3.3.5 to 3.3.8 in between. Two mutexes have to be shared-locked in 2.4 then at the same time (DeviceModule::recoveryMutex and DeviceModule::errorMutex, which is acquired second). This does not present any risk of dead locks, since the only place where the DeviceModule::errorMutex is unique-locked (see DeviceModule::reportException()) no other mutex is acquired.
  • 2.3.2 In principle just getting and releasing the shared lock on DeviceModule::initialValueMutex unconditionally would be a sufficient implementation. The version number cannot be valid if the lock cannot be acquired yet, and after this the exclusive lock is never acquired again after it has been relased in 3.3.10. However, checking the version number is probably cheaper than acquiring the lock in each doPreRead().
  • 2.3.3 There is one situation where the data content of the "value after construction" is propagated: If the device, which was functional when leaving 2.3 but is broken already in 2.4, or an exception is received while getting the initial value, the operation is skipped. It returns with the data invalid flag, but there never was a valid intial value before. This can only happen if there are exceptions on the device, never at the normal start of the application with working devices.
  • 2.3.5 If the shared lock on the DeviceModule::errorMutex would be held while waiting for the shared lock for the DeviceModule::initialValueMutex it would dead-lock with the DeviceModule, which needs the exclusive lock of DeviceModule::errorMutex to release the DeviceModule::initialValueMutex in 3.3.10. The only thing that can happend by not having the DeviceModule::errorMutex is that in case of a device error the transfer is already skipped in 2.4 and not by an exception in the transfer.
  • 2.4.1 In case of read operations with AccessMode::wait_for_new_data, there is no doXxxTransferYyy() called by the TransferElement. The requirement in B.2.2.4 is fullfilled by the backend implementations, see the TransferElement specification in DeviceAccess.
  • 2.4.6 The actual implementation to skip the transfer is done in the TransferElement and the TransferGroup. If the ExceptionHandlingDecorator would implement it by overriding doXxxTransferYyy() it would not work for the TransferGroup, which instead calls the transfer function of the LowLevelTransferElement.
  • 2.6.2 The RecoveryHelper::wasWritten flag is used to report loss of data. If the loss of data is already reported directly, it should not later be reported again. Hence the written flag is set even if there was a loss of data in this context. Setting the flag is ideally done before decrementing the DeviceModule::synchronousTransferCounter in 2.6.3, because this eliminates the possibility that acquiring the shared lock on the DeviceModule::recoveryMutex could block (exclusive lock is only acquired during recovery, which cannot start before DeviceModule::synchronousTransferCounter == 0)
  • 2.6.3 The state of DeviceModule::deviceHasError does not matter here. The counter always MUST be decreased after a transfer (if it has been incremented in the corresponding preXxx()), whether the transfer failed or not.
  • 2.7 Remember: exceptions from other phases are redirected to the post phase by the TransferElement base class.
  • 2.7.1 No transfers will be started in any of the accessors of the device, including this one. This is important to avoid the race condition described in the comment to 1.6.1

C.3 DeviceModule

  • 3.1 The application always starts with all devices as closed. For each device, the initial value for Devices/<alias>/status is set to 1 and the initial value for Devices/<alias>/message is set to an error that the device has not been opened yet (the message will be overwritten with the real error message if the first attempt to open fails, see 3.3.1).
  • 3.2 The DeviceModule locks the DeviceModule::initialValueMutex (cf. 2.3). This happens before launching any module and fan out threads.
  • 3.3 In the DeviceModule thread, the following procedure is executed (in a loop until termination):
    • 3.3.1 The DeviceModule tries to open the device until it succeeds and Device::isFunctional() returns true.
      • 3.3.1.1 If the very first attempt to open the device after the application start fails, the error message of the exception is used to overwrite the content of Devices/<alias>/message. Otherwise error messages of exceptions thrown by Device::open() are not visible.
    • 3.3.2 The queue of reported exceptions is cleared. (*)
    • 3.3.3 Check that all registers on DeviceModule::listOfReadRegisters are isReadable() and all registers on DeviceModule::listOfWriteRegisters are isWriteable().
      • 3.3.3.1 This involves obtaining an accessor for the register first, which is discarded after the check.
      • 3.3.3.2 If there is an exception, update Devices/<alias>/message with the error message and go back to 3.3.1.
      • 3.3.3.3 If one of the accessors does not meet this condition, throw a ChimeraTK::logic_error.
    • 3.3.4 The device is initialised by iterating DeviceModule::initialisationHandlers list and executing the functors.
      • 3.3.4.1 If there is an exception, update Devices/<alias>/message with the error message and go back to 3.3.1.
    • 3.3.5 Obtain unique lock on DeviceModule::recoveryMutex.
    • 3.3.6 Call write() on all valid RecoveryHelper::accessor using RecoveryHelper::versionNumber, in the ascending order of the RecoveryHelper::writeOrder.
    • 3.3.7 While holding the DeviceModule::errorMutex: Clear the DeviceModule::deviceHasError flag to allow the ExceptionHandlingDecorator to execute read/write operations again (cf. 3.3.13)
    • 3.3.8 Release lock on DeviceModule::recoveryMutex (was obtained in 3.3.5).
    • 3.3.9 (Re-)activate the asynchronous read transfers of the device by calling Device::activateAsyncRead().
    • 3.3.10 Release the DeviceModule::initialValueMutex, if this point is passed for the very first time (was obtained in 3.2, cf. 2.3). (*)
    • 3.3.11 Devices/<alias>/status is set to 0 and Devices/<alias>/message is set to an empty string. Devices/<alias>/deviceBecameFunctional is written.
    • 3.3.12 The DeviceModuleThread waits for the next reported exception.
    • 3.3.13 An exception is received. The call to reportException (cf. C.4) in the other thread has already set deviceHasError to true (*). From this point on, no new transfers will be started.
    • 3.3.14 Devices/<alias>/status is set to 1 and Devices/<alias>/message is set to the first received exception message.
    • 3.3.15 The device module waits until all running synchronous read and write operations of ExceptionHandlingDecorators have ended (wait until DeviceModule::synchronousTransferCounter == 0). (*)
    • 3.3.16 The thread goes back to 3.3.1 and tries to re-open the device.

(*) Comments

  • 3.3.2 The exact place when this is done does not matter, as long as it is done after 3.3.15 (no ongoing synchronous transfers) and before 3.3.7 (resetting deciveHasError). As soon as DeviceModule::deviceHasError is cleared, new exceptions can be reported, which would be lost if the list was cleared afterwards. As DeviceModule::reportException() will only write to the exception queue if DeviceModule::deviceHasError is true, and then sets DeviceModule::deviceHasError to true while holding a lock, there will only be one exception in the queue anyway. There are race conditions if exceptions reported by the backend from the same error arrive late. It can trigger a second, unnecessary recovery. But an exception cannot be missed if the error queue is cleared before resetting DeviceModule::deviceHasError.
  • 3.3.10 Releasing the DeviceModule::initialValueMutex has to happen after 3.3.7 (clearing DeviceModule::deviceHasError) to prevent the ExceptionHandlingDecorator from erroneously detecting a device error in 2.4.3 after waiting for the DeviceModule::initialValueMutex in 2.3.
  • 3.3.13 Setting the DeviceModule::deviceHasError flag has to be done in the application thread which has caught the exception. If you just send a message and let the device module do both setting and clearing of the flag you can have a race condition: Another accessor can still start a transfer until the DeviceModule has woken up and set the flag, which can be avoided. Note that the original, severe race condition that let to this design (the same thread would not freeze because the desicion to do so was done in pre-read) does not exist any more since the backend has taken over the responsibility not to send any new data to the queue after an exception has been reported.
  • 3.3.15 The backend takes care that after an exception all transfer elements with "waitForNewData" will not start new asynchronous transfers until they have been re-activated with Device::activateAsyncReads() (see DeviceAccess TransferElement specification).

C.4 DeviceModule::reportException()

  • 4.1 Acquire unique lock on DeviceModule::errorMutex (keep until function returns).
  • 4.2 Just return, if DeviceModule::deviceHasError is already true.
  • 4.3 Set DeviceModule::deviceHasError to true (*).
  • 4.4 Generate a new VersionNumber and store in DeviceModule::exceptionVersionNumber.
  • 4.5 Write exception message to DeviceModule::errorQueue.

(*) Comments

D. Known issues

TODO