We have a requirement to develop an audit procedure that ensures the
integrity of our data. We have a hierarchical, embedded, real-time
system, and our data is distributed over multiple cards. Each higher
level card stores the data and executable image of its lower cards.
All requests to update the data emanate from external management
systems, and they are incident on the highest level card, where some
basic validation is performed.
Management update requests are processed in a trickle-down manner. An
update request is routed to the target card, where it is processed.
The target card is given an opportunity to validate the request based
on the run-time conditions that prevail on that card (which only the
target card knows about, since run-time data is not maintained at
higher levels). If the target card explicitly rejects the request, and
the reject response is not lost before it reaches the top level, the
update request is rejected, and the rejection response is sent to the
external management entity that issued the request. If the request
times out (due to link failure or card failure), the higher level card
goes ahead and updates its version of the data.
Our requirement is to audit the management data, since that is the only
data that survives a process restart or a card reset. We have looked
at two approaches to handle this:
(1) Periodically obtain the checksum of the files at all the levels,
and compare them. In case of a discrepancy, we always defer to the
higher level card. While this seems reasonable, the cards themselves
have different processors, and they may not yield checksums in a
consistent manner for the same data file
(2) When the highest level card successfully updates the data
pertaining to the target card, it logs the management request. As
mentioned above, the highest level card will always update its version
of the data, as long as the basic validation is successful. The target
card will also log the update management requests that caused it to
update some data. Periodically the top level card will send a message
that indicates to the target card what all updates it (the top level
card) has made for the target card. The target card will compare this
information with the log of update commands that it is maintaining.
This comparison will be made based on a correlation tag that is
generated by the top level card.
If there are more entries in the top-level card's version of the
successful updates that have been made since the last audit cycle, it
means that the top-level card processed more update commands than the
target card. Moreover, the target card will know exactly which
commands it missed, and it executes those commands on itself (albeit in
a time delayed manner). Upon success, both the top-level card and the
target card will delete these entries in the log files that they are
maintaining. In case of a failure, these entries will not be deleted,
and hopefully the reconciliation will take place in the next audit
cycle.
After reconciliation, if there are any intermediate cards, between the
top-level card and the target card, the data to those cards will be
blindly over-written by the top-level card. This will minimize the
risk of the top-level card (which is really the data master) and the
intermediate cards getting out of sync.
If anyone can think of other approaches that we can consider, kindly
post them here.
Thanks,
Zahid
