Software Tools and Libraries for Fault Tolerance Yennun Huang, Chandra Kintala and Yi-Min Wang AT&T Bell Laboratories Murray Hill, NJ 07974 Abstract: Software fault tolerance is the task of detecting and recovering from faults that are not handled in the underlying hardware or operating system layers of an application. We describe four reusable software components, watchd, libft, libckp and REPL, that perform automatic detection and restart of failed processes, checkpointing and recovery of in-memory data, and replication and synchronization of files in an application. These components have been ported to a number of UNIX platforms and can be used in any application with minimal programming effort. Several telecommunications products in AT&T have enhanced their fault-tolerance capability using these components. Experience with those products to date indicates that these modules provide efficient and economical means to increase the level of fault tolerance in an application.