Update the troubleshooting for easier rocksdb recovery/repair.

Signed-off-by: Jason Volk <jason@zemos.net>
2025-08-30 00:46:11 +00:00
parent 7bc47e58d0
commit 4bdc260015
1 changed files with 66 additions and 35 deletions
--- a/docs/troubleshooting.md
+++ b/docs/troubleshooting.md
@@ -93,48 +93,79 @@ reliability at a slight performance cost due to TCP overhead.
 #### Database corruption
-If your database is corrupted *and* is failing to start (e.g. checksum
+There are many causes and varieties of database corruption. There are several
-mismatch), it may be recoverable but careful steps must be taken, and there is
+methods for mitigation, each with outcomes ranging from a recovered state down
-no guarantee it may be recoverable.
+to a savage state. This guide has been simplified into a set of universal steps
 which everyone can follow from the top until they have recovered or reach the
 end. The details and implications will be explained within each step.
-The first thing that can be done is launching Tuwunel with the
+> [!NOTE]
-`rocksdb_repair` config option set to true. This will tell RocksDB to attempt to
+> All command-line `-O` options can be expressed as environment variables or in
-repair itself at launch. If this does not work, disable the option and continue
+> the config file based on your deployment's requirements. Note that
-reading.
+> `--maintenance` is only available on the command-line, but is equivalent to
 > configuring `startup_netburst = false` and `listening = false`.
-RocksDB has the following recovery modes:
+0. Start the server with the following options:
- `TolerateCorruptedTailRecords`
+`tuwunel --maintenance -O rocksdb_recovery_mode=0`
 - `AbsoluteConsistency`
 - `PointInTime`
 - `SkipAnyCorruptedRecord`
-By default, Tuwunel uses `TolerateCorruptedTailRecords` as generally these may
+This is actually a "control" and not a method of recovery. If the server starts
-be due to bad federation and we can re-fetch the correct data over federation.
+you either do not have corruption or have deep corruption indicated by very
-The RocksDB default is `PointInTime` which will attempt to restore a "snapshot"
+specific errors from rocksdb citing corruption during runtime. If you are
-of the data when it was last known to be good. This data can be either a few
+certain there is deep corruption skip to step 4, otherwise you are finished
-seconds old, or multiple minutes prior. `PointInTime` may not be suitable for
+without any modifications.
 default usage due to clients and servers possibly not being able to handle
 sudden "backwards time travels", and `AbsoluteConsistency` may be too strict.
-`AbsoluteConsistency` will fail to start the database if any sign of corruption
+1. Start the server in Tolerate-Corrupted-Tail-Records mode:
 is detected. `SkipAnyCorruptedRecord` will skip all forms of corruption unless
 it forbids the database from opening (e.g. too severe). Usage of
 `SkipAnyCorruptedRecord` voids any support as this may cause more damage and/or
 leave your database in a permanently inconsistent state, but it may do something
 if `PointInTime` does not work as a last ditch effort.
-With this in mind:
+`tuwunel --maintenance -O rocksdb_recovery_mode=1`
- First start Tuwunel with the `PointInTime` recovery method. See the [example
+The most common corruption scenario is from a loss of power to the hardware
-config](configuration/examples.md) for how to do this using
+(not an application crash, though it is still possible). This is remediated
-`rocksdb_recovery_mode`
+by dropping the most recently written record. It is highly unlikely there will
- If your database successfully opens, clients are recommended to clear their
+be any impact on the application from this loss. In the best-case the same data
-client cache to account for the rollback
+is often re-requested over the federation or replaced by a client. In the
- Leave your Tuwunel running in `PointInTime` for at least 30-60 minutes so as
+worst-case clients may need to clear-cache & reload to guarantee correctness.
-much possible corruption is restored
+If the server starts you are finished.
- If all goes will, you should be able to restore back to using
+
-`TolerateCorruptedTailRecords` and you have successfully recovered your database
+2. Start the server in Point-In-Time mode:
 `tuwunel --maintenance -O rocksdb_recovery_mode=2`
 Similar to the corruption scenario above but for more severe cases. The most
 recent records are discarded back to the point where there is no corruption.
 It is highly unlikely there will be any impact on the application from this
 loss, but it is more likely than above that clients may need to clear-cache
 & reload to correctly resynchronize with the server.
 3. Start the server in Skip-Any-Corrupted-Record mode:
 > [!CAUTION]
 > Salvage mode potentially impacting the application's ability to function.
 > We cannot provide any further support for users who have entered this mode.
 `tuwunel --maintenance -O rocksdb_recovery_mode=3`
 Similar to the prior corruption scenarios but for the most severe cases.
 The database will be inconsistent. It is theoretically possible for the
 server to continue functioning without notable issue in the best case, but
 it is completely uncertain what the effect of this operation will be. If
 the server starts you should immediately export your messages, encryption
 keys, etc, in a salvage effort and prepare to reinstall.
 4. Start the server in repair mode.
 > [!CAUTION]
 > Salvage mode potentially impacting the application's ability to function.
 > We cannot provide any further support for users who have entered this mode.
 `tuwunel --maintenance -O rocksdb_repair=true`
 For corruption affecting the bulk database tables not covered by any journal.
 This will leave the database in an inconsistent and unpredictable state. It
 is theoretically possible to continue operating the server depending on which
 records were dropped, such as some historical records which are no longer
 essential. Nevertheless the impact of this operation is impossible to assess
 and a successful recovery should be used to salvage data prior to reinstall.
 ## Debugging