Update the troubleshooting for easier rocksdb recovery/repair.

Signed-off-by: Jason Volk <jason@zemos.net>
This commit is contained in:
Jason Volk
2025-08-30 00:46:11 +00:00
parent 7bc47e58d0
commit 4bdc260015

View File

@@ -93,48 +93,79 @@ reliability at a slight performance cost due to TCP overhead.
#### Database corruption
If your database is corrupted *and* is failing to start (e.g. checksum
mismatch), it may be recoverable but careful steps must be taken, and there is
no guarantee it may be recoverable.
There are many causes and varieties of database corruption. There are several
methods for mitigation, each with outcomes ranging from a recovered state down
to a savage state. This guide has been simplified into a set of universal steps
which everyone can follow from the top until they have recovered or reach the
end. The details and implications will be explained within each step.
The first thing that can be done is launching Tuwunel with the
`rocksdb_repair` config option set to true. This will tell RocksDB to attempt to
repair itself at launch. If this does not work, disable the option and continue
reading.
> [!NOTE]
> All command-line `-O` options can be expressed as environment variables or in
> the config file based on your deployment's requirements. Note that
> `--maintenance` is only available on the command-line, but is equivalent to
> configuring `startup_netburst = false` and `listening = false`.
RocksDB has the following recovery modes:
0. Start the server with the following options:
- `TolerateCorruptedTailRecords`
- `AbsoluteConsistency`
- `PointInTime`
- `SkipAnyCorruptedRecord`
`tuwunel --maintenance -O rocksdb_recovery_mode=0`
By default, Tuwunel uses `TolerateCorruptedTailRecords` as generally these may
be due to bad federation and we can re-fetch the correct data over federation.
The RocksDB default is `PointInTime` which will attempt to restore a "snapshot"
of the data when it was last known to be good. This data can be either a few
seconds old, or multiple minutes prior. `PointInTime` may not be suitable for
default usage due to clients and servers possibly not being able to handle
sudden "backwards time travels", and `AbsoluteConsistency` may be too strict.
This is actually a "control" and not a method of recovery. If the server starts
you either do not have corruption or have deep corruption indicated by very
specific errors from rocksdb citing corruption during runtime. If you are
certain there is deep corruption skip to step 4, otherwise you are finished
without any modifications.
`AbsoluteConsistency` will fail to start the database if any sign of corruption
is detected. `SkipAnyCorruptedRecord` will skip all forms of corruption unless
it forbids the database from opening (e.g. too severe). Usage of
`SkipAnyCorruptedRecord` voids any support as this may cause more damage and/or
leave your database in a permanently inconsistent state, but it may do something
if `PointInTime` does not work as a last ditch effort.
1. Start the server in Tolerate-Corrupted-Tail-Records mode:
With this in mind:
`tuwunel --maintenance -O rocksdb_recovery_mode=1`
- First start Tuwunel with the `PointInTime` recovery method. See the [example
config](configuration/examples.md) for how to do this using
`rocksdb_recovery_mode`
- If your database successfully opens, clients are recommended to clear their
client cache to account for the rollback
- Leave your Tuwunel running in `PointInTime` for at least 30-60 minutes so as
much possible corruption is restored
- If all goes will, you should be able to restore back to using
`TolerateCorruptedTailRecords` and you have successfully recovered your database
The most common corruption scenario is from a loss of power to the hardware
(not an application crash, though it is still possible). This is remediated
by dropping the most recently written record. It is highly unlikely there will
be any impact on the application from this loss. In the best-case the same data
is often re-requested over the federation or replaced by a client. In the
worst-case clients may need to clear-cache & reload to guarantee correctness.
If the server starts you are finished.
2. Start the server in Point-In-Time mode:
`tuwunel --maintenance -O rocksdb_recovery_mode=2`
Similar to the corruption scenario above but for more severe cases. The most
recent records are discarded back to the point where there is no corruption.
It is highly unlikely there will be any impact on the application from this
loss, but it is more likely than above that clients may need to clear-cache
& reload to correctly resynchronize with the server.
3. Start the server in Skip-Any-Corrupted-Record mode:
> [!CAUTION]
> Salvage mode potentially impacting the application's ability to function.
> We cannot provide any further support for users who have entered this mode.
`tuwunel --maintenance -O rocksdb_recovery_mode=3`
Similar to the prior corruption scenarios but for the most severe cases.
The database will be inconsistent. It is theoretically possible for the
server to continue functioning without notable issue in the best case, but
it is completely uncertain what the effect of this operation will be. If
the server starts you should immediately export your messages, encryption
keys, etc, in a salvage effort and prepare to reinstall.
4. Start the server in repair mode.
> [!CAUTION]
> Salvage mode potentially impacting the application's ability to function.
> We cannot provide any further support for users who have entered this mode.
`tuwunel --maintenance -O rocksdb_repair=true`
For corruption affecting the bulk database tables not covered by any journal.
This will leave the database in an inconsistent and unpredictable state. It
is theoretically possible to continue operating the server depending on which
records were dropped, such as some historical records which are no longer
essential. Nevertheless the impact of this operation is impossible to assess
and a successful recovery should be used to salvage data prior to reinstall.
## Debugging