Update the troubleshooting for easier rocksdb recovery/repair.

Signed-off-by: Jason Volk <jason@zemos.net>
This commit is contained in:
Jason Volk
2025-08-30 00:46:11 +00:00
parent 7bc47e58d0
commit 4bdc260015

View File

@@ -93,48 +93,79 @@ reliability at a slight performance cost due to TCP overhead.
#### Database corruption #### Database corruption
If your database is corrupted *and* is failing to start (e.g. checksum There are many causes and varieties of database corruption. There are several
mismatch), it may be recoverable but careful steps must be taken, and there is methods for mitigation, each with outcomes ranging from a recovered state down
no guarantee it may be recoverable. to a savage state. This guide has been simplified into a set of universal steps
which everyone can follow from the top until they have recovered or reach the
end. The details and implications will be explained within each step.
The first thing that can be done is launching Tuwunel with the > [!NOTE]
`rocksdb_repair` config option set to true. This will tell RocksDB to attempt to > All command-line `-O` options can be expressed as environment variables or in
repair itself at launch. If this does not work, disable the option and continue > the config file based on your deployment's requirements. Note that
reading. > `--maintenance` is only available on the command-line, but is equivalent to
> configuring `startup_netburst = false` and `listening = false`.
RocksDB has the following recovery modes: 0. Start the server with the following options:
- `TolerateCorruptedTailRecords` `tuwunel --maintenance -O rocksdb_recovery_mode=0`
- `AbsoluteConsistency`
- `PointInTime`
- `SkipAnyCorruptedRecord`
By default, Tuwunel uses `TolerateCorruptedTailRecords` as generally these may This is actually a "control" and not a method of recovery. If the server starts
be due to bad federation and we can re-fetch the correct data over federation. you either do not have corruption or have deep corruption indicated by very
The RocksDB default is `PointInTime` which will attempt to restore a "snapshot" specific errors from rocksdb citing corruption during runtime. If you are
of the data when it was last known to be good. This data can be either a few certain there is deep corruption skip to step 4, otherwise you are finished
seconds old, or multiple minutes prior. `PointInTime` may not be suitable for without any modifications.
default usage due to clients and servers possibly not being able to handle
sudden "backwards time travels", and `AbsoluteConsistency` may be too strict.
`AbsoluteConsistency` will fail to start the database if any sign of corruption 1. Start the server in Tolerate-Corrupted-Tail-Records mode:
is detected. `SkipAnyCorruptedRecord` will skip all forms of corruption unless
it forbids the database from opening (e.g. too severe). Usage of
`SkipAnyCorruptedRecord` voids any support as this may cause more damage and/or
leave your database in a permanently inconsistent state, but it may do something
if `PointInTime` does not work as a last ditch effort.
With this in mind: `tuwunel --maintenance -O rocksdb_recovery_mode=1`
- First start Tuwunel with the `PointInTime` recovery method. See the [example The most common corruption scenario is from a loss of power to the hardware
config](configuration/examples.md) for how to do this using (not an application crash, though it is still possible). This is remediated
`rocksdb_recovery_mode` by dropping the most recently written record. It is highly unlikely there will
- If your database successfully opens, clients are recommended to clear their be any impact on the application from this loss. In the best-case the same data
client cache to account for the rollback is often re-requested over the federation or replaced by a client. In the
- Leave your Tuwunel running in `PointInTime` for at least 30-60 minutes so as worst-case clients may need to clear-cache & reload to guarantee correctness.
much possible corruption is restored If the server starts you are finished.
- If all goes will, you should be able to restore back to using
`TolerateCorruptedTailRecords` and you have successfully recovered your database 2. Start the server in Point-In-Time mode:
`tuwunel --maintenance -O rocksdb_recovery_mode=2`
Similar to the corruption scenario above but for more severe cases. The most
recent records are discarded back to the point where there is no corruption.
It is highly unlikely there will be any impact on the application from this
loss, but it is more likely than above that clients may need to clear-cache
& reload to correctly resynchronize with the server.
3. Start the server in Skip-Any-Corrupted-Record mode:
> [!CAUTION]
> Salvage mode potentially impacting the application's ability to function.
> We cannot provide any further support for users who have entered this mode.
`tuwunel --maintenance -O rocksdb_recovery_mode=3`
Similar to the prior corruption scenarios but for the most severe cases.
The database will be inconsistent. It is theoretically possible for the
server to continue functioning without notable issue in the best case, but
it is completely uncertain what the effect of this operation will be. If
the server starts you should immediately export your messages, encryption
keys, etc, in a salvage effort and prepare to reinstall.
4. Start the server in repair mode.
> [!CAUTION]
> Salvage mode potentially impacting the application's ability to function.
> We cannot provide any further support for users who have entered this mode.
`tuwunel --maintenance -O rocksdb_repair=true`
For corruption affecting the bulk database tables not covered by any journal.
This will leave the database in an inconsistent and unpredictable state. It
is theoretically possible to continue operating the server depending on which
records were dropped, such as some historical records which are no longer
essential. Nevertheless the impact of this operation is impossible to assess
and a successful recovery should be used to salvage data prior to reinstall.
## Debugging ## Debugging