Update the troubleshooting for easier rocksdb recovery/repair.
Signed-off-by: Jason Volk <jason@zemos.net>
This commit is contained in:
@@ -93,48 +93,79 @@ reliability at a slight performance cost due to TCP overhead.
|
|||||||
|
|
||||||
#### Database corruption
|
#### Database corruption
|
||||||
|
|
||||||
If your database is corrupted *and* is failing to start (e.g. checksum
|
There are many causes and varieties of database corruption. There are several
|
||||||
mismatch), it may be recoverable but careful steps must be taken, and there is
|
methods for mitigation, each with outcomes ranging from a recovered state down
|
||||||
no guarantee it may be recoverable.
|
to a savage state. This guide has been simplified into a set of universal steps
|
||||||
|
which everyone can follow from the top until they have recovered or reach the
|
||||||
|
end. The details and implications will be explained within each step.
|
||||||
|
|
||||||
The first thing that can be done is launching Tuwunel with the
|
> [!NOTE]
|
||||||
`rocksdb_repair` config option set to true. This will tell RocksDB to attempt to
|
> All command-line `-O` options can be expressed as environment variables or in
|
||||||
repair itself at launch. If this does not work, disable the option and continue
|
> the config file based on your deployment's requirements. Note that
|
||||||
reading.
|
> `--maintenance` is only available on the command-line, but is equivalent to
|
||||||
|
> configuring `startup_netburst = false` and `listening = false`.
|
||||||
|
|
||||||
RocksDB has the following recovery modes:
|
0. Start the server with the following options:
|
||||||
|
|
||||||
- `TolerateCorruptedTailRecords`
|
`tuwunel --maintenance -O rocksdb_recovery_mode=0`
|
||||||
- `AbsoluteConsistency`
|
|
||||||
- `PointInTime`
|
|
||||||
- `SkipAnyCorruptedRecord`
|
|
||||||
|
|
||||||
By default, Tuwunel uses `TolerateCorruptedTailRecords` as generally these may
|
This is actually a "control" and not a method of recovery. If the server starts
|
||||||
be due to bad federation and we can re-fetch the correct data over federation.
|
you either do not have corruption or have deep corruption indicated by very
|
||||||
The RocksDB default is `PointInTime` which will attempt to restore a "snapshot"
|
specific errors from rocksdb citing corruption during runtime. If you are
|
||||||
of the data when it was last known to be good. This data can be either a few
|
certain there is deep corruption skip to step 4, otherwise you are finished
|
||||||
seconds old, or multiple minutes prior. `PointInTime` may not be suitable for
|
without any modifications.
|
||||||
default usage due to clients and servers possibly not being able to handle
|
|
||||||
sudden "backwards time travels", and `AbsoluteConsistency` may be too strict.
|
|
||||||
|
|
||||||
`AbsoluteConsistency` will fail to start the database if any sign of corruption
|
1. Start the server in Tolerate-Corrupted-Tail-Records mode:
|
||||||
is detected. `SkipAnyCorruptedRecord` will skip all forms of corruption unless
|
|
||||||
it forbids the database from opening (e.g. too severe). Usage of
|
|
||||||
`SkipAnyCorruptedRecord` voids any support as this may cause more damage and/or
|
|
||||||
leave your database in a permanently inconsistent state, but it may do something
|
|
||||||
if `PointInTime` does not work as a last ditch effort.
|
|
||||||
|
|
||||||
With this in mind:
|
`tuwunel --maintenance -O rocksdb_recovery_mode=1`
|
||||||
|
|
||||||
- First start Tuwunel with the `PointInTime` recovery method. See the [example
|
The most common corruption scenario is from a loss of power to the hardware
|
||||||
config](configuration/examples.md) for how to do this using
|
(not an application crash, though it is still possible). This is remediated
|
||||||
`rocksdb_recovery_mode`
|
by dropping the most recently written record. It is highly unlikely there will
|
||||||
- If your database successfully opens, clients are recommended to clear their
|
be any impact on the application from this loss. In the best-case the same data
|
||||||
client cache to account for the rollback
|
is often re-requested over the federation or replaced by a client. In the
|
||||||
- Leave your Tuwunel running in `PointInTime` for at least 30-60 minutes so as
|
worst-case clients may need to clear-cache & reload to guarantee correctness.
|
||||||
much possible corruption is restored
|
If the server starts you are finished.
|
||||||
- If all goes will, you should be able to restore back to using
|
|
||||||
`TolerateCorruptedTailRecords` and you have successfully recovered your database
|
2. Start the server in Point-In-Time mode:
|
||||||
|
|
||||||
|
`tuwunel --maintenance -O rocksdb_recovery_mode=2`
|
||||||
|
|
||||||
|
Similar to the corruption scenario above but for more severe cases. The most
|
||||||
|
recent records are discarded back to the point where there is no corruption.
|
||||||
|
It is highly unlikely there will be any impact on the application from this
|
||||||
|
loss, but it is more likely than above that clients may need to clear-cache
|
||||||
|
& reload to correctly resynchronize with the server.
|
||||||
|
|
||||||
|
3. Start the server in Skip-Any-Corrupted-Record mode:
|
||||||
|
|
||||||
|
> [!CAUTION]
|
||||||
|
> Salvage mode potentially impacting the application's ability to function.
|
||||||
|
> We cannot provide any further support for users who have entered this mode.
|
||||||
|
|
||||||
|
`tuwunel --maintenance -O rocksdb_recovery_mode=3`
|
||||||
|
|
||||||
|
Similar to the prior corruption scenarios but for the most severe cases.
|
||||||
|
The database will be inconsistent. It is theoretically possible for the
|
||||||
|
server to continue functioning without notable issue in the best case, but
|
||||||
|
it is completely uncertain what the effect of this operation will be. If
|
||||||
|
the server starts you should immediately export your messages, encryption
|
||||||
|
keys, etc, in a salvage effort and prepare to reinstall.
|
||||||
|
|
||||||
|
4. Start the server in repair mode.
|
||||||
|
|
||||||
|
> [!CAUTION]
|
||||||
|
> Salvage mode potentially impacting the application's ability to function.
|
||||||
|
> We cannot provide any further support for users who have entered this mode.
|
||||||
|
|
||||||
|
`tuwunel --maintenance -O rocksdb_repair=true`
|
||||||
|
|
||||||
|
For corruption affecting the bulk database tables not covered by any journal.
|
||||||
|
This will leave the database in an inconsistent and unpredictable state. It
|
||||||
|
is theoretically possible to continue operating the server depending on which
|
||||||
|
records were dropped, such as some historical records which are no longer
|
||||||
|
essential. Nevertheless the impact of this operation is impossible to assess
|
||||||
|
and a successful recovery should be used to salvage data prior to reinstall.
|
||||||
|
|
||||||
## Debugging
|
## Debugging
|
||||||
|
|
||||||
|
|||||||
Reference in New Issue
Block a user