Operational Guide

Deploying to hg.mozilla.org

All code running on the servers behind hg.mozilla.org is in the version-control-tools repository.

Deployment of new code to hg.mozilla.org is performed via an Ansible playbook defined in the version-control-tools repository. To deploy new code, simply run:

$ ./deploy hgmo

Important

Files from your local machine’s version-control-tools working copy will be installed on servers. Be aware of any local changes you have performed, as they might be reflected on the server.

Scheduled Outage Policy

For outages expected to last nor more than one hour no advanced notice to Sheriffs or Developers is required; however, it is recommended that a notification is posted on the relevant Slack/Matrix channels at the time of outage commencement.

Should an issue occur that makes it likely for the outage to extend beyond the one hour mark, if possible the work should be rolled back and rescheduled for another date.

For outages expected to last more than one hour Sheriffs and Developers require notice at least one business day prior to the scheduled outage time. A notification must be posted on the relevant Slack/Matrix channels at the time of outage commencement.

Outage scheduling should be mindful of peak development times, and where possible shuold be scheduled to occur during North American off-peak periods: before 9am US/Eastern, over lunch (noon US/Eastern or US/Pacific), or after 5pm US/Pacific.

Operational Secrets

Secrets can be checked in to version-control-tools by encrypting and decrypting them with sops. To encrypt/decrypt variables checked in to this repo, you will need to:

  1. Have your Auth0 email be given permission to access the version-control-tools secrets key. Ask about this in #vcs, or file a bug under hg.mozilla.org.
  2. Install the GCP SDK.
  3. Install sops and go through the GCP setup steps.

With sops installed, you can encrypt and decrypt files with:

$ sops --decrypt path/to/file/configfile.yaml
$ sops --encrypt path/to/file/configfile.yaml

After running these commands, only keys in the configuration that are suffixed with “_encrypted” will have their values encrypted. All other values are left as cleartext.

sops will automatically use the correct key configuration using the .sops.yaml file in the root of the repository. When you run ./deploy hgmo, all configuration files containing secrets will be decrypted automatically.

Deployment Gotchas

Not all processes are restarted as part of upgrades. Notably, the httpd + WSGI process trees are not restarted. This means that Python or Mercurial upgrades may not been seen until the next httpd service restart. For this reason, deployments of Mercurial upgrades should be followed by manually restarting httpd when each server is out of the load balancer.

Restarting httpd/wsgi Processes

Note

this should be codified in an Ansible playbook

If a restart of the httpd and WSGI process tree is needed, perform the following:

  1. Log in to the Zeus load balancer at https://zlb1.external.ops.mdc1.mozilla.com:9090
  2. Find the hgweb-http pool.
  3. Mark as host as draining then click Update.
  4. Poll the draining host for active connections by SSHing into the host and curling http://localhost/server-status?auto. If you see more than 1 active connection (the connection performing server-status), repeat until it goes away.
  5. systemctl restart httpd.service
  6. Put the host back in service in Zeus.
  7. Repeat 3 to 6 until done with all hosts.

Dealing with high memory on hgweb

Sometimes memory usage on hgweb can spike to very high levels as a result of too many connections coming in at once, and the memory usage may not decrease without manual intervention. If memory usage has spiked as a result of increased number of connections, administrators can run sudo apachectl graceful to gracefully restart httpd.

Zeus Rate Limiting

Zeus has some rate limiting rules defined for hgweb.

  • Rates are defined in the “Rates” section in Zeus.
    • Catalogs > Rates > “hg-rate”
    • A value of 0 for a rate means “do not use this rate”.
  • Rates are used in “Rules”.
    • Services > hg.mozilla.org-https > Rules > Edit > hgweb-rate-rule
    • There you can see the content of the rule, which references the rate.
  • Rules are applied to the virtual server.
    • Use the “Enable”/”Disabled” checkbox to activate/deactivate the rule.
  • Viewing the effect of the rate limiting rule is possible from the activity view.
    • Activity > Current Activity > Chart data > “Rate Limit Queue HG” > Plot

Forcing a hgweb Repository Re-clone

It is sometimes desirable to force a re-clone of a repository to each hgweb node. For example, if a new Mercurial release offers newer features in the repository store, you may need to perform a fresh clone in order to upgrade the repository on-disk format.

To perform a re-clone of a repository on hgweb nodes, the hgmo-reclone-repos deploy target can be used:

$ ./deploy hgmo-reclone-repos mozilla-central releases/mozilla-beta

The underlying Ansible playbook integrates with the load balancers and will automatically take machines out of service and wait for active connections to be served before performing a re-clone. The re-clone should thus complete without any client-perceived downtime.

Repository Mirroring

The replication/mirroring of repositories is initiated on the master/SSH server. An event is written into a distributed replication log and it is replayed on all available mirrors. See Replication for more.

Most repository interactions result in replication occurring automatically. In a few scenarios, you’ll need to manually trigger replication.

The vcsreplicator Mercurial extension defines commands for creating replication messages. For a full list of available commands run hg help -r vcsreplicator. The following commands are common.

hg replicatehgrc
Replicate the hgrc file for the repository. The .hg/hgrc file will effectively be copied to mirrors verbatim.
hg replicatesync

Force mirrors to synchronize against the master. This ensures the repo is present on mirrors, the hgrc is in sync, and all repository data from the master is present.

Run this if mirrors ever get out of sync with the master. It should be harmless to run this on any repo at any time.

hg -R <repo> replicatedelete
Atomically delete this repo from the ssh master and all mirrors. The repo will be moved to a non-public location and removed.

Important

You will need to run /var/hg/venv_tools/bin/hg instead of /usr/bin/hg so Python package dependencies required for replication are loaded.

Marking Repositories as Read-only

Repositories can be marked as read-only. When a repository is read-only, pushes are denied with a message saying the repository is read-only.

To mark an individual repository as read-only, create a .hg/readonlyreason file. If the file has content, it will be printed to the user as the reason the repository is read-only.

To mark all repositories on hg.mozilla.org as read-only, create the /repo/hg/readonlyreason file. If the file has content, it will be printed to the user.

Retiring Repositories

Users can delete their own repositories - this section applies only to non-user repositories.

Convention is to retire (aka delete) repositories by moving them out of the user accessible spaces on the master and deleting from webheads.

This can be done via ansible playbook in the version-control-tools repository:

$ cd ansible
$ ansible-playbook -i hosts -e repo=relative/path/on/server hgmo-retire-repo.yml

Managing Repository Hooks

It is somewhat common to have to update hooks on various repositories.

The procedure for doing this is pretty simple:

  1. Update a .hg/hgrc file on the SSH master
  2. Replicate hgrc to mirrors

Generally speaking, sudo vim to edit .hg/hgrc files is sufficient. Ideally, you should use sudo -u hg vim .hg/hgrc.

To replicate hgrc changes to mirrors after updating an hgrc, simply run:

$ /var/hg/venv_tools/bin/hg replicatehgrc

Note

hg replicatehgrc operates on the repo in the current directory.

The definition of hooks is somewhat inconsistent. Generally speaking, hook entries are cargo culted from another repo.

Try Head Management

The Try repository continuously grows new heads as people push to it. There are some version control operations that scale with the number of heads. This means that the repository gets slower as the number of heads increases.

To work around this slowness, we periodically remove old heads. We do this by performing dummy merges. The procedure for this is as follows:

# Clone the Try repo. This will be very slow unless --uncompressed is used.
$ hg clone --uncompressed -U https://hg.mozilla.org/try
$ cd try
# Verify heads to merge (this could take a while on first run)
$ hg log -r 'head() and branch(default) and not public()'
# Capture the list of heads to merge
$ hg log -r 'head() and branch(default) and not public()' -T '{node}\n' > heads
# Update the working directory to the revision to be merged into. A recent
# mozilla-central revision is typically fine.
$ hg up <revision>
# Do the merge by invoking `hg debugsetparents` repeatedly
$ for p2 in `cat heads`; do echo $p2; hg debugsetparents . $p2; hg commit -m 'Merge try head'; done
# Push to try without scheduling any jobs
# You may wish to post in Matrix or Slack with a notice as well
$ hg push -r . ssh://hg.mozilla.org/try

Clonebundles Management

Various repositories have their content snapshotted and uploaded to S3. These snapshots (bundles in Mercurial parlance) are advertised via the Mercurial server to clients and are used to seed initial clones. See Cloning from Pre-Generated Bundles for more.

From an operational perspective, bundle generation is triggered by the hg-bundle-generate.service and hg-bundle-generate.timer systemd units on the master server. This essentially runs the generate-hg-s3-bundles script. Its configuration lives in the script itself as well as /repo/hg/bundles/repos (which lists the repos to operate on and their bundle generation settings).

The critical output of periodic bundle generation are the objects uploaded to S3 (to multiple buckets in various AWS regions) and the advertisement of these URLs in per-repo .hg/clonebundles.manifest files. Essentially for each repo:

  1. Bundles are generated
  2. Bundles are uploaded to multiple S3 buckets
  3. clonebundles.manifest is updated to advertise newly-uploaded URLs
  4. clonebundles.manifest is replicated from hgssh to hgweb mirrors
  5. Clients access clonebundles.manifest as part of hg clone and start requesting referenced URLs.

If bundle generation fails, it isn’t the end of the world: the old bundles just aren’t as up to date as they could be.

Important

The S3 buckets have automatic 7 day expiration of objects. The assumption is that bundle generation completes successfully at least once a week. If bundle generation doesn’t run for 7 days, the objects referenced in clonebundles.manifest files will expire and clients will encounter HTTP 404 errors.

In the event that a bundle is corrupted, manual intervention may be required to mitigate to problem.

As a convenience, a backup of the .hg/clonebundles.manifest file is created during bundle generation. It lives at .hg/clonebundles.manifest.old. If a new bundle is corrupt but an old one is valid, the mitigation is to restore from backup:

$ cp .hg/clonebundles.manifest.old .hg/clonebundles.manifest
$ /var/hg/venv_tools/bin/hg replicatesync

If a single bundle or type of bundle is corrupted or causing problems, it can be removed from the clonebundles.manifest file so clients stop seeing it.

Inside the clonebundles.manifest file are N types of bundles uploaded to M S3 buckets (plus a CDN URL). The bundle types can be identified by the BUNDLESPEC value of each entry. For example, if stream clone bundles are causing problems, the entries with a BUNDLESPEC containing none-packed could be removed.

Danger

Removing entries from a clonebundles.manifest can be dangerous.

The removal of entries could shift a lot of traffic from S3/CDN to the hgweb servers themselves - possibly overloading them.

The removal of a particular entry type could have performance implications for Firefox CI. For example, removing stream clone bundles will make hg clone take several minutes longer. This is often acceptable as a short-term workaround and is preferred to removing clone bundles entirely.

Important

If modifying a .hg/clonebundles.manifest file, remember to run /repo/hg/venv_tools/bin/hg replicatesync to trigger the replication of that file to hgweb mirrors. Otherwise clients won’t see the changes!

Corrupted fncache File

In rare circumstances, a .hg/store/fncache file can become corrupt. This file is essentially a cache of all known files in the .hg/store directory tree.

If this file becomes corrupt, symptoms often manifest as stream clones being unable to find a file. e.g. during working directory update there will be an error:

abort: No such file or directory: '<path>'

You can test the theory that the fncache file is corrupt by grepping for the missing path in the .hg/store/fncache file. There should be a <path>.i entry in the fncache file. If it is missing, the fncache file is corrupt.

To rebuild the fncache file:

$ sudo -u <user> /var/hg/venv_tools/bin/hg -R <repo> debugrebuildfncache

Where <user> is the user that owns the repo (typically hg) and <repo> is the local filesystem path to the repo to repair.

hg debugrebuildfncache should be harmless to run at any time. Worst case, it effectively no-ops. If you are paranoid. make a backup copy of .hg/store/fncache before running the command.

Important

Under no circumstances should .hg/store/fncache be removed or altered by hand. Doing so may result in further repository damage.

Mirrors in pushdataaggregator_groups File

On the SSH servers, the /repo/hg/pushdataaggregator_groups file lists all hgweb mirrors that must have acknowledged replication of a message before that message is re-published to replicatedpushdata Kafka topic. This topic is then used to publish events to Pulse, SNS, etc.

When adding or removing hgweb machines from active service, this file needs to be manually updated to reflect the current set of active mirrors.

If an hgweb machine is removed and the pushdataaggregator_groups file is not updated, messages won’t be re-published to the replicatedpushdata Kafka topic. This should eventually result in an alert for lag of that Kafka topic.

If an hgweb machine is added and the pushdataaggregator_groups file is not updated, messages could be re-published to the replicatedpushdata Kafka topic before the message has been acknowledged by all replicas. This could result in clients seeing inconsistent repository state depending on which hgweb server they access.

Verifying Replication Consistency

The replication service tries to ensure that repositories on multiple servers are as identical as possible. But testing for this using standard filesystem comparison tools is difficult because some bits on disk may vary even though Mercurial data is consistent.

The hg mozrepohash command can be used to display hashes of important Mercurial data. If the output from this command is identical across machines, then the underlying repository stores should be identical.

To mass collect hashes of all repositories, you can run something like the following on an hgssh host:

$ /var/hg/version-control-tools/scripts/find-hg-repos.py /repo/hg/mozilla/ | \
  sudo -u hg -g hg parallel --progress --res /var/tmp/repohash \
  /var/hg/venv_tools/bin/hg -R /repo/hg/mozilla/{} mozrepohash

or the following on an hgweb host:

$ /var/hg/version-control-tools/scripts/find-hg-repos.py /repo/hg/mozilla/ | \
  sudo -u hg -g hg parallel --progress --res /var/tmp/repohash \
  /var/hg/venv_replication/bin/hg -R /repo/hg/mozilla/{} mozrepohash

This command will use GNU parallel to run hg mozrepohash on all repositories found by the find-hg-repos.py script and write the results into /var/tmp/repohash.

You can then rsync those results to a central machine and compare output:

$ for h in hgweb{1,2,3,4}.dmz.mdc1.mozilla.com; do \
    rsync -avz --delete-after --exclude stderr $h:/var/tmp/repohash/ $h/ \
  done

$ diff -r hgweb1.dmz.mdc1.mozilla.com hgweb2.dmz.mdc1.mozilla.com

SSH Server Services

This section describes relevant services running on the SSH servers.

An SSH server can be in 1 of 2 states: master or standby. At any one time, only a single server should be in the master state.

Some services always run on the SSH servers. Some services only run on the active master.

The standby server is in a state where it is ready to become the master at any time (such as if the master crashes).

Important

The services that run on the active master are designed to only have a single global instance. Running multiple instances of these services can result in undefined behavior or event data corruption.

Master Server Management

The current active master server is denoted by the presence of a /repo/hg/master.<hostname> file. e.g. the presence of /repo/hg/master.hgssh1.dmz.mdc1.mozilla.com indicates that hgssh1.dmz.mdc1.mozilla.com is the active master.

All services that should have only a single instance (running on the master) have systemd unit configs that prevent the unit from starting if the master.<hostname> file for the current server does not exist. So, as long as only a single master.<hostname> file exists, it should not be possible to start these services on more than one server.

The hg-master.target systemd unit provides a common target for starting and stopping all systemd units that should only be running on the active master server. The unit only starts if the /repo/hg/master.<hostname> file is present.

Note

The hg-master.target unit only tracks units specific to the master. Services like the sshd daemon processing Mercurial connections are always running and aren’t tied to hg-master.target.

The /repo/hg/master.<hostname> file is monitored every few seconds by the hg-master-monitor.timer and associated /var/hg/version-control-tools/scripts/hg-master-start-stop script. This script looks at the status of the /repo/hg/master.<hostname> file and the hg-master.target unit and reconciles the state of hg-master.target with what is wanted.

For example, if /repo/hg/master.hgssh1.dmz.mdc1.mozilla.com exists and hg-master.target isn’t active, hg-master-start-stop will start hg-master.target. Similarly, if /repo/hg/master.hgssh1.dmz.mdc1.mozilla.com is deleted, hg-master-start-stop will ensure hg-master.target (and all associated services by extension) are stopped.

So, the process for transitioning master-only services from one machine to another is to delete one master.<hostname> file then create a new master.<hostname> for the new master.

Important

Since hg-master-monitor.timer only fires every few seconds and stopping services may take several seconds, one should wait at least 60s between removing one master.<hostname> file and creating a new one for a server server. This limitation could be improved with more advanced service state tracking.

sshd_hg.service

This systemd service provides the SSH server for accepting external SSH connections that connect to Mercurial.

This is different from the system’s SSH service (sshd.service). The differences from a typical SSH service are as follows:

  • The service is running on port 222 (not port 22)
  • SSH authorized keys are looked up in LDAP (not using the system auth)
  • All logins are processed via pash, a custom Python script that dispatches to Mercurial or performs other adminstrative tasks.

This service should always be running on all servers, even if they aren’t the master. This means that hg-master.target does not control this service.

hg-bundle-generate.timer and hg-bundle-generate.service

These systemd units are responsible for creating Mercurial bundles for popular repositories and uploading them to S3. The bundles it produces are also available on a CDN at https://hg.cdn.mozilla.net/.

These bundles are advertised by Mercurial repositories to facilitate bundle-based cloning, which drastically reduces the load on the hg.mozilla.org servers.

This service only runs on the master server.

pushdataaggregator-pending.service

This systemd service monitors the state of the replication mirrors and copies fully acknowledged/applied messages into a new Kafka topic (replicatedpushdatapending).

The replicatedpushdatapending topic is watched by the vcsreplicator-headsconsumer process on the hgweb machines.

This service only runs on the master server.

pushdataaggregator.service

This systemd service monitors the state of the replication mirrors and copies fully acknowledged/applied messages into a new Kafka topic (replicatedpushdata).

The replicatedpushdata topic is watched by other services to react to repository events. So if this service stops working, other services will likely sit idle.

This service only runs on the master server.

pulsenotifier.service

This systemd service monitors the replicatedpushdata Kafka topic and sends messages to Pulse to advertise repository events.

For more, see Change Notifications.

The Pulse notifications this service sends are relied upon by various applications at Mozilla. If it stops working, a lot of services don’t get notifications and things stop working.

This service only runs on the master server.

snsnotifier.service

This systemd service monitors the replicatedpushdata Kafka topic and sends messages to Amazon S3 and SNS to advertise repository events.

For more, see Change Notifications.

This service is essentially identical to pulsenotifier.service except it publishes to Amazon services, not Pulse.

unifyrepo.service

This systemd service periodically aggregates the contents of various repositories into other repositories.

This service and the repositories it writes to are currently experimental.

This service only runs on the master server.

Monitoring and Alerts

hg.mozilla.org is monitored by Nagios.

check_hg_bundle_generate_age

This check monitors the last generation time of clone bundles. The check is a simple wrapper around the check_file_age check. It monitors the age of the /repo/hg/bundles/lastrun file. This file should be touched every ~24h when the hg-bundle-generate.service unit completes.

Remediation

If this alert fires, it means the hg-bundle-generate.service unit hasn’t completed in the past 1+ days. This failure is non-urgent. But the failure needs to be investigated within 5 days.

A bug against the hg.mozilla.org service operator should be filed. The alert can be acknowledged once a bug is on file.

If the alert turns critical and an hg.mozilla.org service operator has not acknowledged the alert’s existence, attempts should be made to page a service operator. The paging can be deferred until waking hours for the person being paged, as the alert does not represent an immediate issue. The important thing is that the appropriate people are made aware of the alert so they can fix it.

check_zookeeper

check_zookeeper monitors the health of the ZooKeeper ensemble running on various servers. The check is installed on each server running ZooKeeper.

The check verifies 2 distinct things: the health of an individual ZooKeeper node and the overall health of the ZooKeeper ensemble (cluster of nodes). Both types of checks should be configured where this check is running.

Expected Output

When everything is functioning as intended, the output of this check should be:

zookeeper node and ensemble OK

Failures of Individual Nodes

A series of checks will be performed against the individual ZooKeeper node. The following error conditions are possible:

NODE CRITICAL - not responding “imok”: <response>
The check sent a ruok request to ZooKeeper and the server failed to respond with imok. This typically means the node is in some kind of failure state.
NODE CRITICAL - not in read/write mode: <mode>
The check sent a isro request to ZooKeeper and the server did not respond with rw. This means the server is not accepting writes. This typically means the node is in some kind of failure state.
NODE WARNING - average latency higher than expected: <got> > <expected>
The average latency to service requests since last query is higher than the configured limit. This node is possibly under higher-than-expected load.
NODE WARNING - open file descriptors above percentage limit: <value>

The underlying Java process is close to running out of available file descriptors.

We should never see this alert in production.

If any of these node errors is seen, #vcs should be notified and the on call person for these servers should be notified.

Failures of Overall Ensemble

A series of checks is performed against the ZooKeeper ensemble to check for overall health. These checks are installed on each server running ZooKeeper even though the check is seemingly redundant. The reason is each server may have a different perspective on ensemble state due to things like network partitions. It is therefore important for each server to perform the check from its own perspective.

The following error conditions are possible:

ENSEMBLE WARNING - node (HOST) not OK: <state>

A node in the ZooKeeper ensemble is not returning imok to an ruok request.

As long as this only occurs on a single node at a time, the overall availability of the ZooKeeper ensemble is not compromised: things should continue to work without service operation. If the operation of the ensemble is compromised, a different error condition with a critical failure should be raised.

ENSEMBLE WARNING - socket error connecting to HOST: <error>

We were unable to speak to a host in the ensemble.

This error can occur if ZooKeeper is not running on a node it should be running on.

As long as this only occurs on a single node at a time, the overall availability of the ZooKeeper ensemble is not compromised.

ENSEMBLE WARNING - node (HOST) is alive but not available

A ZooKeeper server is running but it isn’t healthy.

This likely only occurs when the ZooKeeper ensemble is not fully available.

ENSEMBLE CRITICAL - unable to find leader node; ensemble likely not writable

We were unable to identify a leader node in the ZooKeeper ensemble.

This error almost certainly means the ZooKeeper ensemble is down.

ENSEMBLE WARNING - only have X/Y expected followers

This warning occurs when one or more nodes in the ZooKeeper ensemble isn’t present and following the leader node.

As long as we still have a quorum of nodes in sync with the leader, the overall state of the ensemble should not be compromised.

ENSEMBLE WARNING - only have X/Y in sync followers

This warning occurs when one or more nodes in the ZooKeeper ensemble isn’t in sync with the leader node.

This warning likely occurs after a node was restarted or experienced some kind of event that caused it to get out of sync.

check_vcsreplicator_lag

check_vcsreplicator_lag monitors the replication log to see if consumers are in sync.

This check runs on every host that runs the replication log consumer daemon, which is every hgweb machine. The check is only monitoring the state of the host it runs on.

The replication log consists of N independent partitions. Each partition is its own log of replication events. There exist N daemon processes on each consumer host. Each daemon process consumes a specific partition. Events for any given repository are always routed to the same partition.

Consumers maintain an offset into the replication log marking how many messages they’ve consumed. When there are more messages in the log than the consumer has marked as applied, the log is said to be lagging. A lagging consumer is measured by the count of messages it has failed to consume and by the elapsed time since the first unconsumed message was created. Time is the more important lag indicator because the replication log can contain many small messages that apply instantaneously and thus don’t really constitute a notable lag.

When the replication system is working correctly, messages written by producers are consumed within milliseconds on consumers. However, some messages may take several seconds to apply. Consumers do not mark a message as consumed until it has successfully applied it. Therefore, there is always a window between event production and marking it as consumed where consumers are out of sync.

Expected Output

When a host is fully in sync with the replication log, the check will output the following:

OK - 8/8 consumers completely in sync

OK - partition 0 is completely in sync (X/Y)
OK - partition 1 is completely in sync (W/Z)
...

This prints the count of partitions in the replication log and the consuming offset of each partition.

When a host has some partitions that are slightly out of sync with the replication log, we get a slightly different output:

OK - 2/8 consumers out of sync but within tolerances

OK - partition 0 is 1 messages behind (0/1)
OK - partition 0 is 1.232 seconds behind
OK - partition 1 is completely in sync (32/32)
...

Even though consumers are slightly behind replaying the replication log, the drift is within tolerances, so the check is reporting OK. However, the state of each partition’s lag is printed for forensic purposes.

Warning and Critical Output

The monitor alerts when the lag of any one partition of the replication log is too great. As mentioned above, lag is measured in message count and time since the first unconsumed message was created. Time is the more important lag indicator.

When a partition/consumer is too far behind, the monitor will issue a WARNING or CRITICAL alert depending on how far behind consumers are. The output will look like:

WARNING - 2/8 partitions out of sync

WARNING - partition 0 is 15 messages behind (10/25)
OK - partition 0 is 5.421 seconds behind
OK - partition 1 is completely in sync (34/34)
...

The first line will contain a summary of all partitions’ sync status. The following lines will print per-partition state.

The check will also emit a warning when there appears to be clock drift between the producer and the consumer.:

WARNING - 0/8 partitions out of sync
OK - partition 0 is completely in sync (25/25)
WARNING - clock drift of -1.234s between producer and consumer
OK - partition 1 is completely in sync (34/34)
...

Remediation to Consumer Lag

If everything is functioning properly, a lagging consumer will self correct on its own: the consumer daemon is just behind (due to high load, slow network, etc) and it will catch up over time.

In some rare scenarios, there may be a bug in the consumer daemon that has caused it to crash or enter a endless loop or some such. To check for this, first look at systemd to see if all the consumer daemons are running:

$ systemctl status vcsreplicator@*.service

If any of the processes aren’t in the active (running) state, the consumer for that partition has crashed for some reason. Try to start it back up:

$ systemctl start vcsreplicator@*.service

You might want to take a look at the logs in the journal to make sure the process is happy:

$ journalctl -f --unit vcsreplicator@*.service

If there are errors starting the consumer process (including if the consumer process keeps restarting due to crashing applying the next available message), then we’ve encountered a scenario that will require a bit more human involvement.

Important

At this point, it might be a good idea to ping people in #vcs or page Developer Services on Call, as they are the domain experts.

If the consumer daemon is stuck in an endless loop trying to apply the replication log, there are generally two ways out:

  1. Fix the condition causing the endless loop.
  2. Skip the message.

We don’t yet know of correctable conditions causing endless loops. So, for now the best we can do is skip the message and hope the condition doesn’t come back:

$ /var/hg/venv_replication/bin/vcsreplicator-consumer /etc/mercurial/vcsreplicator.ini --skip --partition N

Note

The --partition argument is semi-important: it says which Kafka partition to pull the to-be-skipped message from. The number should be the value from the systemd service that is failing / reporting lag.

Important

Skipping messages could result in the repository replication state getting out of whack.

If this only occurred on a single machine, consider taking the machine out of the load balancer until the incident is investigated by someone in #vcs.

If this occurred globally, please raise awareness ASAP.

Important

If you skip a message, please file a bug in Developer Services :: hg.mozilla.org with details of the incident so the root cause can be tracked down and the underlying bug fixed.

check_vcsreplicator_pending_lag

check_vcsreplicator_pending_lag monitors the replication log to see whether the vcsreplicator-headsconsumer process has processed all available messages.

This check is similar to vcsvcsreplicator_lag except it is monitoring the processing of the replicatedpushdatapending topic as performed by the vcsreplicator-headsconsumer process.

Expected Output

When a host is fully in sync with the replication log, the check will output the following:

OK - 1/1 consumers completely in sync

OK - partition 0 is completely in sync (X/Y)

When a host has some partitions that are slightly out of sync with the replication log, we get a slightly different output:

OK - 1/1 consumers out of sync but within tolerances

OK - partition 0 is 1 messages behind (0/1)
OK - partition 0 is 1.232 seconds behind

Even though consumers are slightly behind replaying the replication log, the drift is within tolerances, so the check is reporting OK. However, the state of each partition’s lag is printed for forensic purposes.

Warning and Critical Output

The monitor alerts when the lag of the replication log is too great. Lag is measured in message count and time since the first unconsumed messaged was created. Time is the more important lag indicator.

When a partition/consumer is too far behind, the monitor will issue a WARNING or CRITICAL alert depending on how far behind consumers are. The output will look like:

WARNING - 1/1 partitions out of sync

WARNING - partition 0 is 15 messages behind (10/25)
OK - partition 0 is 5.421 seconds behind

The check will also emit a warning when there appears to be clock drift between the producer and the consumer.:

WARNING - 0/1 partitions out of sync
OK - partition 0 is completely in sync (25/25)
WARNING - clock drift of -1.234s between producer and consumer

Remediation to Consumer Lag

Because of the limited functionality performed by the vcsreplicator-headsconsumer process, this alert should never fire.

If this alert fires, the likely cause is the vcsreplicator-headsconsumer process / vcsreplicator-heads.service daemon has crashed. Since this process operates mostly identically across machines, it is expected that a failure will occur on all servers, not just 1.

First check the status of the daemon process:

$ systemctl status vcsreplicator-heads.service

If the service isn’t in the active (running) state, the consumer daemon has crashed for some reason. Try to start it:

$ systemctl start vcsreplicator-heads.service

You might want to take a look at the logs in the journal to make sure the process is happy:

$ journalctl -f --unit vcsreplicator-heads.service

If there are errors starting the consumer process (including if the consumer process keeps restarting due to crashing applying the next available message), then we’ve encountered a scenario that will require a bit more human involvement.

Important

If the service is not working properly after restart, escalate to VCS on call.

check_pushdataaggregator_pending_lag

check_pushdataaggregator_pending_lag monitors the lag of the aggregated replication log (the pushdataaggregator-pending.service systemd service).

The check verifies that the aggregator service has copied all fully replicated messages to the replicatedpushdatapending Kafka topic.

The check will alert if the number of outstanding ready-to-copy messages exceeds configured thresholds.

Important

If messages aren’t being copied into the aggregated message log, recently pushed changesets won’t be exposed on https://hg.mozilla.org/.

Expected Output

Normal output will say that all messages have been copied and all partitions are in sync or within thresholds:

OK - aggregator has copied all fully replicated messages

OK - partition 0 is completely in sync (1/1)
OK - partition 1 is completely in sync (1/1)
OK - partition 2 is completely in sync (1/1)
OK - partition 3 is completely in sync (1/1)
OK - partition 4 is completely in sync (1/1)
OK - partition 5 is completely in sync (1/1)
OK - partition 6 is completely in sync (1/1)
OK - partition 7 is completely in sync (1/1)

Failure Output

The check will print a summary line indicating total number of messages behind and a per-partition breakdown of where that lag is. e.g.:

CRITICAL - 2 messages from 2 partitions behind

CRITICAL - partition 0 is 1 messages behind (1/2)
OK - partition 1 is completely in sync (1/1)
CRITICAL - partition 2 is 1 messages behind (1/2)
OK - partition 3 is completely in sync (1/1)
OK - partition 4 is completely in sync (1/1)
OK - partition 5 is completely in sync (1/1)
OK - partition 6 is completely in sync (1/1)
OK - partition 7 is completely in sync (1/1)

See https://mozilla-version-control-tools.readthedocs.io/en/latest/hgmo/ops.html
for details about this check.

Remediation to Check Failure

If the check is failing, first verify the Kafka cluster is operating as expected. If it isn’t, other alerts on the hg machines should be firing. Failures in this check can likely be ignored if the Kafka cluster is in a known bad state.

If there are no other alerts, there is a chance the daemon process has become wedged. Try bouncing the daemon:

$ systemctl restart pushdataaggregator-pending.service

Then wait a few minutes to see if the lag decreased. You can also look at the journal to see what the daemon is doing:

$ journalctl -f --unit pushdataaggregator-pending.service

If things are failing, escalate to VCS on call.

check_pushdataaggregator_lag

check_pushdataaggregator_lag monitors the lag of the aggregated replication log (the pushdataaggregator.service systemd service).

The check verifies that the aggregator service has copied all fully replicated messages to the unified, aggregate Kafka topic.

The check will alert if the number of outstanding ready-to-copy messages exceeds configured thresholds.

Important

If messages aren’t being copied into the aggregated message log, derived services such as Pulse notification won’t be writing data.

Expected Output

Normal output will say that all messages have been copied and all partitions are in sync or within thresholds:

OK - aggregator has copied all fully replicated messages

OK - partition 0 is completely in sync (1/1)

Failure Output

The check will print a summary line indicating total number of messages behind and a per-partition breakdown of where that lag is. e.g.:

CRITICAL - 1 messages from 1 partitions behind

CRITICAL - partition 0 is 1 messages behind (1/2)

See https://mozilla-version-control-tools.readthedocs.io/en/latest/hgmo/ops.html
for details about this check.

Remediation to Check Failure

If the check is failing, first verify the Kafka cluster is operating as expected. If it isn’t, other alerts on the hg machines should be firing. Failures in this check can likely be ignored if the Kafka cluster is in a known bad state.

If there are no other alerts, there is a chance the daemon process has become wedged. Try bouncing the daemon:

$ systemctl restart pushdataaggregator.service

Then wait a few minutes to see if the lag decreased. You can also look at the journal to see what the daemon is doing:

$ journalctl -f --unit pushdataaggregator.service

If things are failing, escalate to VCS on call.

check_pulsenotifier_lag

check_pulsenotifier_lag monitors the lag of Pulse Change Notifications in reaction to server events.

The check is very similar to check_vcsreplicator_lag. It monitors the same class of thing under the hood: that a Kafka consumer has read and acknowledged all available messages.

For this check, the consumer daemon is the pulsenotifier service running on the master server. It is a systemd service (pulsenotifier.service). Its logs are in /var/log/pulsenotifier.log.

Expected Output

There is a single consumer and partition for the pulse notifier Kafka consumer. So, expected output is something like the following:

OK - 1/1 consumers completely in sync

OK - partition 0 is completely in sync (159580/159580)

See https://mozilla-version-control-tools.readthedocs.io/en/latest/hgmo/ops.html
for details about this check.

Remediation to Check Failure

There are 3 main categories of check failure:

  1. pulse.mozilla.org is down
  2. The pulsenotifier daemon has crashed or wedged
  3. The hg.mozilla.org Kafka cluster is down

Looking at the last few lines of /var/log/pulsenotifier.log should indicate reasons for the check failure.

If Pulse is down, the check should be acked until Pulse service is restored. The Pulse notification daemon should recover on its own.

If the pulsenotifier daemon has crashed, try restarting it:

$ systemctl restart pulsenotifier.service

If the hg.mozilla.org Kafka cluster is down, lots of other alerts are likely firing. You should alert VCS on call.

In some cases, pulsenotifier may repeatedly crash due to a malformed input message, bad data, or some such. Essentially, the process encounters bad input, crashes, restarts via systemd, encounters the same message again, crashes, and the cycle repeats until systemd gives up. This scenario should be rare, which is why the daemon doesn’t ignore bad messages (ignoring messages could lead to data loss).

If the daemon becomes wedged on a specific message, you can tell the daemon to skip the next message by running:

$ /var/hg/venv_tools/bin/vcsreplicator-pulse-notifier --skip /etc/mercurial/notifications.ini

This command will print a message like:

skipped hg-repo-init-2 message in partition 0 for group pulsenotifier

Then exit. You can then restart the daemon (if necessary) via:

$ systemctl start pulsenotifier.service

Repeat as many times as necessary to clear through the bad messages.

Important

If you skip messages, please file a bug against Developer Services :: hg.mozilla.org and include the systemd journal output for pulsenotifier.service showing the error messages.

check_snsnotifier_lag

check_snsnotifier_lag monitors the lag of Amazon SNS Change Notifications in reaction to server events.

This check is essentially identical to check_pulsenotifier_lag except it monitors the service that posts to Amazon SNS as opposed to Pulse. Both services share common code. So if one service is having problems, there’s a good chance the other service is as well.

The consumer daemon being monitored by this check is tied to the snsnotifier.service systemd service. Its logs are in /var/log/snsnotifier.log.

Expected Output

Output is essentially identical to check_pulsenotifier_lag.

Remediation to Check Failure

Remediation is essentially identical to check_pulsenotifier_lag.

The main differences are the names of the services impacted.

The systemd service is snsnotifier.service. The daemon process is /var/hg/venv_tools/bin/vcsreplicator-sns-notifier.

Adding/Removing Nodes from Zookeeper and Kafka

When new servers are added or removed, the Zookeeper and Kafka clusters may need to be rebalanced. This typically only happens when servers are replaced.

The process is complicated and requires a number of manual steps. It shouldn’t be performed frequently enough to justify automating it.

Adding a new server to Zookeeper and Kafka

The first step is to assign a Zookeeper ID in Ansible. See https://hg.mozilla.org/hgcustom/version-control-tools/rev/da8687458cd1 for an example commit. Find the next available integer that hasn’t been used before. This is typically N+1 where N is the last entry in that file.

Note

Assigning a Zookeeper ID has the side-effect of enabling Zookeeper and Kafka on the server. On the next deploy, Zookeeper and Kafka will be installed.

Deploy this change via ./deploy hgmo.

During the deploy, some Nagios alerts may fire saying the Zookeeper ensemble is missing followers. e.g.:

hg is WARNING: ENSEMBLE WARNING - only have 4/5 expected followers

This is because as the deploy is performed, we’re adding references to the new Zookeeper server before it is actually started. These warnings should be safe to ignore.

Once the deploy finishes, start Zookeeper on the new server:

$ systemctl start zookeeper.service

Nagios alerts for the Zookeeper ensemble should clear after Zookeeper has started on the new server.

Wait a minute or so then start Kafka on the new server:

$ systemctl start kafka.service

At this point, Zookeeper and Kafka are both running and part of their respective clusters. Everything is in a mostly stable state at this point.

Rebalancing Kafka Data to the New Server

When the new Kafka node comes online, it will be part of the Kafka cluster but it won’t have any data. In other words, it won’t really be used (unless a cluster event such as creation of a new topic causes data to be assigned to it).

To have the new server actually do something, we’ll need to run some Kafka tools to rebalance data.

The tool used to rebalance data is /opt/kafka/bin/kafka-reassign-partitions.sh. It has 3 modes of operation, all of which we’ll use:

  1. Generate a reassignment plan
  2. Execute a reassignment plan
  3. Verify reassignments have completed

All command invocations require a --zookeeper argument defining the Zookeeper servers to connect to. The value for this argument should be the zookeeper.connect variable from /etc/kafka/server.properties. e.g. localhost:2181/hgmoreplication. If this value doesn’t match exactly, the ``–generate`` step may emit empty output and other operations may fail.

The first step is to generate a JSON document that will be used to perform data reassignment. To do this, we need a list of broker IDs to move data to and a JSON file listing the topics to move.

The list of broker IDs is the set of Zookeeper IDs as defined in ansible/group_vars/hgmo (this is the file you changed earlier to add the new server). Simply select the servers you wish for data to exist on. e.g. 14,15,16,17,20.

The JSON file denotes which Kafka topics should be moved. Typically every known Kafka topic is moved. Use the following as a template:

{
  "topics": [
    {"topic": "pushdata"},
    {"topic": "replicatedpushdata"},
    {"topic": "replicatedpushdatapending"},
  ],
  "version": 1
}

Hint

You can find the set of active Kafka topics by doing an ls /var/lib/kafka/logs and looking at directory names.

Once you have all these pieces of data, you can run kafka-reassign-partitions.sh to generate a proposed reassignment plan:

$ /opt/kafka/bin/kafka-reassign-partitions.sh \
  --zookeeper <hosts> \
  --generate \
  --broker-list <list> \
  --topics-to-move-json-file topics.json

This will output 2 JSON blocks:

Current partition replica assignment

{...}
Proposed partition reassignment configuration

{...}

You’ll need to copy and paste the 2nd JSON block (the proposed reassignment) to a new file, let’s say reassignments.json.

Then we can execute the data reassignment:

$ /opt/kafka/bin/kafka-reassign-partitions.sh \
  --zookeeper <hosts> \
  --execute \
  --reassignment-json-file reassignments.json

Data reassignment can take up to several minutes. We can see the status of the reassignment by running:

$ /opt/kafka/bin/kafka-reassign-partitions.sh \
  --zookeeper <hosts> \
  --verify \
  --reassignment-json-file reassignments.json

If your intent was to move Kafka data off a server, you can verify data has been removed by looking in the /var/lib/kafka/logs data on that server. If there is no topic/partition data, there should be no sub-directories in that directory. If there are sub-directories (they have the form topic-<N>), adjust your topics.json file, generate a new reassignments.json file and execute a reassignment.

Removing an old Kafka Node

Once data has been removed from a Kafka node, it can safely be turned off.

The first step is to remove the server from the Zookeeper/Kafka list in Ansible. See https://hg.mozilla.org/hgcustom/version-control-tools/rev/adc5024917c7 for an example commit.

Deploy this change via ./deploy hgmo.

Next, stop Kafka and Zookeeper on the server:

$ systemctl stop kafka.service
$ systemctl stop zookeeper.service

At this point, the old Kafka/Zookeeper node is shut down and should no longer be referenced.

Clean up by disabling the systemd services:

$ systemctl disable kafka.service
$ systemctl disable zookeeper.service

Kafka Nuclear Option

If Kafka and/or Zookeeper lose quorum or the state of the cluster gets out of sync, it might be necessary to reset the cluster.

A hard reset of the cluster is the nuclear option: full data wipe and starting the cluster from scratch.

A full reset consists of the following steps:

  1. Stop all Kafka consumers and writers
  2. Stop all Kafka and Zookeeper processes
  3. Remove all Kafka and Zookeeper data
  4. Define Zookeeper ID on each node
  5. Start Zookeeper 1 node at a time
  6. Start Kafka 1 node at a time
  7. Start all Kafka consumers and writers

To stop all Kafka consumers and writers:

# hgweb*
$ systemctl stop vcsreplicator@*.service

# hgssh*
$ systemctl stop hg-master.target

You will also want to make all repositories read-only by creating the /repo/hg/readonlyreason file (and having the content say that pushes are disabled for maintenance reasons).

To stop all Kafka and Zookeeper processes:

$ systemctl stop kafka.service
$ systemctl stop zookeeper.service

To remove all Kafka and Zookeeper data:

$ rm -rf /var/lib/kafka /var/lib/zookeeper

To define the Zookeeper ID on each node (the /var/lib/zookeeper/myid file), perform an Ansible deploy:

$ ./deploy hgmo

Note

The deploy may fail to create some Kafka topics. This is OK.

Then, start Zookeeper one node at a time:

$ systemctl start zookeeper.service

Then, start Kafka one node at a time:

$ systemctl start kafka.service

At this point, the Kafka cluster should be running. Perform an Ansible deploy again to create necessary Kafka topics:

$ ./deploy hgmo

At this point, the Kafka cluster should be fully capable of handling hg.mo events. Nagios alerts related to Kafka and Zookeeper should clear.

You can now start consumer daemons:

# hgweb
$ systemctl start vcsreplicator@*.service

# hgssh
$ systemctl start hg-master.target

When starting the consumer daemons, look at the journal logs for any issues connecting to Kafka.

As soon as the daemons start running, all Nagios alerts for the systems should clear.

Finally, make repositories pushable again:

$ rm /repo/hg/readonlyreason