Mailcow post-migration DKIM issues (550-5.7.26)

Mailcow post-migration DKIM issues (550-5.7.26)

Anyone who enters the realm of hosting their own email server knows there is are a metric tonne of potential issues and down-sides, but with enough research it really can pay off.

I've been using Mailcow for hosting my own e-mail server since October 2019, it has been smooth sailing. I have noted four issues between October 2019 and June 2021, two of which were my fault. Only one of these issues caused downtime and it was fixed in 5 minutes. That's pretty good uptime for something that most people say isn't worth the effort!

MailCow is a 'dockerized' mail stack, containing everything you need to get an email server up and running. It includes a DNS resolver, anti-virus and the powerful anti-spam program Rspamd. One thing that tops off Mailcow is how easy it is to update - it has a built-in updater script which can be run via console/SSH, it pulls the latest containers and keeps everything up to date and secure with minimal downtime.

Initially I used a Scaleway low-cost development instance to host the server but decicided I'd co-locate a physical server and migrate the Mailcow instance. The migration was so easy that I was shocked when everything just worked on the other server!

Roll on a few months, a friend tells me that they are getting a bit of spam but it's not being detected by the spam filter. "No big deal", he said. I hadn't seen an increase in spam on any other domain so I figure it's just a one-off.

A week later I try to send an email to three friends and get one bounce from a Microsoft Office 365 customer... Remote Server returned '550-5.7.26 Unauthenticated email from domain.com is not accepted due to domain's 550-5.7.26 DMARC policy.A number of other emails make it through OK, so this particular receiver must be particularly paranoid.

A test email to check-auth@verifier.port25.com returns DKIM check: none.

Something is up!

A quick update script doesn't fix it so we're in for a late night of troubleshooting!

Old support posts (c.2017) for DKIM related issues with Mailcow aren't much use as the keys are now stored in redis, so that's a bust.

I ran git diff master/origin to check if I have a configuration issue - but it's pretty much standard.

Checking the docker-compose logs I found that Rspamd would crash a few times when an email was sent... rspamd_crash_sig_handler: caught fatal signal 11(Segmentation fault), plus Postfix would show lost connection after EHLO three times as an email was 'Sending' in Outlook.

Then, a bit of research on the Mailcow community forum led me to a thought... this system is containerised for a reason - let's just remove the Rspamd volume and start it again. It doesn't contain any data I can't live without.

I'll add a note here: always backup before doing an action like this... some volume rm commands will render emails unreadable.

docker-compose down
docker volume rm mailcowdockerized_rspamd-vol-1
docker-compose up -d
Removing Rspamd volume and re-starting Mailcow

Once the container restarted I sent a test email to verifier.port25.com and saw DKIM check: pass.

Emails are now DKIM signed & because Rspamd isn't crashing, Outlook isn't losing a connection before the email is pushed to Postfix, resulting in emails sending much quicker.

Hopefully if you migrate your Mailcow instance and you start seeing issues, this post might help.