Mailcow post-migration DKIM issues (550-5.7.26)
Anyone who enters the realm of hosting their own email server knows there is are a metric tonne of potential issues and down-sides, but with enough research it really can pay off.
I've been using Mailcow for hosting my own e-mail server since October 2019, it has been smooth sailing. I have noted four issues between October 2019 and June 2021, two of which were my fault. Only one of these issues caused downtime and it was fixed in 5 minutes. That's pretty good uptime for something that most people say isn't worth the effort!
MailCow is a 'dockerized' mail stack, containing everything you need to get an email server up and running. It includes a DNS resolver, anti-virus and the powerful anti-spam program Rspamd. One thing that tops off Mailcow is how easy it is to update - it has a built-in updater script which can be run via console/SSH, it pulls the latest containers and keeps everything up to date and secure with minimal downtime.
Initially I used a Scaleway low-cost development instance to host the server but decicided I'd co-locate a physical server and migrate the Mailcow instance. The migration was so easy that I was shocked when everything just worked on the other server!
Roll on a few months, a friend tells me that they are getting a bit of spam but it's not being detected by the spam filter. "No big deal", he said. I hadn't seen an increase in spam on any other domain so I figure it's just a one-off.
A week later I try to send an email to three friends and get one bounce from a Microsoft Office 365 customer... Remote Server returned '550-5.7.26 Unauthenticated email from domain.com is not accepted due to domain's 550-5.7.26 DMARC policy.
A number of other emails make it through OK, so this particular receiver must be particularly paranoid.
A test email to check-auth@verifier.port25.com
returns DKIM check: none
.
Something is up!
A quick update script doesn't fix it so we're in for a late night of troubleshooting!
Old support posts (c.2017) for DKIM related issues with Mailcow aren't much use as the keys are now stored in redis, so that's a bust.
I ran git diff master/origin
to check if I have a configuration issue - but it's pretty much standard.
Checking the docker-compose
logs I found that Rspamd would crash a few times when an email was sent... rspamd_crash_sig_handler: caught fatal signal 11(Segmentation fault)
, plus Postfix would show lost connection after EHLO
three times as an email was 'Sending' in Outlook.
Then, a bit of research on the Mailcow community forum led me to a thought... this system is containerised for a reason - let's just remove the Rspamd volume and start it again. It doesn't contain any data I can't live without.
I'll add a note here: always backup before doing an action like this... some volume rm
commands will render emails unreadable.
Once the container restarted I sent a test email to verifier.port25.com and saw DKIM check: pass
.
Emails are now DKIM signed & because Rspamd isn't crashing, Outlook isn't losing a connection before the email is pushed to Postfix, resulting in emails sending much quicker.
Hopefully if you migrate your Mailcow instance and you start seeing issues, this post might help.