emailregex.com maintains a regular expression for validating email addresses. The site claims it’s based on the RFC 5322 specification. The regular expression contains several ASCII control characters. It wasn’t clear what they were used for, and since SonarQube’s static code analysis flags them as unwanted, I did some research to discover what they’re used for and how the regular expression works.
After several days of digging into various RFC specs, especially RFC 5322 Section 4 and 4.1, I learned that the following non-printable control characters are in fact allowed in email addresses and messages:
|01||^A||Start of heading (SOH)|
|02||^B||Start of text (STX)|
|03||^C||End of text (ETX)|
|04||^D||End of transmission (EOT)|
Source: TechOnTheNet's ASCII Chart
To figure out why these are important, I needed to have a way to replicate those characters in an email address in a set of test cases. To see them in action, I started up a Node.js REPL on the terminal, and tried a simple example:
When executing the above log statement, we hear a distinct BEEP sound, confirming the character sequence works as advertised.
There are many other control characters as well. For instance, placing the ASCII control character code for a BACKSPACE in a console.log statement causes the character immediately preceding the BACKSPACE character to disappear in the output:
As you can see, the word “Goodby” is missing the ’e’ character.
To recap, when printing the ‘\x07’ BELL character, the full text of the message appears, “Goodbye World”, and we hear the BELL sound. When printing the ‘\x08’ BACKSPACE in the string, we observe a BACKSPACE event occurring, deleting the preceding character. This implies that these characters could potentially do the same things if printed as an email address or message in a terminal.
Now that we know how to use these characters and represent them in strings, we can now start to investigate exactly what emailregex.com’s email regular expression is doing in terms of non-printable control characters.
After analyzing the emailregex.com regular expression in a debugger, I discovered that it allows certain non-printable control characters in quoted strings in the local-part (the portion to the left of the @ symbol), and it allows those same characters in the domain part, within brackets, after an IP address followed by a colon. For instance, both of these email addresses are considered valid by their regular expression:
The character between the two
> < characters is not a space; it’s a non-printable control character. Same with the character after the colon in the domain part of the second email address. This character is the ASCII control character ‘\x07’, or BEEP, which, when put into console.log in Node.js in the macOS terminal, creates a BEEP sound.
As mentioned previously, the control characters are only considered valid in the local-part of the email address when in quotes, and they’re only considered valid in the domain part if it’s an IP address in square brackets. For example, the below email address is not valid:
But if the character is in the domain part in brackets, then the email is considered valid:
You can run these console.log commands in the node.js REPL to see for yourself and try other examples.
To see more tests, I added them in regexr.com - emailregex.com RFC5322.
Thanks to David’s link to section 4.1 of the RFC5322 spec, and thanks to Wiktor Stribiżew’s answer on user6410654’s question. Their research contributed to answering these questions.
Last Modified on 2023-06-17
Author James Mortensen