Introduction
Deduplicating email is a complex technical challenge and is also a core component of Sedna’s product offering. Sedna uses logic to deduplicate messages ingested into your Sedna instance to enable efficient cross-team collaboration with a single message as the source of truth.
Understanding Deduplication
What is Email Deduplication?
Email deduplication is the process of identifying instances of the same email copied to multiple teams and consolidating them into a single Sedna message so that activity e.g. tagging, comments, read by status, can be shared across multiple inboxes/teams.
Elements of Deduplication
Our approach to deduplication involves analysing various email components:
Message-ID: Each email carries a unique header value known as the Message-ID. Created at the time of origin, this ID is used to distinguish between individual messages, aiding our system in identifying duplicates accurately.
Only when a Message-ID matches between messages do we perform the following additional checks:
Subject Line: We compare each subject line to assess for any differences. If they do not match, we consider the messages as unique from each other.
Attachments: We sort and checksum attachments to assess for any differences. If they do not match, we consider the messages as unique from each other.
Plaintext and HTML Body: Both the plaintext and HTML body of the email are checked and compared in the deduplication process. If they do not match, we consider the messages as unique from each other.
By combining these specific identifiers and components in our deduplication process, we ensure a robust and efficient system that consolidates duplicate emails accurately, streamlining your communication workflow.
Supported Gateways
Click-Tracking Scheme Deanonymization:
Some emails contain unique 'click-tracking' links that record the identity of the email recipient. Deduplication disrupts this process by sharing a single copy of the email (and thus a single instance of the link) to all recipients
Mimecast, Proofpoint, Microsoft ATP Safe Links, Cisco Secure Web:
In the process of securing your email, many email gateways will rewrite the hyperlinks in emails to guide clicks through their own system, in order to prevent users from accidentally opening infected websites.
Unfortunately, this will cause originally identical emails to become distinct due to the hyperlinks now containing additional information unique to each copy of the email.
The Sedna email-ingestion system analyses all hyperlinks for this type of adjustment and, for the purpose of deduplication alone, will ignore these differences.
This process ensures that we do not interfere with the core functionality of the email gateways, but can result in some challenges as and when email gateways change their methods of hyperlink modification.
Front Tracking Pixels:
Within Front emails, tracking pixels are replaced with the identifier "Front," intentionally disrupting tracking mechanisms. Although this action impacts Front's tracking, it significantly aids in email consolidation and management.
Current Limitations
There are a set of known limitations with Sedna’s deduplication technology.
Gateway Dependency: While we support select Email Gateways (as noted in ‘Supported Gateways’), it is important to note that unsupported email gateways or changes to existing supported Email Gateways may result in our inability to identify duplicate sets of emails. We aim to continue to improve our handling of Email Gateways and therefore we encourage you to report instances of duplicate emails so we may analyse the cause and improve our algorithms.
Routing Uniformity: Successful deduplication requires uniform paths for emails passing through email gateways. For example, if we receive one email sent through your email gateway and has some of its content rewritten and another copy that bypassed the gateway, we are no longer able to deduplicate these emails since they have become distinct in ways that we are unable to detect
Attachment Modifications: Header modifications within attached .eml files can pose challenges in deduplication in scenarios where the checksum of the attachments are no longer identical.
Date mismatch: Our current deduplication process is only employed to emails that are received within 5 days of each other. If 2 copies of the same email are received more than 5 days apart, they will not be checked for deduplication.
Comments
0 comments
Please sign in to leave a comment.