top of page

Understanding Exact Data Match (EDM) Classifier in Microsoft Purview - part 2


Introduction 

Welcome back to my series on Exact Data Match (EDM) classifiers in Microsoft Purview! In Part 1, we explored the basics of EDM classifiers, how they compare to traditional sensitive information types, and their effectiveness in reducing false positives. We also discussed practical applications in managing sensitive data like client, patient, and employee records through auto-labelling and data loss prevention (DLP) policies.

In this second part, we'll dive deeper into the key advantages of using EDM classifiers and guide you through the step-by-step process of setting one up in Microsoft Purview.

 

Table of contents


Step-by-Step Guide: Setting Up an EDM Classifier in Microsoft Purview

Alright, let’s get into the nitty-gritty of setting up an EDM classifier in Microsoft Purview. No need to hold your breath, we’re going step-by-step, and I promise, it's easier than it sounds.


Step 1 - Prepare Your Dataset

Let’s say you want to upload some client data. Imagine you’ve got all your client info hanging out in your CRM system - each client has their own unique Client ID, plus their first name, last name, company address, phone number, email... the whole shebang.

You need to export that data and clean it up. Create a nice, neat .csv file with just the info you want to work with. If you’re not a data-wrangling ninja, don't worry! Microsoft gives you a sample .csv file, or you can head to dlptest.com, click Sample Data, and grab a format that works.



Now, take your sample data (you know, the “Name/SSN/DOB” kind of thing), clean it up, and replace it with your actual client data. Boom. Step 1, done!


This was my sample data file from dlptest.com:

And this is my client data file that I'm going to be using during the set up.


Step 2 - Custom Sensitive Information Type (SIT) Creation

Since I need to search for client data, including Client IDs that follow a unique pattern specific to my organisation, none of the built-in data classifiers work for me. So, I’m going to create a custom Sensitive Information Type (SIT) using a regular expression for my Client IDs. I go through the wizard to create the custom SIT, set a low confidence level for the pattern. Basing an Exact Data Match (EDM) classifier in Microsoft Purview on a low-confidence SIT is often done to ensure broader detection and to avoid false negatives. Then, I define the regular expression to match the structure of my Client IDs - three letters followed by nine numbers. My regex looks like this: [A-Z]{3}[0-9]{9}. After that, I just keep clicking "Next" until the SIT is fully set up.




Click Next, Next, Next, and boom - your SIT is ready


Step 3 - Create a Security Group

Before we get too far, you’ll need to create a security group for anyone hashing and uploading data. It needs to be called "EDM_DataUploaders". Add anyone who’ll be working with the data, and you’re ready to roll.



Step 4 - Create the EDM Classifier

Since I’m using the new Purview portal, I head over to purview.microsoft.com and navigate to Solutions > Information Protection > Classifiers > EDM Classifiers. From there, I click Create EDM Classifier and follow the wizard.





When it comes to defining the schema, I go with the recommended option of uploading a file that contains some sample data.



This file needs to have the same column names as the data (e.g., client, employee, or patient info) that I'll be hashing and uploading later.

Upload a file with some sample data - the column names need to match those in your client/patient/employee .csv file that you'll be hashing and uploading later.

Ensure that the columns in your sample file exactly match those in your actual dataset. Sample files should be under 2.5 MB, while the final dataset can include:

  • Up to 100 million rows of sensitive data.

  • Up to 32 columns per data source.

  • Up to 10 searchable columns.


Once the file is uploaded, I select my primary elements and map my columns to sensitive information types (SITs). Since I previously created a custom SIT, I can get a full match validation here. While the system automatically tries to map your columns to SITs or tokens, you can use the drop-down menu under the Match Mode column to adjust anything that doesn’t quite fit.


In my case, I choose ClientID as my primary element because each client has a unique ID, making it the key field to match.



For columns that don’t need a direct SIT mapping, you can choose between single-token (one word) and multi-token (multiple words). For example, State (like "North Carolina") would be multi-token, whereas Email Address would be single-token.


Once I've adjusted the match modes for all my columns, I click Next.

Here, I get the option to apply the same settings across all columns or customise things like case insensitivity and whether to ignore delimiters and punctuation. This is helpful if your data, like phone numbers or ZIP codes, has been entered inconsistently (e.g., +1 00-000-000 vs. 100000000).

On the next page, I configure detection rules for the primary elements. Each one can have up to three rules with different confidence levels, helping define how closely the detected sensitive info matches the primary element. I tweak the rules for medium confidence and also adjust the proximity of supporting elements.


Once that’s sorted, I click Next and Submit


At this point, I’ve successfully created an EDM classifier, and I can see the schema name for my EDM. Save this info! But if you forget, no worries—you can always click on the classifier's name to retrieve the schema name.


Back on the EDM classifiers page, the status will show as "Source file not uploaded yet", with a link to guide you through the upload process.


Step 5 - Create Directories for EDM

Time to get your house in order. Open File Explorer and create these directories:

  • C:\EDM

  • C:\EDM\Hash (for the hashed data)

  • C:\EDM\Data (for the plain text data)


Step 6 - Download and Install the EDM Upload Agent

There are two methods to hash and upload your data:

  1. Two-computer method: In this approach, your sensitive information source table is in clear-text format. You use one computer to perform the hashing and another computer for uploading. This method prevents your clear-text data from being exposed on the computer with direct access to your Microsoft 365 tenant.

  2. Single-computer method: If you choose to hash and upload using the same computer, it must have direct connectivity to your Microsoft 365 tenant. Additionally, the clear-text sensitive information source table must be stored on this computer to complete the hashing process.

Reference to Microsoft's doc can be found here.


I am using the single computer method.


Now, grab the EDM Upload Agent that fits your organisation:

  • Commercial + GCC (Most commercial users should use this option)

  • GCC-High (This option is specifically for high-security government cloud subscribers.)

  • DoD (This option is specifically for United States Department of Defense cloud customers.)


Once ready, double-click the download to run it, and when asked, change the installation directory to C:\EDM\Data. Wait for the install to finish. (It’s not coffee-break long, don’t worry.)


Step 7 - Authorise the EDM Upload Agent

Next up, open Command Prompt or PowerShell as an admin.

Navigate to the directory where you installed the EDM Upload Agent:

Run the command to authorize:

Sign in with the account you added to the EDM_DataUploaders group.



Step 8 - Download Your Schema

Now, to grab your schema (you’ll need this to hash your data):


For me, it was:

EdmUploadAgent.exe /SaveSchema /DataStoreName demoedm01Schema /OutputDir C:\EDM\Data


You should be able to see your schema file saved in the C:\EDM\Data (unless you changed the OutputDir).


Step 9 - Validate Your Data

Before jumping into hashing, let’s make sure your table won’t choke on any special characters. Run this validation command:


Step 10 - Hash and Upload Your Sensitive Data

Ready to go? Make sure you’ve got:

  • Data Store Name: The name of your EDM schema.

  • Data File: The path to your .csv (e.g., C:\EDM\Data\clients.csv).

  • Hash File Location: Where you’ll store the hash (e.g., C:\EDM\Hash).

  • Schema File: The path to your schema file (e.g., C:\EDM\Data\demoedm01Schema.xml).

  • AllowedBadLinesPercentage: For those annoying bad/ blank rows (e.g., 5%).


Now run the command:


for me it was:

EdmUploadAgent.exe /UploadData /DataStoreName demoedm01Schema /DataFile C:\EDM\Data\clients.csv /HashLocation C:\EDM\Hash /Schema C:\EDM\Data\demoedm01Schema.xml /AllowedBadLinesPercentage 5


Step 11 - Verify and Troubleshoot

Did everything go smoothly? If yes, congrats! If not, adjust the AllowedBadLinesPercentage and try again. If you still get errors, check out the message for more clues - it usually gives you a percentage of bad lines.


Step 12 - Refresh and Check the Status

Go back to your EDM classifiers page in Purview and hit refresh. You’ll probably see Indexing with a progress percentage. Grab a snack and check back in a few minutes. When it's done, it'll say Index complete.


Step 13 - Your Custom SIT is Ready!

Finally, head over to Classifiers > Sensitive Info Types. You'll now see your shiny new EDM classifier sitting there as a custom SIT, ready to be used in Data Loss Prevention (DLP) policies, Insider Risk Managament (IRM), or whatever your heart desires. 🎉


Conclusion

And there you have it! Today, we walked through the process of setting up an EDM classifier in Microsoft Purview, from preparing your dataset to creating a custom sensitive information type and configuring your EDM schema. Now you're all set to upload your data securely and make sure it's ready for exact data matching.

But we’re not done yet! In the next and final part, I’ll show you how to test your newly created EDM classifier and, most importantly, how to put it to work within different Microsoft Purview policies like Data Loss Prevention and Insider Risk Management. So stay tuned!


Thank you for reading :)

Comments


bottom of page