Digitization is on the rise as many companies are looking for better ways to process and store documents. Traditional archives are being moved into the cloud, and more documents are processed in digital workflows.
While digitization has fantastic benefits, there are some challenges to consider. The most important one is to comply with the strict General Data Privacy Regulations (GDPR) imposed in May 2018. Although these regulations improve the protection of data and clarify the responsibilities of organizations, they don’t prevent data breaches altogether.
In fact, the costs resulting from data breaches increased from US$3.86 million to US$4.24 million, which is the highest recorded average total cost seen in 17 years. As cybercriminals are becoming more sophisticated, companies must find solutions to protect the stored data better. An excellent solution to minimize data breach risks and ensure GDPR compliance is data masking.
This blog will cover what data masking is, how it works, and how Klippa can automate data masking for you.
What is Data Masking?
Data masking, also known as data anonymization, data redaction, or data obfuscation, is a security technique to mask sensitive data. Such data is for instance social security numbers or payment card numbers.
Data masking is applied to avoid compromising the data and reduce security risks while complying with data privacy regulations.
For example, many organizations need to perform Know Your Customer (KYC) checks within customer onboarding processes. By performing these checks to validate customers’ identities, entities need to process identity documents.
However, some information such as social security numbers cannot be stored under the GDPR. Although there are exceptions, the majority of organizations need to anonymize or obfuscate the data to ensure compliance.
Currently, data masking is gaining more traction, and the industry is estimated to grow from US$483.90 million in 2020 to US$1044.93 million by 2026.
Types of Sensitive Data
Data masking can be used to protect many types of data. The most common types include:
- Personally Identifiable Information (PII)
- Protected Health Information (PHI)
- Payment Card Information (PCI-DSS)
- Health Insurance Portability and Accountability Act (HIPAA)
It is essential to know how data masking works and identify which types and techniques are suited for your business purposes. Only then will it be easier to utilize data masking to safeguard privacy-sensitive data.
Let’s have a look at how data masking works.
How Does Data Masking Work?
The starting point is to identify all sensitive data that your organization holds or processes. It is essential to keep in mind that data can come in many forms; emails, faxes, excel sheets, database information, and scanned documents such as passports, to name a few.
Once the identification of data is completed, data masking algorithms and techniques should be applied. Organizations can blackline, replace, encrypt, or remove sensitive data depending on the use case and legal requirements.
Let’s take an Excel sheet with customer data, including sensitive information like bank account numbers, as an example. When storing these types of information, data masking can help increase the security of your data.
For instance, instead of revealing sensitive data, the bank account numbers can be replaced by an “x,” and only the last four digits are shown.
Even with only the last four digits shown, your back-office staff is still able to verify the bank account ownership. This way, fraudsters cannot use the bank account number, even if they get the information.
Another example might be masking information on the scan of an identity document from a KYC process. Below you can see a before and after of a masked passport to ensure GDPR compliance.
A similar data masking approach can be applied to insurance numbers, payment card numbers, or social security numbers to name a few.
Now that we have explained how data masking works, let’s look at two different types.
Data Masking Types
There are several types of data masking, and the use primarily depends on the resources, use cases, and providers. The two common types of data masking are static and dynamic data masking.
We will elaborate on the difference in the following paragraphs.
Static Data Masking
Static Data Masking (SDM) is often needed for software testing to replace sensitive data by altering data that is stored on a laptop, hard drive, or in some database. With static data masking, organizations can comply with data and privacy regulations such as GDPR, PCI, PHI, PII, ITAR, and HIPAA.
This data masking architecture starts with the original copy, from which sensitive data is masked before sending it further to be processed (in a database, software, etc.).
With this approach, sensitive information is permanently replaced to ensure compliance with data privacy regulations and protection against data breaches.
Dynamic Data Masking
Dynamic Data Masking (DDM) architecture differs from the static one. It is used to mask sensitive data in transit (i.e., actively used), leaving the original copy unaltered. With this approach, the unmasked data is visible in the actual database.
DDM is mainly used to process customer inquiries and handle medical records within role-based security applications. Hiding sensitive data from specific users is necessary for some industries.
With DDM, organizations can use modified queries (i.e., requests for data) coming to the original database to dynamically mask the data and pass it on to the party requesting it.
This type of data masking is often used when organizations send data to a third-party vendor or internal stakeholders, who are not authorized to see sensitive data. Such data can be social Social Security Numbers (SSN) or payment card numbers.
Now that the most common types of data masking are covered, let’s look at data masking techniques.
Data Masking Techniques
Data masking comes with various techniques, which are explained below.
Substitution
Substitution, also referred to as pseudonymization, is a technique used to substitute the original data with random data from supplied or customized lookup files. It is useful when organizations need to preserve the authentic look of data while disguising sensitive data.
This technique can effectively protect data from breaches and help control internal access.
Shuffling
Shuffling is a technique similar to substitution. It is also used to substitute original data with other data that looks authentic. The difference is that the entities in the same column are randomly shuffled.
For instance, organizations can use this technique to shuffle employee name columns of multiple employee records randomly. This technique can be prone to reverse engineering if anyone gets their hands on the shuffling algorithm.
Averaging
Averaging is a method to replace original values with an average value of the table columns. For instance, instead of showing the salaries or account balances of individuals, the initiator shows only the average value of wages or account balances.
This method helps maintain the aggregate value and is commonly used for statistical analysis or data collection purposes of financial institutions.
Nulling out (Deletion)
Nulling out is a technique to replace sensitive data with a null value to prevent unauthorized users from seeing the original data. It simply means removing the information or replacing it with an empty value on documents.
In some use cases, information on certain documents is left out entirely, such as the date of birth on resumes. Often, this is done to eliminate the risks of unethical hiring practices.
Data Redaction (blacklining)
Data redaction, also known as blacklining, is a method similar to nulling out, as only some part of the original data is masked.
For example, only the last four digits of the payment card number are shown to customers in the online shopping environment to prevent fraud.
The same method can be applied to any document containing privacy-sensitive information. Below you can see the example with a passport, where several fields are redacted.
Data Scrambling
The data scrambling technique is used to alter data by randomly rearranging the order of characters or numbers with a specific algorithm.
The original data can no longer be obtained after completing the process as the data is scrambled.
Data Encryption
Data encryption is a technique that allows access to data only with the decryption key.
It is the most complex data masking algorithm and the most secure one. In addition to the complexity, it requires proper encryption key management to ensure security.
Why is Data Masking Important?
Since the GDPR was imposed, data protection has become the top priority for many businesses. As a result, organizations have found it essential to implement data masking as one of the tools to protect their sensitive data.
So why is data masking necessary? In principle, data masking offers organizations a safe path to create alternative versions of usable and well-secured data.
With the use of data masking, organizations can reap the following benefits.
Data Masking for GDPR Compliance
Data masking helps organizations comply with data privacy laws and regulations. With various data masking techniques available, numerous organizations can eliminate the exposure of sensitive data.
Yet, not all organizations use data masking to comply with the GDPR.
For instance, in 2020, large fashion clothing retailer H&M was fined €35 million due to GDPR violations. The incident involved the management accessing sensitive data such as religious beliefs and family issues through meeting recordings. These recordings were used as a base to evaluate employee performance.
This incident could have been avoided by redacting all sensitive data from the documented recordings of these meetings.
Protection Against Data Breach
One of the main benefits of data masking is to make data useless for cyber attackers while preserving its usability for the organization. Even if the data is breached due to cyber attacks, many data masking techniques can prevent intruders from gaining sensitive information.
In 2018, Panera Bread was reported to leak at least 37 million customer records due to a lack of access control and security measures. Data such as personal emails, addresses, and credit card information were accessible through crawling.
Yet again, this scenario could have been avoided with various data masking techniques.
Reduced Data Security Risks
Many companies work together with third parties and vendors within their perimeter, to whom some data are being handed over. In addition to that, employees and other internal stakeholders may also have access to data.
To put it simply, there is always a threat of data being lost. Data masking can provide the means to secure the data against people or parties who are not authorized to see it.
Only false data can be seen unless authorization to unmask the data has been granted. Thus, data anonymization can reduce internal data security risks and data leaks.
Overall, data masking provides impressive benefits, which can help businesses gain a competitive advantage. But what are the common use cases? Let’s have a look at some of them.
Use Cases of Data Masking
There are many use cases for data masking, including the following ones:
- Blacklining payment card numbers
- Anonymizing social security numbers
- Resume redaction
- Masking data for digital archiving
- Redaction or encryption of personal health information
- Redaction or encryption of government documents
- Redaction of legal documents and public court cases
- Anonymization of meeting recordings
- Internal access control
- Encryption of intellectual property documents
- Data sharing with third-party vendors
Below we dive into the first four in more detail.
Blacklining Payment Card Numbers
Under some circumstances, a member of your organization might need access to credit card or payment card information. Therefore, using data masking to blackline the last four digits of the card number can prevent exposure to sensitive components such as payment card numbers.
It is a widespread method for banks and other financial institutions to handle their customers’ payment information. By blacklining the payment card number, organizations can ensure compliance with the PCI-DSS.
Anonymizing Social Security Numbers
Information such as the SSN on identity documents such as passports and ID cards is highly sensitive. Often, organizations outside governmental institutions are not allowed to store the SSN in their database.
In the Netherlands, the burgerservicenummer (BSN) is equivalent to the SSN. The BSN is a unique personal citizen service number used to identify each registered citizen. For instance, the BSN is used by government institutions to find data from each citizen, often for tax purposes.
SSNs and BSNs are strictly prohibited under the GDPR as they belong to “special categories of personal data.” Of course, there are cases where storing such data is allowed. But only with a special legal exception or consent from the person.
Therefore, it is common to anonymize SSN or BSN numbers using various data masking techniques.
Resume Redaction
Despite all the training to reduce bias in the hiring process, a vast amount of recruiters are guilty of basing their decisions on different biases. Unfortunately, it is still common that if two candidates have similar skill sets and experience, the more attractive one would be hired.
Although it is illegal to discriminate in any way in the recruitment process, many companies are still doing it. In fact, 20% of companies in the U.S. account for half of the discrimination cases.
Organizations have started to redact resumes to eliminate biases and discrimination in the early stage of the recruitment process. According to the report from HRO Today, the most common fields that are redacted from resumes include:
- Home address
- Name
- Picture (attractiveness, gender, race)
With data masking, recruiters are better supported in evaluating candidates solely based on their skills and experience. It is important to note that recruiters are only human, after all.
Masking Data for Digital Archiving
Storing paper-based data is no longer an option for many organizations. The reasons for organizations to move towards digitization with the advancement of technology include:
- An enormous backlog of unorganized data
- Internal access control
- Time and cost savings
- Environmental friendliness
- GDPR compliance
- Easy data accessibility
While data archiving can be highly beneficial, its challenge is to meet legal obligations concerning data privacy laws. In this respect, data masking is a secure and solid solution to ensure GDPR compliance.
Before archiving, companies can simply use data masking to redact all sensitive parts such as names, patient numbers, and social security numbers from documents or substitute them with structurally identical data (same amount of numbers or characters).
Organizations have adopted this method in industries such as legal and healthcare, to name a few.
Now that we have covered a few of the use cases, let’s take a look at the transformation of document redaction.
Transformation of Document Redaction
For as long as we can remember, manual document redaction has been used across various industries. It is a tedious task to perform and has many underlying issues. One of the major issues is scalability.
The human personnel struggle to maintain accuracy, efficiency, and consistency over time. This results in slow turnaround times, unsatisfied customers, and high costs.
Take the legal industry as an example. A typical workflow involves teams of lawyers and paralegals going through a large pile of documents for hundreds of hours.
Instead of using their knowledge and experience to do meaningful tasks, they are tasked with adding, modifying, and removing redactions from documents. Not to mention the costs of hiring this personnel.
Adding more personnel as the volume of documents increases quickly ramps up the costs. So let’s face it. Manually redacting documents is not a scalable option (at least not if you want to be cost-efficient).
Luckily, it is possible to automate document redaction with today’s technology. There are two ways that organizations can capitalize on; fully-automated data masking and human-assisted data masking automation.
Fully Automated Data Masking
In a fully automated data masking solution, human intervention is not needed. With technologies such as AI-powered Optical Character Recognition (OCR), it is possible automatically to recognize, locate, and redact the information field from documents that are required.
All you need to do is feed the OCR engine documents that need to be masked, and it does the rest. This option frees up your human resources, which you can allocate for more complicated tasks. This way, you can maximize your organization’s efficiency.
Human-assisted Data Masking
The other solution is to use human-assisted automation, in other words, human-in-the-loop (HITL) automation. This solution uses AI for automation and enables human personnel to do the final checks to verify data masking completion.
The advantage of software with human-in-the-loop automation is that it allows higher document redaction accuracy. This is not a surprise as the HITL solution combines the best of artificial intelligence with the best of human intelligence.
Sometimes there might be issues with the technology (image quality, document quality, etc.), which restricts it from performing data masking assignments. Therefore, reviewing the data input or output can help reduce mistakes.
Creating either of these solutions from scratch is difficult, costly, and time-consuming. That is why we at Klippa decided to combine our OCR technology with data masking functionalities to help various organizations. We can help our clients automate document masking at scale.
So why should your organization automate data masking? Let’s dive into the benefits that come with it.
Benefits of Automated Data Masking
While organizations can safeguard data from leaks and ensure GDPR compliance with data masking, automated it adds many more benefits. These benefits include:
- Faster turnaround times – Automating document or data redaction enables your workforce to focus on more important tasks. It would require fewer people to complete these tedious tasks and speed up the turnaround time.
- Accuracy – With an automated data masking solution that uses AI, businesses can achieve higher accuracy simply because machines and computers don’t get tired.
- Speed – With an automated solution, the data redaction process can go up to 90 times faster. You can see a simplified calculation in the following section.
- Cost savings – With higher efficiency and accuracy achieved with AI, organizations can significantly save money (labor hours, reduced mistakes, etc.).
- Scalability – There is a limit to how many documents an average employee can redact. Data masking automation offers businesses a way to redact documents at scale without increasing operational costs.
It seems like there are many benefits that organizations can enjoy from an automated data masking solution. But what does it mean to you in terms of business?
To make it tangible for you, we have provided an example calculation of a potential return on investment (ROI) in the following section.
The ROI of Automated Data Masking Solution
Assume that you have 100,000 identity documents, from which you need to redact the social security numbers. Let’s also presume that, on average, an experienced office clerk can redact two identity documents manually every minute. That makes 120 identity documents in an hour. Let’s estimate the cost to hire an experienced clerk (including insurance, hourly salary, and other costs) to be €25,00 per hour.
To ensure that the redaction is done correctly, you would need to hire another office clerk to double-check redacted documents. Let’s assume that an identical clerk can double-check each redaction for accuracy at the pace of 360 identity documents per hour.
The total cost of manually redacting 100,000 identity documents would go over €27,700. The employee hours that it would take to complete the project are over 1,110.
Comparing that to the Klippa solution, which can redact 10,000 documents per hour, you would be able to complete the project within 10 hours. That’s roughly 1,100 labor hours saved.
As an estimation, it would cost your organization €4,000 to complete this project with our solution (depending on document volume and type). You would be able to complete the project over 90 times faster and with an ROI of 6.9.
Try our data masking ROI calculator yourself!
Masking Your Data with Klippa
Our Intelligent Document Processing (IDP) solution, Klippa DocHorizon, is designed to help organizations digitize, extract, classify, verify, and anonymize data from various documents.
With Klippa DocHorizon, your organization can reduce turnaround times, costs, and human error while safeguarding sensitive data.
While our AI-based OCR software includes data masking features, we have developed a data masking API to enable integrations with our client’s existing document management, enterprise resource planning (ERP), or electronic health record (EHR) systems.
In addition to the API, we have developed a data masking SDK to enable companies to leverage our technology within their system.
Data Masking API
To help our clients get rid of repetitive labor in administrative processes, we have developed a data masking OCR API. It enables our clients to blackline certain fields and pictures from documents.
Our parsing engine can be trained to recognize specific fields that need to be blacklined. We process numerous out-of-the-box fields, but custom fields can be added or removed on request.
Several inputs can be fed to the Parsing engine, such as JPG, PNG, and PDF. The default output that our clients receive is a JSON file, which can be forwarded to the desired destinations such as Enterprise Resource Planning (ERP) systems. However, the outputs can be customized to for example CSV, XLSX, or XML if needed. Next to the structured JSON, it is also possible to get the masked documents in JPG, PDF, or similar file types.
Our data masking OCR API is currently made available through a RESTful API, enabling our clients to integrate it into web-based applications. To help our clients, we provide clear documentation.
Data Masking on Mobile
If you are in need of a mobile data masking solution, Klippa can also support you. We offer a mobile scanner SDK that includes data masking functionalities. Clients use this scanner SDK to mask certain information in identity documents, receipts, invoices, and many other types of documents.
Currently, our scanner SDK is available for both Android and IOS. Besides that, we offer wrappers for cross-platform languages like ReactNative, Flutter, Cordova, and Nativescript. Generally, it can be integrated into any mobile solution.
Watermarking Documents
In case one of the data masking approaches is not possible for you, Klippa also offers digital document watermarking as an alternative. This way, you can protect the copyright of your documents, enable your clients to share data more securely and reduce security risks while storing sensitive data.
Simply feed Klippa DocHorizon with a document, and it will return the same document with a watermark. That watermark contains something according to your needs—for example, your company name + the scanning date.
Whether you are looking for an end-to-end solution or an API / SDK integration to automate your document processing, Klippa is here to help. Fill in the form below for a free demo, or contact our experts to see how Klippa can support you.