When studying for the CCSP exam, you must consider how to implement data security technologies and design data security strategies that fit your business and security needs. The following technologies are commonly applied as part of a comprehensive data security strategy in the cloud:
  • Encryption and key management
  • Hashing
  • Data loss prevention (DLP)
  • Data de-identification (by masking and data obfuscation)
  • Tokenization

Encryption and key management

As encryption pertains to cloud data security, encryption and key management are critical topics that must be fully understood in order to pass the CCSP exam. With resource pooling (and multitenancy) being a key characteristic of cloud computing, it’s important to remember that physical separation and protections are not commonly available in cloud environments. As such, strategic use of encryption is crucial to ensuring secure data storage and use in the cloud.

When designing or implementing encryption technologies, remember that an encryption architecture has three basic components:

  • The data being secured
  • The encryption engine that performs all encryption operations
  • The encryption keys used to secure the data
While it would seem like encrypting everything would be the best way to ensure data security, it’s important to consider that encryption has a performance impact on systems; system resources are used in order to process encryption algorithms every time data is encrypted or decrypted, which can add up if encryption is used excessively. As a CCSP, it is up to you to implement encryption so that data is as secure as possible while minimizing the impact to system performance.

Countless other challenges and considerations exist when implementing encryption technologies, both on-prem and in cloud environments. Some key cloud encryption challenges are

  • Almost all data processing requires that data is in an unencrypted state. If a cloud customer is using a CSP for data analysis or processing, then encryption can be challenging to implement.
  • Encryption keys are cached in memory when in use and often stay there for some time. This consideration is a major point of in multitenant environments because memory is a shared resource between tenants. CSPs must implement protections against tenants’ keys being accessed by tenants who share the same resources.
  • Cloud data is often highly replicated (for availability purposes), which can make encryption and key managing challenging. Most CSPs have mechanisms in place to ensure that any copies of encrypted data remain encrypted.
  • Throughout the entire data lifecycle, data can change states, locations, and format, which can require different applications of encryption along the way. Managing these changes may be a challenge, but understanding the Cloud Secure Data Lifecycle can help design complete end-to-end encryption solutions.
  • Encryption is a confidentiality control at heart. It does not address threats to integrity of data on its own. Other technologies discussed throughout this chapter should be implemented to address integrity concerns.
  • The effectiveness of an encryption solution is dependent upon how securely the encryption keys are stored and managed. As soon as an encryption key gets into the wrong hands, all data protected with that key is compromised. Keys that are managed by the CSP may potentially be accessed by malicious insiders, while customer-managed encryption keys are often mishandled or mismanaged.
As the last point indicates, key management is a huge factor in ensuring that encryption implementations effectively secure cloud data. Because of its importance and the challenges associated with key management in the cloud, this task is typically one of the most complicated ones associated with securing cloud data.

When developing your organization’s encryption and key management strategy, it’s important that you consider the following:

  • Key generation: Encryption keys should be generated within a trusted, secure cryptographic module. FIPS 140-3 validated modules have been tested and certified to meet certain requirements that demonstrate tamper resistance and integrity of encryption keys.
  • Key distribution: It’s important that encryption keys are distributed securely to prevent theft or compromise during transit. One best practice is to encrypt keys with a separate encryption key while distributing to other parties (in PKI applications, for example). The worst thing that could happen is sending out a bunch of “secret” keys that get stolen by malicious eavesdroppers!
  • Key storage: Encryption keys must be protected at rest (both in volatile and persistent memory) and should never be stored in plaintext. Keys may be stored and managed internally on a virtual machine or other integrated application, externally and separate from the data itself, or managed by a trusted third party that provides key escrow services for secure key management. A Hardware Security Module (HSM) is a physical device that safeguards encryption keys. Many cloud providers provide HSM services, as well as software-based HSM capabilities.
  • Key destruction or deletion: At the end of the encryption key’s lifecycle, there will be a time that the key is no longer needed. Key destruction is the removal of an encryption key from its operational location. Key deletion takes it a step further and also removes any information that could be used to reconstruct that key. To prevent a Denial of Service due to unavailable keys, deletion should only occur after an archival period that includes substantial analysis to ensure that the key is in fact no longer needed.

Cloud environments rely heavily on encryption throughout the entire data lifecycle. While encryption itself is used for confidentiality, the widespread use of encryption means that availability of the encryption keys themselves is a major concern. Pay close attention to availability as you’re designing your key management systems and processes.

Hashing

Hashing, as depicted, is the process of taking an arbitrary piece of data and generating a unique string or number of fixed-length from it. Hashing can be applied to any type of data — documents, images, database files, virtual machines, and more.

Hashing. Hashing.

Hashing in a data structure provides a mechanism to ensure the integrity of data. Hashes are similar to human fingerprints, which can be used to uniquely identify a single person to whom that fingerprint belongs. As seen, even the slightest change to a large text file will noticeably change the output of the hashing algorithm. Hashing is incredibly useful when you want to be sure that what you’re looking at now is the same as what you created before. In cloud environments, hashing helps verify that virtual machine instances haven’t been modified (maliciously or accidentally) without your knowledge. Simply hash your VM image before running it and compare it to the hash of the known-good VM image; the hash outputs should be identical.

The term hashing is sometimes used interchangeably with encryption, but they are very different! Encryption is a two-way function, meaning what can be encrypted can be decrypted. Conversely, hashing is a one-way function. You can only generate a hash of an object; you cannot retrieve an object from its hash. Encryption, again, is used to provide confidentiality, while hashing provides integrity checking. Be careful not to confuse these two terms!

Several hashing algorithms are available, but the SHA (Secure Hash Algorithm) family of algorithms are amongst the most popular. Specific algorithms are outside the scope of this book, but you can research SHA-1, SHA-2, and SHA-3 for additional context.

Data Loss Prevention (DLP)

Data loss prevention (DLP), also known as data leakage prevention, is the set of technologies and practices used to identify and classify sensitive data, while ensuring that sensitive data is not lost or accessed by unauthorized parties.

Data Loss Prevention (DLP). Data Loss Prevention (DLP).

DLP can be applied to help restrict the flow of both structured and unstructured data to authorized locations and users. Effective use of DLP goes a long way to helping organizations safeguard their data’s confidentiality, both on-prem and in the cloud. To put it plainly, DLP analyzes data storage, identifies sensitive data components, and prevents users from accidentally or maliciously sending that sensitive data to the wrong party.

When designing a DLP strategy, organizations must consider how the technology fits in with their existing technologies, processes, and architecture. DLP controls need to be thoroughly understood and applied in a manner that aligns with the organization’s overall enterprise architecture in order to ensure that only the right type of data is blocked from being transmitted.

Hybrid cloud users, or users that utilize a combination of cloud-based and on-prem services, should pay extremely close attention to their enterprise security architecture while developing a DLP strategy. Because data traverses both cloud and noncloud environments, a poor DLP implementation can result in segmented data security policies that are hard to manage and ineffective.

DLP that is incorrectly implemented can lead to false-positives (for example, blocking legitimate traffic) or false-negatives (allowing sensitive data to be sent to unauthorized parties).

DLP implementations consist of three core components or stages:
  • Discovery and classification: The first stage of DLP is discovery and classification. Discovery is the process of finding all instances of data, and classification is the act of categorizing that data based on its sensitivity and other characteristics. Examples of classifications may include “credit card data,” “Social Security numbers,” “health records,” and so on. Comprehensive discovery and proper classification is crucial to success during the remaining DLP stages.
  • Monitoring: After data has been fully discovered and classified, it is able to be monitored. Monitoring is an essential component of the DLP implementation and involves watching data as it moves throughout the cloud data lifecycle. The monitoring stage is where the DLP implementation is looking to identify data that is being misused or handled outside of established usage policies. Effective DLP monitoring should happen on storage devices, networking devices, servers, workstations, and other endpoints — and it should evaluate traffic across all potential export routes (email, Internet browsers, and so on).
  • Enforcement: The final DLP stage, enforcement, is where action is taken on policy violations identified during the monitoring stage. These actions are configured based on the classification of data and the potential impact of its loss. Violations of less sensitive data is traditionally logged and/or alerted on, while more sensitive data can actually be blocked from unauthorized exposure or loss. A common use-case here is financial services companies that detect credit card numbers being emailed to unauthorized domains and are able to stop the email in its tracks, before it ever leaves the corporate network.
Always remember “Security follows the data” — and DLP technology is no different. When creating a DLP implementation strategy, it’s important that you consider techniques for monitoring activity in every data state. DLP data states are
  • DLP at rest: For data at rest, the DLP implementation is stored wherever the data is stored, such as a workstation, file server, or some other form of storage system. Although this DLP implementation is often the simplest, it may need to work in conjunction with other DLP implementations to be most effective.
  • DLP in transit: Network-based DLP is data loss prevention that involves monitoring outbound traffic near the network perimeter. This DLP implementation monitors traffic over Hypertext Transfer Protocol (HTTP), Hypertext Transfer Protocol Secure (HTTPS), File Transfer Protocol (FTP), and Simple Mail Transfer Protocol (SMTP), and other protocols.

If the network traffic being monitored is encrypted, you will need to integrate encryption and key management technologies into your DLP solution. Standard DLP implementations cannot effectively monitor encrypted traffic, such as HTTPS.

  • DLP in use: Host-based, or endpoint-based, DLP is data loss prevention that involves installation of a DLP application on a workstation or other endpoint device. This DLP implementation allows monitoring of all data in use on the client device and provides insights that network-based DLP are not able to provide.

Because of the massive scale of many cloud environments, host-based DLP can be a major challenge. There are simply too many hosts and endpoints to monitor without a sophisticated strategy that involves automated deployment. Despite this challenge, host-based DLP is not impossible in the cloud, and CSPs continue to make monitoring easier as new cloud-native DLP features become available.

After you understand DLP and how it can be used to protect cloud data, there are a few considerations that cloud security professionals commonly face when implementing cloud-based DLP:
  • Cloud data is highly distributed and replicated across locations. Data can move between servers, from one data center to another, to and from backup storage, or between a customer and the cloud provider. This movement, along with the data replication that ensures availability, present challenges that need to be worked through in a DLP strategy.
  • DLP technologies can impact performance. Host-based DLP scan all data access activities on an endpoint, and network-based DLP scan all outbound network traffic across a network boundary. This constant monitoring and scanning can impact system and network performance and must be considered while developing and testing your DLP strategy.
  • Cloud-based DLP can get expensive. The pay-for-what-you-use model is often a great savings to cloud customers, but when it comes to DLP, the constant resource utilization associated with monitoring traffic can quickly add up. It’s important to model and plan for resource consumption costs on top of the costs of the DLP solution itself.

Data de-identification

Confidentiality is incredibly important, especially in the cloud. While mechanisms like encryption and DLP go a long way to providing data confidentiality, they’re not always feasible. Data de-identification (or anonymization) is the process of removing information that can be used to identify a specific individual from a dataset. This technique is commonly used as a privacy measure to protect Personally Identifiable Information (PII) or other sensitive information from being exposed when an entire dataset is shared. The following figure depicts the purest form of data de-identification; in this example, student names have been removed in order to protect the confidentiality of their grades.

Several techniques are available to de-identify sensitive information; masking (or obfuscation) and tokenization are two of the most commonly used methods.

Data de-identification. Data de-identification.

Masking

Masking is the process of partially or completely replacing sensitive data with random characters or other nonsensitive data. Masking, or obfuscation, can happen in a number of ways, but the following figure is a visual depiction of the most popular type of data masking, which is commonly used to protect credit card numbers and other sensitive financial information.

Data masking. Data masking.

As a cloud security professional, you can use several techniques when masking or obfuscating data. Here are a few to remember:

  • Substitution: Substitution mimics the look of real data, but replaces (or appends) it with some unrelated value. Substitution can either be random or algorithmic, with the latter allowing two-way substitution — meaning if you have the algorithm, then you can retrieve the original data from the masked dataset.
  • Scrambling: Scrambling mimics the look of real data, but simply jumbles the characters into a random order. For example, a customer’s whose account number is #5551234 may be shown as #1552435 in a development environment. (For what it’s worth, my scrambled phone number is 0926381135.)
  • Deletion or nulling: This technique is just what it sounds like. When using this masking technique, data appears blank or empty to anyone who isn’t authorized to view it.
Aside from being used to comply with regulatory regulations (like HIPAA or PCI DSS), data masking is often used when organizations need to use production data in a test or development environment. By masking the data, development environments are able to use real data without exposing sensitive data elements to unauthorized viewers or less secure environments.

Tokenization

Tokenization is the process of substituting a sensitive piece of data with a nonsensitive replacement, called a token. The token is merely a reference back to the sensitive data, but has no meaning or sensitivity on its own. The token maintains the look and feel of the original data and is mapped back to the original data by the tokenization engine or application. Tokenization allows code to continue to run seamlessly, even with randomized tokens in place of sensitive data.

Tokenization can be outsourced to external, cloud-based tokenization services (referred to as tokenization-as-a-service). When using these services, it’s prudent to understand how the provider secures your data both at rest and in transit between you and their systems.

About This Article

This article is from the book:

About the book author:

Arthur J. Deane is a security and compliance executive at Google. He is a technical professional with 13+ years experience in information security, cloud security, IT risk management, and systems engineering.

This article can be found in the category: