IT Solutions: Data Deduplication – An Essential Component of your Data Protection Strategy
Organisations are facing many challenges when it comes to managing and protecting their business data. According to various industry experts, companies will experience a number of changes to their environment that will cause them to reconsider how they deploy and business continuity solutions.
Virtualisation of servers continues to be high on the agenda for many companies. More companies are looking to drive efficiencies by deploying new applications on virtual platforms, and they are beginning to migrate the bulk of existing applications away from physical servers.
While server virtualisation promises to simplify the infrastructure, what many companies find is that it increases the complexity of management functions, especially with regards to Data protection tasks.
Another challenge facing a lot of organisations is Merger and Acquisition activity. This is resulting in either a distributed workforce, or a consolidation exercise. Either way, IT teams need to manage more users, with more data, spread over more locations. Sitting behind these challenges is the acceleration in the acquisition and generation of data. This at best complicates the other business challenges, and at worst introduces risk to an organisations’ ability to continue delivering services and generating income.
When we look at these challenges from a data protection viewpoint we can see that it is simply a case of:
- Addressing expanding data generated by more users in more locations.
- Keeping the organisation running by ensuring high availability of key applications to maintain business processes.
- Supporting the ever-evolving environment. This spans everything from a move to virtualised servers, thru hardware upgrades, to new versions of important business tools.
To help alleviate this problem, data reduction technologies are being deployed as part of the data protection strategy, allowing more data to be crammed onto the disk targets, and therefore cope better with the data growth. Over the last few years the data reduction technologies used to resolve the problem have evolved:
- Data Compression Algorithms. Most data protection products have included Compression Algorithms for many years. Either as part of the core product, or as an option, the compression features are used to reduce the consumption of resources, such as backup destination or network bandwidth. The data would often be compressed at the client before transmission to the destination media, incurring a processing cost which occasionally would be detrimental to the performance of the applications being protected. The algorithms and techniques’ used in the design of the data compression functions were varied, and often involved trade-offs between the degree of compression and the processing resources required to compress and uncompress the data.
- Single Instance Storage. (SIS) This is perhaps best described as File Level Deduplication. It refers to the process of keeping one copy of content that multiple users or computers share. It is a means to eliminate duplicate data and to increase efficiency of storage systems. SIS is often found within file systems, e-mail server software, data backup and other storage-related solutions. One of the more common implementations of single instance storage is within email servers. SIS is used to keep a single copy of a message within the database, even when multiple users have been sent the data. This was implemented within email products to help them handle the dramatic increases in email volume, and managed to resolve problems associated with both the architectural boundaries imposed on the size of the database, and the performance impact of delivering one email to multiple recipients. It is quicker to set the pointers than it is to write many copies to disk. When used within a backup solution, single instance storage can help to reduce the amount of target media required since it avoids storing duplicate copies of the same file. When protecting multiple servers or environments with a lot of users of unstructured data, identical files are very common. For example, if an organization has not deployed a collaboration tool such as Microsoft SharePoint, then many users will save the same document in their home directories, resulting in many duplicates consuming space on the backup >media, and causing longer backup processes.
- Data Deduplication. Increasingly the term Data Deduplication refers to the technique of data reduction by breaking streams of data down into very granular components, such as blocks or bytes, and then storing only the first instance of the item on the destination media, and then adding all other occurrences to an index. Because it works at a more granular level than single instance storage, the resulting savings in space are much higher, thus delivering more cost effective solutions. The savings in space translate directly to reduced acquisition, operation, and management costs.
THE DATA DEDUPLICATION REVOLUTION
Data Deduplication technologies are deployed in many forms and many places within the backup and recovery infrastructure. It has evolved from being delivered within specially designed disk appliances offering post processing deduplication through to today being a distributed technology found as an integrated part of backup and recovery software. Along the way some suppliers of solutions have identified the good and bad points of each evolution and developed what today are high performance efficient technologies.
How does Deduplication work?
As with many things in the world of IT, there are numerous techniques in use for deduplicating data, some are unique to specific vendors, who guard their technology behind patents and copyrights, others use more open methods. The goal of all is to identify the maximum amount of duplicate data using the minimum of resources. The most common technique in use is that of “chunking” the data.
Deduplication takes place by splitting the data stream into “chunks” and then comparing the chunks with each other. Some implementations use fixed chunk sizes, other use variable chunk sizes. The latter tends to offer a higher success rate in identifying duplicate data as it is able to adapt to different data types and environments. The smaller the chunk size then the more duplicates will be found, however, performance of the backup and more importantly the restore is affected. Therefore, vendors spend a lot of time identifying the optimal size for different data types and environments, and the use of variable chunk sizes often allow tuning to occur, sometimes automatically.
During Deduplication every chunk of data is processed using a hash algorithm and assigned a unique identifier, which is then compared an index. If that hash number is already in the index, the piece of data is considered a duplicate and does not need to be stored again, and instead a link is made to the original data. Otherwise the new hash number is added to the index and the new data is stored on the disk. When the data is read back, if a link is found, the system simply replaces that link with the referenced data chunk. The deduplication process is intended to be transparent to end users and applications.
When does Deduplication Occur?
Deduplication can occur in any of three places. “At the client”- where the source data sits, “in-line” – as the data travels to the target, or “on the target” – after the data has been written, the latter is often referred to as “post process”. All three locations offer advantages and disadvantages, and one or more of these techniques will be found in the deduplication solutions available on the market today. The choice of which type of deduplication an organization deploys is governed by their infrastructure, budgets, and perhaps most importantly, their business process requirements.
Post Process Deduplication
This works by first capturing and storing all the data, and then processing it at a later time to look for the duplicate chunks. This requires a larger initial disk capacity than in-line solutions, however, because the processing of duplicate data happens after the backup is complete, there is no real performance hit on the data protection process.
And CPU and memory requirements for use in the deduplication process are consumed on the target, away from the original application and therefore not interfering with business perations. As the target device may be the destination for data from many file and application servers, post process deduplication also offers the additional benefit of comparing data from all sources – this Global Deduplication increases the level of saved storage space even further.
In-Line Deduplication
The analysis of the data, the calculation of the hash value, and the comparison with the index all takes place as the data travels from source to target. The benefit being it requires less storage as data is first placed on the target disk, however, on the negative side because so much processing has to occur, the speed of moving the data can be slowed down. In reality, the efficiency of the in-line processing has increased to the point that the performance on the backup job is so small it is inconsequential. Historically, the main issue with in-line deduplication was that it was often focused only on the data stream being transported, and did not always take into account data from other sources. This could result in a less “global” deduplication occurring and therefore more disk space being consumed than is necessary.
Client Side Deduplication
Sometimes referred to as Source deduplication, this takes place where the data resides. The deduplication hash calculations are initially created on the client (source) machines. Files that have identical hashes to files already in the target device are not sent, the target device just creates appropriate internal links to reference the duplicated data and results in less data being transferred to the target. This efficiency does however, incur a cost.
The CPU and memory resources required to analyze the data will also be needed by the application being protected, therefore, application performance will most likely be negatively affected during the backup process.
The data growth challenges that every organization is facing are pushing the implementation of new backup and recovery technologies to help meet service level agreements around availability and performance. Shrinking budgets are pulling IT departments in the opposite direction. Data Deduplication is a technology that helps organizations balance these opposing demands.
Disk based backups can be rolled out in order to reduce the backup window and improve recovery time, and deduplication means the investment in those disk based targets is maximized. Organizations should review the different deduplication technologies on the market and choose a solution that is able to integrate seamlessly into their backup environment, in a cost effective way. Making sure that any investment does not tie them into a hardware solution that is difficult to expand as the organization’s data grows.
For more information and a personalized IT Solutions business offer, please contact us.
Source: www.computerworld.com