Accelerate Data Deduplication Using Chunking and Hashing Functions

ID 658948
Updated 6/18/2015
Version Latest
Public

author-image

By

Abstract

This article covers the chunking and hashing functions found in the Intel® Intelligent Storage Acceleration Library (Intel® ISA-L).  Intel ISA-L is the algorithmic library that addresses key storage market needs including optimization for Intel® architecture (IA) and enhancing efficiency, data integrity, security/encryption, erasure codes, compression, CRC, AES, and more.

With so much data coming into cloud storage, the demand for storage space and data security is exploding. A technique called data deduplication can improve storage space utilization by reducing the duplicated data for a given set of files. And during the data deduplication process, a hashing function can be combined to generate a fingerprint for the data chunks. Traditionally, these processes have been too CPU intensive to put into production environments. With Intel® processors, Intel ISA-L can provide the tools to help accelerate everything from small office NAS appliances up to enterprise storage systems.

Sample Code

Notes: For more details on how to build the executable, see the instructions in the Intel ISA-L package.  The request form is provided in the resources section at the end of this article.

To address these specific needs in cloud storage, Intel ISA-L provides accelerated multi-buffer hashes and rolling hashes for data deduplication. The sample code demonstrates how these two functions can be combined efficiently.

Implementation Details

The sample code shows variable length chunking and multi-buffer hashing can be easily combined. The process consists of:

1. Set the parameters including minimum, maximum, and mean chunk size.

w = 32;				// Set width of rolling hash window
min_chunk = 1024;		// Smallest chunk allowed
mean_chunk = 4 * 1024;		// Target for average chunk size
max_chunk = 32 * 1024;		// Largest allowed chunk size

The condition of “hash & mask == trigger” determines the frequency of hits where a boundary is found. Generally a boundary is set when the windowed hash meets the condition:

hash & mask == trigger

A helper function can pick a mask value that should, on average, return a mean chunk size.

mask = rolling_hashx_mask_gen(mean_chunk, 0);
trigger = rand() & mask;

This makes it fairly simple to run through the data a chunk at a time.

while (remain > max_chunk)
{

2. Skip ahead to min chunk.

rolling_hash1_reset(&state, p + min_chunk - w);

3. Start at min chunk and run until a boundary is found or max chunk is hit.

rolling_hash1_run(&state, p + min_chunk, max_chunk - min_chunk,  mask,     trigger, &offset);

4. At the chunk boundary, process the chunk.

process_chunk(p, min_chunk + offset);

5. Then go to the next chunk.

p += offset + min_chunk;
remain -= (offset + min_chunk);
}

The Process

After chunk boundaries are identified, the next process is to generate a cryptographic hash of the entire chunk. This hash function uniquely identifies that chunk and allows identification of duplicate chunks.

High performance multi-buffer hashing uses the asynchronous hash submit() call. Two support functions, get_next_job_ctx() and put_next_job_ctx(), help maintain a pool of multi-buffer hash job context structures and also abstracts their rotation and use.

void process_chunk(uint8_t * buff, int len)
{
      SHA256_HASH_CTX *ctx;

      ctx = get_next_job_ctx();
      ctx = sha256_ctx_mgr_submit(&mb_hash_mgr, ctx, buff, len, HASH_ENTIRE);

      if (ctx) {
            process_finished_hash(ctx);
            put_next_job_ctx(ctx);
      }
}

This allows pre-allocation of the context pool and keeps alloc/free out of the processing loop.

Then final processing of the hash is performed on finished chunks as they are returned with process_finished_hash(). The multi-buffer hashing interface works best with many jobs in the queue and can efficiently handle multiple hashes even when submitted jobs are of different lengths.

When processing through the loop in steps 2 to 5 is complete, most chunk boundaries have been found and hashed, except cases for the leftover fragment and perhaps a few left in the multi-buffer queue. The final step is to flush the queue and process these last chunks.

while ((ctx = sha256_ctx_mgr_flush(&mb_hash_mgr)) != NULL)
     process_finished_hash(ctx);

At this point all chunks have been processed. The computationally intense functions found in dedup (variable length chunking and generating a cryptographic hash of the block) are complete.

For the hash context pooling functions, the functions below abstract the pooling of mb-hash structures that keep track of hash jobs.

SHA256_HASH_CTX ctxpool[SHA256_MAX_LANES], *last_ctx;
SHA256_HASH_CTX_MGR mb_hash_mgr;

void setup_chunk_processing(void)
{
      int i;

      sha256_ctx_mgr_init(&mb_hash_mgr);

      for (i = 0; i < HASH_POOL_SIZE; i++)
            hash_ctx_init(&ctxpool[i]);

      last_ctx = &ctxpool[0];
}

SHA256_HASH_CTX *get_next_job_ctx(void)
{
      int i;
      SHA256_HASH_CTX *ctx;

      if (last_ctx && hash_ctx_complete(last_ctx))
            return last_ctx;

      for (i = 0; i < HASH_POOL_SIZE; i++) {
            if (hash_ctx_complete(&ctxpool[i]))
                  return &ctxpool[i];
      }
      ctx = sha256_ctx_mgr_flush(&mb_hash_mgr);
      assert(ctx != NULL);
      return ctx;
}

void put_next_job_ctx(SHA256_HASH_CTX * ctx)
{
      if (ctx && hash_ctx_complete(ctx))
            last_ctx = ctx;
}

Conclusion

The walk through of the C sample code shows how easy it is to apply the two chunking and hashing functions from the Intel ISA-L. Storage developers can take the learning presented here and quickly apply it to the cloud storage environment running on IA.

Related Links and Resources

About the Author

The author would like to recognize Greg Tucker and Colleen Culberson for their contributions. 

The Author:  Quoc-Thai Le is a software engineer and has been with Intel Corporation for over 20 years.  His areas of focus have included software automation, server power and performance analysis, and software defined storage.