重复数据删除PPT-谭玉娟

合集下载
  1. 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
  2. 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
  3. 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。

Data Deduplication (重复数据删除)
谭玉娟 华中科技大学

Data Deduplication
Motivation Background Research Topic Use Cases How to Use ?

Motivation (1)
Global Storage
IDC: 3/4

Motivation (2)
Backup System

Motivation(3)
Data transfer bottleneck
Backup p 1 TB data to Amazon S3
The average bandwidth measured is 800KB/s
1 × 10
12
Bytes
⎛ 800 ⎜ ⎝
× 10
13
Bytes
sec
ond
⎞ ⎟ ⎠
= 1,250,000
seconds
more than 14 days unacceptable backup window
2003-2008: Wide area: 2.7x
Computing 16x disk storage: 10x

Motivation (4)

Motivation (5)

Dedup Background Whole file Deduplication
foo
01101010….. ….110010101
bar
01101010 01101010…..
….110010101 110010101

Dedup Background Fixed Chunk Deduplication
foo
01101010….. ….110010101 …1110111111
1
bar
01101010…..
….110010101 … 110010101

Dedup Background
Fixed Chunk

Insert Data

V i bl Si Variable Sized d Ch Chunk k

Insert Data


Dedup Background Rabin Figerprinting
foo
00100000….. 1 01101011….. 00100000…..
110101 101010 010100
bar
01101011….. 01101010…..

The Deduplication Space
Algorithm Parameters Cost
Deduplication effectiveness
Whole-file
Low
Lowest
Seeks
Fixed Chunk Chunk Size
CPU Complexity Seeks
Middle
Rabin fingerprints
Average Chunk Size
More CPU More Complexity p y
Highest

Deduplication VS Compression Deduplication
Lossless compression Granularity: File-level, Chunk-level Large scale storage system y compression p technology gy System
Con entional lossless compression Conventional
Granularity: Byte-level Small datasets General data compression technology

Implementation(1)
Client-Side: used for saving bandwdith
Deduplication
Backup stream after deduplication
Application Server
Storage g device

Implementation(2)
Target-Side: used for saving Storage
Deduplication
Backup stream before deduplication
Application Server
Storage device

Deduplication Process

Deduplication Process
Chunking Indexing Index lookup
Link data generation
Commit new data chunk
Update index table

Research—磁盘瓶颈
没有足够RAM 空间存放所有数据块的索引信息, 大部分的索引信息必须存放在磁盘上。
8TB 数据 20GB 索引. 800TB 数据 ,2TB 2TB 索引.
T l Too large!! !!
在重复数据块查找的过程中 索引的查询会带来大 在重复数据块查找的过程中,索引的查询会带来大 量的磁盘访问,引来了重复数据删除过程中的磁盘 瓶颈问题。 瓶颈问题

Research—可靠性
文件 1 文件 2 文件 3
数据块 1
数据块 2
数据块 3
数据块 4
数据块 5
数据块 6
可靠性

相关文档
最新文档