Facebook credited members of its data infrastructure team with the development of HDFS RAID — including Dikang Gu, Peter Knowles, and Guogiang Jerry Chen — and it offered some background on the technology in its developer blog post:
The default replication of a file in HDFS is three, which can lead to a lot of space overhead. HDFS RAID reduces this space overhead by reducing the effective replication of data. The replication factor of the original file is reduced, but data safety guarantees are maintained by creating parity data.
There are two erasure codes implemented in HDFS RAID: XOR and Reed-Solomon. The XOR implementation sets the replication factor of the file to two. It maintains data safety by creating one parity block for every 10 blocks of source data. The parity blocks are stored in a parity file, also at replication two. Thus for 10 blocks of the source data, we have 20 replicas for the source file and two replicas for the parity block, which accounts for a total of 22 replicas. The effective replication of the XOR scheme is 22/10 = 2.2.
The Reed-Solomon implementation sets the replication factor of the file to one. It maintains data safety by creating four parity blocks for every 10 blocks of source data. The parity blocks are stored in a parity file at replication one. Thus for 10 blocks of the source data, we have 10 replicas for the source file and four replicas for the parity blocks, which accounts for a total of 14 replicas. The effective replication of Reed-Solomon scheme is 14/10 = 1.4.
Theoretically we should be able to get 2.2x replication factor for XOR RAID files and 1.4X replication factor for Reed Solomon RAID files.
For much more on HDFS RAID — including how Facebook addressed the challenges of data corruption, implementing RAID across a large directory, and handling directory changes — please see the engineering blog post.