Tuesday, April 28, 2015

Compression In Map Reduce

Compression In Map Reduce
==================

* Compression reduces number of bytes written to/read from HDFS.
* Compression effectively improves the effeciency of network bandwidth and disk space.
* This saves the amount of data being transfored between MAP nodes to REDUCE nodes.

LZO
===

LZO is  a Compression/Decompression Library.

CompressionFomat : LZO
Hadoop CompressionCodec : com.hadoop.compression.lzo.lzopCodec

NOTE: "codec" is the implementation of a compression-decompression alogorithm. In Hadoop, a "codec" is reprasented by an implementation of the "CompressionCodec" interface.

LZO key characterstics:
----------------------------

1. Very fast decompression
2. Requires an additional buffer during the compression (size of 8 kb or 64 kb depends on the compression level)
3. It does not requires the additional buffer during the decompression other than the Source and Destination buffers..thats
    why fast decompresson is possible with LZO.
4. Allows the user to adjust the balance between compression ration and compression speed, without affecting the speed
    of decompression.

--------------------------------------------------------------

Hadoop provides the below Compression Codecs:

- com.hadoop.compression.DefaultCodec
- com.hadoop.compression.lzoCodec
- com.hadoop.compression.SnappyCodec


--------------------------------------------------------------------------------
Configurations Required for LZO compression - In "mapred-site.xml"
--------------------------------------------------------------------------------
By default Compression is not enabled in Mapreduce(i.e. value is false). To achieve the compression we have to edit the below property of 
mapred-site.xml




<property>

      <name>mapred.output.compress</name>
      <value>false</value>

</property>

NOTE: to enable the compression , value should be "true"

----------------------------------------------------------------------

which compression codec to be used while compressing job output. By default "DefaultCodec" will be used.
In order to use other than default(LZO or Snappy) , we have to replace their corresponding codecs.

Like Below:

<property>

      <name>mapred.output.compression.codec</name>
      <value>org.apache.hadoop.io.compress.DefaultCodec</value>

</property>

NOTE: 1.  in place of DefaultCodec, give LzoCodec , SnappyCodec

          2. We can also specify which compression codec to be used while compressing the map outputs.

<property>

      <name>mapred.map.output.compression.codec</name>
      <value>org.apache.hadoop.io.compress.DefaultCodec</value>

</property>  

-------------------------------------------------

Snappy Compression :
==============

   snappy is a also a compression / decompression library. It does not aim for maximum compression, or compatabitlity with
other compression library. 

   Snappy aims for very high speeds and reasonable compression.




No comments:

Post a Comment