Wednesday, May 13, 2015

Apache Hive : ORDER BY VS SORT BY

ORDER BY VS SORT BY
===================
ORDER BY
--------
1. The ORDER BY clause is familar from other SQL dialects.
2. It performs the 'total ordering of the query result set' i.e. ' all the data will be passed to a single reducer' which
   will take longer time if we have larger datasets.
3. Hive requires the LIMIT clause for ORDER BY as it will take longer time (if hives property
   hive.mapred.mode is set to 'strict' which is by default 'nonstrict'

NOTE: In /usr/lib/hive/conf/hive-default.xml--> below property

<property>
  <name>hive.mapred.mode</name>
  <value>nonstrict</value>
  <description>The mode in which the hive operations are being performed. In strict mode, some risky queries are not allowed to run</description>
</property>


SORT BY
-------
3. Where as SORT BY (Hive's alternative to ORDER BY) , will orders the data only within each reducer and performs      the  'local ordering of the data' where each REDUCER'S OUTPUT will be SORTTED.

4. Better perfomance is traded for total ordering.



hive> select * from testudf;
OK
100       Karan         24000
101   Gopal             25000
102          RaviKanth  26000
103   Ravi Prakash      270000
104   Gopi Prakash      28000
105   Ravi Prakash      29000
106   Vishnu Prakash    31000
107   Kasi Prakash      34000
Time taken: 0.419 seconds

hive> desc testudf;
OK
empid int  
ename string     
esal  int  
Time taken: 0.197 seconds

hive> select empid , ename , esal from testudf order by empid asc , esal desc;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapred.reduce.tasks=<number>
Starting Job = job_201304160610_0001, Tracking URL = http://localhost:50030/jobdetails.jsp?jobid=job_201304160610_0001
Kill Command = /usr/lib/hadoop/bin/hadoop job  -Dmapred.job.tracker=localhost:8021 -kill job_201304160610_0001
2013-04-16 06:15:57,723 Stage-1 map = 0%,  reduce = 0%
2013-04-16 06:16:08,764 Stage-1 map = 100%,  reduce = 0%
2013-04-16 06:16:18,977 Stage-1 map = 100%,  reduce = 100%
Ended Job = job_201304160610_0001
OK
100       Karan         24000
101   Gopal             25000
102          RaviKanth  26000
103   Ravi Prakash      270000
104   Gopi Prakash      28000
105   Ravi Prakash      29000
106   Vishnu Prakash    31000
107   Kasi Prakash      34000
Time taken: 36.859 seconds

hive> select empid , ename , esal from testudf sort  by empid asc , esal desc;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapred.reduce.tasks=<number>
Starting Job = job_201304160610_0002, Tracking URL = http://localhost:50030/jobdetails.jsp?jobid=job_201304160610_0002
Kill Command = /usr/lib/hadoop/bin/hadoop job  -Dmapred.job.tracker=localhost:8021 -kill job_201304160610_0002
2013-04-16 06:18:23,766 Stage-1 map = 0%,  reduce = 0%
2013-04-16 06:18:31,316 Stage-1 map = 100%,  reduce = 0%
2013-04-16 06:18:40,440 Stage-1 map = 100%,  reduce = 33%
2013-04-16 06:18:41,451 Stage-1 map = 100%,  reduce = 100%
Ended Job = job_201304160610_0002
OK
100       Karan         24000
101   Gopal             25000
102          RaviKanth  26000
103   Ravi Prakash      270000
104   Gopi Prakash      28000
105   Ravi Prakash      29000
106   Vishnu Prakash    31000
107   Kasi Prakash      34000
Time taken: 28.06 seconds
hive>


No comments:

Post a Comment