Celebrate the Big Data Problems – #2

Celebrate the Big Data Problems – #2

How to identify the no of buckets for a Hive table while executing the HiveQL DDLs ?

The dataottam team has come up with blog sharing initiative called “Celebrate the Big Data Problems”. In this series of blogs we will share our big data problems using CPS (Context, Problem, Solutions) Framework.


Bucketing is another technique for decomposing data sets into more manageable parts. For example, suppose a table using the date as the top-level partition and the employee_id as the second-level partition leads to too many small partitions. Instead, if we bucket the employee table and use employee_id as the bucketing column, the value of this column will be hashed by a user-defined number into buckets. Records with the same employee_id will always be stored in the same bucket. But the challenges are to identify the no of buckets for certain Hive tables in the big data system. While creating table you can specify like CLUSTERED BY (employee_id) INTO XX BUCKETS; where XX is the number of buckets. Bucketing has several advantages. The number of buckets is fixed so it does not fluctuate with data. If two tables are bucketed by employee_id, Hive can create a logically correct sampling. Bucketing also aids in doing efficient map-side joins etc.


How to identify the no of buckets for a Hive table while executing the HiveQL DDLs ?


To identify the buckets we need to do a small exercise as below steps,

We need to get the daily / run-wise records from the business, vertical, or domains.
Convert into average percentage of increment by taking at-least five days’ data, or a week data.
Multiple the incremental percenateg with 1024 to have it in incremental size in megabytes
Divide it by 192 or 128 for RCFile and HiveIO respectively
Formulae :
Incremental Records = Total Records / Incremental Records
Incremental Records % of Total = (Incremental Records / Total Records) * 100
Incremental Size in MB = Incremental Records % of Total * 1024
of Buckets = Incremental size in MB / 192 for RCFile
of Buckets = Incremental size in MB / 128 for HiveIO
Reason for conversion in to MB is that Hadoop has it’s file storage blocks in to MB for large blocking.
Example :
If we have initially 100 Records and average increment with 5 Records per run/day wise and we are interested in using RCFile
Incremental Records = 100 / 5 = 20
Incremental Records % of Total = (20 /100 ) * 100 = 20%
Incremental Size in MB = 20 * 1024 = 20480

Continue Reading …

Contact Form

By clicking 'Submit' you Agree to Guvi Terms & Conditions.

Our Learners Work at

Our Popular Course

Share this post

Author Bio


Our Live Classes

Learn Javascript, HTML, CSS, Java, Data Structure, MongoDB & more
Learn Python, Machine Learning, NLP, Tableau, PowerBI & more
Learn Selenium, Python, Java, Jenkins, Jmeter, API Testing & more
Learn Networking, Security Testing, IAM, Access Management & more

Hey wait, Don’t miss New Updates from GUVI!

Get Your Course Now

Related Articles