4. HIVE

4.1

  • Driver - is responsible for managing a lifecycle of HiveQL statement. It also maintains a sessions handle and statistics.

  • Compiler - Query parsing, type checking and semantic analysis. The Compiler is invoked by the driver upon receiving a HiveQL statement.

  • Metastore - Stores the system catalog and meta data about tables, columns, partitions etc. Is stored on a traditional RDBMS

  • Execute Engine - Hive supports pluggable execution engines and currently can run queries via MapReduce, Tez, and Spark.

  • Optimizer - Optimized logical plan in the form of a DAG of jobs.

4.2

  • Hive organizes tables into partitions a way of dividing a table into coarse-grained parts based on the value of a partition column, such as a date. Using partitions can make it faster to do queries on slices of the data.

  • Tables or partitions may be subdivided further into buckets to give extra structure to the data that may be used for more efficient queries.

    For example, bucketing by user ID means we can quickly evaluate a user-based query by running it on a randomized sample of the total set of users.

4.3

  • UDF - 1 -> 1. (math operations) evaluate implementation required. create temporary function as + jar (registered)

  • UDAF - many -> 1.(aggregation functions). class required. More complex registration

  • UDTF - 1 -> many. (table generating) explode as an example

  • ObjectInspector - Hive uses ObjectInspector to analyze the internal structure of the row object and also the structure of the individual columns.

4.4

Last updated