MAP REDUCE

Distributed data processing model and execution environment that runs on large clusters of commodity machines

  • Parallel processing of blocks of files stored in HDFS on multiple nodes (computers in cluster)

  • MapReduce is one of the several Yarn applications that runs batchjobs

In production almost never used, mostly spark and flink is going to be deprecated. This topic is fYI.

Concepts

  • Automatic parallelization and distribution

  • Fault-tolerance

  • Status and monitoring tools

  • Clean abstraction for programmers

  • MapReduce programs are usually written in Java, but can also be written in any scripting language using Hadoop Streaming

  • All of Hadoop is written in Java

  • Developer can concentrate simply on writing the Map and Reduce functions

Scalable file processing

  • Map task

    • Process input files blocks

    • Produce (key-value) pairs

    • Executed in parallel

  • Reduce task

    • Take (key-value) pairs sorted by keys

    • Combine/aggregate them to

    • Produce final result

    • Can be zero, one or more (executed in parallel)

Map reduce architecture

  • Input data is stored in HDFS, spread across nodes and replicated.

  • Application submits job (mapper, reducer, input) to job tracker

  • Job tracker

    • Splits input data

    • Schedules and monitors various map and reduce tasks

  • Task Tracker

    • Execute map and reduce tasks

Mapper

The purpose of the map phase is to organize the data in preparation for the processing done in the reduce phase.

The input to the map function is in the form of key-value pairs, even though the input to a MapReduce program is a file or file(s).

Mapper maps input key/value pairs to a set of intermediate key/value pairs.

map(inKey, inValue) -> list(intermediateKey, intermediateValue)

By default, the value is a data record and the key is generally the offset of the data record from the beginning of the data file.

The output consists of a collection of key-value pairs which are input for the reduce function.

The content of the key-value pairs depends on the specific implementation

  • Identity mapper. Implements mapper and maps i/p s directly to o/p s

  • Inverse mapper. implements mapper and reverses the key value pair

  • Regex mapper. implements mapper and generates a (match, 1) pair for every regular expression match.

  • Token count mapper. It implements mapper and generates(Token,1) pair when the input is tokenized.

Reducer

It sorts the incoming data on the [key, value] pair and groups together all values of the same key, then such reducer() is called and it generates pairs by iterating over the values associated with a given key.

reduce(intermediateKey, list(intermediateValue)) -> list(outKey, outValue)

Each reduce function processes the intermediate values for a particular key generated by the map function and generates the output.

The output collector retrieves the o/p of a reducer process and writes into o/p file.

The reporter provides an option to record extra information about the reducer and the task processes.

  • Identify Reducer. It implements a reducer[key,value] and map inputs directly to the outputs.

  • Long sum reducer . It implements a reducer[key, long writable,] to get the given key

Overall mapreduce process

Hadoop VS Spark

Last updated