# MAP REDUCE

*Distributed data processing model and execution environment that runs on large clusters of commodity machines*

* Parallel processing of blocks of files stored in HDFS on multiple nodes (computers in cluster)
* MapReduce is one of the several Yarn applications that runs batchjobs

> In production almost **never used**, mostly spark and flink is going to be deprecated. This topic is fYI.

## Concepts

* Automatic parallelization and distribution
* Fault-tolerance
* Status and monitoring tools
* Clean abstraction for programmers
* MapReduce programs are usually written in Java, but can also be written in any scripting language using **Hadoop Streaming**
* All of Hadoop is written in Java
* Developer can concentrate simply on writing the Map and Reduce functions

## Scalable file processing

* Map task
  * Process input files blocks
  * Produce (key-value) pairs
  * Executed in parallel
* Reduce task
  * Take (key-value) pairs sorted by keys
  * Combine/aggregate them to
  * Produce final result
  * Can be zero, one or more (executed in parallel)

![What is it in HADOOP](https://3047264112-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-LoWAdFshQklZGvmhPe2%2F-LogSwvG74nb20eIjjCB%2F-LogSxtsXHlwVdcsAUDE%2FIMG_9.png?generation=1568409246230445\&alt=media)

## Map reduce architecture

* Input data is stored in HDFS, spread across nodes and replicated.
* Application submits job (mapper, reducer, input) to job tracker
* Job tracker
  * Splits input data
  * Schedules and monitors various map and reduce tasks
* Task Tracker
  * Execute map and reduce tasks

![Map reduce architecture](https://3047264112-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-LoWAdFshQklZGvmhPe2%2F-LogSwvG74nb20eIjjCB%2F-LogSxturnKqGZg5UNGm%2FIMG_10.png?generation=1568409244937743\&alt=media)

## Mapper

The purpose of the map phase is to organize the data in preparation for the processing done in the reduce phase.

> The input to the map function is in the form of key-value pairs, even though the input to a MapReduce program is a file or file(s).

Mapper maps input key/value pairs to a set of intermediate key/value pairs.

```
map(inKey, inValue) -> list(intermediateKey, intermediateValue)
```

By default, the value is a data record and the key is generally the offset of the data record from the beginning of the data file.

The output consists of a collection of key-value pairs which are input for the reduce function.

> The content of the key-value pairs depends on the specific implementation

* **Identity mapper**. Implements mapper and maps i/p s directly to o/p s
* **Inverse mapper**. implements mapper and reverses the key value pair
* **Regex mapper**.  implements mapper and generates a (match, 1) pair for every regular expression match.
* **Token count mapper**. It implements mapper and generates(Token,1) pair when the input is tokenized.

## Reducer

It sorts the incoming data on the \[key, value] pair and groups together all values of the same key, then such reducer() is called and it generates pairs by iterating over the values associated with a given key.

```
reduce(intermediateKey, list(intermediateValue)) -> list(outKey, outValue)
```

Each reduce function processes the intermediate values for a particular key generated by the map function and generates the output.

The output collector retrieves the o/p of a reducer process and writes into o/p file.

The reporter provides an option to record extra information about the reducer and the task processes.

* **Identify Reducer**. It implements a reducer\[key,value] and map inputs directly to the outputs.
* **Long sum reducer** .  It implements a reducer\[key, long writable,] to get the given key

## Overall mapreduce process

![mapreduce process](https://3047264112-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-LoWAdFshQklZGvmhPe2%2F-LogSwvG74nb20eIjjCB%2F-LogSxtwMCY2zUFMkkoO%2FIMG_11.png?generation=1568409117572492\&alt=media)

## Hadoop VS Spark

![Hadoop VS Spark](https://3047264112-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-LoWAdFshQklZGvmhPe2%2F-LogSwvG74nb20eIjjCB%2F-LogSxty4_4JTSo6jzT7%2FIMG_12.png?generation=1568409117489871\&alt=media)
