MAP REDUCE
Distributed data processing model and execution environment that runs on large clusters of commodity machines
Parallel processing of blocks of files stored in HDFS on multiple nodes (computers in cluster)
MapReduce is one of the several Yarn applications that runs batchjobs
In production almost never used, mostly spark and flink is going to be deprecated. This topic is fYI.
Concepts
Automatic parallelization and distribution
Fault-tolerance
Status and monitoring tools
Clean abstraction for programmers
MapReduce programs are usually written in Java, but can also be written in any scripting language using Hadoop Streaming
All of Hadoop is written in Java
Developer can concentrate simply on writing the Map and Reduce functions
Scalable file processing
Map task
Process input files blocks
Produce (key-value) pairs
Executed in parallel
Reduce task
Take (key-value) pairs sorted by keys
Combine/aggregate them to
Produce final result
Can be zero, one or more (executed in parallel)
Map reduce architecture
Input data is stored in HDFS, spread across nodes and replicated.
Application submits job (mapper, reducer, input) to job tracker
Job tracker
Splits input data
Schedules and monitors various map and reduce tasks
Task Tracker
Execute map and reduce tasks
Mapper
The purpose of the map phase is to organize the data in preparation for the processing done in the reduce phase.
The input to the map function is in the form of key-value pairs, even though the input to a MapReduce program is a file or file(s).
Mapper maps input key/value pairs to a set of intermediate key/value pairs.
By default, the value is a data record and the key is generally the offset of the data record from the beginning of the data file.
The output consists of a collection of key-value pairs which are input for the reduce function.
The content of the key-value pairs depends on the specific implementation
Identity mapper. Implements mapper and maps i/p s directly to o/p s
Inverse mapper. implements mapper and reverses the key value pair
Regex mapper. implements mapper and generates a (match, 1) pair for every regular expression match.
Token count mapper. It implements mapper and generates(Token,1) pair when the input is tokenized.
Reducer
It sorts the incoming data on the [key, value] pair and groups together all values of the same key, then such reducer() is called and it generates pairs by iterating over the values associated with a given key.
Each reduce function processes the intermediate values for a particular key generated by the map function and generates the output.
The output collector retrieves the o/p of a reducer process and writes into o/p file.
The reporter provides an option to record extra information about the reducer and the task processes.
Identify Reducer. It implements a reducer[key,value] and map inputs directly to the outputs.
Long sum reducer . It implements a reducer[key, long writable,] to get the given key
Overall mapreduce process
Hadoop VS Spark
Last updated
Was this helpful?