NiFi

Put simply NiFi was built to automate the flow of data between systems. While the term 'dataflow' is used in a variety of contexts, we use it here to mean the automated and managed flow of information between systems. This problem space has been around ever since enterprises had more than one system, where some of the systems created data and some of the systems consumed data. The problems and solution patterns that emerged have been discussed and articulated extensively.

A comprehensive and readily consumed form is found in the Enterprise Integration Patterns

Framework to build scalable direct graphs of data routing, transformation and system mediation logic.
Originally was developed by NSA (National Security Agency) and then open sourced
Apache top-level project
Cloudera DataFlow (CDF), formerly Hortonworks DataFlow (HDF), is based around Apache NiFi

Apache NiFi Features

Web-based user interface
Highly configurable:
- Loss tolerant vs guaranteed delivery
- Low latency vs high throughput
- Dynamic prioritization
- Flow can be modified at runtime
- Out of the box back pressure mechanism
Data provenance
- Build your own processor and more
- Enable rapid development and effective testing
Security (SSL, encrypted content, authentication and authorization)

What is flow file?

FlowFile is an object transported through the system
NiFi is data agnostic which means FlowFile can contain any type of information in any format
Similar to HTTP data, Flow file contains two parts:
- Header, with metadata
- Body, with message content

Components

NiFi executes within a JVM on a host operating system. The primary components of NiFi on the JVM are as follows:

Web Server. The purpose of the web server is to host NiFi’s HTTP-based command and control API.
Flow Controller. The flow controller is the brains of the operation. It provides threads for extensions to run on, and manages the schedule of when extensions receive resources to execute.
Extensions. There are various types of NiFi extensions which are described in other documents. The key point here is that extensions operate and execute within the JVM.
FlowFile Repository. The FlowFile Repository is where NiFi keeps track of the state of what it knows about a given FlowFile that is presently active in the flow. The implementation of the repository is pluggable. The default approach is a persistent Write-Ahead Log located on a specified disk partition.
Content Repository. The Content Repository is where the actual content bytes of a given FlowFile live. The implementation of the repository is pluggable. The default approach is a fairly simple mechanism, which stores blocks of data in the file system. More than one file system storage location can be specified so as to get different physical partitions engaged to reduce contention on any single volume.
Provenance Repository. The Provenance Repository is where all provenance event data is stored. The repository construct is pluggable with the default implementation being to use one or more physical disk volumes. Within each location event data is indexed and searchable.

Cluster

Apache Nifi can operate on a single machine as well as in a cluster to increase reliability, availability and throughput of solution.

Starting with the NiFi 1.0 release, a Zero-Master Clustering paradigm is employed. Each node in a NiFi cluster performs the same tasks on the data, but each operates on a different set of data. Apache ZooKeeper elects a single node as the Cluster Coordinator, and failover is handled automatically by ZooKeeper.

All cluster nodes report heartbeat and status information to the Cluster Coordinator. The Cluster Coordinator is responsible for disconnecting and connecting nodes. Additionally, every cluster has one Primary Node, also elected by ZooKeeper.

As a DataFlow manager, you can interact with the NiFi cluster through the user interface (UI) of any node. Any change you make is replicated to all nodes in the cluster, allowing for multiple entry points.

Apache NiFi Features

What is flow file?

Components

Cluster

Links