StreamSets

Overview

  • StreamSets Data Collector (SDC) is a lightweight, powerful design and execution engine that streams data in real time.

  • SDC was released to the open source community in 2015.

  • SDC is supported by StreamSets (founded 2014) that was 2018 Cloudera Partner Impact Awards Winner.

StreamSets Data Collector Features

  • Web-based user interface

  • Highly configurable:

    • “At least once” and “At most once” delivery guarantees are supported.

    • Deep integration with the Hadoop ecosystem, including connectors for HDFS, HBase, Kafka and Solr.

    • Flexible deployment targets of pipelines to edge servers or to clusters.

    • Deployed as a Spark Streaming application or as a MapReduce job.

    • Embedded monitoring to provide runtime visibility to data flow performance.

  • A key concept in SDC is the idea of pipeline

What is pipeline?

  • A pipeline describes the flow of data from the origin system to destination systems and defines how to transform the data along the way.

  • A pipeline consists of a single origin stage to represent the origin system, multiple processor stages to transform data, and multiple destination stages to represent destination systems.

Last updated