Flink官方文档笔记01 从架构的角度看看Flink
文章目录
从架构的角度来看看Flink是什么
Apache Flink is a framework and distributed processing engine for stateful computations over unbounded and bounded data streams.
Apache Flink是一个框架和分布式处理引擎,用于无边界和有边界数据流上的有状态计算。
Flink has been designed to run in all common cluster environments, perform computations at in-memory speed and at any scale.
Flink被设计用于在所有常见的集群环境中运行,以in-memory的速度和任何的规模执行计算。
Here, we explain important aspects of Flink’s architecture.
在这里,我们介绍一下Flink架构的重要内容
处理无界流数据和有界流数据
Any kind of data is produced as a stream of events.
任何类型的数据都作为事件流产生。
Credit card transactions, sensor measurements, machine logs, or user interactions on a website or mobile application, all of these data are generated as a stream.
信用卡交易、传感器测量、机器日志或网站或移动应用程序上的用户交互,所有这些数据都生成为流。
Data can be processed as unbounded or bounded streams.
数据可以被当作无边界的流或有边界的流来处理
什么是无界流?
Unbounded streams have a start but no defined end.
无边界流有一个开始但没有定义的结束。
They do not terminate and provide data as it is generated.
它们不会在生成数据时终止并提供数据。
Unbounded streams must be continuously processed, i.e., events must be promptly handled after they have been ingested.
无界流必须连续处理,即事件被摄入后必须迅速处理。
It is not possible to wait for all input data to arrive because the input is unbounded and will not be complete at any point in time.
等待所有输入数据到达是不可能的,因为输入是无界的,在任何时间点都不会完成。
Processing unbounded data often requires that events are ingested in a specific order, such as the order in which events occurred, to be able to reason about result completeness.
处理无界数据通常要求事件按照特定的顺序摄入,例如事件发生的顺序,以便能够推断结果的完整性。
什么是有界流?
Bounded streams have a defined start and end.
有边界流有定义的开始和结束。
Bounded streams can be processed by ingesting all data before performing any computations.
在执行任何计算之前,可以通过摄入所有数据来处理有界流。
Ordered ingestion is not required to process bounded streams because a bounded data set can always be sorted.
处理有界流不需要有序摄取,因为有界数据集总是可以排序的。
Processing of bounded streams is also known as batch processing.
有界流的处理也称为批处理。
Apache Flink excels at(擅长) processing unbounded and bounded data sets.
Apache Flink擅长(擅长)处理无限和有限的数据集。
Precise control of time and state enable Flink’s runtime to run any kind of application on unbounded streams.
对时间和状态的精确控制使Flink的运行时能够在未绑定流上运行任何类型的应用程序。
Bounded streams are internally processed by algorithms and data structures that are specifically designed for fixed sized data sets, yielding excellent performance.
有界流由专门为固定大小的数据集设计的算法和数据结构在内部处理,从而产生优异的性能。
Convince yourself by exploring the use cases that have been built on top of Flink.
你还可以通过查看Flink内置的样例来快速学习掌握Flink。
你可以在任何地方部署你的APP!
Apache Flink is a distributed system and requires compute resources in order to execute applications.
Apache Flink是一个分布式系统,需要计算资源来执行应用程序。
Flink integrates with all common cluster resource managers such as Hadoop YARN
, Apache Mesos
, and Kubernetes but can also be setup to run as a stand-alone cluster.
Flink集成了所有常见的集群资源管理器,如“Hadoop YARN”、“Apache Mesos”和Kubernetes,但也可以设置为作为独立集群运行。
Flink is designed to work well each of the previously listed resource managers.
Flink被设计为能够很好地工作于前面列出的每个资源管理器。
This is achieved by resource-manager-specific deployment modes that allow Flink to interact with each resource manager in its idiomatic way.
这是通过特定于资源管理器的部署模式实现的,该部署模式允许Flink以其惯用方式与每个资源管理器交互。
When deploying a Flink application, Flink automatically identifies the required resources based on the application’s configured parallelism and requests them from the resource manager.
在部署Flink应用程序时,Flink会根据应用程序配置的并行性自动识别所需的资源,并从资源管理器请求它们。
In case of a failure, Flink replaces the failed container by requesting new resources.
在失败的情况下,Flink通过请求新的资源来替换失败的容器。
All communication to submit or control an application happens via REST calls.
所有提交或控制应用程序的通信都是通过REST调用进行的。
This eases the integration of Flink in many environments.
这简化了Flink在许多环境中的集成。
你可以运行任意规模的程序!
Flink is designed to run stateful streaming applications at any scale.
Flink被设计用于在任何规模上运行有状态流应用程序。
Applications are parallelized into possibly thousands of tasks that are distributed and concurrently executed in a cluster.
应用程序被并行化,可能有数千个任务分布在一个集群中并发执行。
Therefore, an application can leverage virtually unlimited amounts of CPUs, main memory, disk and network IO. Moreover, Flink easily maintains very large application state.
因此,应用程序可以利用几乎无限数量的cpu、主内存、磁盘和网络IO。此外,Flink很容易维护非常大的应用程序状态。
Its asynchronous and incremental checkpointing algorithm ensures minimal impact on processing latencies while guaranteeing exactly-once state consistency.
它的异步和增量检查点算法确保对处理延迟的影响最小,同时保证精确的一次状态一致性。
Users reported impressive scalability numbers for Flink applications running in their production environments, such as
用户报告了在其生产环境中运行的Flink应用程序令人印象深刻的可伸缩性数字,例如
- 每天处理数万亿事件的应用程序,
- 维护多个兆兆字节(TB)的状态的应用程序
- 在数千个核上运行的应用程序。
充分地利用内存性能
Stateful Flink applications are optimized for local state access.
有状态Flink应用程序针对本地状态访问进行了优化。
Task state is always maintained in memory or, if the state size exceeds the available memory, in access-efficient on-disk data structures.
任务状态总是保存在内存中,如果状态大小超过可用内存,则保存在访问效率高的磁盘数据结构中。
Hence, tasks perform all computations by accessing local, often in-memory, state yielding very low processing latencies.
因此,任务通过访问本地(通常在内存中)状态来执行所有计算,从而产生非常低的处理延迟。
Flink guarantees exactly-once state consistency in case of failures by periodically and asynchronously checkpointing the local state to durable storage.
Flink通过定期和异步检查点来保证错误发生时数据的exactly-once。
想看下一篇吗?yes/是👇