Skip to main content

Flink Notes

Interview Questions

  • Task Managers: We have 50 TMs, one for each of our 50 consumer nodes for that Kafka Topic, ensuring immediate message processing. 以确保每个消费者节点都有一个对应的 Flink TM 进行处理。这样可以确保消息被即时处理。
  • Job Manager: 3. For high availability, Flink supports multiple Job Manager configurations. In this setup, if the primary JM fails, a standby JM takes over to ensure continuous job execution. 主要负责协调任务和管理集群。为了高可用性,Flink 支持多个 Job Manager 的配置。在这种配置中,如果主 JM 宕机,一个备用 JM 将接管,从而确保作业的连续运行。

2. What data transformation operations did you use?

  • map parses string logs into LogEvent objects. 解析原始的字符串日志,并将其转换为 Java 对象。
  • filter selects specific data we need.
  • timeWindow aggregates data for a specific time summary. 生成特定时间段内的数据聚合
// 将日志转换为Java对象,方便处理
DataStream<LogEvent> parsedStream = stream
.map(new MapFunction<String, LogEvent>() {
@Override
public LogEvent map(String value) throws Exception {
return LogParser.parse(value);
}
});

// 使用filter函数过滤掉不需要的日志条目
DataStream<LogEvent> filteredStream = parsedStream
.filter(new FilterFunction<LogEvent>() {
@Override
public boolean filter(LogEvent logEvent) throws Exception {
return logEvent.getLogLevel().equals("ERROR");
}
});

// 使用时间窗口进行日志条目的聚合
DataStream<LogSummary> windowedSummary = filteredStream
.keyBy("logType")
.timeWindow(Time.minutes(5))
.aggregate(new LogAggregationFunction());

// 输出处理后的日志摘要
windowedSummary.print();

We're processing a daily volume of several million records, with each record being around 5KB in size. 每天处理的数据量是几百万条,每条数据大小大约为 5KB。

4. Imagine a case where your job is not running for 2 hours for some reasons, it's a real-time job, so we're missing some data for the last 2 hours. How are you going to handle this case?

  • Data Replay 数据回放
    • When Flink is integrated with Kafka, in the event of a Flink failure, reset the offset to a position before the failure to ensure no data is lost. 当 Flink 发生故障的时候 Reset Offset 到故障之前的位置,确保没有数据丢失。
  • Checkpoints 检查点
    • Checkpoint is a snapshot of Flink's state, used for data recovery.
    • If Flink encounters an error, you can restart from the most recent checkpoint without starting from scratch.
  • Sink 保存数据结果
    • To ensure data is not written duplicate after system recovery. 为了确保在系统恢复后数据不会被重复写入
    • Two-Phase Commit Protocol (2PC): For sinks requiring exactly-once guarantees, Flink uses 2PC to ensure data is written only once, even after failures. 对于需要确保 exactly-once 语义的 sink,Flink 支持 2PC。这确保了数据只被写入一次,即使在故障恢复的情况下。
    • Idempotent sink: Some external systems or storages offer idempotent operations, meaning that even if data is written multiple times, the final outcome remains unchanged. 一些外部系统或存储提供了幂等性操作,这意味着即使数据被多次写入,最终的结果也不会改变。
  • Reference