Skip to main content

9. Design Distributed File System

info

in progress...

Introduction to GFS

Problems with traditional file systems

Before Google File System (GFS), there were

  • single-node file systems
  • network-attached storage (NAS) systems
  • storage area networks (SAN)

These file systems served well at the time they were designed. However, with the growing demands of data storage and processing needs of large distributed data-intensive applications, these systems had limitations, some of which have been discussed below.

Single-Node File System

A single-node file system is a system that runs on a single computer and manages the storage attached to it. A single server has limited resources like storage space, processing power, as well as I/O operations that can be performed on a storage disk per second.

We can attach substantial storage capacity to a single server, increase the RAM, and upgrade the processor, but there are limits to this type of vertical scaling. A single server also has limitations regarding the number of data reads and writes, and how quickly data is stored and accessed. These limitations restrict the system's ability to process large datasets and serve a large number of clients simultaneously. 单个服务器在数据读取和写入的数量以及数据存储和访问的速度方面也有限制。这些限制限制了系统处理大型数据集和同时服务大量客户端的能力。

We also, can't ignore the fact that a single-node system is a single point of failure where the system becomes unavailable to the users. The focus should be on high throughput rather than low latency for applications requiring large datasets processing. 对于需要大型数据集处理的应用程序,重点应该放在高吞吐量而不是低延迟上。

Network-Attached Storage (NAS) System

The network-attached storage (NAS) system consists of a file-level storage server with multiple clients connected to it over a network running the network file system (NFS) protocol. Clients can store and access the files on this remote storage server like the local files. 网络附加存储 (NAS) 系统由一个文件级存储服务器组成,多个客户端通过运行网络文件系统 (NFS) 协议的网络连接到它。客户端可以像本地文件一样存储和访问此远程存储服务器上的文件。

The NAS system has the same limitations as a single-node file system. Setting up and managing a NAS system is easy but expensive to scale. This system can also suffer from throughput issues while accessing large files from a single server. NAS 系统具有与单节点文件系统相同的限制。设置和管理 NAS 系统很容易,但扩展成本很高。当从单个服务器访问大文件时,该系统也会遇到吞吐量问题。

Storage Area Network (SAN) System

The storage area network (SAN) system consists of a cluster of commodity storage devices connected to each other, providing block-level data storage to the clients over the network. SAN systems are easy to scale by adding more storage devices.

However, these systems are difficult to manage because of the complexity of the second network — the Fiber Channel (FC). To set up the Fiber Channel, we need dedicated host bus adapters (HBAs) to be deployed on each host server, switches, and specialized cabling. It is difficult to monitor where failure has occurred in this complex system. Data inconsistency issues among replicas may appear in this case. Rebalancing the load on the storage devices might also be difficult to handle with this architecture. 要设置光纤通道,我们需要在每个主机服务器、交换机和专用布线上部署专用主机总线适配器 (HBA)。很难监控这个复杂系统中发生故障的位置。这种情况下可能会出现副本间数据不一致的问题。

Note: SAN deployments are special-purpose networks apart from the usual Ethernet networks. This duplicate network, while good for segregating storage traffic, is expensive in terms of dollar cost.

GFS

Google File System (GFS) is a distributed file system that stores and processes large amounts of distributed data-intensive applications data on a storage cluster of commodity servers.

Functional requirements

  • Data storage: The system should allow users to store their data on GFS.
  • Data retrieval: The system should be able to give data back to users when they request it.

Non functional requirements

  • Scalability: The system should be able to store an increasing amount of data (hundreds of terabytes and beyond), and handle a large number of clients concurrently.
  • Availability: A file system is one of the main building blocks of many large systems used for data storage and retrieval. The unavailability of such a system disrupts the operation of the systems that rely on it. Therefore, the file system should be able to respond to the client’s requests all the time.
  • Fault tolerance 容错性: The system should be able to handle hardware failure gracefully. The system should be able to recover from failures without losing data. 系统能够处理硬件错误,并且从错误中恢复而不会丢失数据。
  • Durability 耐久性: Once the system has acknowledged to the user that its data has been stored, the data shouldn’t be lost unless the user deletes the data themselves. 一旦系统向用户确认其数据已被存储,数据就不会丢失,除非用户自己删除数据。
  • Easy operational management:
    • The system should be easy for the system to handle data replication, re-replication, garbage collection, taking snapshots, and other system-wide management activities. 系统应该易于系统处理数据复制、重新复制、垃圾收集、拍摄快照和其他系统范围的管理活动。
    • If some data becomes stale, there should be an easy mechanism to detect and fix it. 如果某些数据变得陈旧,应该有一种简单的机制来检测和修复它。
    • The system should allow multiple independent tenants to use GFS for safe data storage and retrieval. 该系统应该允许多个独立的租户使用 GFS 进行安全的数据存储和检索。
  • Performance optimization:
    • The focus should be on high throughput rather than low latency for applications that require processing for large datasets. 对于需要处理大型数据集的应用程序,重点应该放在高吞吐量而不是低延迟上。
    • Additionally, Google’s applications, for which the system was being built, most often append data to the files instead of overwriting the existing data. So, the system should be optimized for append operations. 通常会将数据附加到文件中,而不是覆盖现有数据。因此,系统应该针对追加操作进行优化。
    • For example, a logging application appends log files with new log entries. All MapReduce outputs write a file from beginning to end by appending key/value pairs to the file(s).
  • Relaxed consistency model 宽松的一致性模型:
    • In GFS, there are more append operations and very few random writes. That’s why GFS doesn’t comply with POSIX and provides a relaxed consistency model that is optimized for append operations. GFS有很多的追加操作
    • Data consistency in a distributed setting is hard, and GFS carefully opts for a custom-consistency model for better scalability and performance. 分布式设置中的数据一致性很难,GFS 谨慎地选择自定义一致性模型以获得更好的可扩展性和性能。

Architecture

A GFS cluster consists of two major types of components– a manager node and a large number of chunkservers. It stores a file’s data in the form of chunks. The architecture is shown in the following illustration.

  • The client is a GFS application program interface through which the end users perform the directory or the file operations.
  • Each file is split into fixed-size chunks. The manager assigns each chunk a 64-bit globally unique ID and assigns chunkservers where the chunk is stored and replicated. A manager is like an administrator that manages the file system metadata, including namespaces, file-to-chunk mapping, and chunk location. 每个文件被分成固定大小的块。 manager 为每个块分配一个 64 位全局唯一 ID,并分配存储和复制块的块服务器。管理器类似于管理文件系统元数据的管理员,包括命名空间、文件到块的映射和块位置。
  • The metadata is stored in the manager’s memory for good performance. For a persistent record of the metadata, the manager logs the metadata changes in an operation log placed on the manager’s hard disk so that it can recover its state after the restart by replaying the operation log. 元数据存储在 manager 的内存中以获得良好的性能。对于元数据的持久记录,管理器将元数据更改记录在放置在管理器硬盘上的操作日志中,以便它可以在重新启动后通过重放操作日志来恢复其状态。
  • Besides managing metadata, the manager also handles the following tasks:
    • Data replication and rebalancing
    • Operational locks to ensure data consistency 操作锁保证数据一致性
    • Garbage collection of the deleted data 已删除数据的垃圾回收
  • Chunkservers are commodity storage servers that store chunks as plain Linux files.

Though Google calls this manager a “master” in their research paper on GFS, we will use the term “GFS manager”, “manager node” or simply the “manager” to refer to the same thing.

The client requests the manager node for metadata information, such as the location of the requested chunks. The manager looks into its metadata and responds to the client with the location of the requested chunks. The client then asks the chunkservers for the data operations. It is important to note that the data doesn't flow through the manager but directly between the client and the chunkserver. 数据不流经 manager,而是直接在 manager 和 chunkserver 之间流动。

Note: The largest GFS cluster can store up to tens of petabytes of data and can be accessed by hundreds of clients concurrently.

GFS File Operations

The illustration below shows all of the operations that GFS clients can perform on files or directories.

Directory operations

GFS clients can perform the following simple directory-level operations. The semantics for these operations are the same as those of the Unix file system. GFS 客户端可以执行以下简单的目录级操作。

  • Create directory: Users can create multiple directories to organize their files.
  • Delete directory:
    • Users should be able to delete a directory. The system should ensure this operation doesn't leave dangling data on the chunkservers. 系统应该确保这个操作不会在 chunkservers 上留下悬空数据。
    • It should also ensure that all the files in the directory are deleted before deleting the directory itself. If the directory contains files, the system should ask the client to delete those files first. 它还应确保在删除目录本身之前删除目录中的所有文件。如果目录中包含文件,系统应该要求客户端先删除这些文件。
  • List directory:
    • The users should be able to list what's inside a directory. GFS represents a file with its full pathname. A lookup table is maintained that maps the full pathname of a file to its metadata, as shown in the following table. This table logically represents the namespace.
    • The last component in the path can be a file or a directory. The rest of the path is all about directories. 路径中的最后一个组件可以是文件或目录。路径的其余部分都是关于目录的。
    • In the table below, listing a directory with path /dir1/dir2 should list the path names that are one name longer than it, for example, /dir1/dir2/fileA and /dir1/dir2/dir4 and so on.

用户删除目录后,如果系统没有正确地处理与这个目录相关联的数据块(在chunkservers中),就可能产生悬空数据。也就是说,虽然目录已经被删除,但在chunkservers上仍然存储着该目录中文件的数据块。

File operations

GFS client API allows the users to perform the following file operations.

Create a file

Multiple users can concurrently create files in the same directory. During this operation, it must be ensured that the directory where one user creates the file is not deleted by another user who has access to that directory. 多个用户可以同时在同一目录中创建文件。在此操作期间,必须确保一个用户创建文件的目录不会被有权访问该目录的另一个用户删除。

The create file operation also involves handling multiple files being created with the same name. If there are N concurrent requests to create a file with the same name in the same directory, only one of the request is successful, while other N-1 clients will receive an error. Also, the operation is performed atomically. 如果有 N 个并发请求在同一目录下创建同名文件,则只有一个请求成功,而其他 N-1 个客户端会收到错误。

Open file

GFS allows users to open the file containing the latest version of the data in order to perform read and write operations. When a client opens the file, the metadata is fetched from the manager and cached on the client side. While the client is reading the data in that file, some of the data might become stale if another client is performing a write operation on the file in the meantime. 当客户端打开文件时,其从 manager 中获取元数据并缓存到客户端。当客户端正在读取该文件中的数据时,如果另一个客户端同时对该文件执行写操作,则某些数据可能会过时。

The reopening of the file by the client should fetch the latest metadata from the manager instead of using the cached metadata. 客户端重新打开文件应该从管理器获取最新的元数据,而不是使用缓存的元数据。

Read file

GFS workloads often comprise two kinds of reads– small random reads, and large streaming reads. Let’s look at the difference between the two below:

  • Small random read: This kind of read accesses a part of the file data at a random position/offset. Some of the examples of random read operations include jumping/scrolling to a random page in a text document consisting of multiple pages, skipping some part of a video and starting watching from an arbitrary frame, and so on. 这种读取在随机位置/偏移处访问文件数据的一部分。随机读取操作的一些示例包括跳转/滚动到由多页组成的文本文档中的随机页面、跳过视频的某些部分以及从任意帧开始观看等等。
  • Large streaming read: In streaming reads, the contiguous data of a file is accessed by consecutive read requests from the same client. An example of streaming reads on the client side is when they watch a video without interruption. Each such request reads one MB or more data. Large streaming reads are efficient as the client most probably has the chunk location in its cache. 在流式读取中,文件的连续数据由来自同一客户端的连续读取请求访问。客户端流式读取的一个例子是不间断地观看视频。每个此类请求都会读取 1 MB 或更多数据。大型流式读取非常高效,因为客户端很可能在其缓存中拥有块位置。

Write file

GFS was designed for larger sequential writes (record append) than random writes. The data, once written by these applications, is rarely modified. Knowing this, GFS handles random write and record append separately to optimize the system performance for the later operation. GFS将随机写入和记录追加分开处理,以优化后续操作的系统性能。

  • Random write: This operation writes data on a random position in the file specified by the client. It can be data insertion or data overwriting. 该操作将数据写入客户端指定的文件中的随机位置。可以是数据插入,也可以是数据覆盖。
  • Record append: It is another name for large sequential writes. It appends data at the end of the file. If there are multiple appends to the same file, GFS puts all of them at offsets of its choice. 它是大型顺序写入的另一个名称。它将数据附加到文件末尾。如果同一个文件有多个追加,GFS 会将所有追加放在其选择的偏移处。

Q&A 问答题

1. How does a single manager suffice to handle requests from hundreds of clients?

According to the architecture of GFS, the manager appears to have a tremendous amount of work to do and this could act as a bottleneck. For the simplicity a single manager offers, making it lighter weight is worthwhile instead of switching to a distributed manager. GFS optimizes the single manager by:

  • Minimizing manager involvement in read/write operations. First, it separates the data and the control flow, so the data doesn’t need to pass through the manager. The client has to interact with the manager only for the metadata. Secondly, the chunk size is kept large to reduce the metadata requests on the manager. 最大限度地减少 manager 对读/写操作的参与。客户端必须仅针对元数据与管理器进行交互。
  • Keeping the metadata in memory and serving the clients from there. 将原数据保存在内存。
  • Caching the metadata on the client side. 在客户端缓存原数据。

2. How can we handle the single manager failure?

To handle the manager failure, we store the metadata changes (mutations) on persistent storage on the manager’s disk as well as on a remote machine to recover in case of disk failures. This persistent record of metadata changes is known as the operation log. The failed manager can recover its state by replaying the operation log placed on the manager’s hard disk. 如果遇到短期的硬件错误,我们使用 操作日志 来恢复状态。我们将原数据的变更,即操作日志存储在 manager 的硬盘上,以及远程机器上,以便在磁盘故障的情况下进行恢复。

In case the original manager faces a long-lasting hardware failure, we start a new instance of the manager on another machine and redirect the clients to it. 如果遇到长期的硬件错误,我们会启动一个新的 manager 实例,并将客户端重定向到它。

3. How can we reduce the amount of time required to replay the growing operation log to rebuild the manager state?

日志重放 Log Replay: 如果系统发生故障(如崩溃后重启),可以通过重放操作日志来恢复管理器的状态。这就是所谓的日志重放。具体来说,就是从某个已知状态(如系统初始状态或某个快照状态)开始,按照操作日志中的顺序执行每一个操作,从而达到恢复系统状态的目的。

在创建检查点时,系统会将当前的状态(包括所有已完成的操作的结果)保存下来。因此,检查点之前的所有操作都已经被“应用”到检查点的状态中了。所以,当我们从检查点开始重放日志时,只需要考虑检查点之后的操作即可。

Checkpoint the manager state when the log grows beyond a certain size. Load the latest checkpoint and replay the logs that are recorded after the last saved checkpoint. 当日志的大小超过一定大小时,给 manager 新增一个 checkpoint。加载最新的 checkpoint 并重放上次保存的检查点之后记录的日志。

References

  • [GFS] Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. 2003. The Google file system. In Proceedings of the nineteenth ACM symposium on Operating systems principles (SOSP '03). Association for Computing Machinery, New York, NY, USA, 29–43.