SD - 系统设计查漏/速记/复习
本文档旨在快速复习系统设计各个题型的重难点和考点
02 Design Youtube
Step 1. Functional Requirements
- Users can upload videos. 用户可以上传视频。
- Users can watch (stream) videos. 用户可以观看 (流式传输)视频。
- Users can view information about a video, such as view counts.
- Users can search for videos.
- Users can comment on videos.
- Users can see recommended videos.
- Users can subscribe to channels.
Step 2. Non-functional Requirements
- The system should be highly available (availability >> consistency). 系统应该具有高可用性 (优先考虑可用性而不是一致性)。
- The system should support uploading and streaming large videos (10s of GBs). 该系统应支持上传和流式传输大型视频 (数十 GB)。
- The system should allow for low latency streaming of videos, even in low bandwidth environments. 即使在低带宽环境下,该系统也应允许低延迟的视频流传输。
- The system should scale to a high number of videos uploaded and watched per day (~1M videos uploaded per day, 100M videos watched per day). 该系统应可扩展到每天上传和观看的大量视频 (每天上传约 100 万个视频,每天观看 1 亿个视频)。
- The system should support resumable uploads. 系统应支持断点续传。
- The system should protect against bad content in videos. 该系统应该防止视频中出现不良内容。
Step 3. Core Entities
- User
- Video
- Video Metadata
Step 4. APIs
- 用户上传视频
POST /upload {Video, Video Metadata}
- 用户播放视频
GET /videos/{videoId} -> Video & Video Metadata
Step 7. Deep Dives and Best Solutions
7.1 How can we handle processing a video to support adaptive bitrate streaming? 我们如何处理视频以支持自适应比特率流
7.2 How do we support resumable uploads? 我们如何支持断点续传
- 1. 创建上传会话 (Upload Session)
- POST /uploads (鉴权后):服务端生成:upload_id (业务侧),s3_upload_id (S3 Multipart 的 UploadId),建议下发推荐分片大小 (见“分片大小策略”)。
- 返回 预签名上传 URL 申请接口的指令 (见第 3 步)。
- 在 DB (如 DynamoDB/Postgres)建 uploads 表 + upload_parts 表并写入初始状态 (见“数据模型/状态机”)。
- 2. 提交文件与分片指纹
- 客户端可上传整体文件 SHA-256 与各分片校验和 (推荐 S3 原生校验:CRC32C/SHA256),便于秒传/去重与完整性校验。
- 3. 按需申请预签名 URL - 分批
- POST
/uploads/{id}/presign
请求一批 partNumbers 的 PUT 预签名 URL (每次 100~500 个,避免一次签太多)。 - 服务端把UploadId + partNumber编码进 URL 权限范围,有效期短 (如 30~60 分钟)。
- POST
- 4. 并行上传分片 (客户端 → S3)
- 客户端用预签名 URL 并行上传,带上 x-amz-checksum-crc32c/x-amz-checksum-sha256 (S3 会校验,强烈推荐),并记录 S3 返回的 ETag。
- 上传成功后客户端回调:POST
/uploads/{id}/parts/{partNumber}/report {etag, checksum, size}
. - 服务端将该 part 标记为 UPLOADED 并保存 etag/checksum/bytes。
- (增强可靠性)服务端可抽样或按阈值调用 ListParts 与 S3 结果对账,标记为 VERIFIED。
- 5. 断点续传
- 任何时刻 GET
/uploads/{id}
返回每个 part 的状态 (NOT_UPLOADED/UPLOADING/UPLOADED/VERIFIED),客户端据此只补缺失分片。 - 客户端也可请求 presign 新一批 URL 继续上传。
- 跨设备恢复:只需持有 upload_id (或通过“整体文件指纹+用户”查回最近会话)。
- 任何时刻 GET
- 6. 合并提交 (服务端完成)
- 当所有 part 为 VERIFIED,服务端执行 CompleteMultipartUpload (带上 [partNumber, ETag] 列表)。
- S3 生成最终对象并触发单个 ObjectCreated 事件 → 触发转码编排 (SQS/SFN/MediaConvert/FFmpeg-on-EKS)。
- 业务侧 VideoMetadata.status 从 UPLOADED → PROCESSING。
- 7. 清理与失败恢复
- 超时 GC:定时任务 (或 Step Functions)扫 uploads 表,对超过 TTL (如 72h)未完成的会话执行 AbortMultipartUpload,并回收 DB 记录。
- 中止:
POST /uploads/{id}/abort
用户主动取消,服务端同样 AbortMultipartUpload。
Uploads
- id: 业务 upload_id
- user_id
- file_sha256: 全文件指纹,便于秒传/去重
- bucket, object_key
- s3_upload_id
- part_size
- status:
INIT → UPLOADING → READY_TO_COMPLETE → COMPLETING → UPLOADED → PROCESSING → DONE/FAILED
- created_at, expires_at
Upload_parts
- (upload_id, part_number) 主键
- fingerprint (可选,各分片哈希)
- etag, checksum_crc32c/sha256, bytes
- status:NOT_UPLOADED → UPLOADING → UPLOADED → VERIFIED
- updated_at
状态推进规则
- 客户端回调或服务端 ListParts 对账后,part→UPLOADED/VERIFIED;
- 全部分片 VERIFIED ⇒ 会话 READY_TO_COMPLETE;
- CompleteMultipartUpload 中置为 COMPLETING;成功后置 UPLOADED 并发转码.
分片大小策略 - 避免 1 万分片上限
- S3 限制: 最小 5MB/分片 最后一片可
less than 5MB
,最大分片数 10,000。 - 推荐: 按文件大小自适应,目标分片数 ≈ 1k 内:
part_size = max(ceil(file_size / 10000), 8MB)
,常用 8–64MB;超大文件 (>50GB)可提升到 128–256MB。
- 移动端网络抖动大,可选择 10 –16MB 取得更细粒度重传与更高成功率。
客户端并发与重试
- 并发窗口: 建议 4–16 并发 (移动端 3–6),根据 RTT 与带宽自适应;
- 重试: 对 5xx/网络错误指数退避 + 抖动 (Jitter);同一 part 重传需保持相同 partNumber;
- URL 过期: 过期即重新申请 presign;
- 幂等: 重复上报 reportPart 时,若 (partNumber, etag/checksum) 相同即幂等成功,若不同则标记冲突并要求客户端重传该分片。
安全与配额
- 预签名 URL 权限仅限该 UploadId+partNumber,短时有效.
- 服务端对每用户/会话限流与配额 (最大并发、峰值带宽、未完成会话上限).
- SSE-KMS 加密对象;注意 KMS 情况下不要依赖 ETag 作为 MD5,统一使用 S3 校验和头或自算指纹.
- 病毒/恶意内容:完成后入“隔离处理队列”,扫描通过再入转码.
7.3 How do we scale to a large number of videos uploaded / watched a day? 我们如何扩大每天上传/观看大量视频的数量
Our system assumes that ~1M videos are uploaded per day and that 100M videos are watched per day. This is a lot of traffic and necessitates that all of our systems scale and ensure a solid experience for end users.
我们的系统假设每天上传约100万个视频,每天观看1亿个视频。这是一个巨大的流量,因此我们所有的系统都必须具备可扩展性,并确保最终用户获得稳定的体验。
Let's walk through each major system component to analyze how it will scale: 让我们逐步了解每个主要系统组件,分析其如何扩展:
- Video Service
- This service is stateless and will be responsible for responding to HTTP requests for presigned URLs and video metadata point queries. It can be horizontally scaled and has a load balancer proxying it.
- 此服务无状态,负责响应预签名 URL 的 HTTP 请求和视频元数据点查询。它可以水平扩展,并由负载均衡器代理。
- Video Metadata
- This is a Cassandra DB and will horizontally scale efficiently due to Cassandra's leaderless replication and internal consistent hashing. Videos will be uniformly distributed as the data will be partitioned by videoId. Of note, a node that houses a popular video might become "hot" of that video is popular, which could be a bottleneck.
- 这是一个 Cassandra 数据库,得益于 Cassandra 的无领导复制和内部一致性哈希技术,可以高效地水平扩展。由于数据将按 videoId 进行分区,因此视频将均匀分布。 值得注意的是,如果某个热门视频的节点很受欢迎,它可能会变成“热门”节点,这可能会成为瓶颈。
- Video Processing Service
- This service can scale to a high number of videos and can have internal coordination around how it distributes DAG work across worker nodes. Of note, this service will likely have some internal queuing mechanism that we don't visualize. This queue system will allow it to handle bursts in video uploads. Additionally, the number of jobs in the queue might be a trigger for this system to elastically scale to have more worker nodes.
- 此服务可扩展到处理大量视频,并能够内部协调如何在工作节点之间分配 DAG 工作。值得注意的是,此服务可能包含一些我们未直观看到的内部排队机制。该队列系统将使其能够处理视频上传的突发流量。此外,队列中的作业数量可能会触发此系统弹性扩展以容纳更多工作节点。
- S3
- S3 scales extremely well to high traffic / high file volumes. It is multi-region and can elastically scale. However, the data center that houses S3 might be far from some % of users streaming the video if the video is streamed by a wide audience, which might slow down the initial loading of a video or cause buffering for those users.
- 拥有极佳的扩展性,能够应对高流量/高文件量。它支持多区域,并可弹性扩展。 然而,如果视频的观看者群体广泛,那么 S3 所在的数据中心可能距离部分观看视频的用户较远,这可能会减慢视频的初始加载速度,或导致这些用户出现缓冲。
解决"热门"视频问题
To address the "hot" video problem, we can consider tuning Cassandra to replicate data to a few nodes that can share the burden of storing video metadata. This will mean that several nodes can service queries for video data. Additionally, we can add a cache that will store accessed video metadata. This cache can store popular video metadata to avoid having to query the DB for it. The cache can be distributed, use the least-recently-used (LRU) strategy, and partitioned on videoId as well. The cache would be a faster way to retrieve data for popular videos and would insulate the DB a bit.
- 我们可以考虑调整 Cassandra,将数据复制到几个可以分担存储视频元数据负担的节点。这意味着多个节点可以处理视频数据查询。
- 此外,我们可以添加一个缓存来存储访问过的视频元数据。
- 这个缓存可以存储热门视频元数据,避免查询数据库。
- 缓存可以采用分布式设计,使用最近最少使用 (LRU) 策略,并根据视频 ID 进行分区。
解决远离S3数据中心的流媒体播放问题
To address the streaming issues for users who might be far away from data centers with S3, we can consider employing CDNs. CDNs could cache popular video files (both segments and manifest files) and would be geographically proximate to many users via a network of edge servers. This would mean that video data would have to travel a significantly shorter distance to reach users, reducing slowness / buffering. Also, if all the data (manifest files and segments) was in the CDN, then the service would never need to interact with the backend at all to continue streaming a video.
- 我们可以考虑使用 CDN。CDN 可以缓存热门视频文件 (包括片段和清单文件),并通过边缘服务器网络在地理位置上接近众多用户。
- 这意味着视频数据到达用户的传输距离将显著缩短,从而减少缓慢/缓冲。
- 此外,如果所有数据 (清单文件和片段)都存储在 CDN 中,那么该服务将无需与后端交互即可继续播放视频。
06 Design Instagram
Step 1. Functional Requirements
(占位)
Step 2. Non-functional Requirements
(占位)
Step 3. Core Entities
(占位)
Step 7. Deep Dives and Best Solutions
(占位)
07 Design Uber
Step 1. Functional Requirements
- Riders can input start & destination and get fee estimate. (乘客可估价)
- Riders can request a ride. (乘客可发单)
- Match rider with nearby available driver. (匹配附近空闲司机)
- Drivers accept/decline & navigate. (司机接/拒单与导航)
Step 2. Non-functional Requirements
- 低延迟匹配 (< 1 min) 成功匹配或失败回退。
- 强一致的分配 (同一时刻司机不可被多派单)。
- 高吞吐:高峰期同地段
100k
请求。
Step 3. Core Entities
- Ride
{id, rider_id, driver_id, vehicle, status, pickupTS, dropoffTS, fee, route}
- Location
{id, driver_id, latitude, longitude, timestamp, ...}
- Fee
{id, pickup, dropoff, eta, fee_amount}
- Rider
{id, name, ...}
- Driver
{id, name, vehicle_info, current_location, status}
Step 4. APIs
- 请求预估费用
POST /fee
{rider_id, pickup_location, destnation}
rpc EstimateFee(EstimateFeeRequest)
- 请求一个 ride
POST /ride
{fee_id}
rpc CreateRide(CreateRideRequest)
- 司机端上报实时地址
POST /drivers/location
{lon, lat}
rpc StreamDriverLocation(stream Location)
- 司机端接单
PATCH /rides/{rideId}
{accept|deny}
rpc AcceptRide(AcceptRideRequest)
Step 5. Services
- Ride Service
- Ride Matching Service
- Map Service
- Location Service (Uber's open sourced hexagon clusters or we say indexes of city and area)
- Notification Service (APNs, Firebase)
Step 6. Storage
- Ride DB / Location DB
Step 7. Deep Dives and Best Solutions
7.1 How do we handle frequent driver location updates and efficient proximity searches on location data? 我们如何处理频繁的驾驶员位置更新和位置数据的有效邻近搜索?
- Use Redis or SSD storage with TTL to handle high-throughput location updates and queries.
- Ensure high availability and correctness with replicas and persistent storage.
- Build space indexes (e.g., H3) to speed up nearby driver searches and reduce distance calculation time.
7.2 How can we manage system overload from frequent driver location updates while ensuring location accuracy? 如何在确保位置准确性的同时管理频繁更新驾驶员位置造成的系统过载?
- Adjust the frequency of location updates based on context factors like speed, routes, traffic, or info from other nearby drivers
- Enhance mobile device functionality and sensor
7.3 How do we prevent multiple ride requests from being sent to the same driver simultaneously? 如何防止多个乘车请求同时发送给同一司机?
- Use a distributed lock (e.g., Redis or ZooKeeper) with a 10-second TTL to lock the driver during assignment.
- If the driver doesn’t respond within 10s, release the lock and forward the request to another nearby driver.
- Ensure the lock service is highly available and fault-tolerant to avoid race conditions or deadlocks.
7.4 How can we ensure no ride requests are dropped during peak demand periods? 我们如何确保在高峰需求期间不会丢失乘车请求?
- MQ (Kafka) with auto scaling, setup partition/consumers
- Reduce rebalancing happen times
7.5 What happens if a driver fails to respond in a timely manner? 如果司机未能及时响应会发生什么?
7.6 How can you further scale the system to reduce latency and improve throughput? 如何进一步扩展系统以减少延迟并提高吞吐量?
- Geo-sharding of Read Replicas, and 保证热点地区、城市的可用性。Avoid Hot key problem
- 增大资源给 Hot area, less resources for other areas
12 Design Dropbox
Step 1. Functional Requirements
- Users should be able to upload a file from any device. 用户应该能够从任何设备上传文件
- Users should be able to download a file from any device. 用户应该能够从任何设备下载文件
- Users should be able to share a file with other users and view the files shared with them. 用户应该能够与其他用户共享文件并查看与他们共享的文件
- Users can automatically sync files across devices. 用户可以跨设备自动同步文件
- Users should be able to edit files. 用户应该能够编辑文件
- Users should be able to view files without downloading them. 用户无需下载即可查看文件
Step 2. Non-functional Requirements
- The system should be highly available (prioritizing availability over consistency).
- availability >> consistency
- The system should support files as large as 50GB.
- The system should be secure and reliable. We should be able to recover files if they are lost or corrupted.
- 系统应安全可靠。文件丢失或损坏后,应能够恢复。
- The system should upload, download, and sync times as fast as possible (low latency).
- 系统上传、下载、同步时间应尽可能快 (低延迟)
- The system should have a storage limit per user.
- 系统应该对每个用户设置存储限制。
Step 3. Core Entities
- File
id, storage_path, file_name, total_chunks, file_size
- FileChunk
file_id, chunk_index, storage_path, size, checksum, version, ts, status
- File Metadata
id, owner_id, file_id, file_name, file_type, file_size, version, ts, is_deleted, checksum, shared_with, access_permission, parent_folder_id, can_preview
- User
id, name, email, hashed_password, created_at, storage_limit, used_storage, devices
Step 4. APIs
- 用户上传文件
POST /files/upload {bytes[] File, FileMetadata}
- 用户下载文件
GET /files/{fileId} -> bytes[] File & FileMetadata
- 客户端可以多线程并行下载所有 chunks 并按照顺序拼接
GET /files/{fileId}/chunks -> chunks metadata {index+storage_path+checksum}
- 用户共享文件
POST /files/{fileId}/share {user_ids[], permission(read/write)}
- 取消分享:
DELETE /files/{fileId}/share {user_ids[]}
- 让客户端查询远程服务器上单个文件的更改
GET /files/{fileId}/changes -> FileMetadata[]
- 查看当前用户下所有文件的变更列表
GET /sync/changes?since=timestamp&device_id=xxx
- Response:
{ file_id, version, change_type: "modified" | "deleted", updated_at }
- 获取可预览的链接 (PDF, doc, image)
GET /files/{fileId}/preview -> preview_url
- 获取用户所有可访问的文件
GET /files?filter=all|shared|owned
Step 5. Services
- User Service
- Register, login
- Devices add or delete
- Auth & Access Service
- Generate and validate token
- File Access Controller(owner/shared/roles)
- File Metadata Service
- Folder relationship, metadata
- File Storage Service
- Manage lifecycle of File and FileChunk, upload, download, sharding
- Connected to Object Storage like S3 or Google Cloud Storage
- Sync Service
- Support file sync and up-to-date cross devices
- Preview and Editor Service
- Convert images/PDFs/documents to displayable formats
- Supports online editing of lightweight files such as markdown/text/JSON
- Notification Service
- Notifications of files being finished uploading, shared, updated, or deleted
- Client device login notification
- CDN
- Fast distribution of static content and accelerated preview
Step 6. Storage
(占位:元数据 DB、对象存储、缓存层)
Step 7. Deep Dives and Best Solutions
7.1 How can you support large files? 如何支持大文件
两个核心要求
- Progress Indicator
- Users should be able to see the progress of their upload so that they know it's working and how long it will take. 用户应该能够看到上传的进度,以便知道上传正在进行以及需要多长时间。
- Resumable Uploads
- Users should be able to pause and resume uploads. If they lose their internet connection or close the browser, they should be able to pick up where they left off rather than redownloading the 49GB that may have already been uploaded before the interruption. 用户应该能够暂停和恢复上传。如果网络连接中断或关闭浏览器,他们应该能够从中断的地方继续上传,而不是重新下载中断前可能已经上传的 49GB 数据。
在深入探讨解决方案之前,让我们花点时间来了解通过单个 POST 请求上传大文件所带来的限制。
- Timeouts 超时
- Web servers and clients typically have timeout settings to prevent indefinite waiting for a response. Web 服务器和客户端通常具有超时设置,以防止无限期地等待响应。单个 50GB 文件的 POST 请求很容易超过这些超时设置。事实上,在面试中,这或许是一个进行快速计算的好时机。如果我们有一个 50GB 的文件,并且互联网连接速度为 100Mbps,那么上传该文件需要多长时间?
- 50GB * 8 bits/byte / 100Mbps = 4000 seconds = 1.1 Hr
- Browser and Server Limitation 服务器限制
- In most cases, it's not even possible to upload a 50GB file via a single POST request due to limitations configured in the browser or on the server. Both browsers and web servers often impose limits on the size of a request payload.
- 大多数情况下,由于浏览器或服务器配置的限制,甚至无法通过单个 POST 请求上传 50GB 的文件。浏览器和 Web 服务器通常都会对请求负载的大小施加限制。例如,Apache 和 NGINX 等流行的 Web 服务器虽然有可配置的限制,但默认限制通常小于 2GB。大多数现代服务 (例如 Amazon API Gateway)的默认限制要低得多,而且无法提高。对于我们在设计中使用的 Amazon API Gateway,默认限制仅为 10MB。
- Network Interruptions 网络中断
- Large files are more susceptible to network interruptions. If a user is uploading a 50GB file and their internet connection drops, they will have to start uploading from scratch.
- 大文件更容易受到网络中断的影响。如果用户正在上传一个 50GB 的文件,而网络连接突然中断,则必须从头开始上传。
- User Experience
- Users are effectively blind to the progress of their upload. They have no idea how long it will take or if it's even working. 用户实际上对上传进度一无所知。他们不知道上传需要多长时间,甚至不知道上传是否成功。
To address these limitations, we can use a technique called "chunking" to break the file into smaller pieces and upload them one at a time (or in parallel, depending on network bandwidth). Chunking needs to be done on the client so that the file can be broken into pieces before it is sent to the server (or S3 in our case). A very common mistake candidates make is to chunk the file on the server, which effectively defeats the purpose since you still upload the entire file at once to get it on the server in the first place. When we chunk, we typically break the file into 5-10 MB pieces, but this can be adjusted based on the network conditions and the size of the file.
为了解决这些限制,我们可以使用一种称为 "分块" 的技术,将文件拆分成更小的块,并一次上传一个 (或者并行上传,具体取决于网络带宽)。分块需要在客户端完成,以便在文件发送到服务器 (在本例中是 S3)之前将其拆分成多个块。候选人常犯的一个错误是在服务器上对文件进行分块,这实际上违背了分块的目的,因为你仍然需要一次性上传整个文件才能将其上传到服务器。当我们进行分块时,我们通常会将文件拆分成 5-10 MB 的小块,但这可以根据网络状况和文件大小进行调整。
With chunks, it's rather straightforward for us to show a progress indicator to the user. We can simply track the progress of each chunk and update the progress bar as each chunk is successfully uploaded. 使用分块上传,我们可以相当直接地向用户显示进度指示器。我们可以简单地跟踪每个分块的进度,并在每个分块成功上传后更新进度条。
The next question is: How will we handle resumable uploads? We need to keep track of which chunks have been uploaded and which haven't.
我们该如何处理可断点上传?我们需要跟踪哪些数据块已上传,哪些尚未上传。
- We can do this by saving the state of the upload in the database, specifically in our FileMetadata table. Let's update the FileMetadata schema to include a chunks field.
- 我们可以通过将上传状态保存到数据库中 (具体来说是在 FileMetadata 表中)来实现这一点.
- When the user resumes the upload, we can check the chunks field to see which chunks have been uploaded and which haven't. We can then start uploading the chunks that haven't been uploaded yet. This way, the user doesn't have to start uploading from scratch if they lose their internet connection or close the browser.
- 当用户恢复上传时,我们可以检查 chunks 字段,查看哪些块已上传,哪些块尚未上传。然后,我们可以开始上传尚未上传的块。这样,即使用户断网或关闭浏览器,也无需从头开始上传。
// for example
{
"id": "123",
"name": "file.txt",
"size": 1000,
"mimeType": "text/plain",
"uploadedBy": "user1",
"status": "uploading",
"chunks": [
{
"id": "chunk1",
"status": "uploaded"
},
{
"id": "chunk2",
"status": "uploading"
},
{
"id": "chunk3",
"status": "not-uploaded"
}
]
}
But how should we ensure this chunks field is kept in sync with the actual chunks that have been uploaded? 我们应该如何确保这个块字段与已上传的实际块保持同步呢?
A better approach is to use S3 event notifications to keep the chunks field in sync with the actual chunks that have been uploaded. S3 event notifications are a feature of S3 that allow you to trigger a Lambda function or send a message to an SNS topic when a file is uploaded to S3. We can use this feature to send a message to our backend when a chunk is successfully uploaded and then update the chunks field in the FileMetadata table without relying on the client.
- 更好的方法是使用 S3 事件通知. 使“chunks”字段与已上传的实际数据块保持同步。
- S3 事件通知是 S3 的一项功能,允许您在文件上传到 S3 时触发 Lambda 函数或向 SNS 主题发送消息。