1. Design Twitter

info

Design an Online social networking website, like Twitter.

Post Tweet
Timeline(主页)
Follow

来源: 花花酱

Interview Signals

System Design 考察的四方面

Work solution: 能否提出一个可以 work 的方案，是否熟悉常见的场景与设计模式.
Analysis and communication: 与面试官保持交流，对 Storage 与 Bandwidth 的分析.
Tradeoff Pros/Cons: 是否能够提出不同的解决方法并评估不同解决方案的优缺点，根据需求做取舍.
Knowledge Base: 知识面深度和广度.

Overview

Step 1: Clarify the requirements
Step 2: Capacity Estimation
Step 3: System APIs
Step 4: High-level System Design
Step 5: Data Storage
Step 6: Scalability

Step 1. Clarify the requirements

Clarify the requirements and goals of the system

Requirements
Traffic size (e.g., Daily Active User)

Discuss the funcitonalities, align with interviewers on components to focus.

Type 1: Functional Requirement

Tweet a. Create b. Delete
Timeline/Feed a. Home b. User
Follow a user
Like a tweet
Search tweets
...

Type 2: Non-Functional Requirement

在一个分布式计算系统中，只能同时满足下列的两点。即 CAP 理论

Consistency 一致性
- Every read receives the most recent write or an error
- Sacrifice: Eventual consistency

Early results of eventual consistency data queries may not have the most recent updates because it takes time for updates to reach database. 最终一致性保证在处理结束后能够保证一致性.

Availability 可用性
- Every request receives a (non-error) response, without the guarantee that it contains the most recent write
- Scalable
  - Performance: low Latency 低延迟
Partition tolerance (Fault Tolerance) 容错性
- The system continues to operate despite an arbitrary number of messgaes being dropped (or delayed) by the network between nodes

Step 2. Capacity Estimation

Handy conversion guide:

2.5 million seconds per month

86400 seconds per day

1B = 8 bit

1KB = 1000B = 10^3 B

1MB = 1000KB = 10^6 B

1GB = 1000MB = 10^9 B

1TB = 1000GB

Assumption

200 million DAU, 100 million new tweets.
Each user: visit home timeline 5 times; other user timeline 3 times.
Each timeline/pages has 20 tweets.
Each tweet has size 280 (140 characters) bytes, metadata 30 bytes.
- Per photo: 200KB, 20% tweets have images.
- Per video: 2MB, 10% tweets have video, 30% videos will be watched.

Storage Estimate

Write size daily

Text: 100M new tweets * (280 + 30) Bytes/tweet = 31GB/day
Image: 100M new tweets * 20% has image * 200KB per image = 4TB/day
Video: 100M new tweets * 10% has video * 2MB per video = 20TB/day
Total: 31GB + 4TB + 20TB = 24TB/day

Bandwidth Estimate

Read heavy system

Daily Read Tweets Volume

200M * (5 home visit + 3 user visit) * 20 tweets/pages = 32Billion tweets/day

Daily Read Bandwidth

Text: 32B * 280 bytes / 86400 = 100MB/s
Image: 32B * 20% tweets has image * 200KB per image / 86400 = 14GB/s
Video: 32B * 10% tweets has video * 30% got watched * 2MB per video / 86400 = 20GB/s
Total: 35GB/s

Step 3. System APIs

postTweet(userToken, string tweet)
deleteTweet(userToken, string tweetId)
likeOrUnlikeTweet(userToken, string tweetId, bool like)
readHomeTimeLine(userToken, int pageSize, opt string pageToken)
readUserTimeLine(userToken, int pageSize, opt string pageToken)

pageToken 用于指定当前已阅读的页面，用于翻页

Step 4. High-level System Design

在用户发布 tweets 的时候将数据存入 Cache, 例如当 Elon Musk post a tweet, Tweet Writer Server 在写 DB 的同时, 也 update Cache.

直接从数据库中读取 user's Home Timeline 是非常慢的，我们可以将所有的 user Home Timeline 存入缓存中

每当一个用户发布 tweet 的时候，writer 会更新这个用户的每个 follower 在 Cache 中的 Home Timeline。此步骤在 Twitter 中叫做 Fan out on write.

// to store in cache like:
{
    user: {
        user_id: 1001,
        tweet_id: 1,
        tweet: "...",
        followers: [...],
    }
}

Focus on Home Timeline

Naive solution: Pull mode 拉模型

How: Fetch tweets from N followers from DB, merge and return. 从数据库中获取 N 个 follower 的 tweets，合并并返回。
Pros: Write is fast: O(1)
Cons: Read is very slow: O(N) power to the number of tables to join. 大量的 join & combine 操作

Better solution: Push mode 推模型

How:
- Maintain a feed list in cache for each user. 维护每个用户的 feed list
- Fanout on write. 当用户发布 tweet 的时候，将数据写入到每个 follower 的 feed list 中
Pros:
- Read is fast: O(1) from the feed list in cache
Cons:
- Write need more efforts: O(N) write for each new tweet. use Async tasks like Queue 写延时对用户影响不大使用异步任务
- Delay in showing latest tweets (eventual consistency)
- Not efficient for users with huge amount of followers (>100k). 对于拥有大量关注者 (>10 万) 的用户来说效率不高。

试想一个场景，某千万粉大 v 发帖，writer server 的写入需要过程，在此期间有些人能先看到 tweet 并开始转发了，有些人还没能看到。

Hybrid solution

Non-hot users:
- fan out on write(push mode): write to user timeline cache. 写入到用户的 timeline cache。
- do not fanout on non-active users. 忽略不活跃用户
Hot users:
- Fan in on read(pull mode): read during timeline request from tweets cache, and aggregate with results from non-hot users. 从 tweets cache 中读取，与非热门用户的结果聚合。
- 当大 V 发布 tweets 的时候，不需要更新到每个 follower 的 timeline 里面。当 user 开始刷新并读取 timeline 的时候，再去 fetch 这些大 V 的 tweets。

Step 5. Data Storage

SQL database
- E.g., user table
NoSQL database
- E.g., timelines
File system
- Media file: image, audio, video

Redis 设计

Redis 的 value 支持很多种数据结构

UserId	TweetId
1	10,11,12,13,14...

Integer, String
double linked list导致内存碎片化，需要维护指针，每个 Id 需要 2 个指针，导致内存利用率很低.
Twitter 使用到的是 ziplist数据结构。即一个数据 block 中存储更多信息，指针数量被大幅减少，而且遍历速度更快. 更好的实现了 pagination 分页的功能. 当用户在读取 tweet 的时候，并不需要全部读取，而是分页面有页面读取上限, 我们只需要满足一页的数据就够了.
实际环境中，Redis 只为每个 user 存储800条TweetId. 并且只有 30 天内登录过的 user 才会被在 Redis 集群中创建<K, V> 作为 read 的优化.

Step 6. Scalability

Identify potential bottlenecks
Discussion solutions, focusing on tradeoffs
- Data sharding 数据分片
  - Data store, cache
- Load balancing 负载均衡
  - E.g., user <-> application server; application server <-> cache server; application <-> db
- Data caching 数据缓存
  - Read heavy 对读 >> 写的 app 特别有用

Sharding

Why: Impossible to store/process all data in a single machine
How: Break large tables into smaller shards on multiple servers 将大表水平拆分为不同的 shards 存储在不同的服务器上
Pros: Horizontal scaling
Cons: Complexity (distributed query, resharding)

Option 1: Shard by tweet's creation time

Pros: Limited shards to query
Cons:
- Hot/Cold data issue 当天的表会有非常高的 read，几年前的表无人访问
- New shards fill up quickly 新 shard 的 write qps 非常高
基于推文创建时间分片: 基于创建时间存储推文的好处是可以快速获取所有的最靠前推文，而且我们只需要查询很小一部分服务器。问题是流量负载不会被分布到多台服务器，例如当写操作时，所有的新推文都进入一台服务器，其余服务器都是闲置的。类似地，当读操作时，存储最近数据的服务器和存储老数据的服务器相比，将有非常高的负载。

Option 2: Shard by hash (userId): store all the data of a user on a single shard

Pros:
- Simple
- Query user timeline is straight forward
Cons:
- Home timeline still needs to query multiple shards
- Non-uniform distribution of storage
  - User data might not be able to fit into a single shard
  - 一些用户和其他用户相比，可能发布大量推文或者关注大量用户。维护增长用户数据的均匀分配是非常困难的
- Hot users: 如果用户成为热门用户，服务器上将会有大量关于该用户的查询。高负载将影响我们服务的性能。
- Availability
基于用户 ID 分片: 我们可以尝试将一个用户的所有数据存储在一台服务器上。存储时，我们可以将用户 ID 传给我们的哈希函数，哈希函数将用户映射到一台数据库服务器，数据库服务器上存储所有用户发布的推文、喜欢的推文、关注者等信息。当查询一个用户发布的推文/关注者/喜欢的推文时，我们可以使用寻找用户数据的哈希函数从数据库服务器中读取相应的信息

Option 3: Shard by hash (tweetId)

Pros:
- Uniform distribution
- High availability
Cons:
- Need to query all shards in order to generate user/home timeline
- 因为 tweets 分布在不同的 shards 里
基于 tweet ID 分片: 我们的哈希函数将每个 tweet ID 映射到随机的一台存储该推文的服务器。为了搜索推文，我们必须查询所有的服务器，每个服务器将返回一个推文集合。一个中心化服务器将聚集这些结果并返回给用户。我们用时间线的生成举例，以下是我们的系统生成用户时间线需要执行的操作:
- 1. app sever 将找到用户关注的所有用户。
- 1. app server 将查询发送到所有的 DB server，找到这些用户发布的 tweet。
- 1. 每个 DB server 将找到每个用户的 tweet，将推文按照发布时间由近及远排序并返回最靠前的推文。
- 1. app server 将所有的结果合并然后再次排序，将最靠前的结果返回给用户

Caching

由于查看 Timeline 的 tweets 需要使用到翻页，而一般情况下翻页次数不多，不会有用户翻到几百页后，这是一种基于前端的限制行为。因此我们只需要保存近期的 tweets 在 cache 中，当用户翻页到后面时，再从 DB 中 query 数据

Deep Dive - Design Timeline Feed

01 Intro

Feed 系统是一个信息分发系统。例如淘宝/头条/抖音/微博等 APP 都必须具有首页功能，首页通常要展示 Feed 列表。信息分发承载着用户留存和商业化的重要指标，从技术上对延迟和可用性有着极高的要求。

Feed 系统有两个要素，召回和排序。

Retrieve 召回: 决定有哪些内容需要分发给哪些用户。
Rank 排序: 由于手机屏幕的尺寸只有手掌大小，信息的展示位不足。所以必须要区分和展示信息的优先级。因此需要有 Rank 排序

Feed 系统的类型

Timeline Feed: 根据用户与用户之间的关注关系来召回 Feed，然后基于发布时间进行排序的简单信息流系统。
Top K Feed: 根据某些召回策略对 Feed 进行召回，然后基于模型排序的复杂推荐系统。这类型不是本次的重点。

02 Requirments 设计目标

用户发布 Feed
用户关注/取关其他用户
用户查看订阅频道，可以看到关注用户的 Feed，以发布时间排序
用户可以点击某个 Feed，进入其详情页
用户可以点击某个用户头像，进入其个人主页，展示他曾发布的 Feed
用户可以点赞、评论、转发 Feed，用户的 Feed 上也会展示标题/封面/点赞/评论/浏览数等信息
内容安全审核机制

03 APIs

发布 Item

1. 用户发布 item 时，写 item 到 item 服务，并调用 Relation Server 获取用户的粉丝关系列表。
1. 触发一系列异步任务
- 审核
- 视频/图文的转码与储存
- NLP 打标签
- ...
1. 异步将内容写入每个粉丝的 Inbox 列表中，并以 Timestamp 排序。

用户读取 Feed 列表

1. 当用户读取 feed 列表时，直接从 feed server 读取该用户的 Inbox 即可。
1. Inbox中存储的是 item_uuid，通过 item_uuid 从 item server 中读取 item 的详细信息 (作者/内容/关系等等)进行聚合。
1. 判断过滤规则，保证内容有效且合规。

1. Design Twitter

Interview Signals

Overview

Step 1. Clarify the requirements

Type 1: Functional Requirement

Type 2: Non-Functional Requirement

Step 2. Capacity Estimation

Assumption

Storage Estimate

Bandwidth Estimate

Step 3. System APIs

Step 4. High-level System Design

Focus on Home Timeline

Step 5. Data Storage

Redis 设计

Step 6. Scalability

Sharding

Caching

Deep Dive - Design Timeline Feed

01 Intro

02 Requirments 设计目标

03 APIs

发布 Item

用户读取 Feed 列表

点击详情页

04 优化 Feed 的发布与订阅

推拉结合 Push & Pull Combination

05 Item Server - 延迟/高可用/储存优化

06 Relation Server - 延迟/高可用/储存优化

Interview Signals​

Overview​

Step 1. Clarify the requirements​

Type 1: Functional Requirement​

Type 2: Non-Functional Requirement​

Step 2. Capacity Estimation​

Assumption​

Storage Estimate​

Bandwidth Estimate​

Step 3. System APIs​

Step 4. High-level System Design​

Focus on Home Timeline​

Step 5. Data Storage​

Redis 设计​

Step 6. Scalability​

Sharding​

Caching​

Deep Dive - Design Timeline Feed​

01 Intro​

02 Requirments 设计目标​

03 APIs​

发布 Item​

用户读取 Feed 列表​

点击详情页​

04 优化 Feed 的发布与订阅​

推拉结合 Push & Pull Combination​

05 Item Server - 延迟/高可用/储存 优化​

06 Relation Server - 延迟/高可用/储存 优化​

Interview Signals

Overview

Step 1. Clarify the requirements

Type 1: Functional Requirement

Type 2: Non-Functional Requirement

Step 2. Capacity Estimation

Assumption

Storage Estimate

Bandwidth Estimate

Step 3. System APIs

Step 4. High-level System Design

Focus on Home Timeline

Step 5. Data Storage

Redis 设计

Step 6. Scalability

Sharding

Caching

Deep Dive - Design Timeline Feed

01 Intro

02 Requirments 设计目标

03 APIs

发布 Item

用户读取 Feed 列表

点击详情页

04 优化 Feed 的发布与订阅

推拉结合 Push & Pull Combination

05 Item Server - 延迟/高可用/储存优化

06 Relation Server - 延迟/高可用/储存优化