3. Design Web Crawler
设计网络爬虫
来源: 花花酱
Interview Signals
System Design 考察的四方面
- Work solution: 能否提出一个可以 work 的方案,是否熟悉常见的场景与设计模式.
- Analysis and communication: 与面试官保持交流,对 Storage 与 Bandwidth 的分析.
- Tradeoff Pros/Cons: 是否能够提出不同的解决方法并评估不同解决方案的优缺点,根据需求做取舍.
- Knowledge Base: 知识面深度和广度.
Overview
- Step 1: Clarify the requirements
- Step 2: Capacity Estimation
- Step 3: High-level Design
- Step 4: Detailed Component Design
- Step 5: Fault Tolerance & Scalability
- Step 6: Extended topics
Step 1: Clarify the requirements
Clarify goals of the system
- Requirements
- Traffic size (e.g., Daily Active User)
Discuss the funcitonalities, align with interviewers on components to focus.
Align with interviewers on 2-3 components to focus in the interview
Type 1: Functional Requirement
Crawl the entire web
- Assumption 1: crawl HTML contents only 只爬取 HTML
- The design should be extensible for downloading image/video contents
- Assumption 2: support HTTP protocol 只支持 HTTP 协议
- The design should be extensible to support FTP and other protocols later
Type 2: Non-Functional Requirement
- Scalability
- Can craw the entire web and fetch millions of documents
- Extensibility
- Easy to support new requirements, e.g., support new documents types
- 模块化 希望未来能支持音频视频
Step 2: Capacity Estimation
- Assumption
- 30 billion web pages (URLs)
- Craw within 30 days
- Each page has size 100KB (HTML only), need 0.5KB for storing meta info
- 30 billion web pages (URLs)
- Storage estimation
- 30B * (100KB + 0.5KB) = 3.02PB
- Bandwidth
- Craw speed
- 30B / (30 days * 86400 s/day) = 11574 pages/second
- Download bandwidth
- 11574 pages/second * (100KB + 0.5KB) / page = 1.16GB/s (1.e.,9.3Gbps)
- Craw speed
- GB/s = gigabyte/second, Gbps = gigabit/second
- Convert gigabyte/second to gigabit/second = * 8 即乘 8 换算
Step 3: High-level Design
Given a set of seed URLS 给定起始点
- Pick a URL, fetch the web page
- Process the downloaded page, e.g., store and index the contents 处理下载的网页
- Parse the page to extract URLs, add to unvisited URLs (URL Frontier) 从下载的网页里抽取还没有访问的 URLs,将其加入处理队列中
- Go back to step 1
- seeds 被加入到
URL Frontier
,Fetcher
从URL Frontier
中抽取一个 URL 去Internet
中下载内容,内容会进入Parser
,对内容进行处理,并抽取未访问的 URLs 放入URL Frontier
,文件内容被存入Content Store
, URLs 被存入URL Store
Document Deduplication
用于文件的去重,过滤重复的内容URL Filter
用于 URL 的过滤和去重,过滤掉不能访问的 URLDNS Resolver
即 DNS 的解析器,在拿到 URL 时先通过该解析器,解析为对应 web server's IP address
Document Deduplication 文件去重
- Duplication
- The same web contents may available under different URLs
- How to dedup
- Document fingerprint: compute a 64-bit checksum (e.g., MD5 or SHA) for every processed document, and store in the database
- 对文件去重需要计算出每个文件的指纹,即哈希算法,来计算一个 64 位的指纹。当遇到新网页时,计算指纹判断是否有重复的内容
- Other approach: SimHash, used by Google, more robust to minor changes 可以容忍微小的改变。即一个文件改变了小部分内容,该算法可以检测出这两者是同一个文件
- Storage
- 30B * 8 Bytes = 240 GB (can use cache + disk storage)
- 64-bit = 8 Bytes
URL Filters
- URL Filters
- Disallowist, robots exclusion, dedupling 过滤掉机器人和不允许访问的 URL 并去重
- 因为爬虫是消耗对方服务器的资源,所以对方可能不希望被爬取
- Robots exclusion
- Fetch
robots.txt
file from a website to test whether a given URL can be crawled 找到robots.txt
会写明是不可爬取还是部分可爬取,需要遵守规则 - Cache can be used to store a local copy of the exclusion rules to reduce the frequency of downloading the robots exclusion for the same web server again and again 在本地存储该文件的拷贝,不需要每次都从 web server 中下载
- Fetch
- URL Deduplication
- URL normalization/standardization 标准化。例如大小写 相对路径绝对路径
- URL fingerprint 通过指纹去重
DNS Resolver
- DNS resolution 解析
- Translating a URL to a unique IP address
- Before contacting a web server, we need to use DNS to get the IP address
- Challenge
- This can be a bottleneck given the large amount of URLs we need to crawl
- 会成为瓶颈因为我们会有大量的 URLs 需要爬取
- Solution
- DNS caching: Can cache DNS results by building local DNS servers
- 建立本地 DNS 缓存,不需要每次都从外部去请求获取 IP address
Step 4: Detailed Component Design
- Crawling Policies
- Politeness policies 礼貌性规则
- Distributed Crawler design 分布式爬取设计
Crawling Policies
Base on https://en.wikipedia.org/wiki/Web_crawler
- selection policy which states the pages to download
- re-visit policy which states when to check for changes to the pages 网页内容会更新,需要不断地重新爬取
- politeness policy that states how to avoid overloading Web sites 不要使得对方的服务器过度负载
- parallelization policy that states how to coordinate distributed web crawlers 分布式爬虫如何合作
Politeness Policy 礼貌性机制
- Crawlers can retrieve data much quicker and in greater depth than human searchers, so they can have a crippling impact on the performance of a site.
- if a single crawler is performing
multiple requests per second
and/or downloading large files, a server can have a hard time keeping up with requests from multiple crawlers. - 发送过多的请求会让对方服务器过载,对方可能会 ban 掉你的 ip address,禁止访问
- 避免并行的发送请求,并每次请求需要有间隔,例如 1 ~ n seconds
Distributed crawler
can overload a website if not implemented carefully.
URL Frontier Design
- Implement selection-policy
- By deciding the priorities of URLs to be download
- Implement politeness-policy
-
- Should have only one connection to a web server at any given time 在任何单一时刻只有一个请求
-
- Should have time gaps between subsequent visits to the same web server 对同一网站的两次访问之间必须要有时间间隔
-
- Function
- Insert new URLs into queue
- Provide URLs for workers to download
Prioritizer
是入口,没有访问过的 urls 从这里进入队列。它根据重要性、该 url 多久未被访问等因素来评定 url 的优先级。例如 优先级=1 则插入 queue 的 head, 优先级=n 则插入 queue 的 tailPoliteness selector
是出口,向下游的机器Worker thread
提供 url 让他们去下载网页。它通过Heap
来控制对queue
的读取,Heap
中存储了每个queue
下一次可以被访问的最早时间。每次抽取从MinHeap
中取堆顶。当每个queue
被访问时会更新它下一次能被访问的时间,从而控制频率Front Queue
实现了优先级排序,决定哪些 url 先下载,哪些 url 后下载Back Queue
实现了礼貌性机制,控制了对某个网站访问的频率。在每个Back Queue (简称B)
中只存放某一域名的网页,例如B1=腾讯
,B2=百度
,B3=新浪
,可以通过selector
来控制访问每一个网页的频率Mapping table
里存储了每个网站对应的backqueue ID
,Politeness router
会根据Mapping table
来将对应网站 route 到其对应的Back Queue
中
Mapping Table
- Maintains the mapping between domain name
<->
back queue ID - When a back queu is empty, it will get refilled by front queue, and the map will be updated accordindly
Domain Name | Back Queue # |
---|---|
wikipedia.org | 12 |
stanford.edu | 19 |
... | ... |
Step 5: Fault Tolerance & Scalability
- Scalability 可扩展性
- Database sharding 分表
- Consistent hashing 一致性哈希
- Fault tolerance 容错性
- Server replacement
Design Solution from Scott
Reference:
1. Requirments
- Empirical results driven. 实证结果驱动。
- From database perspective, Data size is the limitation. 从数据库的角度来看,数据大小是限制。
- gurugamer game replay: 20k, daily replay: 1.7M, could be used to estimate the total data size and forecast the foreseeable future.
- From the whole system perspective, server end throttling is the main limitation. 从整个系统的角度来看,服务器端的节流是主要的限制。
- Use different accounts
- Use different IPs
- Multi-thread
- Be a good citizen, don’t bring down the server
- The same node crawl several sites in parallel
2. Strategy
The basic algorithm executed by any Web crawler is to take a list of whitelisted domains and seed URLs as its input and repeatedly execute the following steps. 任何网络爬虫执行的基本算法都是以白名单域和种子 URL 列表作为输入,并重复执行以下步骤。
- Pick a URL from the unvisited URL list.
- Async crawl the URL
- Parse the document contents and store the data, index the contents.
- Look for new URLs in the whitelisted, Add the new URLs to the list of unvisited URLs.
3. High Level Design
4. Service
- Scheduler:
- Periodically or on demand trigger new crawling. 周期性按需启动任务。
- Insert seed URLs
- Crawler:
- stateless crawler, async scalable worker. 异步可伸缩 Worker
- Deep copy vs shallow copy
- Parser:
- Async stateless worker: parse the validate the raw data, write the validated data to a Clean Store. 解析验证原始数据,将验证后的数据写入 Clean Store 中。
- Notification:
- Notify the user current progress and stuck process, unexpected schema etc.
- Sweeper 清扫器/Compensator 补偿器:
- Monitor the progress and re-trigger a crawl or parsing if no response.
- Generate notification if unknown schema / exception met.
5. DataBase & Data Model
Raw Data Table (S3)
Id | data | Created time |
---|---|---|
uuid | HTML / JSON | 21435343 |
Clean Data
Hash(URL, iteration) (shard key) | type | URL | iteration | headline | main_text |
---|---|---|---|---|---|
string | News | cnn.com/1 | 1 | Uber | What’s new? |
Metadata - Scheduling Table
Id | Cron | user | Creation | Name | URLs | state |
---|---|---|---|---|---|---|
uuid | 0 8 * * * | scott | 214145 | replays | Link to YAML | {ON, PAUSED, OFF} |
Metadata - Iteration Progress
Primary key: schedule_id + iteration
schedule | iteration | start | Urls found | Stuck | crawled | processed |
---|---|---|---|---|---|---|
uuid | 9 | 214145 | 10000 | 10 | 9000 | 8600 |
Metadata - Record Job
- Shard key:
Id = Hash(URL, iteration)
- STATE:
{Queued, Crawled, Processed, Stuck}
id | URL | iteration | state | create_time | last_update | notified | Deleted |
---|---|---|---|---|---|---|---|
str | games.net | 9 | QUEUED | 123431 | 10 | No |
6. Fault Tolerant
Scheduler (Apache Airflow)
- Read
YAML
: what to crawl - Insert seed in
Record Job
metadata, no-op if exists- Send and forget to crawler
- Insert progress in
Iteration Progress
table, no-op if exists - If fault, go back to
Queued
state.
Heavy load processing usually stateless, send & forget. 无状态、异步非阻塞和不需要响应
Crawler: Stateless
- Read the URL need to be crawled, if already crawled, no-op
- Crawl the URL, store data into
Raw Data
table - Update record to
Crawled
state
Parser: Stateless
- Read the Record need to parsed, if already
Processed
orStuck
, no-op - Parse and validate record:
- If invalidate, update record to
STUCK
and return
- If invalidate, update record to
- Store the validated record to
Clean Data
- Update record state to
PROCESSED
Sweeper: Periodically wakes up
- Read records in
STUCK
state and not yet notified- Notify developers to take actions.
- Mark the record notified
- Read the records in
Queued
state for a long period of time- Re-queue into Crawler. 重新加入爬取队列
- Read the records in
Crawled
state for a long period of time- Re-queue into Parser. 重新加入解析队列
- Update the iteration progress records:
- Send summary notification to users.
- May need to consolidate if cross shards. 如果跨分片可能需要合并。