- 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
- 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
- 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
• METADATA tablets define logical Tables • Table (logical grouping)
– Tablet (S)
• Tablet Log (1)
– Written on GFS
• Tablet Servers
• memtable (1)
Building Blocks
• Chubby
– Distributed locking service – Uses Paxos algorithm for consensus
• GFS
– Runs on same box as Bigtable – Underlying file system
• Can use regex for row and column matching
Tablet Serving
• Master pings for liveness of tablet server
– Failure: tablet reports that it has lost its lock or fails to reach the server
3
C源自文库mpactions
– separate SSTable for each locality group – segregation of column families which are not usually accessed together more efficient reads – some SSTables can be declared to be in-memory (loaded lazily) – compress each SSTable block separately for a locality group (can read portions of SSTable wRo decompressing entire thing) – two pass compression scheme (1st pass Bentley and McIlroy’s scheme; 2nd pass fast compression algorithm) – emphasize speed over space reduction
– Block Cache
• High level cache to store key-value pairs returned by SSTable to tablet server • Useful when reading same data over and over again • Low level cache to store blocks read from GFS • Useful when reading data close to data recently read
– (row:string, column:string, time:int64) string
The Data Model
1
API
• Enables read, write, delete of tables, column families, rows, column family metadata (access control)
• Caching
– Improve read performance using 2-level cache – Scan Cache
Performance Improvements
• Commit-log
– single commit log per tablet server – append mutations to single file; co-mingling mutations for different tablets – complicates recovery
• Location of root tablet is in master lock file.
Tablet Assignment
• Master keeps track of the set of live tablet servers and tablet assignments
– When tablet server starts it acquires a lock on a unique file in a specific directory.
• High performance • High availability • Wide applicability
– 6; Google products using Bigtable (Analytics, Finance, Earth, OrkutK)
• Monitor temporal changes
2
Tablet Location
Bootstrapping
• Initiated by cluster management
– Master grabs master lock – Scans directory for live servers – Checks every tablet server for tablet assignment – Scans METADATA table to learn the set of tablets for table
Motivation
• High scalability
Bigtable: A Distributed Storage System for Structured Data
April 21, 2;;<
– Scale to petabytes of data – Thousands of machines
• Minor compaction:
– Memtable -c SSTable when too large
• Compression
• Major compaction:
– (S)SSTables e (1)memtable -c(1) SSTable
Performance Improvements
• Bloom filters
– Bloom filters for a particular locality group in SSTable – Reduce need to read from disk if SSTable not in memory
• tablets are offloaded to other tablet servers in case of failure rebuild tablets by reading and applying mutations from commit log • sort commit log • partition log into chunks to allow parallelism • two log writing threads per tablet server to prevent hiccups due to GFS latency
• Write operations increase the size of memtable and commit log
– Longer log, longer recovery
Performance Improvements
• Locality group (grouping multiple column families)
– A _tablet` is a row range (set of ordered rows)
Implementation
• Client library • Master Server
– – – – – Only 1 master everyK guaranteed by Chubby Assigning tablets to tablet servers Detecting additionRexpiration of tablet servers Balancing load of tablet servers Garbage collection of GFS files
• memtable row is copy-on-write • reads and writes occur in parallel
Real Applications
• Google Analytics • Google Earth • Personalized Search
• Exploiting Immutability
Ex: anchor:S.cnn.com
• Can execute client supplied scripts on server using Sawzall API
No write support back into Bigtable
Table Component Hierarchy
• Everything is composed of _Tablets`
RR Open the table Table ST = OpenOrDie("RbigtableRwebRwebtable"); RR Write a new anchor and delete an old anchor RowMutation r1(T, "com.cnn.www"); r1.Set("anchor:www.c-span.org", "CNN"); r1.Delete("anchor:www.abc.com"); Operation op; Apply([op, [r1);
Lessons
• Expect failures
– – – – – – – Fail-stop failures Memory and network corruption Clock skews Hung machines Extended and asymmetric network partitions Failures in underlying components (Chubby) Overflow of GFS quotas
– In memory
• SSTable (S)
– Written on GFS, immutatable.
– 1; to 1;;; tablets in 1 tablet server – Handles readRwrite request from client application – Splits tablets when tablets grow too large
4
Performance Improvements
• Tablet Recovery
– perform compaction on tablet before offloading to another tablet server – 2 minor compactions to remove need for recovery tablet server to deal with recovery log entries – No synchronization needed to read from SSTable – Only memtable is mutable – Use mark-and-sweep garbage collection for SSTables in METADATA table – Split tablets quickly by letting child tables share SSTable of parent tablet
Overview
• Similar to database, but no relational data model essentially a key value store • Sparse, distributed, persistent, multidimensional sorted map