Google云计算技术Bigtable_国外课件

格式：pdf
大小：178.55 KB
文档页数：6

下载文档原格式

GOOGLE bigtable

GOOGLE bigtable摘要bigtable是设计来分布存储大规模结构化数据的，从设计上它可以扩展到上２^50字节，分布存储在几千个普通服务器上．Ｇoogle的很多项目使用ＢＴ来存储数据，包括网页查询，google earth和google金融．这些应用程序对ＢＴ的要求各不相同：数据大小（从URL到网页到卫星图象）不同，反应速度不同（从后端的大批处理到实时数据服务）．对于不同的要求，ＢＴ都成功的提供了灵活高效的服务．在本文中，我们将描述ＢＴ的数据模型．这个数据模型让用户动态的控制数据的分布和结构．我们还将描述ＢＴ的设计和实现．１．介绍在过去两年半里，我们设计，实现并部署了ＢＴ．ＢＴ是用来分布存储和管理结构化数据的．ＢＴ的设计使它能够管理2^50 bytes(petabytes)数据，并可以部署到上千台机器上．ＢＴ完成了以下目标：应用广泛，可扩展，高性能和高可用性（high availability）. 包括google analytics, google finance, orkut, personalized search, writely和google earth在内的60多个项目都使用BT.这些应用对ＢＴ的要求各不相同，有的需要高吞吐量的批处理，有的需要快速反应给用户数据．它们使用的ＢＴ集群也各不相同，有的只有几台机器，有的有上千台，能够存储2^40字节(terabytes)数据．ＢＴ在很多地方和数据库很类似：它使用了很多数据库的实现策略．并行数据库［１４］和内存数据库［１３］有可扩展性和高性能，但是ＢＴ的界面不同．ＢＴ不支持完全的关系数据模型；而是为客户提供了简单的数据模型，让客户来动态控制数据的分布和格式{就是只存储字串，格式由客户来解释}，并允许客户推断底层存储数据的局部性｛以提高访问速度｝．数据下标是行和列的名字，数据本身可以是任何字串．ＢＴ的数据是字串，没有解释｛类型等｝．客户会在把各种结构或者半结构化的数据串行化｛比如说日期串｝到数据中．通过仔细选择数据表示，客户可以控制数据的局部化．最后，可以使用ＢＴ模式来控制数据是放在内存里还是在硬盘上．｛就是说用模式，你可以把数据放在离应用最近的地方．毕竟程序在一个时间只用到一块数据．在体系结构里，就是：locality, locality, locality｝第二节描述数据模型细节．第三节关于客户ＡＰＩ概述．第四节简介ＢＴ依赖的google框架．第五节描述ＢＴ的实现关键部分．第6节叙述提高ＢＴ性能的一些调整．第7节提供ＢＴ性能的数据．在第8节，我们提供ＢＴ的几个使用例子，第9节是经验教训．在第10节，我们列出相关研究．最后是我们的结论．２．数据模型ＢＴ是一个稀疏的，长期存储的｛存在硬盘上｝，多维度的，排序的映射表．这张表的索引是行关键字，列关键字和时间戳．每个值是一个不解释的字符数组．｛数据都是字符串，没类型，客户要解释就自力更生吧｝．(row:string, column:string,time:int64)->string {能编程序的都能读懂，不翻译了}//彼岸翻译的第二节我们仔细查看过好些类似bigtable的系统之后定下了这个数据模型。

谷歌BigTable数据库

谷歌BigTable数据库Bigtable包括了三个主要的组件：链接到客户程序中的库、一个Master服务器和多个Tablet服务器。

针对系统工作负载的变化情况，BigTable可以动态的向集群中添加（或者删除）Tablet服务器。

Master服务器主要负责以下工作：为Tablet服务器分配Tablets、检测新加入的或者过期失效的Table服务器、对Tablet服务器进行负载均衡、以及对保存在GFS上的文件进行垃圾收集。

除此之外，它还处理对模式的相关修改操作，例如建立表和列族。

每个Tablet服务器都管理一个Tablet的集合（通常每个服务器有大约数十个至上千个Tablet）。

每个Tablet服务器负责处理它所加载的Tablet的读写操作，以及在Tablets过大时，对其进行分割。

和很多Single-Master类型的分布式存储系统【17.21】类似，客户端读取的数据都不经过Master服务器：客户程序直接和Tablet服务器通信进行读写操作。

由于BigTable的客户程序不必通过Master服务器来获取Tablet的位臵信息，因此，大多数客户程序甚至完全不需要和Master服务器通信。

在实际应用中，Master服务器的负载是很轻的。

一个BigTable集群存储了很多表，每个表包含了一个Tablet的集合，而每个Tablet包含了某个范围内的行的所有相关数据。

初始状态下，一个表只有一个Tablet。

随着表中数据的增长，它被自动分割成多个Tablet，缺省情况下，每个Tablet的尺寸大约是100MB到200MB。

我们使用一个三层的、类似Ｂ+树[10]的结构存储Tablet的位臵信息(如图4)。

第一层是一个存储在Chubby中的文件，它包含了Root Tablet的位臵信息。

Root Tablet包含了一个特殊的METADATA表里所有的Tablet 的位臵信息。

METADATA表的每个Tablet包含了一个用户Tablet的集合。

google云计算系列课程第一讲：介绍PPT课件

What is the key attribute that all these examples have in common?
8
Parallel vs. Distributed
Parallel computing can mean:
Vector processing of data Multiple CPUs in a single computer
3
Computer Speedup
Moonsistors on a chip doubles every 18 months, for the same cost” (1965)
Image: Tom’s Hardware and not subject to the Creative 4 Commons license applicable to the rest of this work.
applicable to the rest of this work.
10
A Brief History… 1985-95
“Massively parallel architectures” start rising in prominence
Distributed computing is multiple CPUs across many computers over the network
9
A Brief History… 1975-85
Parallel computing was favored in the early years
1
Course Overview
5 lectures
1 Introduction 2 Technical Side: MapReduce & GFS 2 Theoretical: Algorithms for distributed

BigTable分享ppt

改进策略
Q&A

?
root tablet 不分裂，因此保持三层结构不会变成四层、五层、六层…… metadata table 存储的也是key/value对
◦ ◦ ◦ key 是一个tablet 的table 标石和talet的尾部标记 value是tablet的位置信息一个meta 行大概1kb内存数据

128MB metadata tablets, is sufficient to address 2^34 tablets 客户端会cache tablet location信息，但具体没看懂

摘要
简介
数据模型
BigTable的数据模型一个例子 Row Column Family Timestamps

数据模型
例子
例子
Row
Column Family
TimeStamps
客户端API(read的例子)

Scanner scanner(T); Scanner * stream; stream = scanner.FetchColumnFamily("anchor"); stream->SetReturnAllVersions(); scanner.Lookup("n.www"); for(; !stream->Done(); stream->next()){ printf("%s %s %lld %s\n", scanner.RowName(), stream->ColumnName(), stream->MicorTimestamp(), stream->Value() ); }

Google云计算技术MapReduce国外课件

◦ Master pings workers, re-schedules failed tasks. ◦ Note: Completed map tasks are re-executed on
failure because their output is stored on the local disk. ◦ Master failure: redo ◦ Semantics in the presence of failures:
result += ParseInt(v); Emit(AsString(result));
More Examples
Distributed grep:
◦ Map: (key, whole doc/a line) (the matched line, key)
◦ Reduce: identity function
outputs to achieve this property.
MapReduce: Fault Tolerance
Handled via re-execution of tasks.
Task completion committed through master
What happens if Mapper fails ?
◦ Re-execute completed + in-progress map tasks
What happens if Reducer fails ?
◦ Re-execute in progress reduce tasks
What happens if Master fails ?
◦ Potential trouble !!
Thousands of machines read input at local disk speed

google云计算原理1精品PPT课件

如何实现物流配送 ◦ 订单是关键！
星辰急便董事长陈平
马云
17
Google云计算原理
Google云计算的背景
18
Google与Microsoft的战争
19
冲突之源
Google和微软之间日益激烈的对立将是一场史诗般的企业战争，将对两家公司的成功和发展产生重要影响，并规定着消费者和企业如何工作、购物、通讯，以及“他们过的数字生活”
29
Google云计算应用场景
Google Wave
◦ 信息分享、协作、发布平台
30
Google云计算应用场景
隶属于PaaS的Google云计算
◦ 属于部署在云端的应用执行环境 ◦ 支持Python和Java两种语言 ◦ 通过SDK提供Google的各种服务，如图形、MAIL和数据存
储等 ◦ 用户可快速、廉价（可免费使用限定的流量和存储）地部
Microsoft CEO 史蒂夫.鲍尔默
◦ 高速宽带连接会象Google断言的那样普及和可靠吗？
◦ 企业、大学、消费者会让Google保存他们的资料吗？
22
Google的秘密武器
应用规模对于系统架构设计的重要性 Google应用的特性
◦ 海量用户+海量数据 ◦ 需要具备较强的可伸缩性 ◦ 如何又快又好地提供服务？
中小企业、大学、消费者会相对迅速地转向基于Web的“云计算”技术
新的赢利模式
◦ 低廉的云计算给Google带来更多的流量，进而带来更多的广告收入
承认“云计算”不会在一夜之间普及
◦ 大公司通常会慢慢地改变自己的习惯 ◦ 其它问题，例如“飞机问题”，以及在不能上网
时用户如何工作。
Google CEO 埃立克.施米特

课件05多结构化数据管理google BIG TABLE

27
Bigtable构件——Chubby
Chubby提供了一个名字空间，里面包括目录和小文件。每个目录或文件可以看作一个锁，读写文件的操作都是原子的。 bby客户程序库提供对Chubby文件的一致性缓存：每个Chubby客户程序都维护一个与Chubby服务的会话（ session）并持有租约，如果客户程序不能在租约到期的时间内重新签订会话的租约，则该会话过期失效。 ↓ 会话失效，则客户程序拥有的锁和打开的文件句柄都失效。
18
API——查找与更新
客户程序可以遍历多个列族 Scanner scanner(T); ScanStream *stream; //选择列族 stream = scanner.FetchColumnFamily(“anchor”); stream->SetReturnAllVersions( ); scanner.Lookup(“n.www”);//选择行 for (; !stream->Done(); stream->Next()) { printf(“%s %s %lld %s\n”, scanner.RowName(), stream->ColumnName(), stream->MicroTimestamp(), stream->Value()); }//遍历n.www行anchor列族的所有版本值。
14
数据模型——时间戳
每一个数据项都可以包含同一份数据的不同版本，通过时间戳（int64）来索引。时间戳赋值： 1）通过Bigtable（可表示精确到毫秒的“实时”时间） 2）由应用程序自己生成具有唯一性的时间戳。
数据项中不同版本的数据按照时间戳倒序排序（最新的数据排在最前面）。
15

Google--云计算平台--解析PPT课件

3. Google的云应用
特点：
基于其自身的云计算基础设施应用了Web2.0技术具有强大的多用户交互能力
17
3. Google的云应用
例子：Google Docs
基于Web的编辑工具与Microsoft Office相近的编辑界面易用的文档权限管理以及多用户操作记录适用于多人协作编辑、项目进度监控等多
13
2. 产品介绍
分布式大规模数据库管理系统 BigTable：介绍
是基于分布式平台的数据库系统由于一般的关系数据库的强一致性要求，
很难将其扩展到很大的规模为了处理Google内部大量的格式化以及半
格式化数据， BigTable 是一种具有弱一致性要求的大规模数据库系统
14
2. 产品介绍
8
2. 产品介绍
Google File System 文件系统：结构
下图表示了单个GFS的结构。
9
2. 产品介绍
Google File System 文件系统：架构
下图表示Google File System的系统架构。
一个GFS集群包含一个主服务器和多个块服务器，被多个客户端访问。文件被分割成固定尺寸的块。在每个块创建的时候，服务器分配给它一个不变的、全球惟一的64位块句柄对它进行标识。块服务器把块作为linux文件保存在本地硬盘上，并根据指定的块句柄和字节范围来读写块数据。为了保证可靠性，每个块都会复制到多个块服务器上，缺省保存三个备份。
6
2. 产品介绍
Google File System 文件系统：特性 Google文件系统中的文件读写模式和传统的文件系统不同。
在Google应用(如搜索)中对

云计算PPT课件

虚拟信息底层结构虚拟
服务
安全
资源管
理
虚拟存储虚拟进程
- 15 -
虚拟化:
简单接入, 提高终端用户管理
& 使用最大化
自动化:
提高速度和预言性 & 减少劳动力
云计算对未来动态IT架构的支撑
商业流程
用户界面 & 接口
Cloud Applications
(“Software-as-a-Service”)
提高速度和预言性减少劳动力商业流程商业流程虚拟信息虚拟信息虚拟存储虚拟存储虚拟进程虚拟进程底层结构虚拟底层结构虚拟虚拟应用cloudapplicationssoftwareasaservice用户界面用户界面接口接口cloudplatformsplatformasaservice商业流程商业流程用户界面用户界面接口接口虚拟应用虚拟信息虚拟信息底层结构虚拟底层结构虚拟虚拟存储虚拟存储虚拟进程虚拟进程商业流程商业流程用户界面接口虚拟应用虚拟信息虚拟存储虚拟进程底层结构虚拟cloudcollaboration云计算对未来动态it架构的支撑商业流程商业流程用户界面接口虚拟应用虚拟信息虚拟存储cloudstoragecloudserversprocessing虚拟进程底层结构虚拟商业流程商业流程用户界面接口虚拟应用virtualizedinformation底层结构虚拟虚拟进程虚拟存储虚拟信息cloudsystemsinfrastructuresoftwaresoftwareasaservice云计算在中小企业的应用用户界面接口商业流程商业流程虚拟应用virtualizedinformation底层机构虚拟虚拟进程虚拟储存virtualizedinformation云计算和下一代it应用云计算还应包含onpremisesoftwareeg

Google与云计算精品PPT课件

• Shareability
– Make sharing as easy as creating and saving
• Freedom
– Users don’t want their data held hostage
• Simplicity
– Easy-to-learn, easy-to-use
• Essentially infinite amount of disk • Essentially infinite amount of computation • (Assuming they can be parallelized)
Google and Cloud Computing
Google与云e Internet: From Hardware to Community • The Innovation: A Computing Cloud • Breakthroughs for Cloud Computing • Google Apps for Cloud Computing • Google Infrastructure for Cloud Computing
• Data stored on the cloud • Software & services on the cloud - Access via web browser • Based on standards and protocols - Linux, AJAX, LAMP, etc. • Accessible from any device
1 User-Centric 2 Task-Centric 3 Powerful 4 Intelligent
5 Affordable 6 Programmable

3.《云计算(第三版)》配套PPT之三：第2章 Google云计算原理与应用(二)

4 of 56
2.3 分布式锁服务Chubby 系统的约束条件
《云计算》第三版配套PPT课件
p1：每个acceptor只接受它得到的第一个决议。
p2：一旦某个决议得到通过，之后通过的决议必须和该决议保持一致。
p2a：一旦某个决议v得到通过，之后任何acceptor再批准的决议必须是v。 p2b：一旦某个决议v得到通过，之后任何proposer再提出的决议必须是v。 p2c：如果一个编号为n的提案具有值v，那么存在一个“多数派”，要么它们中没有谁批准过编号小于n的任何提案，要么它们进行的最近一次批准具有值v。
《云计算》第三版配套PPT课件
目录
2.1 Google文件系统GFS 2.2 分布式数据处理MapReduce 2.3 分布式锁服务Chubby 2.4 分布式结构化数据表Bigtable 2.5 分布式存储系统Megastore 2 . 6 大规模分布式系统的监控基础架构Dapper 2.7 海量数据的交互式分析工具Dremel 2.8 内存大数据分析系统PowerDrill 2.9 Google应用程序引擎
为了保证决议的唯一性，acceptors也要满足一个约束条件：当且仅当 acceptors 没有收到编号大于n的请求时，acceptors 才批准编号为n的提案。
5 of 56
2.3 分布式锁服务Chubby 一个决议分为两个阶段
《云计算》第三版配套PPT课件
1
准备阶段
proposers选择一个提案并将它的编号设为n 将它发送给acceptors中的一个“多数派”
远程过程调用
客户端
Chubby
应用程序程序率
客户端进程
主服务器
客户端
在客户这一端每个客户应用程序都有一个Chubby程序库（Chubby Library），客户端的所有应用都是通过调用这个库中的相关函数来完成的。

CloudComputingand-bigdata精品PPT课件

Open Compute Project (Facebook)
Datacenters Cloud Computing
“…long-held dream of computing as a utility…”
From Mid 2006
Rent virtual computers in the “Cloud” On-demand machines, spot pricing
Summary
Focus on Storage vs. FLOPS Scale out with commodity components Pay-as-you-go model
Jeff Dean @ Google
How do we program this ?
Programming Models
$2.40
1 ECU = CPU capacity of a 1.0-1.2 GHz 2007 Opteron or 2007 Xeon processor
Hardware
Hopper vs. Datacenter
Hopper
Datacenter2
Nodes
6384
1000s to 10000s
Amazon EC2
Machine t1.micro
Memory (GB)
Compute Units (ECU)
Local Storage
(GB)
Cost / hour
0.615
2
0
$0.02
m1.xlarge
15
8
1680
$0.48
cc2.8xlarge
60.5
88 (Xeon 2670)

精选-《云计算(第三版)》配套PPT之四：第2章-Google云计算原理与应用(三)

25
Megastore在Google中已经部署和使用了若干年，有超过100 个产品使用Megastore作为其存储系统
从图中可以看出，绝大多数产品具有极高的可用性（>99.999%）。这表明 Megastore系统的设计是非常成功的，基本达到了预期目标
2.5 分布式存储系统Megastore
可扩展性
Google的服务增长速度是惊人的，设计出的系统至少在未来几年里要能够满足Google服务和集群的需求。
31
2.6 大规模分布式系统的监控基础架构Dapper
2.6.1 基本设计目标 2.6.2 Dapper监控系统简介 2.6.3 关键性技术 2.6.4 常用Dapper工具 2.6.5 Dapper使用经验
Dinner, Paris …
101,502
12:15:22
Betty, Paris
…
102
Mary
Bigtable的列名实际上是表名和属性名结合在一起得到，不同表中实体可存储在同一个Bigtable行中
11
2.5 分布式存储系统Megastore
2.5.1 设计目标及方案选择 2.5.2 Megastore数据模型 2.5.3 Megastore中的事务及并发控制 2.5.4 Megastore基本架构 2.5.5 核心技术——复制 2.5.6 产品性能及控制措施
每个模式都由一系列的表（tables）构成，表又包含有一系列的实体（entities），每实体中包含一系列属性（properties）
属性是命名的且具有类型，这些类型包括字符型（strings）、数字类型（numbers）或者 Google的Protocol Buffers。
8

Google的三大核心技术BigTable

Google's BigTable 原理（翻译）题记：google 的成功除了一个个出色的创意外，还因为有 Jeff Dean 这样的软件架构天才。

------ 编者官方的Google Reader blog中有对BigTable 的解释。

这是Google 内部开发的一个用来处理大数据量的系统。

这种系统适合处理半结构化的数据比如 RSS数据源。

以下发言是Andrew Hitchcock在 2005 年10月18号基于：Google 的工程师 Jeff Dean 在华盛顿大学的一次谈话 (Creative Commons License).首先，BigTable 从2004 年初就开始研发了，到现在为止已经用了将近8个月。

（2005年2月）目前大概有100个左右的服务使用BigTable，比如：Print,Search History,Maps和Orkut。

根据Google的一贯做法，内部开发的BigTable是为跑在廉价的PC机上设计的。

BigTable 让Google在提供新服务时的运行成本降低，最大限度地利用了计算能力。

BigTable 是建立在GFS ，Scheduler ，Lock Service 和MapReduce 之上的。

每个Table都是一个多维的稀疏图sparse map。

Table 由行和列组成，并且每个存储单元cell 都有一个时间戳。

在不同的时间对同一个存储单元cell有多份拷贝，这样就可以记录数据的变动情况。

在他的例子中，行是URLs ，列可以定义一个名字，比如：contents。

Contents 字段就可以存储文件的数据。

或者列名是：”language”，可以存储一个“EN”的语言代码字符串。

为了管理巨大的Table，把Table根据行分割，这些分割后的数据统称为：Tablets。

每个Tablets大概有100-200 MB，每个机器存储100个左右的Tablets。

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

4
Performance Improvements
• Tablet Recovery
– perform compaction on tablet before offloading to another tablet server – 2 minor compactions to remove need for recovery tablet server to deal with recovery log entries – No synchronization needed to read from SSTable – Only memtable is mutable – Use mark-and-sweep garbage collection for SSTables in METADATA table – Split tablets quickly by letting child tables share SSTable of parent tablet
• Write operations increase the size of memtable and commit log
– Longer log, longer recovery
Performance Improvements
• Locality group (grouping multiple column families)‫‏‬
Motivation
• High scalability
Bigtable: A Distributed Storage System for Structured Data
April 21, 2;;<
– Scale to petabytes of data – Thousands of machines
• Bloom filters
– Bloom filters for a particular locality group in SSTable – Reduce need to read from disk if SSTable not in memory
• tablets are offloaded to other tablet servers in case of failure rebuild tablets by reading and applying mutations from commit log • sort commit log • partition log into chunks to allow parallelism • two log writing threads per tablet server to prevent hiccups due to GFS latency
– A _tablet` is a row range (set of ordered rows)‫‏‬
Implementation
• Client library • Master Server
– – – – – Only 1 master everyK guaranteed by Chubby Assigning tablets to tablet servers Detecting additionRexpiration of tablet servers Balancing load of tablet servers Garbage collection of GFS files
• High performance • High availability • Wide applicability
– 6; Google products using Bigtable (Analytics, Finance, Earth, OrkutK)‫‏‬
• Monitor temporal changes
– Block Cache
• High level cache to store key-value pairs returned by SSTable to tablet server • Useful when reading same data over and over again • Low level cache to store blocks read from GFS • Useful when reading data close to data recently read
– (row:string, column:string, time:int64) string
The Data Model
1
API
• Enables read, write, delete of tables, column families, rows, column family metadata (access control)‫‏‬
• Caching
– Improve read performance using 2-level cache – Scan Cache
Performance Improvements
• Commit-log
– single commit log per tablet server – append mutations to single file; co-mingling mutations for different tablets – complicates recovery
• METADATA tablets define logical Tables • Table (logical grouping)‫‏‬
– Tablet (S)‫‏‬
• Tablet Log (1)‫‏‬
– Written on GFS
• Tablet Servers
• memtable (1)‫‏‬
• Location of root tablet is in maБайду номын сангаасter lock file.
Tablet Assignment
• Master keeps track of the set of live tablet servers and tablet assignments
– When tablet server starts it acquires a lock on a unique file in a specific directory.
Tablet Serving
• Master pings for liveness of tablet server
– Failure: tablet reports that it has lost its lock or fails to reach the server
3
Compactions
– In memory
• SSTable (S)‫‏‬
– Written on GFS, immutatable.
– 1; to 1;;; tablets in 1 tablet server – Handles readRwrite request from client application – Splits tablets when tablets grow too large
Overview
• Similar to database, but no relational data model essentially a key value store • Sparse, distributed, persistent, multidimensional sorted map
Building Blocks
• Chubby
– Distributed locking service – Uses Paxos algorithm for consensus
• GFS
– Runs on same box as Bigtable – Underlying file system
• Can use regex for row and column matching
• memtable row is copy-on-write • reads and writes occur in parallel
Real Applications
• Google Analytics • Google Earth • Personalized Search
• Exploiting Immutability
Lessons
• Expect failures
– – – – – – – Fail-stop failures Memory and network corruption Clock skews Hung machines Extended and asymmetric network partitions Failures in underlying components (Chubby)‫‏‬ Overflow of GFS quotas
RR Open the table Table ST = OpenOrDie("RbigtableRwebRwebtable"); RR Write a new anchor and delete an old anchor RowMutation r1(T, "n.www"); r1.Set("anchor:", "CNN"); r1.Delete("anchor:"); Operation op; Apply([op, [r1);
– separate SSTable for each locality group – segregation of column families which are not usually accessed together more efficient reads – some SSTables can be declared to be in-memory (loaded lazily)‫‏‬ – compress each SSTable block separately for a locality group (can read portions of SSTable wRo decompressing entire thing)‫‏‬ – two pass compression scheme (1st pass Bentley and McIlroy’s scheme; 2nd pass fast compression algorithm)‫‏‬ – emphasize speed over space reduction