Massively Distributed Database Systems大规模分布式数据库系统
- 格式:pptx
- 大小:383.40 KB
- 文档页数:1
dbs数据库名词解释- DBS:数据库系统(Database System),是指由软件、硬件和数据组成的,用于存储、管理和检索大量有组织的数据的系统。
数据库系统可以分为关系型数据库系统(RDBMS)和非关系型数据库系统(NoSQL)等不同类型。
- 数据库(Database):是指将数据按照一定的结构和规则组织起来,并存储在计算机系统中的数据集合。
它可以被认为是一个组织数据的仓库,可以存储和管理大量结构化、半结构化和非结构化数据。
- 数据库管理系统(Database Management System,简称DBMS):是一种管理数据库的软件,它提供了对数据库的管理和操作功能。
数据库管理系统可以用来创建、修改和删除数据库中的数据、定义和管理数据库模式、处理数据的查询和事务等操作。
- 数据库模式(Database Schema):是指数据库的逻辑结构和组织方式,在数据库中定义了表、表之间的关系、属性和约束等。
数据库模式确定了数据库中数据的存储方式和访问方式。
- 数据表(Table):是数据库模式中的一种对象,它由多个列和行组成。
每个列描述了一个属性,每行表示一个记录。
数据表用于存储实体或对象的数据,每个表都有一个唯一的名称,并且可以定义各种约束和索引等。
- 数据列(Column):也称为字段或属性,是数据表中的一个垂直方向的数据集合,它定义了表中每个记录的一个属性的数据类型和约束。
- 数据行(Row):也称为记录或元组,是数据表中的一个水平方向的数据集合,它包含了表中的每个属性对应的具体值。
- 数据库索引(Database Index):是一种数据结构,用于加快数据库中数据的检索速度。
索引可以基于一个或多个列,它提供了一种类似于书的目录的功能,可以根据指定条件快速定位到数据。
- 数据库查询语言(Database Query Language,简称DQL):是一种用于在数据库中执行查询操作的语言。
常见的数据库查询语言包括结构化查询语言(SQL)和NoSQL数据库的查询语言(如MongoDB的查询语言等)。
关于分布式数据库系统的计算机英语1. Introduction to Distributed Database Systems2. Key Concepts in Distributed Database Systems2.2 Data Replication: Data replication is the process of creating multiple copies of data and storing them at different sites in the network. Replication enhances fault tolerance and availability of data by allowing access to the nearest replica when a site or a network link fails.2.3 Data Consistency: Ensuring data consistency is a major challenge in a distributed database system. Consistency refers to the correctness and integrity of data across different sites. Various techniques, such as distributed transaction management and replica synchronization, are used to maintain data consistency.2.4 Data Transparency: Data transparency refers to the ability of users and applications to access and manipulate data without being aware of its distribution and location in the network. Transparency is achieved through the use of a distributed query processor that handles the distribution and retrieval of data.3.1 Data Fragmentation and Allocation: Data fragmentation involves dividing the database into smaller parts, called fragments, which are distributed across different sites. Theallocation process determines which fragment is stored at which site, based on factors such as data access patterns and network bandwidth.3.4 Replica Management: Replica management involves the creation, maintenance, and coordination of replicas in a distributed database system. This includes replica synchronization, consistency management, and fault detection and recovery.4. Advantages and Challenges of Distributed Database Systems4.1 Advantages of Distributed Database Systems- Improved performance and scalability: Distributed database systems can handle large amounts of data and provide high performance by distributing the workload across multiple nodes.- Fault tolerance and high availability: Data replication and distributed nature of the system make it resilient to failures, ensuring that data is available even if a site or a network link fails.- Cost-effective: Distributed database systems can utilize existing hardware and network infrastructure, minimizing the need for additional resources.4.2 Challenges of Distributed Database Systems- Data consistency: Ensuring consistency across multiple sites is challenging, especially in the presence of concurrent transactions and replication.- Network latency: Network latency and bandwidth constraints can impact the performance of distributed database systems.- Security and privacy: Distributed database systems need to address security concerns such as access control, encryption, and authentication, to protect data from unauthorized access.5. Conclusion。
智慧工地管理方案及技术措施18智慧工地是一种利用信息化手段进行精确设计和施工模拟的工程项目管理方法。
通过三维设计平台实现施工过程管理,建立互联协同、智能生产、科学管理的施工项目信息化生态圈。
在虚拟现实环境下,将数据与物联网采集到的工程信息进行数据挖掘分析,提供过程趋势预测及专家预案,实现工程施工可视化智能管理。
智慧工地将更多人工智慧、传感技术、虚拟现实等高科技技术植入到建筑、机械、人员穿戴设施、场地进出关口等各类物体中,形成“物联网”,再与“互联网”整合在一起,实现工程管理干系人与工程施工现场的整合。
智慧建造整体架构可以分为三个层面。
第一个层面是终端层,利用物联网技术和移动应用提高现场管控能力。
通过RFID、传感器、摄像头、手机等终端设备,实现对项目建设过程的实时监控、智能感知、数据采集和高效协同,提高作业现场的管理能力。
第二层就是平台层,通过云平台进行高效计算、存储及提供服务,让项目参建各方更便捷的访问数据,协同工作,使得建造过程更加集约、灵活和高效。
第三层就是应用层,核心内容应始终围绕以提升工程项目管理这一关键业务为核心,因此PM项目管理系统是工地现场管理的关键系统之一。
BIM的可视化、参数化、数据化的特性让建筑项目的管理和交付更加高效和精益,是实现项目现场精益管理的有效手段。
要实现智慧建造,就必须要做到不同项目成员之间、不同软件产品之间的信息数据交换。
建立一个公开的信息交换标准,才能使所有软件产品通过这个公开标准实现互相之间的信息交换,才能实现不同项目成员和不同应用软件之间的信息流动。
这个基于对象的息交换标准格式包括定义信息交换的格式、定义交换信息、确定交换的信息和需要的信息是同一个东西三种标准。
2、BIM技术在建筑物使用寿命期间可以有效地进行运营维护管理。
它拥有空间定位和记录数据的能力,可以快速准确地定位建筑设备组件,进行可接入性分析,选择可持续性材料,并制定行之有效的维护计划。
结合RFID技术,将建筑信息导入资产管理系统,可以实现建筑物的资产管理。
分布式数据库如何工作Distributed Database Howdoes it workHow does Distributed Database work?A distributed database is considered as a database in which two or more files are located in two different places. However, they are either connected through the same network or lies in a completely different network. It is a single huge database in which portions of the data are stored in multiple physical locations and processing system is done by distributing the data among various nodes of the database. It is a system in which a huge database is settled down in a distributed manner in several physicallocations to avoid any kind of confusions while dealing with that database.The distributed database system is managed in a centralized manner by connecting the data logically. This helps in managing the bulk data in a manner as if it was all stored in one single place. In such a centralized database it is seen that the data are synchronized in such a manner that deletes or updates done in one location is automatically upgraded in other parts of the data. This is the concept of a distributed database in making the management of bulk data easy. Now we will tell you more with the help of an infographic.How Does Distributed Database Work?Definition of NetworkThe network is defined as a system that helps in connecting multiple devices together that helps them to communicate effectively. Networks can be small or it can consist of billions of devices that are connected to each other. Networking is of various types and each has some role or the other to perform. Two major types of networks are LAN and WAN. The first type is a local area network that allows for forming a network to a specific and personalized area such as home, office and campus.Within this also there is single or large network depending on the space of the area. On the other hand, WAN is a wide area network that is not limited to a single area and spread over multiple locations. WAN is seen to consist of multiple LAN system and these LANs are connected with the help of internet. Moreover,WAN allows limiting the access to the network with the help of authentication, firewalls and other security systems.The network is also defined according to characteristics that help in categorizing different types of networks such as typology, protocol and architecture and forms an integral part in the distributed database system.The typology is the geometric arrangement of the network in a system in the form of a ring, star, bus and others.The protocol is another characteristic that defines a set of rules and signals that help the networks use to communicate with each other. For example, the protocol for LAN is Ethernet.Architecture is another network characteristics that show the design or form of the network such as peer to peer or server architecture.The characteristics of the networks play an important role in a distributed database because it helps in connecting data in different location effectively and in a secured manner.Features of Distributed DatabaseIn a collection or group, it is seen that a distributed database is logically connected to each other and is often described under a single database. This means that a distributed database is not kept in a spread manner and is represented in a collaborative form.This interdependency of the database on each other from a different location is done with the help of a processor. The processors in a site connect with another site with the help of the network and do not have any kind of multiprocessing configuration. However, there are misconceptions that the distributed database system is loosely connected to each other in a file.In reality, it is not so because the entire process of a distributed database system is a complicated one. Based on these facts, the distributed database has various types of features that help defines them clearly, such as:Location independentDistributed query processingReliability of safety and reduction in data lossThe internal and external security systemCost-effective by reducing the bandwidth pricesEase of access to the data even if a failure occurs in umbrella networkEasy integration of more nodes to the databaseThe efficiency of speed and resourcesThere are some concerns connected to a distributed database system such as it should be kept up-to-date and there should be consistency while using the data that is remotely stored.Advantages of Distributed Database systemA distributed database is capable of offering various types of advantages to the business in the maintenance of large size data in a simpler and systematic form. This type of database is able to make modular development which means that a system can easily be expanded by connected new computers or local data to a site. Then the site is connected to the distributed system without much interruption.The distributed database also offers advantages over a centralized database system by preventing the system to stop working completely. In a time of failure, it is seen that a centralized database system stops completely, while in a distributed database in case of failure the system becomes slow and continue to perform until the error is fixed completely. Thisallows the user of the database from stopping their work completed in a time of failure.In addition to the above benefits, it is also seen that the distributed database system helps in offering lower communication costs to the admin. The admin can access the data effectively if is located close to where it is extracted the most. This facility helps in reducing the cost of the database admins. This is because communication becomes easier in this system by locating the data closer to the point of use.The response rate for the extraction of particular information or data is done at a faster rate with the help of the distributed database system. This is because the data is distributed in such a manner that it is kept close to the users in a particular site and they can use the data anytimethey want from the site. These are some of the advantages that the distributed database offers to the user for handling large and complex data.The environment in which Distributed Database WorksThe ability to create a distributed version of a database has been existing since the 1980s. This is done based on various types of distributed database environment that are widely categorized as homogenous and heterogeneous database.This shows that the process of distributed database system does not work in a single type of system and is spread over sites. This means that multiple computers and networks are involved in the process. This has led to thecategorization of the environment of the database in two different categories.Homogenous database–environment helps the sites to store the database identically. This type of environment works in a way in which the structures are the same in all the sites such as operating system, database management system and data structures. This environment further works under two environment that is autonomous and non-autonomous.Autonomous–in this each DBMS works in an independent manner by passing messages back and forth and helps in sharing data updates.Non-autonomous–in this environment the central database management system worksand coordinates database access across sites and update other nodes.Heterogeneous Database–in this environment different sites use different types of software to reach the problems of query processing and transactions. In such type of environment, the distributed database is stored in different sites in such a way that one site is unaware of what is having in another site. In such a process, the company uses different data models for storing the database and hence translation has to be done to connect from one model to another.In a heterogeneous environment, it is seen that a distributed database system works in a much complex manner and involves various steps, unlike the homogeneous database. There are two broad categories of nodes such as systemsand gateway. The system helps in supporting one or all the functionality of the logical database. Gateway, on the other hand, helps in creating paths for other databases without creating many benefits for one single logical database.Options for Distributing a DatabaseDistribution of a database in a site in a number of forms depending on the characteristics of the data. There are four basic strategies adopted by the Distribution Database system to distribute the data across multiples sites.The types of strategies that distributed database can use in its process are data replication, horizontal partitioning, vertical partitioning and combination of the above. The characteristicsand the processes involved in each of these options can be explained with the help of relational databases. Now we will tell you about the Data replication.Data ReplicationIn this type of option, it is seen that the entire data relation is stored in two or more number of sites. In this type of processes, it is seen that the database is distributed or stored in copies in different systems entirely. This is a way distributed database system will allow for fault tolerance capacity by storing a copy of all data in a number of sites.Such type of processes in common in an information system organization in which the database is removed from a centralized positionand moved to location specific server so that it is kept close to the user. This type of method help in using either synchronous or asynchronous distributed database technologies. Thus, replication is a copied version of the entire database stored in every site that the organization use to access.Advantages of replication are huge due to the ease of usage and highly secured process. Some of the advantages of using the replication process of the distributed database are:Reliability- this means that one site containing the relation database fails then another site can be approached easily to get a copy of the database. The available copies can then be uploaded after the transaction takes place andfailed nodes can be updated once they are repaired and return to service.Fast response- this process allows for fast response of the database in case of need because the data is stored near to the user to be processed quickly.Node decoupling- is another benefit of the replication process for distributing database because in this each transaction may move without coordinating with another network as each site has access to the entire database.Data Replication process also faces various kinds of disadvantages such as space for storage requirement as the database is huge and also complexities and cost attached toupdating the database because each site has to be updated about any new relation.Horizontal PartitioningThis is yet another process that is used in a distributed database in which some of the rows in a relation are put in one site and other rows are put under a base relation in another site. It is done in a horizontal or base form as the name suggests and the rows of the database are distributed in a number of sites.This can be seen with the help of an example that is customer relations in which the rows are located in home branches. In this system in case the transaction is made in the home branch then the transaction is processed locally and response time is reduced. In case the customermakes a transaction in another branch then the data is sent to the home branch for processing and then send back to the initiating branch.This process of distributed database system also has various types of advantages and disadvantages from the efficiency it adds to data management. The advantages of using horizontal partitioning are:Efficiency- this means that the data in this system is stored close to the user and separated from other data that is used by some other users. This reduces the chances of confusion and improved efficiencies to a great extent.Local optimization- data is stored in such a way that it can help in improving the performance of local access.Security- it is the biggest advantage of using this process because all types of data are not available in one place and data that is not relevant is kept separately without any kind of distraction.The use of horizontal partitioning also has various kinds of disadvantages attached to it such as inconsistent access speed, which means that the data is required from various points and this increases the access time. Moreover, there is a backup vulnerability, which means that due to lack of replication of similar kinds of data when one type of data become damaged in one site then it is completely lost and cannot be updated.Vertical PartitioningVertical partitioning is yet another form of distributed database process in which the data is partitioned column-wise. Some of the columns of the data or relations are projected in one site and other columns are projected under a base relation in another site.In this type of process, the distributed database system works in a separate manner as it works in horizontal partitioning system. The data or relations that are shared in each of the sites are connected to each other with the help of a common domain so that it can be extracted easily.Vertical partition of the database also has some advantages and disadvantages to being used and getting destroyed. The advantages of vertical partitioning are similar to that of thehorizontal partition system because in this process as well data are kept separately without much replication. The only exception that in vertical partition the combination of the data is many complications difficult to make compared to horizontal partitions.。
Distributed databaseA distributed database is a database in which portions of the database are stored on multiple computers within a network.( 分布式数据库是一个把数据库的各个部分存放于网络上的多个不同计算机的数据库。
)Users have access to the portion of the database at their location so that they can access the data relevant to their tasks without interfering with the work of others.( 用户只访问一部分位置的数据库,这样就可以访问和他们的任务相关的数据而不干扰别人的工作。
)A distributed database system consists of a collection of sites ,connected together via some kind of communications network, in which(分布式数据库系统一个通过某种通信网络连接在一起的网站的集合):(1).Each site is a full data base system site in its own right, but(每个节点是一个自成体系的数据库系统)(2).The sites have agreed to work together so that a user at any site can access data anywhere in the network exactly as if the data were all stored at the users own site.( 网站已经连接在一起工作, 以便在任何节点的用户都可以访问在网络任意地方的数据就像存储在用户自己的站点上一样。
2021年1月10日第5卷第1期现代信息科技Modern Information TechnologyJan.2021 Vol.5 No.11382021.1收稿日期:2020-12-16联邦学习的隐私保护技术研究石进,周颖,邓家磊(电科云(北京)科技有限公司,北京 100041)摘 要:联邦学习作为一种新兴的人工智能计算框架,旨在解决分布式环境下数据安全交换与隐私保护,然而联邦学习在应用时仍然存在安全问题。
鉴于此,文章从多个层面分析联邦学习的隐私安全问题,并针对性地提出了防御措施;面向联邦学习安全高速数据交换,提出了一种基于改进同态加密算法的联邦学习模型,为联邦学习落地实施提供借鉴和参考。
关键词:联邦学习;用户隐私;数据安全;同态加密中图分类号:TP309;TP181文献标识码:A文章编号:2096-4706(2021)01-0138-05Study on Privacy Protection Techniques of Federated LearningSHI Jin ,ZHOU Ying ,DENG Jialei(Diankeyun (Beijing )Technology Co.,Ltd.,Beijing 100041,China )Abstract :As a new artificial intelligent computing framework ,federated learning aims to solve the problem of data safetyexchange and privacy protection in distributed environment. However ,federated learning still has security problems in application. In view of this ,the paper analyzes the privacy security issues of federated learning from multiple levels and contrapuntally puts forward defensive measures. A federated learning model based on improved homomorphism encryption algorithm is proposed for high-speed data exchange of federated learning security ,which provides reference for the implementation of federated learning.Keywords :federated learning ;user privacy ;data security ;homomorphism encryption0 引 言联邦学习顺应了移动互联网时代对安全隐私问题的需求,一经出现即受到广泛关注,在科技金融、医疗卫生等行业的应用也在逐步推广。
分布式系统(DistributedSystem)资料希望转载的朋友,你可以不用联系我.但是一定要保留原文链接,因为这个项目还在继续也在不定期更新.希望看到文章的朋友能够学到更多.•《Reconfigurable Distributed Storage for Dynamic Networks》介绍:这是一篇介绍在动态网络里面实现分布式系统重构的paper.论文的作者(导师)是MIT读博的时候是做分布式系统的研究的,现在在NUS带学生,不仅仅是分布式系统,还有无线网络.如果感兴趣可以去他的主页了解.•《Distributed porgramming liboratory》介绍:分布式编程实验室,他们发表的很多的paper,其中不仅仅是学术研究,还有一些工业界应用的论文.•《MIT Theory of Distributed Systems》介绍:麻省理工的分布式系统理论主页,作者南希·林奇在2002年证明了CAP理论,并且著《分布式算法》一书.•《Notes on Distributed Systems for Young Bloods》介绍:分布式系统搭建初期的一些建议•《Principles of Distributed Computing》介绍:分布式计算原理课程•《Google's Globally-Distributed Database》介绍:Google全球分布式数据介绍•《The Architecture Of Algolia’s Distributed Search Network》介绍:Algolia的分布式搜索网络的体系架构介绍•《Build up a High Availability Distributed Key-Value Store》介绍:构建高可用分布式Key-Value存储系统•《Distributed Search Engine with Nanomsg and Bond》介绍:Nanomsg和Bond的分布式搜索引擎•《Distributed Processing With MongoDB And Mongothon》介绍:使用MongoDB和Mongothon进行分布式处理•《Salt: Combining ACID and BASE in a Distributed Database》介绍:分布式数据库中把ACID与BASE结合使用.•《Makes it easy to understand Paxos for Distributed Systems》介绍:理解的Paxos的分布式系统•《There is No Now Problems with simultaneity in distributed systems》介绍:There is No Now Problems with simultaneity in distributed systems•《Distributed Systems》介绍:伦敦大学学院分布式系统课程课件.•《Distributed systems for fun and profit》介绍:分布式系统电子书籍.•《Distributed Systems Spring 2015》介绍:卡内基梅隆大学春季分布式课程主页•《Distributed Systems: Concepts and Design (5th Edition)》介绍: 电子书,分布式系统概念与设计(第五版)•《走向分布式》介绍:这是一位台湾网友 ccshih 的文字,短短的篇幅介绍了分布式系统的若干要点。
大数据常见术语解释(全文)大数据常见术语解释(全文)胡经国大数据(B ig Data),是指无法在可承受的时间范围内用常规软件工具进行捕捉、管理和处理的数据集合,是需要新处理模式才能具有更强的决策力、洞察发现力和流程优化能力的海量、高增长率和多样化的信息资产。
大数据的出现产生了许多新术语,这些术语往往比较难以理解。
为此,我们根据有关大数据文献编写了本文,供大家认识大数据参考。
1、聚合(Aggregation)聚合是指搜索、合并、显示数据的过程。
2、算法(Algorithms)算法是指可以完成某种数据分析的数学公式。
3、分析法(Analytics)分析法用于发现数据的内在涵义。
4、异常检测(Anomaly Detection)异常检测用于在数据集中搜索与预期模式或行为不匹配的数据项。
除了“Anomalies”以外,用来表示“异常”的英文单词还有以下几个:outliers,exceptions,surprises,contaminants。
它们通常可提供关键的可执行信息。
5、匿名化(Anonymization)匿名化使数据匿名,即移除所有与个人隐私相关的数据。
6、应用(Application)在这里,应用是指实现某种特定功能的计算机软件。
7、人工智能(Artificial Intelligence)人工智能是指研发智能机器和智能软件;这些智能设备能够感知周围的环境,并根据要求作出相应的反应,甚至能自我学习。
8、行为分析法(Behavioural Analytics)行为分析法是指根据用户的行为如“怎么做”,“为什么这么做”以及“做了什么”来得出结论,而不是仅仅针对人物和时间的一门分析学科。
它着眼于数据中的人性化模式。
9、大数据科学家(Big Data Scientist)大数据科学家是指能够设计大数据算法使得大数据变得有用的人。
10、大数据创业公司(Big Data Startup)大数据创业公司是指研发最新大数据技术的新兴公司。
数据库管理系统的介绍Raghu Ramakrishnan数据库(database,有时拼作data base)又称为电子数据库,是专门组织起来的一组数据或信息,其目的是为了便于计算机快速查询及检索。
数据库的结构是专门设计的,在各种数据处理操作命令的支持下,可以简化数据的存储,检索,修改和删除。
数据库可以存储在磁盘,磁带,光盘或其他辅助存储设备上。
数据库由一个或一套文件组成,其中的信息可以分解为记录,每一记录又包含一个或多个字段(或称为域)。
字段是数据存取的基本单位。
数据库用于描述实体,其中的一个字段通常表示与实体的某一属性相关的信息。
通过关键字以及各种分类(排序)命令,用户可以对多条记录的字段进行查询,重新整理,分组或选择,以实体对某一类数据的检索,也可以生成报表。
所有数据库(最简单的除外)中都有复杂的数据关系及其链接。
处理与创建,访问以及维护数据库记录有关的复杂任务的系统软件包叫做数据库管理系统(DBMS)。
DBMS软件包中的程序在数据库与其用户间建立接口。
(这些用户可以是应用程序员,管理员及其他需要信息的人员和各种操作系统程序)。
DBMS可组织,处理和表示从数据库中选出的数据元。
该功能使决策者能搜索,探查和查询数据库的内容,从而对在正规报告中没有的,不再出现的且无法预料的问题做出回答。
这些问题最初可能是模糊的并且(或者)是定义不恰当的,但是人们可以浏览数据库直到获得所需的信息。
简言之,DBMS将“管理”存储的数据项,并从公共数据库中汇集所需的数据项以回答非程序员的询问。
DBMS由3个主要部分组成:(1)存储子系统,用来存储和检索文件中的数据;(2)建模和操作子系统,提供组织数据以及添加,删除,维护,更新数据的方法;(3)用户和DBMS之间的接口。
在提高数据库管理系统的价值和有效性方面正在展现以下一些重要发展趋势;1.管理人员需要最新的信息以做出有效的决策。
2.客户需要越来越复杂的信息服务以及更多的有关其订单,发票和账号的当前信息。
关于分布式数据库系统的计算机英语计算机真的是我们人类应该很伟大的发明,它促进了我们的生活,小编今天就给大家整理了有关于计算机的英语,仅供参考计算机英语Distributed Database SystemA decentralized system-known in the database world as a distributed system can be highly responsive to differences in data gathering, storage, and access. It can adjust to differences in user psychol-ogy-between, say, a multinational corporation's employees in individualistic Greece and in disciplined Japan. It can also adapt to the management styles of strong managers in different locations.An organization that prefers homogeneity and top-down control will naturally choose a centralized database system. It may also prefer the hierarchical data model.By contrast, an organization that prefers local improvisation and free wheeling may well choose distributed database system. It may also choose the network data model,which is well suited to searching and updating data in a distributed system.Communication between distributed commutinies of computer is required for many reasons.At a national level, for example, computers located in different parts of the country use public communication services to exchange electronic messages(mail) and to transfer files of information from one computer to another: Similarly, at a local level within, say, a single building or establishment, distributed communities of computer-based workstations use local communication networks to access expensive shared resources-for example, printers, copiers, disks and tapes, etc.-that are also managed by computers.Clearly, asthe range of computer-based products and associated public and local com-medication networks procreate, computer-to-computer communication will expand rapidly and ultimately dominate the field of distributed system.Historically, computers were so expensive that most large organizations did all their data processing on a single, centralized machine. While very efficient for such tasks as payroll and generating accounting reports, the centralized approach was very useful to those who needed a quick response to a unique, local problem.With today's inexpensive micros and minis, there is no reason why a branch office, the engineer-in department, or any other group needing computer support cannot have its own computer. By linking these remote machines to a centralized computer via communication lines, local activity can be monitored and coordinated. This approach is called distributed data processing.分散式系统在数据库领域中称为分布式系统,这种系统能处理不同的数据采集、存储和访问方一式。
DISTRIBUTED DATABASE SYSTEMSM. Tamer ÖzsuUniversity of WaterlooDepartment of Computer ScienceWaterloo, Ontario Canada N2L 3G1{tozsu@db.uwaterloo.ca}OutlineIn this article, we discuss the fundamentals of distributed DBMS technology. We address the data distribution and architectural design issues as well as the algorithms that need to be implemented to provide the basic DBMS functions such as query processing, concurrency control, reliability, and replication control.GlossaryAtomicity: The property of transaction processing whereby either all the operations of a transaction are executed or none of them are (all-or-nothing).Client/server architecture: A distributed/parallel DBMS architecture where a set of cli-ent machines with limited functionality access a set of servers which manage data. Concurrency control algorithm: Algorithms that synchronize the operations of concur-rent transactions that execute on a shared database.Distributed database management system: A database management system that man-ages a database that is distributed across the nodes of a computer network and makes this distribution transparent to the users.Deadlock: An occurrence where each transaction in a set of transactions circularly waits on locks that are held by other transactions in the set.Durability: The property of transaction processing whereby the effects of successfully completed (i.e., committed) transactions endure subsequent failures.Isolation: The property of transaction execution which states that the effects of one transaction on the database are isolated from other transactions until the first com-pletes its execution.Locking: A method of concurrency control where locks are placed on database units(e.g., pages) on behalf of transactions that attempt to access them.Logging protocol: The protocol which records, in a separate location, the changes that a transaction makes to the database before the change is actually made.One copy equivalence: Replica control policy which asserts that the values of all copies of a logical data item should be identical when the transaction that updates that item terminates.Query optimization: The process by which the ``best'' execution strategy for a given query is found from among a set of alternatives.Query processing: The process by which a declarative query is translated into low-level data manipulation operations.Quorum-based voting algorithm: A replica control protocol where transactions collect votes to read and write copies of data items. They are permitted to read or write data items if they can collect a quorum of votes.Read-Once/Write-All protocol: The replica control protocol which maps each logical read operation to a read on one of the physical copies and maps a logical write opera-tion to a write on all of the physical copies.Serializability: The concurrency control correctness criterion which requires that the concurrent execution of a set of transactions should be equivalent to the effect of some serial execution of those transactions.Termination protocol: A protocol by which individual sites can decide how to terminatea particular transaction when they cannot communicate with other sites where thetransaction executes.Transaction: A unit of consistent and atomic execution against the database. Transparency: Extension of data independence to distributed systems by hiding the dis-tribution, fragmentation and replication of data from the users.Two-phase commit: An atomic commitment protocol which ensures that a transaction is terminated the same way at every site where it executes. The name comes from the fact that two rounds of messages are exchanged during this process.Two-phase locking: A locking algorithm where transactions are not allowed to request new locks once they release a previously held lock.1. IntroductionThe maturation of database management system (DBMS) technology has coincided with significant developments in computer network and distributed computing technolo-gies. The end result is the emergence of distributed database management systems. These systems have started to become the dominant data management tools for highly data-intensive applications. Many DBMS vendors have incorporated some degree of distribu-tion into their products.A distributed database (DDB) is a collection of multiple, logically interrelated da-tabases distributed over a computer network. A distributed database management system (distributed DBMS) is the software system that permits the management of the distrib-uted database and makes the distribution transparent to the users. The term “distributed database system” (DDBS) is typically used to refer to the combination of DDB and the distributed DBMS. These definitions point to two identifying architectural principles. The first is that the system consists of a (possibly empty) set of query sites and a non-empty set of data sites. The data sites have data storage capability while the query sites do not. The latter only run the user interface routines in order to facilitate the data access at data sites. The second is that each site (query or data) is assumed to logically consist of a sin-gle, independent computer. Therefore, each site has its own primary and secondary stor-age, runs its own operating system (which may be the same or different at different sites), and has the capability to execute applications on its own. A computer network, rather than a multiprocessor configuration, interconnects the sites. The important point here is the emphasis on loose interconnection between processors that have their own operating systems and operate independently.2. Data Distribution AlternativesA distributed database is physically distributed across the data sites by fragmenting and replicating the data. Given a relational database schema, fragmentation subdivides each relation into horizontal or vertical partitions. Horizontal fragmentation of a relation is accomplished by a selection operation that places each tuple of the relation in a differ-ent partition based on a fragmentation predicate (e.g., an Employee relation may be fragmented according to the location of the employees). Vertical fragmentation divides a relation into a number of fragments by projecting over its attributes (e.g., the Employee relation may be fragmented such that the Emp_number, Emp_name and Address in-formation is in one fragment, and Emp_number, Salary and Manager information is in another fragment). Fragmentation is desirable because it enables the placement of data in close proximity to its place of use, thus potentially reducing transmission cost, and it reduces the size of relations that are involved in user queries. Based on the user access patterns, each of the fragments may also be replicated. This is preferable when the same data are accessed from applications that run at a number of sites. In this case, it may be more cost-effective to duplicate the data at a number of sites rather than continuously moving it between them. Figure 1 depicts a data distribution where Employee, Pro-ject and Assignment relations are fragmented, replicated and distributed across multiple sites of a distributed database.Figure 1 goes hereFigure 1. A Fragmented, Replicated, and Distributed Database Example3. Architectural AlternativesThere are many possible alternatives for architecting a distributed DBMS. The sim-plest is the client/server architecture, where a number of client machines access a single database server. The simplest client/server systems involve a single server that is ac-cessed by a number of clients (these can be called multiple-client/single-server). In this case, the database management problems are considerably simplified since the database is stored on a single server. The pertinent issues relate to the management of client buffers and the caching of data and (possibly) locks. The data management is done centrally at the single server. A more distributed, and more flexible, architecture is the multiple-client/multiple-server architecture where the database is distributed across multiple serv-ers that have to communicate with each other in responding to user queries and in exe-cuting transactions. Each client machine has a “home” server to which it directs user re-quests. The communication of the servers among themselves is transparent to the users. Most current database management systems implement one or the other type of the cli-ent-server architectures. A truly distributed DBMS does not distinguish between client and server machines. Ideally, each site can perform the functionality of a client and a server. Such architectures, called peer-to-peer, require sophisticated protocols to manage data that is distributed across multiple sites. The complexity of required software has de-layed the offering of peer-to-peer distributed DBMS products.If the distributed database systems at various sites are autonomous and (possibly) exhibit some form of heterogeneity, they are usually referred to as multidatabase systems or federated database systems. If the data and DBMS functionality distribution is accom-plished on a multiprocessor computer, then it is referred to as a parallel database system.These are different than a distributed database system where the logical integration among distributed data is tighter than is the case with multidatabase systems or federated database systems, but the physical control is looser than that in parallel DBMSs. In this article, we do not consider multidatabase systems or parallel database systems.4. Overview of Technical IssuesA distributed DBMS has to provide the same functionality that its centralized counterparts provide, such as support for declarative user queries and their optimization, transactional access to the database involving concurrency control and reliability, en-forcement of integrity constraints and others. In the remaining sections we discuss some of these functions; in this section we provide a brief overview.Query processing deals with designing algorithms that analyze queries and convert them into a series of data manipulation operations. Besides the methodological issues, an important aspect of query processing is query optimization. The problem is how to decide on a strategy for executing each query over the network in the most cost-effective way, however cost is defined. The factors to be considered are the distribution of data, com-munication costs, and lack of sufficient locally available information. The objective is to optimize where the inherent parallelism of the distributed system is used to improve the performance of executing the query, subject to the above-mentioned constraints. The problem is NP-hard in nature, and the approaches are usually heuristic.User accesses to shared databases are formulated as transactions, which are units of execution that satisfy four properties: atomicity, consistency, isolation, and durability –jointly known as the ACID properties. Atomicity means that a transaction is an atomicunit and either the effects of all of its actions are reflected in the database, or none of them are. Consistency generally refers to the correctness of the individual transactions;i.e., that a transaction does not violate any of the integrity constraints that have been de-fined over the database. Isolation addresses the concurrent execution of transactions and specifies that actions of concurrent transactions do not impact each other. Finally, dura-bility concerns the persistence of database changes in the face of failures. ACID proper-ties are enforced by means of concurrency control algorithms and reliability protocols.Concurrency control involves the synchronization of accesses to the distributed database, such that the integrity of the database is maintained. The concurrency control problem in a distributed context is somewhat different than in a centralized framework. One not only has to worry about the integrity of a single database, but also about the con-sistency of multiple copies of the database. The condition that requires all the values of multiple copies of every data item to converge to the same value is called mutual consis-tency.Reliability protocols deal with the termination of transactions, in particular, their behavior in the face of failures. In addition to the typical failure types (i.e., transaction failures and system failures), distributed DBMSs have to account for communication (network) failures as well. The implication of communication failures is that, when a failure occurs and various sites become either inoperable or inaccessible, the databases at the operational sites remain consistent and up to date. This complicates the picture, as the actions of these sites have to be eventually reconciled with those of failed ones. There-fore, recovery protocols coordinate the termination of transactions so that they terminate uniformly (i.e., they either abort or they commit) at all the sites where they execute. Fur-thermore, when the computer system or network recovers from the failure, the distributed DBMS should be able to recover and bring the databases at the failed sites up-to-date. This may be especially difficult in the case of network partitioning, where the sites are divided into two or more groups with no communication among them.Distributed databases are typically replicated; that is, a number of the data items re-side at more than one site. Replication improves performance (since data access can be localized) and availability (since the failure of a site does not make a data item inaccessi-ble). However, management of replicated data requires that the values of multiple copies of a data item are the same. This is called the one copy equivalence property. Distributed DBMSs that allow replicated data implement replication protocols to enforce one copy equivalence.5. Distributed Query OptimizationQuery processing is the process by which a declarative query is translated into low-level data manipulation operations. SQL is the standard query language that is supported in current DBMSs. Query optimization refers to the process by which the “best” execu-tion strategy for a given query is found from among a set of alternatives.In distributed DBMSs, the process typically involves four steps (Figure 2): query decomposition, data localization, global optimization,and local optimization. Query de-composition takes an SQL query and translates it into one expressed in relational algebra. In the process, the query is analyzed semantically so that incorrect queries are detected and rejected as early as possible, and correct queries are simplified. Simplification in-volves the elimination of redundant predicates that may be introduced as a result of querymodification to deal with views, security enforcement and semantic integrity control. The simplified query is then restructured as an algebraic query.The initial algebraic query generated by the query decomposition step is input to the second step: data localization. The initial algebraic query is specified on global relations irrespective of their fragmentation or distribution. The main role of data localization is to localize the query’s data using data distribution information. In this step, the fragments that are involved in the query are determined and the query is transformed into one that operates on fragments rather than global relations. As indicated earlier, fragmentation is defined through fragmentation rules that can be expressed as relational operations (hori-zontal fragmentation by selection, vertical fragmentation by projection). A distributed relation can be reconstructed by applying the inverse of the fragmentation rules. This is called a localization program. The localization program for a horizontally (vertically) fragmented query is the union (join) of the fragments. Thus, during the data localization step each global relation is first replaced by its localization program, and then the result-ing fragment query is simplified and restructured to produce another “good” query. Sim-plification and restructuring may be done according to the same rules used in the decom-position step. As in the decomposition step, the final fragment query is generally far from optimal; the process has only eliminated “bad” algebraic queries.Figure 2 goes in hereFigure 2. Distributed Query Processing MethodologyFor a given SQL query, there is more than one possible algebraic query. Some of these algebraic queries are “better”' than others. The quality of an algebraic query is de-fined in terms of expected performance. The process of query optimization involves tak-ing the initial algebraic query and, using algebraic transformation rules, transforming it into other algebraic queries until the “best” one is found. The “best” algebraic query is determined according to a cost function that calculates the cost of executing the query according to that algebraic specification. In a distributed setting, the process involves global optimization to handle operations that involve data from multiple sites (e.g., join) followed by local optimization for further optimizing operations that will be performed at a given site.The input to the third step, global optimization, is a fragment query, that is, an alge-braic query on fragments. The goal of query optimization is to find an execution strategy for the query that is close to optimal. Remember that finding the optimal solution is computationally intractable. An execution strategy for a distributed query can be de-scribed with relational algebra operations and communication primitives (send/receive operations) for transferring data between sites. The previous layers have already opti-mized the query – for example, by eliminating redundant expressions. However, this op-timization is independent of fragment characteristics such as cardinalities. In addition, communication operations are not yet specified. By permuting the order of operations within one fragment query, many equivalent query execution plans may be found. Query optimization consists of finding the “best” one among candidate plans examined by the optimizer1.1 The difference between an optimal plan and the best plan is that the optimizer does not, because of com-putational intractability, examine all of the possible plans.The final step, local optimization, takes a part of the global query (called a subquery) that will run at a particular site and optimizes it further. This step is very simi-lar to query optimization in centralized DBMSs. Thus, it is at this stage that local infor-mation about data storage, such as indexes, etc, are used to determine the best execution strategy for that subquery.The query optimizer is usually modeled as consisting of three components: a search space, a cost model, and a search strategy. The search space is the set of alternative exe-cution plans to represent the input query. These plans are equivalent, in the sense that they yield the same result but they differ on the execution order of operations and the way these operations are implemented. The cost model predicts the cost of a given execution plan. To be accurate, the cost model must have accurate knowledge about the parallel execution environment. The search strategy explores the search space and selects the best plan. It defines which plans are examined and in which order.In a distributed environment, the cost function, often defined in terms of time units, refers to computing resources such as disk space, disk I/Os, buffer space, CPU cost, communication cost, and so on. Generally, it is a weighted combination of I/O, CPU, and communication costs. Nevertheless, a typical simplification made by distributed DBMSs is to consider communication cost as the most significant factor. This is valid for wide area networks, where the limited bandwidth makes communication much more costly than it is in local processing. To select the ordering of operations it is necessary to pre-dict execution costs of alternative candidate orderings. Determining execution costs be-fore query execution (i.e., static optimization) is based on fragment statistics and the for-mulas for estimating the cardinalities of results of relational operations. Thus the optimi-zation decisions depend on the available statistics on fragments. An important aspect of query optimization is join ordering, since permutations of the joins within the query may lead to improvements of several orders of magnitude. One basic technique for optimizing a sequence of distributed join operations is through use of the semijoin operator. The main value of the semijoin in a distributed system is to reduce the size of the join oper-ands and thus the communication cost. However, more recent techniques, which consider local processing costs as well as communication costs, do not use semijoins because they might increase local processing costs. The output of the query optimization layer is an optimized algebraic query with communication operations included on fragments.6. Distributed Concurrency ControlWhenever multiple users access (read and write) a shared database, these accesses need to be synchronized to ensure database consistency. The synchronization is achieved by means of concurrency control algorithms that enforce a correctness criterion such as serializability. User accesses are encapsulated as transactions, whose operations at the lowest level are a set of read and write operations to the database. Concurrency control algorithms enforce the isolation property of transaction execution, which states that the effects of one transaction on the database are isolated from other transactions until the first completes its execution.The most popular concurrency control algorithms are locking-based. In such schemes, a lock, in either shared or exclusive mode, is placed on some unit of storage (usually a page) whenever a transaction attempts to access it. These locks are placed ac-cording to lock compatibility rules such that read-write, write-read, and write-write con-flicts are avoided. It is a well known theorem that if lock actions on behalf of concurrent transactions obey a simple rule, then it is possible to ensure the serializability of these transactions: “No lock on behalf of a transaction should be set once a lock previously held by the transaction is released.” This is known as two-phase locking, since transac-tions go through a growing phase when they obtain locks and a shrinking phase when they release locks. In general, releasing of locks prior to the end of a transaction is prob-lematic. Thus, most of the locking-based concurrency control algorithms are strict in that they hold on to their locks until the end of the transaction.In distributed DBMSs, the challenge is to extend both the serializability argument and the concurrency control algorithms to the distributed execution environment. In these systems, the operations of a given transaction may execute at multiple sites where they access data. In such a case, the serializability argument is more difficult to specify and enforce. The complication is due to the fact that the serialization order of the same set of transactions may be different at different sites. Therefore, the execution of a set of distributed transactions is serializable if and only if1.the execution of the set of transactions at each site is serializable, and2.the serialization orders of these transactions at all these sites are identical.Distributed concurrency control algorithms enforce this notion of global seri-alizability. In locking-based algorithms there are three alternative ways of enforcing global serializability: centralized locking, primary copy locking, and distributed locking algorithm.In centralized locking, there is a single lock table for the entire distributed database. This lock table is placed, at one of the sites, under the control of a single lock manager. The lock manager is responsible for setting and releasing locks on behalf of transactions. Since all locks are managed at one site, this is similar to centralized concurrency control and it is straightforward to enforce the global serializability rule. These algorithms are simple to implement, but suffer from two problems. The central site may become a bot-tleneck, both because of the amount of work it is expected to perform and because of the traffic that is generated around it; and the system may be less reliable since the failure or inaccessibility of the central site would cause system unavailability. Primary copy lock-ing is a concurrency control algorithm that is useful in replicated databases where there may be multiple copies of a data item stored at different sites. One of the copies is desig-nated as a primary copy and it is this copy that has to be locked in order to access that item. All the sites know the set of primary copies for each data item in the distributed system, and the lock requests on behalf of transactions are directed to the appropriate primary copy. If the distributed database is not replicated, copy locking degenerates into a distributed locking algorithm.In distributed (or decentralized) locking, the lock management duty is shared by all the sites in the system. The execution of a transaction involves the participation and co-ordination of lock managers at more than one site. Locks are obtained at each site where the transaction accesses a data item. Distributed locking algorithms do not have the over-head of centralized locking ones. However, both the communication overhead to obtain all the locks, and the complexity of the algorithm are greater.One side effect of all locking-based concurrency control algorithms is that they cause deadlocks. The detection and management of deadlocks in a distributed system is difficult. Nevertheless, the relative simplicity and better performance of locking algo-rithms make them more popular than alternatives such as timestamp-based algorithms or optimistic concurrency control.7. Distributed Reliability ProtocolsTwo properties of transactions are maintained by reliability protocols: atomicity and durability. Atomicity requires that either all the operations of a transaction are executed or none of them are (all-or-nothing property). Thus, the set of operations contained in a transaction is treated as one atomic unit. Atomicity is maintained in the face of failures. Durability requires that the effects of successfully completed (i.e., committed) transac-tions endure subsequent failures.The underlying issue addressed by reliability protocols is how the DBMS can con-tinue to function properly in the face of various types of failures. In a distributed DBMS, four types of failures are possible transaction failures, site (system) failures, media (disk) failures and communication failures. Transactions can fail for a number of reasons: due to an error in the transaction caused by input data or by an error in the transaction code, or the detection of a present or potential deadlock. The usual approach to take in cases of transaction failure is to abort the transaction, resetting the database to its state prior to the start of the database.Site (or system) failures are due to a hardware failure (eg, processor, main memory, power supply) or a software failure (bugs in system code). The effect of system failures is the loss of main memory contents. Therefore, any updates to the parts of the database that are in the main memory buffers (also called volatile database) are lost as a result of system failures. However, the database that is stored in secondary storage (also called stable database) is safe and correct. To achieve this, DBMSs typically employ logging protocols, such as Write-Ahead Logging, which record changes to the database in system logs and move these log records and the volatile database pages to stable storage at ap-propriate times. From the perspective of distributed transaction execution, site failures are important since the failed sites cannot participate in the execution of any transaction.Media failures refer to the failure of secondary storage devices that store the stable database. Typically, these failures are addressed by introducing redundancy of storage devices and maintaining archival copies of the database. Media failures are frequently treated as problems local to one site and therefore are not specifically addressed in the reliability mechanisms of distributed DBMSs.The three types of failures described above are common to both centralized and distributed DBMSs. Communication failures, on the other hand, are unique to distributed systems. There are a number of types of communication failures. The most common ones are errors in the messages, improperly ordered messages, lost (or undelivered) mes-sages, and line failures. Generally, the first two of these are considered to be the respon-sibility of the computer network protocols and are not addressed by the distributed DBMS. The last two, on the other hand, have an impact on the distributed DBMS proto-cols and, therefore, need to be considered in the design of these protocols. If one site is。
分布式数据库系统(1)分布式数据库数据库分布式数据库(istributedDatabase,DDB)是指数据分存在计算机网络中的各台计算机上的数据库。
分布式数据库系统DistributedDatabaseSystem,DDBS)通常使用较小的计算机系统,每台计算机可单独放在一个地方每台计算机中都可能有DBMS的一份完整拷贝副本,或者部分拷贝副本,并具有自己局部的数据库位于不同地点的许多计算机通过网络互相连接,共同组成一个完整的、全局的逻辑上集中、物理上分布的大型数据库。
分布式数据库是指利用高速计算机网络将物理上分散的多个数据存储单元连接起来组成一个逻辑上统一的数据库。
分布式数据库的基本思想是将原来集中式数据库中的数据分散存储到多个通过网络连接的数据存储节点上,以获取更大的存储容量和更高的并发访问量。
近年来,随着数据量高速增长,分布式数据库技术也得到了快速的发展传统的关系型数据库开始从集中式模型向分布式架构发展基于关系型的分布式数据库在保留传统数据库的数据模型和基本特征下,从集中式存储走向分布式存储,从集中式计算走向分布式计算。
另一方面,随着数据量越来越大,关系型数据库开始暴露出一些难以克服的缺点以NoSQL为代表的高可扩展性、高并发性等优势非关系型数据库快速发展一时间市场上出现了大量的key-value存储系统、文档型数据库等NoSQL数据库产品。
NoSQL类型数据库正日渐成为大数据时代下分布式数据库领域的主力。
这种组织数据库的方法克服了物理中心数据库组织的弱点。
首先,降低了数据传送代价因为大多数对数据库的访问操作都是针对局部数据库的,而不是对其他位置的数据库访问其次,系统的可靠性提高了很多因为当网络出现故障时,仍然允许对局部数据库的操作,而且一个位置的故障不影响其他位置的处理工作只有当访问出现故障位置的数据时,在某种程度上才受影响第三,便于系统的扩充增加一个新的局部数据库,或在某个位置扩充一台适当的小型计算机,都很容易实现。