数据仓库(中英文翻译)
奥运比赛项目-深圳大学分数线
DATA WAREHOUSE
Data warehousing
provides architectures and tools for business
executives to
systematically organize,
understand, and use their data to make strategic
decisions. A
large number of organizations
have found that data warehouse systems are
valuable tools
in today's competitive, fast
evolving world. In the last several years, many
firms have
spent millions of dollars in
building enterprise-wide data warehouses. Many
people feel
that with competition mounting in
every industry, data warehousing is the latest
must-have marketing weapon —— a way to keep
customers by learning more about their
needs.
“So
Data warehouses have been defined in
many ways, making it difficult to formulate a
rigorous definition. Loosely speaking, a data
warehouse refers to a database that is
maintained separately from an organization's
operational databases. Data warehouse
systems
allow for the integration of a variety of
application systems. They support
information
processing by providing a solid platform of
consolidated, historical data for
analysis.
According to W. H. Inmon, a leading architect
in the construction of data warehouse
systems,
“a data warehouse is a subject-oriented,
integrated, time-variant, and nonvolatile
collection of data in support of management's
decision making process.
comprehensive
definition presents the major features of a data
warehouse. The four
keywords, subject-
oriented, integrated, time-variant, and
nonvolatile, distinguish data
warehouses from
other data repository systems, such as relational
database systems,
transaction processing
systems, and file systems. Let's take a closer
look at each of these
key features.
(1).Subject-oriented: A data warehouse is
organized around major subjects, such as
customer, vendor, product, and sales. Rather
than concentrating on the day-to-day
operations and transaction processing of an
organization, a data warehouse focuses on the
modeling and analysis of data for decision
makers. Hence, data warehouses typically
provide a simple and concise view around
particular subject issues by excluding data that
are not useful in the decision support
process.
(2) Integrated: A data warehouse is
usually constructed by integrating multiple
heterogeneous sources, such as relational
databases, flat files, and on-line transaction
records. Data cleaning and data integration
techniques are applied to ensure consistency
in naming conventions, encoding structures,
attribute measures, and so on.
(3).Time-
variant: Data are stored to provide information
from a historical perspective
(e.g., the past
5-10 years). Every key structure in the data
warehouse contains, either
implicitly or
explicitly, an element of time.
(4)Nonvolatile: A data warehouse is always a
physically separate store of data
transformed
from the application data found in the operational
environment. Due to this
separation, a data
warehouse does not require transaction processing,
recovery, and
concurrency control mechanisms.
It usually requires only two operations in data
accessing: initial loading of data and access
of data.
In sum, a data warehouse is a
semantically consistent data store that serves as
a
physical implementation of a decision
support data model and stores the information on
which an enterprise needs to make strategic
decisions. A data warehouse is also often
viewed as an architecture, constructed by
integrating data from multiple heterogeneous
sources to support structured andor ad
hoc queries, analytical reporting, and decision
making.
“OK
Based on the above, we
view data warehousing as the process of
constructing and
using data warehouses. The
construction of a data warehouse requires data
integration,
data cleaning, and data
consolidation. The utilization of a data warehouse
often
necessitates a collection of decision
support technologies. This allows “knowledge
workers
conveniently obtain an overview of
the data, and to make sound decisions based on
information in the warehouse. Some authors use
the term “data warehousing
only to the process
of data warehouse construction, while the term
warehouse DBMS is
used to refer to the
management and utilization of data warehouses. We
will not make
this distinction here.
“How
are organizations using the information from data
warehouses? Many
organizations are using this
information to support business decision making
activities,
including:
(1) increasing
customer focus, which includes the analysis of
customer buying
patterns (such as buying
preference, buying time, budget cycles, and
appetites for
spending),
(2)
repositioning products and managing product
portfolios by comparing the
performance of
sales by quarter, by year, and by geographic
regions, in order to fine-tune
production
strategies,
(3) analyzing operations and
looking for sources of profit,
(4) managing
the customer relationships, making environmental
corrections, and
managing the cost of
corporate assets.
Data warehousing is also
very useful from the point of view of
heterogeneous
database integration. Many
organizations typically collect diverse kinds of
data and
maintain large databases from
multiple, heterogeneous, autonomous, and
distributed
information sources. To integrate
such data, and provide easy and efficient access
to it is
highly desirable, yet challenging.
Much effort has been spent in the database
industry and
research community towards
achieving this goal.
The traditional database
approach to heterogeneous database integration is
to build
wrappers and integrators (or
mediators) on top of multiple, heterogeneous
databases. A
variety of data joiner and data
blade products belong to this category. When a
query is
posed to a client site, a metadata
dictionary is used to translate the query into
queries
appropriate for the individual
heterogeneous sites involved. These queries are
then
mapped and sent to local query
processors. The results returned from the
different sites
are integrated into a global
answer set. This query-driven approach requires
complex
information filtering and integration
processes, and competes for resources with
processing at local sources. It is inefficient
and potentially expensive for frequent queries,
especially for queries requiring aggregations.
Data warehousing provides an interesting
alternative to the traditional approach of
heterogeneous database integration described
above. Rather than using a query-driven
approach, data warehousing employs an update-
driven approach in which information
from
multiple, heterogeneous sources is integrated in
advance and stored in a warehouse
for direct
querying and analysis. Unlike on-line transaction
processing databases, data
warehouses
do not contain the most current information.
However, a data warehouse
brings high
performance to the integrated heterogeneous
database system since data are
copied,
preprocessed, integrated, annotated, summarized,
and restructured into one
semantic data store.
Furthermore, query processing in data warehouses
does not interfere
with the processing at
local sources. Moreover, data warehouses can store
and integrate
historical information and
support complex multidimensional queries. As a
result, data
warehousing has become very
popular in industry.
1.
Differences between operational database
systems and data warehouses
Since most people
are familiar with commercial relational database
systems, it is
easy to understand what a data
warehouse is by comparing these two kinds of
systems.
The major task of on-line operational
database systems is to perform on-line
transaction and query processing. These
systems are called on-line transaction processing
(OLTP) systems. They cover most of the day-to-
day operations of an organization, such
as,
purchasing, inventory, manufacturing, banking,
payroll, registration, and accounting.
Data
warehouse systems, on the other hand, serve users
or “knowledge workers
role of data analysis and
decision making. Such systems can organize and
present data in
various formats in order to
accommodate the diverse needs of the different
users. These
systems are known as on-line
analytical processing (OLAP) systems.
The
major distinguishing features between OLTP and
OLAP are summarized as
follows.
(1). Users
and system orientation: An OLTP system is
customer-oriented and is used
for transaction
and query processing by clerks, clients, and
information technology
professionals. An OLAP
system is market-oriented and is used for data
analysis by
knowledge workers, including
managers, executives, and analysts.
(2). Data
contents: An OLTP system manages current data
that, typically, are too
detailed to be easily
used for decision making. An OLAP system manages
large amounts
of historical data, provides
facilities for summarization and aggregation, and
stores and
manages information at different
levels of granularity. These features make the
data
easier for use in informed decision
making.
(3). Database design: An OLTP system
usually adopts an entity-relationship (ER)
data model and an application -oriented
database design. An OLAP system typically
adopts either a star or snowflake model, and a
subject-oriented database design.
(4). View:
An OLTP system focuses mainly on the current data
within an enterprise
or department, without
referring to historical data or data in different
organizations. In
contrast, an OLAP system
often spans multiple versions of a database
schema, due to the
evolutionary process of an
organization. OLAP systems also deal with
information that
originates from different
organizations, integrating information from many
data stores.
Because of their huge volume,
OLAP data are stored on multiple storage media.
(5). Access patterns: The access patterns of
an OLTP system consist mainly of short,
atomic
transactions. Such a system requires concurrency
control and recovery
mechanisms. However,
accesses to OLAP systems are mostly read-only
operations (since
most data warehouses store
historical rather than up-to-date information),
although many
could be complex queries.
Other features which distinguish between OLTP
and OLAP systems include database
size,
frequency of operations, and performance metrics
and so on.
2.
But, why have a
separate data warehouse?
“Since
operational databases store huge amounts of
data
perform on-line analytical processing
directly on such databases instead of spending
additional time and resources to construct a
separate data warehouse?
A major reason for
such a separation is to help promote the high
performance of both
systems. An operational
database is designed and tuned from known tasks
and workloads,
such as indexing and hashing
using primary keys, searching for particular
records, and
optimizing “cannedata warehouse
queries are often
complex. They involve the
computation of large groups of data at summarized
levels, and
may require the use of special
data organization, access, and implementation
methods
based on multidimensional views.
Processing OLAP queries in operational databases
would substantially degrade the performance of
operational tasks.
Moreover, an operational
database supports the concurrent processing of
several
transactions. Concurrency control and
recovery mechanisms, such as locking and logging,
are required to ensure the consistency and
robustness of transactions. An OLAP query
often needs read-only access of data records
for summarization and aggregation.
Concurrency
control and recovery mechanisms, if applied for
such OLAP operations,
may jeopardize the
execution of concurrent transactions and thus
substantially reduce the
throughput of an OLTP
system.
Finally, the separation of operational
databases from data warehouses is based on the
different structures, contents, and uses of
the data in these two systems. Decision support
requires historical data, whereas operational
databases do not typically maintain historical
data. In this context, the data in operational
databases, though abundant, is usually far
from complete for decision making. Decision
support requires consolidation (such as
aggregation and summarization) of data from
heterogeneous sources, resulting in high
quality, cleansed and integrated data. In
contrast, operational databases contain only
detailed raw data, such as transactions, which
need to be consolidated before analysis.
Since
the two systems provide quite different
functionalities and require different kinds
of
data, it is necessary to maintain separate
databases.
数据仓库
数据仓库为商务运作提供结构与工具,以便
系统地组织、理解和使用数据进
行决策。大量组织机构已经发现,在当今这个充满竞争、快速发展的世界
,数据
仓库是一个有价值的工具。在过去的几年中,许多公司已花费数百万美元,建立
企业范围
的数据仓库。许多人感到,随着工业竞争的加剧,数据仓库成了必备的
最新营销武器——通过更多地了解
客户需求而保住客户的途径。
“那么”,你可能会充满神秘地问,“到底什么是数据仓库?”
数据仓库已被多种方式定义,使得很难严格地定义它。宽松地讲,数据仓库
是一个数据库,它与组织机
构的操作数据库分别维护。数据仓库系统允许将各种
应用系统集成在一起,为统一的历史数据分析提供坚
实的平台,对信息处理提供
支持。
按照W. H. Inmon,一位数据仓库系统构造方面
的领头建筑师的说法,“数
据仓库是一个面向主题的、集成的、时变的、非易失的数据集合,支持管理决
策
制定”。这个简短、全面的定义指出了数据仓库的主要特征。四个关键词,面向
主题的、集成
的、时变的、非易失的,将数据仓库与其它数据存储系统(如,关
系数据库系统、事务处理系统、和文件
系统)相区别。让我们进一步看看这些关
键特征。
(1)
、
面向主题的:
数据仓库围绕一些主题,如顾客、供应商、产品和销售组
织。数据仓库关注决策者的数据建模与分析,而
不是构造组织机构的日常操作和
事务处理。因此,数据仓库排除对于决策无用的数据,提供特定主题的简
明视图。
(2)
、
集成的:通常,构造数据仓库是将多个异种数据源,如关系数据库
、一
般文件和联机事务处理记录,集成在一起。使用数据清理和数据集成技术,确保
命名约定、
编码结构、属性度量的一致性等。
(3)
、
时变的:数据存储从历史的角度(例如,过去5-10
年)提供信息。数
据仓库中的关键结构,隐式或显式地包含时间元素。
(4)
、 <
br>非易失的:数据仓库总是物理地分离存放数据;这些数据源于操作环境
下的应用数据。由于这种分
离,数据仓库不需要事务处理、恢复和并行控制机制。
通常,它只需要两种数据访问:数据的初始化装入
和数据访问。
概言之,数据仓库是一种语义上一致的数据存储,它充当决策支持数据模型
的物
理实现,并存放企业决策所需信息。数据仓库也常常被看作一种体系结构,
通过将异种数据源中的数据集
成在一起而构造,支持结构化和启发式查询、分析
报告和决策制定。
“好”,你现在问,“那么,什么是建立数据仓库(data warehousing)?”
根据上面的讨论,我们把建立数据仓库看作构造和使用数据仓库的过程。数
据仓库的构造需要数据集成
、数据清理、和数据统一。利用数据仓库常常需要一
些决策支持技术。这使得“知识工人”(例如,经理
、分析人员和主管)能够使
用数据仓库,快捷、方便地得到数据的总体视图,根据数据仓库中的信息做出
准
确的决策。有些作者使用术语“建立数据仓库”表示构造数据仓库的过程,而用
术语“仓库D
BMS”表示管理和使用数据仓库。我们将不区分二者。
“组织机构如何使用数据仓库中的信息?”许
多组织机构正在使用这些信息
支持商务决策活动,包括:
(1)、增加顾客关注,包括分析顾
客购买模式(如,喜爱买什么、购买时间、
预算周期、消费习惯);
(2)、
根据季度、年、地区的营销情况比较,重新配置产品和管理投资,调整
生产策略;
(3)、分析运作和查找利润源;
(4)、管理顾客关系、进行环境调整、管理合股人的资产开销。
从异种数据库集成的角度看
,数据仓库也是十分有用的。许多组织收集了形
形色色数据,并由多个异种的、自治的、分布的数据源维
护大型数据库。集成这
些数据,并提供简便、有效的访问是非常希望的,并且也是一种挑战。数据库工<
br>业界和研究界都正朝着实现这一目标竭尽全力。
对于异种数据库的集成,传统的数据库做法是:
在多个异种数据库上,建立
一个包装程序和一个集成程序(或仲裁程序)。这方面的例子包括IBM
的数据
连接程序 (Data Joiner) 和Informix的数据刀(DataBlade)
。当一个查询提交客户
站点,首先使用元数据字典对查询进行转换,将它转换成相应异种站点上的查询。
然后,将这些查询映射和发送到局部查询处理器。由不同站点返回的结果被集成
为全局回答。这
种查询驱动的方法需要复杂的信息过滤和集成处理,并且与局部
数据源上的处理竞争资源。这种方法是低
效的,并且对于频繁的查询,特别是需
要聚集操作的查询,开销很大。
对于异种数据库集成的
传统方法,数据仓库提供了一个有趣的替代方案。数
据仓库使用更新驱动的方法,而不是查询驱动的方法
。这种方法将来自多个异种
源的信息预先集成,并存储在数据仓库中,供直接查询和分析。与联机事务处
理
数据库不同,数据仓库不包含最近的信息。然而,数据仓库为集成的异种数据库
系统带来了高
性能,因为数据被拷贝、预处理、集成、注释、汇总,并重新组织
到一个语义一致的数据存储中。在数据
仓库中进行的查询处理并不影响在局部源
上进行的处理。此外,数据仓库存储并集成历史信息,支持复杂
的多维查询。这
样,建立数据仓库在工业界已非常流行。
1.
操作数据库系统与数据仓库的区别
由于大多数人都熟悉商品关系数据库系统,将数据仓库与之比较,就容易理
解什么是数据仓库。
联机操作数据库系统的主要任务是执行联机事务和查询处理。这种系统称为
联机事务处理(OL
TP)系统。它们涵盖了一个组织的大部分日常操作,如购买、
库存、制造、银行、工资、注册、记帐等
。另一方面,数据仓库系统在数据分析
和决策方面为用户或“知识工人”提供服务。这种系统可以用不同
的格式组织和
提供数据,以便满足不同用户的形形色色需求。这种系统称为联机分析处理
(OL
AP)系统。
OLTP 和OLAP 的主要区别概述如下。
(1)、用户和系统的面向性:OLTP
是面向顾客的,用于办事员、客户、和
信息技术专业人员的事务和查询处理。OLAP
是面向市场的,用于知识工人(包
括经理、主管、和分析人员)的数据分析。
(2)、
数据内容:OLTP
系统管理当前数据。通常,这种数据太琐碎,难以
方便地用于决策。OLAP 系统管理大量历史数据,
提供汇总和聚集机制,并在
不同的粒度级别上存储和管理信息。这些特点使得数据容易用于见多识广的决
策。
(3)、数据库设计:通常,OLTP 系统采用实体-
联系(ER)模型和面向应用
的数据库设计。而OLAP
系统通常采用星形或雪花模型和面向主题的数据库设
计。
(4)、
视图:OLTP 系统主要关注一个企业或部门内部的当前数据,而不涉
及历史数据或不同组织
的数据。相比之下,由于组织的变化,OLAP 系统常常
跨越数据库模式的多个版本。OLAP
系统也处理来自不同组织的信息,由多个
数据存储集成的信息。由于数据量巨大,OLAP
数据也存放在多个存储介质上。
(5)、访问模式:OLTP
系统的访问主要由短的、原子事务组成。这种系统需
要并行控制和恢复机制。然而,对OLAP 系统的
访问大部分是只读操作(由于
大部分数据仓库存放历史数据,而不是当前数据),尽管许多可能是复杂的
查询。
OLTP 和OLAP 的其它区别包括数据库大小、操作的频繁程度、性能度量等。
2.
但是,为什么需要一个分离的数据仓库
“既然
操作数据库存放了大量数据”,你注意到,“为什么不直接在这种数
据库上进行联机分析处理,而是另外
花费时间和资源去构造一个分离的数据仓
库?”
分离的主要原因是提高两个系统的性能。操作
数据库是为已知的任务和负载
设计的,如使用主关键字索引和散列,检索特定的记录,和优化“罐装的”
查询。
另一方面,数据仓库的查询通常是复杂的,涉及大量数据在汇总级的计算,可能
需要特殊
的数据组织、存取方法和基于多维视图的实现方法。在操作数据库上处
理OLAP
查询,可能会大大降低操作任务的性能。
此外,操作数据库支持多事务的并行处理,需要加锁和日志等
并行控制和恢
复机制,以确保一致性和事务的强健性。通常,OLAP
查询只需要对数据记录
进行只读访问,以进行汇总和聚集。如果将并行控制和恢复机制用于这种OLAP
操作,就会危害并行事务的运行,从而大大降低OLTP 系统的吞吐量。
最后,数据仓库与
操作数据库分离是由于这两种系统中数据的结构、内容和
用法都不相同。决策支持需要历史数据,而操作
数据库一般不维护历史数据。在
这种情况下,操作数据库中的数据尽管很丰富,但对于决策,常常还是远
远不够
的。决策支持需要将来自异种源的数据统一(如,聚集和汇总),产生高质量的、
纯净的
和集成的数据。相比之下,操作数据库只维护详细的原始数据(如事务),
这些数据在进行分析之前需要
统一。由于两个系统提供很不相同的功能,需要不
同类型的数据,因此需要维护分离的数据库。