计算机毕业设计外文翻译---数据仓库
马来亚大学-翅膀作文
DATA WAREHOUSE
Data
warehousing provides architectures and tools for
business executives to systematically
organize, understand, and use their data to
make strategic decisions. A large number of
organizations
have found that data warehouse
systems are valuable tools in today's competitive,
fast evolving world.
In the last several
years, many firms have spent millions of dollars
in building enterprise-wide data
warehouses.
Many people feel that with competition mounting in
every industry, data warehousing is
the latest
must-have marketing weapon —— a way to keep
customers by learning more about their
needs.
“So
Data warehouses have been defined in
many ways, making it difficult to formulate a
rigorous
definition. Loosely speaking, a data
warehouse refers to a database that is maintained
separately from
an organization's operational
databases. Data warehouse systems allow for the
integration of a variety
of application
systems. They support information processing by
providing a solid platform of
consolidated,
historical data for analysis.
According to W.
H. Inmon, a leading architect in the construction
of data warehouse systems, “a
data warehouse
is a subject-oriented, integrated, time-variant,
and nonvolatile collection of data in
support
of management's decision making process.
the
major features of a data warehouse. The four
keywords, subject-oriented, integrated, time-
variant,
and nonvolatile, distinguish data
warehouses from other data repository systems,
such as relational
database systems,
transaction processing systems, and file systems.
Let's take a closer look at each of
these key
features.
(1)Subject-oriented: A data
warehouse is organized around major subjects, such
as customer,
vendor, product, and sales.
Rather than concentrating on the day-to-day
operations and transaction
processing of an
organization, a data warehouse focuses on the
modeling and analysis of data for
decision
makers. Hence, data warehouses typically provide a
simple and concise view around
particular
subject issues by excluding data that are not
useful in the decision support process.
(2)Integrated: A data warehouse is usually
constructed by integrating multiple heterogeneous
sources, such as relational databases, flat
files, and on-line transaction records. Data
cleaning and data
integration techniques are
applied to ensure consistency in naming
conventions, encoding structures,
attribute
measures, and so on..
(3)Time-variant: Data
are stored to provide information from a
historical perspective (e.g., the
past 5-10
years). Every key structure in the data warehouse
contains, either implicitly or explicitly, an
element of time.
(4)Nonvolatile: A data
warehouse is always a physically separate store of
data transformed from
the application data
found in the operational environment. Due to this
separation, a data warehouse
does not require
transaction processing, recovery, and concurrency
control mechanisms. It usually
requires only
two operations in data accessing: initial loading
of data and access of data..
In sum, a data
warehouse is a semantically consistent data store
that serves as a physical
implementation of a
decision support data model and stores the
information on which an enterprise
needs to
make strategic decisions. A data warehouse is also
often viewed as an architecture,
constructed
by integrating data from multiple heterogeneous
sources to support structured andor ad
hoc
queries, analytical reporting, and decision
making.
“OK
Based on the above, we view
data warehousing as the process of constructing
and using data
warehouses. The construction of
a data warehouse requires data integration, data
cleaning, and data
consolidation. The
utilization of a data warehouse often necessitates
a collection of decision support
technologies.
This allows “knowledge workers
warehouse to
quickly and conveniently obtain an overview of the
data, and to make sound decisions
1
based on information in the warehouse.
Some authors use the term “data warehousing
the
process of data warehouse construction, while the
term warehouse DBMS is used to refer to the
management and utilization of data warehouses.
We will not make this distinction here.
“How
are organizations using the information from data
warehouses? Many organizations are
using this
information to support business decision making
activities, including:
(1) increasing customer
focus, which includes the analysis of customer
buying patterns (such as
buying preference,
buying time, budget cycles, and appetites for
spending).
(2) repositioning products and
managing product portfolios by comparing the
performance of
sales by quarter, by year, and
by geographic regions, in order to fine-tune
production strategies.
(3) analyzing
operations and looking for sources of profit.
(4) managing the customer relationships,
making environmental corrections, and managing the
cost of corporate assets.
Data warehousing
is also very useful from the point of view of
heterogeneous database
integration. Many
organizations typically collect diverse kinds of
data and maintain large databases
from
multiple, heterogeneous, autonomous, and
distributed information sources. To integrate such
data, and provide easy and efficient access to
it is highly desirable, yet challenging. Much
effort has
been spent in the database industry
and research community towards achieving this
goal.
The traditional database approach to
heterogeneous database integration is to build
wrappers and
integrators (or mediators) on top
of multiple, heterogeneous databases. A variety of
data joiner and
data blade products belong to
this category. When a query is posed to a client
site, a metadata
dictionary is used to
translate the query into queries appropriate for
the individual heterogeneous sites
involved.
These queries are then mapped and sent to local
query processors. The results returned from
the different sites are integrated into a
global answer set. This query-driven approach
requires complex
information filtering and
integration processes, and competes for resources
with processing at local
sources. It is
inefficient and potentially expensive for frequent
queries, especially for queries requiring
aggregations.
Data warehousing provides an
interesting alternative to the traditional
approach of heterogeneous
database integration
described above. Rather than using a query-driven
approach, data warehousing
employs an update-
driven approach in which information from
multiple, heterogeneous sources is
integrated
in advance and stored in a warehouse for direct
querying and analysis. Unlike on-line
transaction processing databases, data
warehouses do not contain the most current
information.
However, a data warehouse brings
high performance to the integrated heterogeneous
database system
since data are copied,
preprocessed, integrated, annotated, summarized,
and restructured into one
semantic data store.
Furthermore, query processing in data warehouses
does not interfere with the
processing at
local sources. Moreover, data warehouses can store
and integrate historical information
and
support complex multidimensional queries. As a
result, data warehousing has become very
popular in industry.
ences between
operational database systems and data warehouses
Since most people are familiar with commercial
relational database systems, it is easy to
understand what a data warehouse is by
comparing these two kinds of systems.
The
major task of on-line operational database systems
is to perform on-line transaction and
query
processing. These systems are called on-line
transaction processing (OLTP) systems. They
cover most of the day-to-day operations of an
organization, such as, purchasing, inventory,
manufacturing, banking, payroll, registration,
and accounting. Data warehouse systems, on the
other
hand, serve users or “knowledge
workersin the role of data analysis and decision
making. Such
systems can organize and present
data in various formats in order to accommodate
the diverse needs
of the different users.
These systems are known as on-line analytical
processing (OLAP) systems.
The major
distinguishing features between OLTP and OLAP are
summarized as follows.
(1)Users and system
orientation: An OLTP system is customer-oriented
and is used for
transaction and query
processing by clerks, clients, and information
technology professionals. An
OLAP system is
market-oriented and is used for data analysis by
knowledge workers, including
managers,
executives, and analysts.
2
(2)Data contents: An OLTP system
manages current data that, typically, are too
detailed to be
easily used for decision
making. An OLAP system manages large amounts of
historical data, provides
facilities for
summarization and aggregation, and stores and
manages information at different levels of
granularity. These features make the data
easier for use in informed decision making.
(3)Database design: An OLTP system usually
adopts an entity-relationship (ER) data model and
an application -oriented database design. An
OLAP system typically adopts either a star or
snowflake
model, and a subject-oriented
database design.
(4)View: An OLTP system
focuses mainly on the current data within an
enterprise or department,
without referring to
historical data or data in different
organizations. In contrast, an OLAP system
often spans multiple versions of a database
schema, due to the evolutionary process of an
organization.
OLAP systems also deal with
information that originates from different
organizations, integrating
information from
many data stores. Because of their huge volume,
OLAP data are stored on multiple
storage
media.
(5). Access patterns: The access
patterns of an OLTP system consist mainly of
short, atomic
transactions. Such a system
requires concurrency control and recovery
mechanisms. However,
accesses to OLAP systems
are mostly read-only operations (since most data
warehouses store
historical rather than up-to-
date information), although many could be complex
queries.
Other features which distinguish
between OLTP and OLAP systems include database
size,
frequency of operations, and performance
metrics and so on.
, why have a separate data
warehouse?
“Since operational databases store
huge amounts of data-line
analytical
processing directly on such databases instead of
spending additional time and resources to
construct a separate data warehouse?
A
major reason for such a separation is to help
promote the high performance of both systems.
An operational database is designed and tuned
from known tasks and workloads, such as indexing
and
hashing using primary keys, searching for
particular records, and optimizing
“canned
other hand, data warehouse queries are
often complex. They involve the computation of
large groups
of data at summarized levels, and
may require the use of special data organization,
access, and
implementation methods based on
multidimensional views. Processing OLAP queries in
operational
databases would substantially
degrade the performance of operational tasks.
Moreover, an operational database supports the
concurrent processing of several transactions.
Concurrency control and recovery mechanisms,
such as locking and logging, are required to
ensure
the consistency and robustness of
transactions. An OLAP query often needs read-only
access of data
records for summarization and
aggregation. Concurrency control and recovery
mechanisms, if applied
for such OLAP
operations, may jeopardize the execution of
concurrent transactions and thus
substantially
reduce the throughput of an OLTP system.
Finally, the separation of operational
databases from data warehouses is based on the
different
structures, contents, and uses of
the data in these two systems. Decision support
requires historical
data, whereas operational
databases do not typically maintain historical
data. In this context, the data
in operational
databases, though abundant, is usually far from
complete for decision making. Decision
support
requires consolidation (such as aggregation and
summarization) of data from heterogeneous
sources, resulting in high quality, cleansed
and integrated data. In contrast, operational
databases
contain only detailed raw data, such
as transactions, which need to be consolidated
before analysis.
Since the two systems provide
quite different functionalities and require
different kinds of data, it is
necessary to
maintain separate databases.
3
数据仓库
数据仓库为商务运作提供了组织结构和工具,以便系统地
组织、理解和使用
数据进行决策。许多组织发现在如今的具有竞争与快速发展的世界中数据仓库是
非常有用的工具。
在最近的几年里,许多公司花了几百万美元用于构建企业数据库。许多人也认为<
br>随着竞争加剧,数据仓库己成为营销必备的手段——一种了解顾客的需求的武
器。
“那么”,你可能会充满神秘地问,“到底什么是数据仓库?”
数据仓库有不同的定义,但却
很难有一个严格的定义。不严谨的说,数据仓
库是一个数据库,它与组织机构的操作数据库分别维护。数
据仓库允许不同应用
系统的集成,为统一的历史数据分析提供坚实的平台,对信息处理提供支持。
按照W.H Inmon,一位数据仓库构造方面的领头建筑师说,“数据仓库是一
个面向主题
的、集成的、随时间变化的、非易失的数据的集合,支持管理决策制
定。”这个简短,但是复合的定义表
述了数据仓库的主要特点。四个关键词,面
向主题的、集成的、时变的、非易失的,将数据仓库与其它数
据存储系统相区别。
让我们进下来认识它的四个特征。
(1)面向对象:数据仓库是围绕一些
主题,如顾客、供应商、产品和销售组
织。数据仓库关注决策者的数据建模与分析,而不是构造机构日常
操作和事务处
理。因此,数据仓库排除了在进程中提供的没有价值的决策。
(2)集成的:数
据仓库通常由多个数据源组成,如关系数据库、一般文件和
联机事务处理记录。数据清理和数据集成技术
被运用于确保命名的合理性、代码
的结构,结构尺度等。
(3)随时间变化:数据被存储是用
来提供变化历史角度的信息。数据仓库中
所包含的关键字,都显性或隐性的反映时间元素。
(
4)非易失性:数据仓库是物理地分离存放数据;基于这种分法,数据仓库
不需要传输进程,覆盖和并发
控制机制。它通常只需要两种数据访问:数据的初
使化装入和数据访问。
4
总得来说,数据仓库是一种语义上一致的数据存储,它充当了物理决策
数
据模型的实施关于哪种企业需要做战略决策。数据仓库经常被认作一种结
构,由集成的数据组合而成,支
持结构化和启发式查询、分析报告和决策制
定。
“好”,“现在你可以问什么是数据仓库。”
基于以上所讲的,我们把数据仓库视为构造和使用数据仓
库的过程。数
据仓库的构造需要数据集成、数据清理和数据统一。利用数据仓库常常需要
一些决
策支持技术。这使得知识工作者能够利用数据仓库,快捷方便地得到
数据总体视图,根据数据仓库中的信
息做出准确的决策。有些人使用术语“建
立数据库”表示构造数据仓库的过程,用仓库DBMS表示管理
和使用数据仓
库。我们将不区分二者。
“组织是如何从数据仓库中使用数据的?”许多组织使用这些信息支持
决策活动,包括: (1)增加顾客关注,包括分析顾客购买模式(如,喜爱买什么、购买时间、
预算周期、消费习惯)
;
(2)根据季度、年、地区的营销情况比较,重新配置产品和管理投资,调
整生产策略;
(3)分析运作和查找利润源;
(4)管理顾客关系、进行环境调整、管理合股人的资产开销。
从异种数据库集成的角度看,
数据仓库也是十分有用的。许多组织收集了不
同类的数据,并由多个异种的、自治的、分布的数据源维护
大型数据库。集成这
些数据,并提供简便、有效的访问是非常希望的,并且也是一种挑战。数据库工业界和研究界都正朝着实现这一目标竭尽全力。
对于异种数据库的集成,传统的数据库做法是:在
多个异种数据库上,建立
一个包装程序和一个集成程序(或仲裁程序)。这方面的例子包括IBM
的数据
连接程序 和Informix的数据刀。当一个查询提交客户站点,首先使用元数据字
典对查询进行转换,将它转换成相应异种站点上的查询。然后,将这些查询映射
和发送到局部查询处理器
。由不同站点返回的结果被集成为全局回答。这种查询
驱动的方法需要复杂的信息过滤和集成处理,并且
与局部数据源上的处理竞争资
5
源。这种方法是低效的,并且对于
频繁的查询,特别是需要聚集操作的查询,开
销很大。
对于异种数据库集成的传统方法,数据
仓库提供了一个有趣的替代方案。数
据仓库使用更新驱动的方法,而不是查询驱动的方法。这种方法将来
自多个异种
源的信息预先集成,并存储在数据仓库中,供直接查询和分析。与联机事务处理
数据
库不同,数据仓库不包含最近的信息。然而,数据仓库为集成的异种数据库
系统带来了高性能,因为数据
被拷贝、预处理、集成、注释、汇总,并重新组织
到一个语义一致的数据存储中。在数据仓库中进行的查
询处理并不影响在局部源
上进行的处理。此外,数据仓库存储并集成历史信息,支持复杂的查询。这样,
建立数据仓库在工业界就非常流行。
1.操作数据库系统与数据仓库的区别
由于大多数人都熟悉商品关系数据库系统,将数据仓库与之比较,就容易理
解什么是数据仓库。
联机操作数据库系统的主要任务是执行联机事务和查询处理。这种系统称为
联机事务处理(OL
TP)系统。它们涵盖了一个组织的大部分日常操作,如购买、
库存、制造、银行、工资、注册、记帐等
。另一方面,数据仓库系统在数据分析
和决策方面为用户或“知识工人”提供服务。这种系统可以用不同
的格式组织和
提供数据,以便满足不同用户的形形色色需求。这种系统称为联机分析处理
(OL
AP)系统。
OLTP 和OLAP 的主要区别概述如下。
(1)用户和系统定位:联机事务处理是以顾客为导向,用于给客户和信息
技术专家
传输和职员查询处理。在线分析系统是以市场为导向,用于知识工作者包括管理
员、执行官和分析员处理
数据。
(2)数据内容:联机事务处理系统管理当前数据,特别的,都是一些详细
并且简单可
以用于做决定。在线分析系统管理大量历史数据,提供总结和聚集的
设备,存储和管理不同水平的粒度。
这些特征使得用户在做决策上更简单。
(3)数据库的设计:联机处理系统通常采用实体数据模型和应
用联机系统
数据设计。在线分析系统采用星形或雪花模型和面向主题的数据库设计。
6
(4)视图:联机事务处理系统聚焦于当前企业或部门数据,而不涉及到历
史数据或在不同组织中的数据。总得来说,在线分析系统经常跨越许多数据库版
本,基于组织机
构的改革。在线分析系统同样处理来自不同组织的数据,从大量
数据存储中整合信息。由于体积的庞大,
在线分析系统在多个数据媒体上建立存
储。
(5)存储模式:联机处理系统组成短小,自动交
易。如此的一个系统需要
并发控制和恢复机制。然而,在线分析系统存储大部分是只读的,尽管大部分可
以复杂查询。
其它区分联机处理系统和在线分析系统包括数据大小,操作的频率,性能的指标。
2.但是,为什么需要一个分离的数据库?
“既然操作数据库存储了大量的数据”,你也看到
了,“为什么不直接执行
在线分析系统数据库替代花费大量时间和资源去构建一个分离的数据库? 这种分离的一个主要的原因是可以提高两个系统的性能。操作数据库是在己知的
任务和负载设计的,
如果用主关键字索引和散列,检索特定的记录和优化“罐装”
的查询。另一方面,数据仓库查询通常是复
杂的。它们涉及了一堆数据总括水平
的大量运算,它们中的一些需要特殊的算法,存储和基于多维视图的
实现方法。
在线分析系统进程查询在操作数据中可能需要降解大量的操作工作。
另外,操作数
据库支持几个交易的并行处理。并行控制和恢复机制,比如锁
定和测量,都需要确保交易的一致性和稳定
性。在线分析系统查询通常需要对数
据记录进行只读访问,以进行汇总和聚集。并行控制和恢复机制,如
果应用于联
机处理系统,可能会危害控制交易的执行,那样的话,会大大地了降低在线分析
系统
的吞吐量。
最后,从数据仓库中分离数据的操作是基于不同的数据结构,内容和用法。
决策支持系统需要历史的数据,而操作数据库一般不保存历史数据。在这里,在
操作系统中的数据尽管
很多,但对于决策,常常还是不够的。决策支持系统需要
把从异构数据中得来的数据进行统一,才能得以
高效的查询,纯净的和集成的数
据。相比之下,操作数据库包含的不仅仅是原始数据,比如交易数据,这
些数据
在进行分析之前需要统一。由于两系统提供的功能不一样,需要不同类型的数据,
因此需
要维护分离的数据库。
7