Brief Introduction to Method of Change Data Capture
Synopsis: The data warehouse system always needs to be refreshed by extracting the change data from the source databases. There are a number of strategies and methods often used to achieve this summarized as below:
1. Table (File) Differencing Full Scan
2. Value-change selection base on Timestamp
3. Base on DBMS Log
4. Base on Trigger Mechanism
5. Base on Replication Technology
All of these methods have their own situations for usage. Many DBMS have their own methods and tools to support Change Data Capture (CDC) and of course we can use our own program to capture the change data from source systems.
Key Word: CDC (Change Data Capture)
Table differencing involves transporting a copy of an entire table from the source (production) database to the staging database (where the change data is captured), where an older version of the table already exists. Often a “MINUS” method is adopted to identify the difference between the old version and the new version, or the target content is just replaced simply. This method is very simple to develop and it will be effective if the data volume is not very large.
However, there are several problems with this method:
· It requires that the new version of the entire table be transported to the staging database, not just the change data, thereby greatly increasing transport costs.
· The computational cost of performing the two MINUS operations on the staging database can be very high.
· Table differencing cannot capture data that have reverted to their old values. For example, suppose the price of a product changes several times between the old version and the new version of the product's table. If the price in the new version ends up being the same as the old, table differencing cannot detect that the price has fluctuated. Moreover, any intermediate price values between the old and new versions of the product's table cannot be captured using table differencing.
· There is no way to determine which changes were made as part of the same transaction. For example, suppose a sales manager creates a special discount to close a deal. The fact that the creation of the discount and the creation of the sale occurred as part of the same transaction cannot be captured, unless the source database is specifically designed to do so.
Change-value selection involves capturing the data on the source database by selecting the new and changed data from the source tables based on the value of a specific column. For example, suppose the source table has a LAST_UPDATE_DATE timestamp column. To capture changes, you base your selection from the source table on the LAST_UPDATE_DATE column value.
This method is a way we often use to do an incremental extraction from the source systems because of its simplicity, but it still has some shortcomings or restrictions:
· The overhead of capturing the change data must be borne on the source database, and you must run potentially expensive queries against the source table on the source database. The need for these queries may force you to add indexes that would otherwise be unneeded. There is no way to offload this overhead to the staging database.
· This method is no better at capturing intermediate values than the table differencing method. If the price in the product's table fluctuates, you will not be able to capture all the intermediate values, or even tell if the price had changed, if the ending value is the same as it was the last time that you captured change data.
· This method is also no better than the table differencing method at capturing which data changes were made together in the same transaction. If you need to capture information concerning which changes occurred together in the same transaction, you must include specific designs for this purpose in your source database.
· Some legacy source systems may not have this kind of timestamp columns to support this method.
By taking advantage of the data sent to the redo log files, change data is captured after a SQL statement that performs a DML operation is committed. In this mode, change data is not captured as part of the transaction that is modifying the source table, and therefore has no effect on that transaction. This kind of methods capture the change while they read and analyst the online log or archive log which record all the operations and the corresponding data values of the DBMS.
The following list describes the advantages of this model base on DBMS log:
This method can capture all effects of INSERT, UPDATE, and DELETE operations, including data values before and after UPDATE operations.
This method can be configured to have minimal performance impact on the source database because it is neither a part of a transaction nor requiring a query on the tables or views.
But this model still has some disadvantages such as:
· Not real time
· Complex and depending on the tools or interfaces DBMS provides
Typical technology instance: Oracle 9i (or above) Asynchronous Change Data Capture, Oracle Streams (Oracle 10g), DB2 Replication Center
Triggers on the source database allow change data to be captured immediately, as each SQL statement that performs a data manipulation language (DML) operation (INSERT, UPDATE, or DELETE) is made. In this mode, change data is captured as part of the transaction modifying the source table.
This method is often used when sending the data to ODS or Staging Database. It has several advantages listed as the following:
· Real time
· Completeness: Every change can be captured by the trigger on DML.
The obvious disadvantage is that it will impact and burden the performance of source system because it is a part of the transaction.
Typical technology instance: Oracle 9i (or above) Synchronous Change Data Capture, Oracle Streams (Oracle 10g)
Replication is the process of copying and maintaining database objects, such as tables, in multiple databases that make up a distributed database system. Changes applied at one site are captured and stored locally before being forwarded and applied at each of the remote locations. Although this technology is designed for sharing data among distributed databases, it can still be used to capture change data.
In SQL Server, this technology is basing on Publisher and Subscriber.
In Oracle 8i (or above), it includes Master Replication and Materialized View Replication. Oracle Streams can also support this technology.
Other tools: DB2 Replication Center.
来自 “ ITPUB博客 ” ，链接：http://blog.itpub.net/9665688/viewspace-497949/，如需转载，请注明出处，否则将追究法律责任。