Change Data Capture (CDC) uses Server Agents to record, insert, update, and delete activity applied to database tables. CDC provides details on changes in an easy-to-use relational format. It captures column information and metadata needed to apply the changes to the target environment for modified rows. A changing table that mirrors the column structure of the tracked source table stores this information.
Capturing change data is no easy feat. However, the open source Apache SeaTunnel project i is a data integration platform provides CDC function with a design philosophy and feature set that makes these captures possible, with features above and beyond existing solutions.
CDC usage scenarios
Classic use cases for CDC is data synchronization or backups between heterogeneous databases. You may synchronize data between MySQL, PostgreSQL, MariaDB, and similar databases in one scenario. You could synchronize the data to a full-text search engine in a different example. With CDC, you can create backups of data based on what CDC has captured.
When designed well, the data analysis system obtains data for processing by subscribing to changes in the target data tables. There's no need to embed the analysis process into the existing system.
Sharing data state between microservices
Microservices are popular, but sharing information between them is often complicated. CDC is a possible solution. Microservices can use CDC to obtain changes in other microservice databases, acquire data status updates, and execute the corresponding logic.
The concept of Command Query Responsibility Segregation (CQRS) is the separation of command activity from query activity. The two are fundamentally different:
- A command writes data to a data source.
- A query reads data from a data source.
The problem is, when does a read event happen in relation to when a write event happened, and what bears the burden of making those events occur?
It can be difficult to update a cache. You can use CDC to obtain data update events from a database and let that control the refresh or invalidation of the cache.
CQRS design usually uses two different storage instances to support business query and change operations. Because of the use of two stores, in order to ensure data consistency, we can use distributed transactions to ensure strong data consistency, at the cost of availability, performance, and scalability. You can also use CDC to ensure final consistency of data, which has better performance and scalability, but at the cost of data latency, which can currently be kept in the range of millisecond in the industry.
For example, you could use CDC to synchronize MySQL data to your full-text search engine, such as ElasticSearch. In this architecture, ElasticSearch searches all queries, but when you want to modify data, you don't directly change ElasticSearch. Instead, you modify the upstream MySQL data, which generates a data update event. This event is consumed by the ElasticSearch system as it monitors the database, and the event prompts an update within ElasticSearch.
In some CQRS systems, a similar method can be used to update the query view.
CDC isn't a new concept and various existing projects implement it. For many users, though, there are some disadvantages to the existing solutions.
Single table configuration
With some CDC software, you must configure each table separately. For example, to synchronize ten tables, you need to write ten source SQLs and Sink SQLs. To perform a transform, you also need to write the transform SQL.
Sometimes, a table can be written by hand, but only when the volume is small. When the volume is large, type mapping or parameter configuration errors may occur, resulting in high operation and maintenance costs.
Apache SeaTunnel is an easy-to-use data integration platform hoping to solve this problem.
Schema evolution is not supported
Some CDC solutions support DDL event sending but do not support sending to Sink so that it can make synchronous changes. Even a CDC that can get an event may not be able to send it to the engine because it cannot change the Type information of the transform based on the DDL event (so the Sink cannot follow the DDL event to change it).
Too many links
On some CDC platforms, when there are several tables, a link must be used to represent each table while one is synchronized. When there are many tables, a lot of links are required. This puts pressure on the source JDBC database and causes too many Binlogs, which may result in repeated log parsing.
SeaTunnel CDC architecture goals
Apache SeaTunnel is an open source high-performance, distributed, and massive data integration framework. To tackle the problems the existing data integration tool's CDC functions cannot solve, the community "reinvents the wheel" to develop a CDC platform with unique features. This architectural design is based on the strengths and weaknesses of existing CDC tools.
Apache Seatunnel supports:
- Lock-free parallel snapshot history data.
- Log heartbeat detection and dynamic table addition.
- Sub-database, sub-table, and multi-structure table reading.
- Schema evolution.
- All the basic CDC functions.
The Apache SeaTunnel reduces the operations and maintenance costs for users and can dynamically add tables.
For example, when you want to synchronize the entire database and add a new table later, you don't need to maintain it manually, change the job configuration, or stop and restart jobs.
Additionally, Apache SeaTunnel supports reading sub-databases, sub-tables, and multi-structure tables in parallel. It also allows schema evolution, DDL transmission, and changes supporting schema evolution in the engine, which can be changed to Transform and Sink.
SeaTunnel CDC current status
Currently, CDC has the basic capabilities to support incremental and snapshot phases. It also supports MySQL for real-time and offline use. The MySQL real-time test is complete, and the offline test is coming. The schema is not supported yet because it involves changes to Transform and Sink. The dynamic discovery of new tables is not yet supported, and some interfaces have been reserved for multi-structure tables.
As an Apache incubation project, the Apache SeaTunnel community is developing rapidly. The next community planning session has these main directions:
1. Expand and improve connector and catalog ecology
We're working to enhance many connector and catalog features, including:
- Support more connectors, including TiDB, Doris, and Stripe.
- Improving existing connectors in terms of usability and performance.
- Support CDC connectors for real-time, incremental synchronization scenarios.
Anyone interested in connectors can review Umbrella.
2. Support for more data integration scenarios (SeaTunnel Engine)
There are pain points that existing engines cannot solve, such as the synchronization of an entire database, the synchronization of table structure changes, and the large granularity of task failure.
We're working to solve those issues. Anyone interested in the CDC engine should look at issue 2272.
3. Easier to use (SeaTunnel Web)
We're working to provide a web interface to make operations easier and more intuitive. Through a web interface, we will make it possible to display Catalog, Connector, Job, and related information, in the form of DAG/SQL. We're also giving users access to the scheduling platform to easily tackle task management.
Visit the web sub-project for more information on the web UI.
Database activity often must be carefully tracked to manage changes based on activities such as record updates, deletions, or insertions. Change Data Capture provides this capability. Apache SeaTunnel is an open source solution that addresses these needs and continues to evolve to offer more features. The project and community are active and your participation is welcome.