If I had to poll data professionals on the tasks they enjoy working on the most, ETL logging probably wouldn’t make the list. However, it is essential to the success of any ETL architecture to establish a proper logging strategy. I like to compare good timber infrastructure to the plumbing in a house: it’s not visible from the outside, not terribly exciting, but you’ll certainly know if it’s installed incorrectly (or not installed at all).
In this ETL best practice tip, I’ll share the benefits of an effective logging strategy and cover some of the variables, including what should be logged and how long those logs should be kept.
ETL Logging
Before I start describing best practices, let me clarify what ETL logging is. I think of the ETL log as an activity record of relevant events that occur before, during, and after the execution of an extract, transform, and load process. This logging typically occurs within the ETL software itself, but can also include other logging (those from job scheduling tools, for example) to supplement the built-in ETL logging. These logs will vary in granularity and scope, with some describing load level metrics (how long the overall process took) and others specific to a small segment of the load (how many rows were loaded in step 7).
There is no single approach to registration. Ideally, each process will be evaluated to determine the logging strategy based on data criticality, frequency of execution, probability of failure, and other important factors in your environment.
What should be logged? ?
As noted above, logging requires more than a simple approach, as each process may have slightly different logging needs. However, in the real world, my years of consulting have taught me that most organizations take an all-or-nothing approach to all ETL processes: either they don’t log at all, or all possible metrics are captured and reported. they store forever. While there are some edge cases for each of these, the answer almost always falls somewhere in the middle (although hopefully closer to the “log everything” approach).
A minimalist approach might simply involve Record the start and end times of the overall process. A narrow approach like this works best when the scope of possible errors is small and when the process can be rerun without causing damage. I see this being used successfully for some stage loads, where data is loaded into volatile staging tables. Most staged loads involve first removing old data from the destination tables, which means they can be executed repeatedly without causing inadvertent duplication of data. Therefore, a minimalist logging approach can be used on these types of loads because the points of failure are fairly minimal and the process can be easily rerun if further diagnostics are required.
Most of processes require more intensive logging, though . To truly capture what happens within each upload process, rather than measure the process as a whole, there should be a step-by-step log for each task. If a particular ETL job has, say, ten steps, an ideal design would record the start and completion of each task, any task-level exceptions, as well as any necessary auditing information.
The following can be used as a rough guide on the types of events and metrics to capture during logging:
- Start and stop events. The start and end timestamp of the ETL process as a whole, as well as its individual steps, should be stored.
- State. Process steps can succeed or fail individually, and as such, their status (not started, running, successful, or failed) should be logged individually.
- Errors and other exceptions. Fault and anomaly logging is often the most time-consuming part of building a logging infrastructure. It is also the part that produces the most value during testing and troubleshooting.
- Audit information. This can range from simply capturing the number of rows loaded in each process execution to a full analysis of row count and monetary value from source to destination.
- Testing and debugging information. This is particularly useful during the development and testing phase, especially for processes that are heavy on the transformation part of ETL.
When planning a registration strategy, it is common just record everything. Some ETL tools, including SQL Server Integration Services, make it relatively easy to go into capture-all mode. I don’t want to discourage you necessarily; having too much information recorded is better than not enough, all other things being equal. However, there is always a cost associated with logging: processing overhead, network bandwidth, and storage requirements must all be considered when planning how ETL processes will be logged.Logging every available metric and event may be easier, but it’s not free. Remember: balance what you’d like to capture in the log against the costs (hard and soft) of exceeding your logging strategy.
How should logs be consumed?
Suppose for a moment you have your logging infrastructure built and it’s working as expected. How do you get this information to the right people?
As a consultant, I’ve seen it dozens of times: ETL processes are fully logged, but no one really knows what’s logged or where the information is stored. If processes log into a black hole, the log information is just data; it must be accessible and understandable before it can be classified as information. Most data geeks like me would prefer to have access to raw, unfiltered log data so we can write our own queries to get exactly the data we want. Other people who aren’t that fond of writing SQL would probably do better with more optimized delivery of registry information. Most of us who develop ETL solutions will create logging tools primarily for ourselves and our colleagues, to aid in troubleshooting and bug reporting. However, increasingly, others, often semi-technical or non-technical people, also need access to ETL records. DBAs, help desk staff, business analysts, and auditors may need to see what’s in the logs, and each group may have a very different expectation of how to get to that data. I’ve even worked with hands-on C-level executives who routinely reviewed ETL log data.
When considering how log information will be consumed, keep the following in mind:
- Who is the primary audience for these records?
- Which other audiences might need access to this information?
- Will the data be reviewed on an ad-hoc basis or is it necessary to Will it be through push delivery or a dashboard?
- Which format delivers the information in the clearest way possible?
- Are there any security issues that require filtering? user level data?
li>
The answers to these questions will drive the strategy for exposing log data. It can be as simple as providing other DBAs with the location of your log data, or more formalized, such as creating a SQL Server Reporting Services report or a Power BI dashboard. Regardless of which approach is used, the goal is to get the necessary information to the target audience as clearly as possible.
Retention Policy
My clients ask this a lot: “How How long do we keep ETL logs?My answer is always this: “Keep them as long as it takes, but no longer.” Yes, that’s a stereotypical consultant answer, but really, the answer depends entirely on expectations of how those records will be used.Here’s a common dialog I share when discussing short-term and long-term retention strategy for ETL records:
Short retention period: “A short retention period ensures that it won’t fill your drives with a bunch of old registry data that you’ll never use. Keeping just a few months or a year of logs provides enough history to fix any recent logic or performance issues in your ETL. Short retention periods may work well for you if your concern is operational support and you don’t use your logs for long-term performance analysis or auditing purposes.”
Long retention period: “Yes, it costs you more in storage space to keep the records for a longer period of time. But who cares? Storage is relatively inexpensive, and you can archive old stuff on cheaper, slower drives. By keeping your logs for a longer period, you can analyze ETL workloads and performance over time to see their trend year-over-year. And if you’re in a heavily audited environment, long-term retention of records can satisfy most, if not all, ETL auditing requirements.”
As with most other concepts that I discuss here, there is no single approach to suit every situation. However, in most cases, I will steer clients toward a longer-term retention period for ETL records, unless there is a clear reason to avoid it. It’s better to have data you’ll never use than to need data you don’t have. There are also degrees of record retention; I have successfully built logging structures with a comprehensive short-term detail logging strategy and purging process to remove details and retain summary information for years.
When creating an ETL log retention strategy, the key questions to ask yourself are:
- Will we use ETL log data for troubleshooting? tactical issues or trend analysis and auditing?
- How much log data are we generating each day/month/year?
- How much does it cost us to store log data?
- How difficult and expensive would it be to archive cold logs to cheaper, slower storage?
- What are the auditing or regulatory requirements for ETL logs?
Conclusion
Creating a proper ETL registry architecture is one of the most basic components of an enterprise data management strategy. Too often, the registry is an afterthought, added after everything else is complete or nearly complete. Just as the plumbing in your home is designed in parallel with the rest of the house, your registration strategy should be a central part of your ETL architecture.
ETL registration design will vary from organization to organization. other. , and may even differ from one project to another within the same company. Considering all the variables—what level of logging is required, who will use it, and how long it should be kept—will help shape the design of your ETL logging architecture.
.