ETL Vs ELT

Cloud-based DWH solutions offer:

  1. Scalable compute.
  2. Scalable storage.
  3. Reduction of transfer time(i.e. no more extract - load the data every time you want to transform the data.).

What are the components of a dbt DAG?

  1. Sources: They are the green nodes in the DAG, sources represent the raw data tables or views in your data warehouse that serve as the starting point for your transformations.

    version: 2
    
    sources:
      - name: raw_data          # Name of the source (logical grouping of tables)
        database: raw           # Database where the source tables are located
        schema: public          # Schema where the source tables are located
        tables:                 # List of tables in the source
          - name: orders        # Name of the table
            description: "Raw orders data from the e-commerce platform"
            columns:            # List of columns in the table
              - name: order_id
                description: "Unique identifier for each order"
                tests:
                  - unique
                  - not_null
              - name: order_date
                description: "Date the order was placed"
              - name: amount
                description: "Total amount of the order"
                tests:
                  - not_null
          - name: customers     # Another table in the same source
            description: "Raw customer data from the CRM system"
    
  2. Models: the core building blocks used to transform and structure data in your data warehouse. They are essentially SQL files with select statements that define how raw data should be transformed into meaningful, reusable datasets for analysis.

    1. Staging Models:
      • These models clean and standardize raw data from source systems.
      • They often perform tasks like renaming columns, casting data types, and filtering out unnecessary rows.
    2. Intermediate Models:
      • These models sit between staging and final models, often performing joins or aggregations to prepare data for downstream use.
    3. Mart Models:
      • These are the final, business-ready models that are directly used for reporting and analysis.
      • Examples include fact tables, dimension tables, or aggregated datasets.
    4. Seed Models:
      • These are static datasets (e.g., CSV files) that are loaded into the data warehouse and treated as models.
  3. Tests: They are used to validate the quality, accuracy, and integrity of your data and transformations. Tests ensure that your data models meet predefined expectations and help catch issues early in the data pipeline.

    version: 2
    
    models:
      - name: orders
        columns:
          - name: order_id
            tests:
              - unique
              - not_null
          - name: status
            tests:
              - accepted_values:
                  values: ['pending', 'shipped', 'delivered']
    
    1. Schema Tests:
      • These tests validate the structure and constraints of your data models.
        • Common schema tests include:
          • not_null: Ensures a column does not contain null values.
          • unique: Ensures all values in a column are unique.
          • accepted_values: Ensures a column contains only specified values.
          • relationships: Ensures referential integrity between two models (e.g., foreign key constraints).
    2. Data Tests:
      • These tests validate the content of your data.
      • Data tests are written as SQL queries that return rows when the test fails.
      • Example: Checking if a column contains negative values.
    3. Custom Tests:
      • You can write custom tests using SQL or Jinja templating to address specific business logic or data quality requirements.
    4. Singular Tests:
      • These are one-off tests written as SQL files in the tests directory.
      • They are useful for complex or unique validation scenarios.
    5. Generic Tests:
      • These are reusable tests defined in the macros directory.
      • They can be applied to multiple columns or models.

Commands:

dbt run → runs your models

dbt test → Tests your model as they are being built

dbt docs generate → generate docs for your projects

dbt build → runs + tests your models