...

Databricks x mindit.io on Governance in Action: How Traceability Builds Trust in Data, Analytics, and AI

Trust in data is one of those problems every organization talks about, but few solve at scale. Everyone wants to be “data-driven,” yet decision-making often slows down the moment people start questioning the numbers. Where did this KPI come from? Why does this dashboard say something different than last week? Which dataset is the real source of truth?

In the webinar Governance in Action, Vlad Mihalcea (BI Technical Lead at mindit.io) and Eileen Zhang (Senior Solutions Engineer at Databricks Switzerland) broke down a practical view of governance, not as a compliance exercise, but as a platform foundation for transparency, speed, and confidence across the organization.

This article summarizes the key ideas from the session and highlights the mechanisms that help governance scale in real-life environments.

The real problem is not data. It is trust.

Vlad opened with a reality most teams recognize immediately: data comes from many systems, it is transformed multiple times, ownership is often unclear, and definitions change over time. When trust drops, everything slows down. Decisions get delayed. Innovation gets cautious. Delivery gets stuck in endless validation cycles.

The goal of governance, as framed in the session, is to restore confidence through visibility and controls that scale. Not more process. Not more manual checks. A system that helps everyone understand what data exists, how it changes, who can access it, and whether it is reliable.

Why governance is hard to scale

Eileen emphasized that many companies do have governance of some kind, but scaling it across an entire enterprise is where things break. The reasons are familiar:

  • Fragmented data sources: legacy on-prem systems, cloud migrations, and often multi-cloud reality.
  • More than tables: governance is no longer just about relational data. It includes files, unstructured datasets, notebooks, queries, dashboards, and machine learning artifacts.
  • New assets: AI models and feature tables are now part of the data estate and require governance too.
  • Different formats, different tools: modern ecosystems need governance that can unify, not only document.

The conclusion is simple: manual governance does not scale. Anything that depends on people maintaining lineage, definitions, or classifications by hand will eventually lag behind reality.

Governance as a platform layer: the Unity Catalog approach

A core message from the webinar is that governance works best when it is built into the platform rather than layered on top.

Eileen positioned Databricks’ approach as an open, unified data platform where open formats (such as Delta tables and other open table formats) sit at the base, and a unifying governance layer sits above them. In Databricks, that governance layer is Unity Catalog.

Compared to traditional catalogs that focus mostly on access control and auditing for tables, Unity Catalog is presented as a broader governance layer that includes:

  • Data discovery
  • Data lineage
  • Data quality monitoring
  • Business semantics
  • Cost controls
  • Governance for multiple asset types (tables, views, files, notebooks, dashboards, models)
  • Openness through APIs and federation to external systems and catalogs

With the foundation set, the webinar deep-dived into four practical pillars: lineage, data quality monitoring, classification and governed tags, and attribute-based access control.

1) Data lineage: answering “Where did this number come from?”

Vlad described the “simple question” most organizations cannot answer quickly: Where did this number actually come from? Data typically flows through ingestion, transformations, curated datasets, and semantic models. Without lineage, tracking a single metric across thousands of tables becomes painful and slow.

Eileen explained how Databricks lineage works in a way that avoids one of the biggest problems with lineage in the wild: manual maintenance.

Automated lineage at runtime

A key point: lineage is created automatically based on how workloads actually run on the platform.

  • Spark generates a compute plan for a job or query.
  • The platform logs metadata from execution.
  • A lineage service analyzes those logs and produces lineage automatically.

This means lineage is created dynamically as pipelines and queries run, reducing the risk of gaps caused by outdated documentation.

Column-level lineage, not just table-level

Many tools stop at table-level lineage. The session highlighted that Unity Catalog can provide column-level lineage, letting teams trace individual fields from ingestion to consumption. This becomes critical in complex reporting environments, where a single business metric might be derived from multiple transformations and joins.

Time-scoped lineage

Another practical detail: lineage can be inspected over specific time windows. For example, you can view dependencies from the last two weeks vs the last year, which helps in investigations, audits, and change tracking.

External sources and federation

In Q&A, Eileen clarified that external sources appear in lineage when they are governed through Unity Catalog constructs such as external locations or external catalogs. If data is pulled via arbitrary connections in code outside governed definitions, it may not appear in lineage.

2) Data quality monitoring: lineage shows flow, quality shows trust

Lineage explains how data moves, but it does not guarantee the data is correct, fresh, or complete. This is where the second pillar comes in: data quality monitoring.

Eileen introduced Lakehouse Monitoring features, focusing on two areas:

Data quality monitoring at the schema level

This is positioned as an “easy start” approach:

  • Enable monitoring at the database (schema) level.
  • Get out-of-the-box tracking for freshness and completeness across the tables in that schema.
  • Use it primarily for production databases, because background monitoring introduces cost and should be prioritized where it matters most.

The system looks for expected refresh patterns and volume trends. If ingestion jobs fail, refresh patterns change, or data volumes drop unexpectedly, alerts can be triggered.

Data profiling at the table and column level

Profiling goes deeper:

  • Calculates statistics for columns.
  • Detects drift and anomalies, such as sudden null spikes or distribution shifts.
  • Supports multiple table types and workloads, including streaming tables, time series monitoring, feature tables for ML, and inference tables.

A practical theme emerged here: quality monitoring is not just for engineers. Vlad highlighted that visibility into quality metrics helps business users too, because they can understand reliability through dashboards and trends rather than waiting for engineering confirmation.

3) Automatic classification and governed tags: finding sensitive data at scale

Next, the webinar tackled a governance challenge that often blocks adoption: sensitive data and responsibility.

Eileen described how fear of mishandling personal data creates hesitation and “protectionism,” where teams avoid sharing data because they do not know what is inside it or do not want accountability in case of a breach.

Automated PII detection

The session presented automated classification that scans metadata and sample data to identify sensitive data types, offering out-of-the-box detection categories (like email addresses, phone numbers, IP addresses, and other identifiers). This helps organizations gain visibility into what they hold, which is the first step toward protecting and safely sharing it.

Governed tags

Once identified, data needs consistent labeling. Governed tags support:

  • Standardized classification values (for example: internal, sensitive, commercially sensitive)
  • Better discovery and search
  • Compliance and governance policies
  • Cost attribution and organizational tagging (like cost center or business domain)

The key idea: tags should not be free-form chaos. Standardization is how you scale consistency.

4) ABAC: fine-grained control through policies that scale

After classification and tags, the next step is policy enforcement. Eileen walked through Attribute-Based Access Control (ABAC) as a dynamic model where access is determined based on properties of users, resources, and requests.

She described a simple three-step flow:

  1. Columns and tables get tagged (manually or via classification).
  2. Governance administrators define access policies based on those tags.
  3. When a user queries data, the platform applies policies dynamically.

Column masking and row filtering

Two practical forms were highlighted:

  • Column masking: mask sensitive columns (emails, phone numbers, identifiers), either fully or partially.
  • Row filtering: restrict rows based on attributes like geography (for example, restricting EU customer records from a US analyst group).

Together, these cover common enterprise privacy and access requirements without requiring teams to create multiple copies of datasets for different audiences.

What “lack of trust” looks like in real organizations

One of the most useful parts of the session came from the Q&A: how to recognize when an organization lacks data trust.

Vlad’s signal: spreadsheet proliferation. If many teams recreate the same reporting in Excel or create parallel versions of the truth, trust is broken.

Eileen’s signals:

  • Protectionism: teams hesitate to share data due to fear of responsibility and lack of controls.
  • Duplication: data copied everywhere because no one knows the ground truth or trusts the central dataset.
  • Transparency gaps: unclear lineage and unclear ownership lead to local forks and bloated volumes.

These behaviors do not just add cost. They slow analytics and derail AI initiatives by creating inconsistent inputs and fragmented interpretations.

Who owns data quality?

A common question came up: is data quality the responsibility of data engineers?

Both speakers agreed the answer is evolving. Engineers play a major role, but quality in production becomes a shared concern across teams. Some monitoring belongs closer to data engineering (freshness, completeness, schema expectations). Some belongs to data science or product teams (feature drift, inference drift). In mature organizations, governance functions may also play a role in defining standards and ensuring accountability.

The practical takeaway: platform-level monitoring makes ownership easier to distribute because visibility becomes shared and actionable.

The shift that happens when governance works

Vlad closed with a powerful framing: governance is not about slowing teams down. When governance is built into the platform, teams move faster because uncertainty drops and rework decreases.

When lineage, quality monitoring, classification, and fine-grained controls come together, people stop asking “Is this correct?” and start asking “What can we do with this?”

That is the moment governance becomes a business accelerator.

Final takeaway

Governance becomes impactful when it is practical, automated where possible, and integrated into the platform. The webinar’s message is that traceability and trust are not optional add-ons for modern analytics and AI. They are prerequisites for scale.

If your organization is investing in dashboards, data products, or AI initiatives, the question is not whether governance matters. The question is whether it is built to keep up with reality.

Watch the full webinar

If you want to see the full walkthrough and hear Vlad Mihalcea (mindit.io) and Eileen Zhang (Databricks Switzerland) explain these concepts with real examples and Q&A, watch the complete webinar recording here:

Next steps

Distribute:

/turn your vision into reality

The best way to start a long-term collaboration is with a Pilot project. Let’s talk.