Data Governance & Catalog
Data governance is crucial for ensuring that data remains secure, private, accurate, and usable across an organization. A robust data governance strategy involves several key components, including data cataloging, metadata management, access control, data quality, and data lineage. These elements work together to create a structured approach to managing and utilizing data assets effectively.
Metadata Management
Sparkflows Own Catalog
Sparkflows offers an integrated catalog for capturing and managing metadata across a wide array of datasets. Whether the data is stored in file formats like CSV, JSON, XML, Parquet, or Avro, or resides in other storage systems such as JDBC databases, Sparkflows' catalog ensures that all relevant metadata is accurately captured and easily accessible.
Integration with External Catalog Systems
Sparkflows seamlessly integrates with various external metadata systems, allowing users to work with data from multiple sources in a unified environment. Supported systems include HIVE, JDBC Sources, Snowflake, Databricks Unity Catalog, and more. Users can view, interact with, and incorporate these external catalogs directly into their workflows within Sparkflows.
Access to Data Assets
Access Control
Sparkflows offers granular control over data assets, including distributed storage, databases, HIVE, Snowflake, ML models, and compute systems like Apache Spark Clusters, ensuring appropriate access for users.
Users, Groups, Roles, and Permissions
Access is managed through a system of users, groups, roles, and permissions, allowing tailored access based on user roles within the organization.
Projects
Projects in Sparkflows can include datasets, workflows, AutoML experiments, reports, and more. These can be shared across groups, promoting collaboration and resource sharing.
Connections
Supports connections to compute systems (e.g. Apache Spark, Kubernetes), storage systems (e.g. SQL/NoSQL databases), and Generative AI platforms (e.g. OpenAI, Bedrock), with configurable access at global, group, or project levels.
Data Profiling
Sparkflows offers extensive data profiling capabilities, allowing users to automatically analyze and profile their data with just one click. This process generates valuable insights into the structure and quality of the data, helping users to identify potential issues and areas for improvement.
Data Lineage
Sparkflows tracks the movement and transformations of data throughout the system, integrating with OpenLineage for comprehensive visibility and compliance.
Data Quality Management
Robust tools allow users to apply and manage data quality rules, generate quality scores, and monitor trends over time. Thresholds can be set to trigger notifications, ensuring proactive data integrity management.