Great Expectations, an open-source Python library, combines with Sparkflows to offer powerful nodes for the seamless implementation of data quality checks in your workflow.
Using Great Expectations for data quality checks involves a simple workflow:
- Read your dataset in the desired format.
- Connect the read node to a GE data quality node of your choice, such as "Expect Column values to be null".
- When connecting nodes, remember that the lower edge provides the original DataFrame, while the higher edge provides the output DataFrame.
- Some nodes in Great Expectations have a configuration called "mostly." If empty, it is treated as 1, meaning all configured columns (100%) must pass the data quality check to be successful. For example, if it's set to 0.8, the data quality checks pass only if 80% of the data meets the expectation.
- Chain these Great Expectations nodes together to check data quality issues in your dataset. The lower edge (output DataFrame) is passed on to the next nodes for further data quality checks, while the higher edge holds the output DataFrame
- You can even create a CSV file with all your results and use the GE Decision node to determine if your data quality checks passed or failed. The results of these checks also appear under the data quality tab, providing a summary of the tests conducted.
- You can even create a CSV file with all your results and use the GE Decision node to determine if your data quality checks passed or failed. The results of these checks also appear under the data quality tab, providing a summary of the tests conducted.
Hey Chris,
Great Expectations, an open-source Python library, combines with Sparkflows to offer powerful nodes for the seamless implementation of data quality checks in your workflow.
Using Great Expectations for data quality checks involves a simple workflow:
- Read your dataset in the desired format.
- Connect the read node to a GE data quality node of your choice, such as "Expect Column values to be null".
- When connecting nodes, remember that the lower edge provides the original DataFrame, while the higher edge provides the output DataFrame.
- Some nodes in Great Expectations have a configuration called "mostly." If empty, it is treated as 1, meaning all configured columns (100%) must pass the data quality check to be successful. For example, if it's set to 0.8, the data quality checks pass only if 80% of the data meets the expectation.
- Chain these Great Expectations nodes together to check data quality issues in your dataset. The lower edge (output DataFrame) is passed on to the next nodes for further data quality checks, while the higher edge holds the output DataFrame
- You can even create a CSV file with all your results and use the GE Decision node to determine if your data quality checks passed or failed. The results of these checks also appear under the data quality tab, providing a summary of the tests conducted.
- You can even create a CSV file with all your results and use the GE Decision node to determine if your data quality checks passed or failed. The results of these checks also appear under the data quality tab, providing a summary of the tests conducted.