Following nodes in Sparkflows can help to perform Data Profiling:
Correlation - It displays relation between dependent and independent features. Relation between features is plotted in Heatmap Graph.
Summary - It calculates and prints spreads of feature such as Count, Mean, Min, Max and so on.
Using various ML Model nodes we can also get an insight into the importance of each feature.
Flag Outlier - It flags outliers in the dataset.
Following nodes in Sparkflows can help to perform Data Cleansing:
Imputing - There are various imputing nodes available to handle missing values. Using these nodes missing values can be replaced with either a Constant or Mean/Median/Mode value.
Dedup - To resolve duplicate entity data.
Drop Duplicate Rows - Handles duplicate rows.
Null Value handling - There are various nodes to handles null values in the dataset.
Find And Replace - There are various nodes to handle unwanted characters, replacing a string pattern with others and so on.
Following nodes in Sparkflows can help to perform Feature Engineering:
String Indexer - It encodes String categorical data to numeric values.
Min Max Scaler And Standard Scaler - They scale incoming data by reducing variance.
Feature Extraction nodes
Feature Transformation nodes
Feature Selection nodes
Splitting Dataset nodes