When building data pipelines, each node or step in the pipeline works off the schema of the data available to it.
The transformation of the schema in each step of the pipeline makes it hard to manage and update the pipeline.
Sparkflows makes it easy to understand the schema transformations. Each node in the workflow provides the details of the schema transforms it makes.
Example of Schema Propagation
Below is a workflow for predicting spam in SMS. The nodes StringIndexer, Tokenizer, TF and IDF update the incoming schema.
Below we see, how schema is transformed from one node to another.
StringIndexer adds the column 'spam_idx'
Tokenizer adds the column 'tok'
HashingTF adds the column 'tf'
IDF adds the column 'idf'
IDF adds the column 'idf'
As we see above, the ability to view the schema at each step of the workflow, significantly simplifies the building and modification of the workflow.
Schema and Fields in the Dialog
Sparkflows has widgets for allowing users to select one or more columns from the incoming schema. For example, the modeling nodes generally have a field which specified the label column, VectorAssembler has a field which allows the user to select a list of columns.
Sparkflows provides intelligence in terms of the data types of the fields which are displayed. For example, in the case of VectorAssembler, only the numeric columns should be applicable. The user should not be displayed other columns like String etc.
Below is an example from Housing Price Prediction. We see that VectorAssembler only displays the numeric columns and ignores the rest.