Alteryx to Sparkflows migration
Overview
Sparkflows is a cloud-based data science platform that provides an end-to-end solution for building, deploying, and managing data science workflows. Sparkflows users can build and deploy machine learning models without writing code using its visual workflow designer, and run them on Spark clusters, making it ideal for big data applications. Because of these offerings Sparkflows can be considered as a potential alternative to Alteryx to perform Data Analytics tasks in an automated way along with its enriched ML Model building capabilities.
Consider this document as a quick guide to migrate from Alteryx to Sparkflows. Intended audiences of this document are those who are looking for a fair alternative to Alteryx.
What you will accomplish
Get first-hand knowledge of processors or nodes offered by Sparkflows corresponding to various Alteryx Tools for reading, writing, manipulating data and documenting workflows.
Have a glance at the processors offered in Sparkflows to work with Machine Learning Models.
Become familiar with Sparkflows powerful visualization capabilities offered through its wide range of nodes.
Become familiar with steps to schedule jobs in Sparkflows with built in Scheduler feature.
Migration
Read Data from Local and Remote Data Sources
These nodes are need to read data from local data source and remote sources like S3 buckets etc.The mapping below depicts the same.
In Alteryx one can read data from files by using different configurations of the Input Tool
In Sparkflows the same can be achieved via using the specific read nodes.
Read Data from Databases
These nodes are need to read data from local data source and remote sources like S3 buckets etc.The mapping below depicts the same.
In Alteryx one can read data from DB using different configurations of the Input Tool
In Sparkflows the same can be achieved via specific connector nodes
Read Data from Databases
These nodes are need to read data from local data source and remote sources like S3 buckets etc.The mapping below depicts the same.
In Alteryx one can perform classification using the configurations of classification tool
In Sparkflows the same can be achieved via using the specific Classifier nodes.
Regression algorithms
These nodes are needed to solve regression use cases.The mapping below depicts the same.
In Alteryx one can perform classification using the configurations of classification tool
In Sparkflows the same can be achieved via using the specific Classifier nodes.
Clustering algorithms
These nodes are needed to solve un-supervised clustering use cases.The mapping below depicts the same.
In Alteryx one can create clusters on data by using the clustering tool
In Sparkflows the clusters can be created via using the specific read nodes.
Neural network algorithms
These nodes are needed to solve regression as well as classification use cases using deep learning.The mapping below depicts the same.
In Alteryx one can build deep learning models using the neural network tool
In Sparkflows the same can be achieved via using keras nodes, h2o and spark neural net nodes
Time series algorithms
These nodes are needed to solve time series use cases.The mapping below depicts the same.
In Alteryx one can build time series models using the ARIMA tool
In Sparkflows the time series models can be created using any of these tools
Score and evaluation of the models build
These nodes are needed to evaluate the regression, classification and k-means models built.The mapping below depicts the same.
In Alteryx one can evaluate the model using Score tool
In Sparkflows the there are score and evaluate nodes for all the models which can be built.
Write Data to Local and Remote Data Sources
These nodes are needed to write data to local data source and remote sources like S3 buckets etc. The mapping below depicts the same.
In Alteryx, one can write data to local and remote sources using Output Data Tool.
In Sparkflows the same can be achieved via using the
specific save nodes.
Write Data to Databases
These nodes are needed to write data to databases. The mapping below depicts the same.
In Alteryx one can write data to DB using different configurations of the Write Data In-DB Tool
In Sparkflows the same can be achieved via specific connector output nodes.
Manipulate Data using Aggregate Function
These nodes are needed to perform various aggregate functions on data. The mapping below depicts the same.
In Alteryx, one can group, count, sum concatenate etc. the data using Summarize Tool.
In Sparkflows the same can be achieved via the specific Group, Add-Column etc. nodes.
Manipulate data using String Function
These nodes are needed to transform column data using various string functions. The mapping below depicts the same.
In Alteryx one can string data using Formula Tool and Text to Column Tool.
In Sparkflows the same can be achieved via different nodes.
Manipulate Data using Sample Function
These nodes are needed to create samples while dealing with large amount of data. The mapping below depicts the same.
In Alteryx, one can perform sampling function using Alteryx Sample tool and Alteryx Random % Sample tool.
In Sparkflows the same can be achieved via the specific Sample and Partition nodes nodes.
Manipulate Data using Join Function
These nodes are needed to perform join function on the data. The mapping below depicts the same.
In Alteryx one can join data via using Join Tool and Join Multiple Tool.
In Sparkflows the same can be achieved via specific Join nodes.
Documenting Workflows in Alteryx Vs Sparkflows
Document your workflows with Sticky notes or Notes nodes as well as by renaming the names of the nodes using Sparkflows workflow editor almost similar to what you perform with Alteryx Documentation Tool.
Modeling and Machine Learning in Sparkflows
Machine learning is a subfield of artificial intelligence that involves the development of algorithms and statistical models that allow systems to automatically improve their performance with experience. Alteryx had Machine Learning Tool with limited features to work with ML models but Sparkflows has a number of processors or nodes to enable both supervised and unsupervised use cases at scale. Sparkflows supports a variety of ML engines, such as Apache Spark ML, H2O, PyCaret, XGBoost, CatBoost, AdaBoost, Scikit-learn, Prophet, ARIMA, Keras, Tensorflow, and Statsmodels. Let’s have a look at those!!!
Sparkflows Predictor, Scorers and Forecast nodes such as Spark Predict, Keras Predict, H2O Score, Arima Forecast, etc. let you predict, forecast and score data using various kinds of models e.g., Arima model, Var model and so on.
Work with super ease on Tree-based models for solving various regression and classification problems with a wide range of regression and classification nodes like Spark Decision Tree Regression and Spark Decision Tree Classification Nodes, Sklearn Random Forest regression and Random Forest Classification nodes, H2O Gradient Boosting Machine Node, etc.
Work on Regression use cases using regression nodes offered in Sparkflows such as Spark Linear Regression, Spark GBT Regression, Sklearn Ridge Regression, Sklearn Lasso Regression, H2O Distributed Random Forest, H2O XGBoost and many more.
Work on Classification use cases with Spark and Sklearn Logistic Regression nodes to predict a categorical response, Spark Random Forest and H2O Distributed Random Forest nodes to combine various trees to reduce the risk of overfitting, Spark GBT Classifier and Sklearn Gradient Boosting Classifier to train decision trees to minimize loss function, Spark Naïve Bayes to apply Bayes theorems to various models.
Work on Clustering use cases with H2O K-means, Spark K-means nodes to group similar data points together. There are various other nodes to help you with solving clustering problems.
In case you want to work with Deep Learning ML Models like Neural Networks Sparkflows offers nodes like Keras Model Compile to define the loss function, Keras Predict to make predictions after you have trained the data, Keras Model Fit to optimize your model parameters based on training data and many more.
Evaluate your ML Model’s performance using a wide range of Evaluation nodes offered in Sparkflows like Spark Binary Classification Evaluator, Sklearn Regression Evaluator, H2O Clustering Evaluator, Spark Multiclass Classification Evaluator, H2O Binary Classification Evaluator and so on.
Visualization Capabilities of Sparkflows
Sparkflows supports powerful data visualization through a number of visualization nodes listed below
Box Plot node that can be used to represent variation of data between series as a box plot.
Gauge node that can be used to represent data for different categories as a gauge plot.
Bubble chart node that can be used to represent variation of data between series as a Bubble Chart.
Graph Values node that can be used to represent variation between a pair of data series in graphical format.
Graph Group By Column node that can be used to represent counts of different groups of data in graphical format.
Graph SubPlots node that can be used to represent variation between multiple pairs of data series in one go in graphical format.
Print Rich Text node that can be used to print the output in rich format text. This node offers a variety of common formatting options, such as bold, color, italics, etc.
Print N Rows node that can be used to print the first N rows from the incoming DataFrame. The Number of rows that needs to be printed can be configured in the node.
Scheduling Jobs in Sparkflows
Sparkflows makes it extremely easy to schedule workflows and pipelines with its Built-in-Scheduler Feature. The steps to schedule workflows and pipelines are enlisted below:
Open the Sparkflows web interface and navigate to the Workflows Page.
Click on the Action Menu (located adjacent to a particular workflow) to find the Schedule option to reach the Schedules Page.
Click on the Schedule New Jobs for Workflows option located on the top right hand side to reach the Scheduling Window.
Click on the Action Menu (located adjacent to a particular workflow) to find the Schedule option to reach the Schedules Page.