One of the things users want today is to be able to write and use native Python libraries when running on Big Data Platforms. The adoption of Python is huge and enabling Python in any Platform is most important.
Overview
Sparkflows Fire now provides deep integration with Python. Apache Spark provides pipe, for streaming RDD's to external processes. Sparkflows Fire takes advantage of this capability, and builds upon it to make it seamless for the users.
More details are available here:
Seamless Integration with Python
Sparkflows Fire not only streams the data out to Python in its Pipe Python node, but it also handles the schema going into Python and receiving schema back from Python.
Workflow
Below is a workflow which reads in cars dataset, processes it with Python, and then finally prints the data received from Python.
Python Code
Sparkflows Fire allows splitting the Python code into 3 parts: header, body and footer.
The header, reads in data line by line, puts it into a Data Structure to be processed by the body. In this case it creates a Pandas DataFrame.
Body, take the Pandas Dataframe, processes it and stores the result into another Pandas Dataframe. In this case we have not made any transforms.
Footer, take in the output Pandas Dataframe, and writes out its content back to Spark. It also writes out the output schema into a file. The output schema is picked up by the Spark code and applied to the data read from Python.
Executing the Workflow
Below is the output printed from the PrintNRows Processor.
Python Code Syntax Validation
When writing code in the workflow, code syntax validation becomes most important. Sparkflows Fire provides smooth syntax validation.
Below is error displayed when a syntax error is introduced into the python code.
Conclusion
Sparkflows Fire provides very powerful Big Data Integration with Python. The possibilities become vast. It allows us to run any Python code on big data, using libraries like Pandas, perform machine learning in parallel etc.
It makes it seamless by providing schema inference, syntax validation and running it distributed without any user intervention.
Комментарии