Netflix open sourced Metaflow, a Python library for building and managing data science projects. Its source code is available on GitHub and it is ready for public use. Originally, Metaflow was developed internally by the Netflix Machine Learning Infrastructure team in order to help data science productivity.
It is powered by the AWS cloud and uses built-in integrations for storage and machine learning services.
Metaflow shares a number of similarities and concepts with projects such as Apache Airflow and Luigi, which developers may have experience with. For instance, with Metaflow, you can create and execute DAGs (directed acyclic graphs).
Code structure
One of the key features is Metaflow’s ability to snapshot all of your code, data, and dependencies into Amazon S3 automatically.
With this, users can pick up and inspect previous workflows exactly where they left off. Using the resume
command, you can resume an execution of a past run at a failed step, add new steps, and alter code.
Let’s have a look at the code structure itself. From the documentation:
Metaflow follows the dataflow paradigm which models a program as a directed graph of operations. This is a natural paradigm for expressing data processing pipelines, machine learning in particular.
We call the graph of operations a flow. You define the operations, called steps, which are nodes of the graph and contain transitions to the next steps, which serve as edges.
SEE ALSO: Mining software development history: Approaches and challenges
External Python libraries
Python has become one of the standard languages for data science, and so it has a number of commonly used libraries. Metaflow allows importing external libraries and supports all common machine learning frameworks using the @conda decorator.
You can use all of the data science libraries that you are already familiar with, such as PyTorch and TensorFlow. Simply write your models using idiomatic Python code.
AWS cloud
Over the years, Netflix has been one of the biggest powerhouses using AWS, directing most member traffic to AWS supported software. Thus, it is no surprise that Metaflow supports Amazon Web Services as its remote backend, using its tested and true services.
From the documentation, these are the current integrations (and two that will come in the future):
With the built-in powerful S3 client, Metaflow can load data up to 10Gbps. No code change is required for usage with AWS.
Read the documentation regarding Metaflow on AWS.
If you do not have an AWS account, Metaflow provides a hosted sandbox environment for data testing. You can use this sandbox for testing out tutorials and evaluating computation with AWS Batch. Request access to the sandbox mode here. Only a limited number of sandboxes can exist at the same time, so you will be added to a waitlist after signing in with your GitHub account.
SEE ALSO: Python is switching to an annual release cycle
Get started
Test it out with introductory tutorials. Go over its local machine capabilities and graduate to learning about AWS configuration and monitoring flows.
You can install Metaflow from PyPI using:
pip install metaflow
Visit the Metaflow site and learn more about its uses. View the roadmap and see what features might become open source next. (Potential candidates include support for R and a Metaflow Slack bot.)
A community chat exists on Gitter for all your questions and discussions.
The post Netflix open sources Metaflow, a Python library with AWS integration appeared first on JAXenter.
Source : JAXenter