Using Omnipy for developing and deploying data flows (Part 2 -Intermediate level)

Learn to make use of the new and powerful Omnipy Python library to develop rerunnable and scalable ETL data flows that Extracts research data from various sources, Transforms the data according in clearly defined steps, and Loads the results where you want them.

Time and place: Jan. 8, 2024 1:00 PM – 4:00 PM, GSH: Grupperom 1

A poster displaying the name of the workshop beside a picture of a young man using a computer.

Lunch will be served every day for participants between 12:00 and 13:00.

Register

Workshop Description

Researchers often spend a significant amount of time on data wrangling tasks, such as reformatting, cleaning, and integrating data from different sources. Despite the availability of software tools, they often end up with difficult to reuse workflows that require manual steps.Omnipy is a new Python library that offers a systematic and scalable approach to research data and metadata wrangling. It allows researchers to import data in various formats and continuously reshape it through typed transformations. For large datasets, Omnipy seamlessly scales up data flows for deployment on external compute resources, with the user in full control of the orchestration.

This workshop will build on the half-day workshop "Using Omnipy for data wrangling and metadata mapping (beginner level)" we are holding before lunch. In this second workshop, participants will learn how to develop various types of data flows in Omnipy, including integration with web services. They will make use of the powerful industry-developed Prefect orchestration engine to scale up the game and deploy high-throughput ETL flows using external compute resources.

The workshop is divided into three parts:

The first part will introduce the slogan "parse, don't validate" and show how these concepts are implemented in Omnipy. On this background, we will introduce the three types of data flows supported by Omnipy: linear, DAG, and function flows. We will also, through hands-on examples, show how to make use of various job modifiers to power up and customise predefined tasks and flows to construct more complex data flows.
The second part will focus on integrating data flows with web services through REST APIs. We will mainly focus on extracting data from data sources, but will also touch upon loading results onto data sinks. Hands-on examples will introduce tasks and flows that allow flattening of JSON data into relational tabular form for mapping, and then restructuring the results back to JSON.
The last part will introduce Omnipy's integration with S3-based cloud storage and the Prefect ETL orchestration library. As a hands-on exercise, the participant will scale up the data flow developed in the second part of the workshop by deploying it on an external compute infrastructure, potentially the Kubernetes-based NIRD Toolkit from SIGMA2 (if Prefect-integration in NIRD is finalised in time for the workshop).

List the learning outcomes

Put the concepts behind the slogan "parse, don't validate" into practice
Define the three fundamental flow types in Omnipy
Reuse and repurpose existing tasks and flows by applying job modifiers
Export data from external REST APIs
Transform nested JSON output into normalised tables
Load results to external services
Scale up a data flow by deploying it on external compute resources
Orchestrate flow runs using the Prefect web-based GUI and inspect data output from external runs.

Prerequisites

The participant should have some experience with Python programming and scripting. We will also assume a basic understanding of the JSON format and how to make use of REST APIs. It is also preferable if the participants have experience with using an Integrated Development Environment (IDE) and the command line, but this is not a prerequisite. Note that we will also assume that the participant has completed the beginner-level workshop "Using Omnipy for Data Wrangling and Metadata Mapping" that we are holding before lunch.

Target audience

PhDs, Postdocs, Technical personnel. Interest and experience with programming in an academic setting. Data science will be a particular focus, but the workshop is open to any interested participants. The use cases will not assume any domain knowledge.

Required material

The participants should bring a laptop with the PyCharm IDE pre-installed and configured with Python 3.10. We will provide detailed instructions ahead of the workshop and possibilities for support with installation prior to the workshop.

Organizer

Sveinung Gundersen, Federico Bianchini and Pável Vázquez

Published Oct. 25, 2023 1:04 PM - Last modified Jan. 9, 2024 11:13 AM