Hands-on introduction to the Python package uniFAIR: a systematic and scalable approach to research data wrangling

This half-day workshop will introduce you to the technical and conceptual background needed to make use of uniFAIR, including the new type hints in Python. Participants will follow hands-on tutorials that are based on a series of use cases from different research scenarios and scientific fields.

Register

Researchers often need to extract, manipulate and integrate data and/or metadata from different sources, such as repositories, databases, or flat files. Much research time is spent on trivial and not-so-trivial details of data wrangling: to reformat data structures; clean up errors; remove duplicate data; or map and integrate dataset fields. Software for data wrangling and analysis, such as Pandas, R or Frictionless, is useful, but researchers still regularly end up with hard-to-reuse scripts, often with manual steps.

uniFAIR is a new Python library with a systematic and scalable approach to research data wrangling. With uniFAIR, researchers can import (meta)data in almost any shape or form: nested JSON; tabular (relational) data; binary streams; or other data structures. Data is continuously parsed and reshaped through a step-by-step process according to a series of data model transformations. uniFAIR provides a catalog of generic task and subflow templates that the researcher can refine and apply to carry out the transformations needed to wrangle data into the required shape.

For large datasets, uniFAIR allows local test jobs on sample-sized data to be seamlessly scaled up to the full datasets and offloaded to external compute resources. Persistent access to the state of the data is available at every step.

Learning outcomes

  • Use type hints in Python in general and to define data models in uniFAIR/Pydantic
  • Understand the ideas behind the slogan "parse, don't validate"
  • Know the architecture of uniFAIR and its main classes, and have an overview of the different modules and their usage
  • Define, refine, apply and revise tasks and flows in uniFAIR - Import data from external REST APIs and flat files
  • Develop data transformation flows to solve a selection of use cases
  • Inspect data after each transformation step. Make informed choices on how to configure the next tasks.
  • Transform nested JSON output into normalized tables (without duplicate data)
  • Map (meta)data fields from the input data model to the user-defined output model
These outcomes will be demonstrated and not hands-on due to time constraints:
  • Scale up the data import from a representative sample to a large dataset and deploy the flow on external compute resources (e.g. NIRD service platform)
  • Orchestrate flow runs using the Prefect web-based GUI and inspect data output from external runs
  • Get started with contributing to the Open Source catalog of uniFAIR modules

Prerequisites

The participant should have experience with Python programming. Experience with type hints in Python is useful, but not required.

Target audience

PhDs, Postdocs, Technical personnel. Interest and experience with Python programming in an academic setting. The use cases will be fetched from different scientific disciplines and will be introduced in a way that will not assume any prior experience with the particular fields.

Required Materials 

In addition to a laptop, the participant should have installed an Interactive Development Environment (IDE). We recommend PyCharm, as this is what will be used in the demonstrations, but it is also allowed to install other IDEs, as long as it supports Python. Installation instructions will be provided.

 

Organizer

Sveinung Gundersen, Federico Bianchini and Jeanne Cheneby
Published Dec. 9, 2022 1:19 PM - Last modified Dec. 20, 2022 12:56 PM