Get SGD58.85 off your premium account! Valid till 9 August 2021. Use the Code ‘SGLEARN2021’ upon checkout. Click Here

Data Versioning for CD4ML – Part 1

(This post is the first of a two-part series where I will be touching on a component of CD4ML: data versioning, and how it can be incorporated into an end-to-end CI/CD workflow. This first part presents a preface and a high-level overview of the topic, ending off with the introduction of DVC. The second part brings one through a deeper and more technical dive into the topic, to give the audience a better understanding of how the tool can be applied concerning the different aspects of a workflow.)

Table of Contents

  1. Preface
  2. The Data Of It All: Possible issues when working with data.
  3. Tracking & Versioning Data: Tools available for issues raised.
  4. Enter DVC: Introduction to Data Versioning Control tool.
  5. Conclusion

Preface

A few weeks ago, a colleague of mine at AI Singapore published this article highlighting the culture and organisational structure adopted by us when it comes to working on data science/AI projects. After being exposed to multiple projects across varying domains, there’s no doubt that there’s a need for us to go full-stack. Our AI Apprenticeship Programme conducted in tandem with the 100E initiative groups individuals into teams with 3 or 4 of them doing effective engineering work; these individuals will be the main parties churning out the deliverables set throughout the project stint. Various implications arising out of the nature of the programme:

  • we have a team of apprentices where each individual is discovering and experiencing the fundamentals of a data science workflow, be it reproducibility, good coding practices, or CI/CD.
  • while differing roles & responsibilities are eventually delegated, there’s an emphasis placed on them to be cognizant about the synergy required to streamline the workflow that spreads across the team.

As the main purpose of this apprenticeship programme is for these individuals to be equipped with deep skills required to successfully deliver AI projects, it is only natural that they are exposed to the cross-functional roles of a data science team.

The Data Of It All

With the above matters being said, there’s one thing that is common across all the projects: the first stage involving data, shared by the project sponsor. To better understand the nature and complexity of the problem at hand, every apprentice and engineer would have to be exposed to the tasks of data cleaning, Exploratory Data Analysis (EDA) and creation of data pipelines. Even though this set of tasks would eventually be delegated to a designated individual(s), no member of the team should be excused from getting their hands dirty with one of the most time consuming and crucial components of a data science workflow. After all, if one wishes to experience the realities of data science work, then there’s no way they should be deprived of having to work with real enterprise data with all its grime, gunk and organisational mess. Ah, the short-lived bliss of downloading datasets with the click of a single button (hi Kaggle!).

So, it’s been established that the whole team would somehow, one way or another, come in contact with the data relevant to the problem statement. The consequence of this is that the initial raw state of the data would branch off to different states, versions and structures, attributed to any of the following (non-exhaustive) factors:

  • engineered features that are formulated by different individuals.
  • differing preferences of directory structures.
  • differing definitions of states of data.

The aforementioned pointers can be further exacerbated by:

  • altered schema/format/structure of raw data, which can occur multiple times if the project sponsor does not have a streamlined data engineering pipeline or no convention is formalised early on.
  • batched delivery of raw data by project sponsor/client.

All of these further adds on to the mess that already exists with the raw data itself. Without proper deliberation in the data preparation stage, this could easily snowball into a huge obstacle for the team to overcome and progress to the modelling stage. The thing is, with formats and structures of processed data, just like the source code of the project, there isn’t a single acceptable answer. Two different source code can still deliver the same outcome. What’s important is to have these different versions of code tracked, versioned and committed to a repository for developers to revert back and forth should they choose to. Hence the reason why we have Git, SVN or Mercurial: version control systems (VCS) which allows us to track changes in source code files. But what about data?

See, code (text) files can easily be tracked as it does not take up much space and footprint, and every version of such small files can be retained by the repository. However, it’s not advisable to do so with data as this could easily cause the size of your repository to blow up. So what are the different ways available for versioning and tracking differing iterations and formats of data? Let’s explore.

Tracking & Versioning Data

Some ways to track and version data:

  1. Multiple (Backup) Folders

So with this method, what one does is having separate folders of EVERY version of processed data. A spreadsheet could be used to track different versions of the processed data as well as each of their dependencies and inputs. If a data scientist would like to refer to an old version of the data because a more recent one does not look promising, they would simply refer to the spreadsheet and provide the location of that old version of data to the script. Now, this is the most straightforward method of versioning and tracking datasets and one with the lowest learning curve and barrier, but it is in no way efficient or safe in implementing. Data files can be easily corrupted this way and unless proper permission controls are configured, it is too easy for anyone to accidentally delete or modify the data. Many organisations still do this for data analysis or data science workflows but as it is the easiest and least costly to implement, it is understandable why they would resort to this. However, such a tedious and inelegant method should prompt one to look for more innovative ways.

repository
├── src
├── data
│ ├── raw
│ │ ├── labels
│ │ └── images
│ └── processed
│ │
│ ├── 210620_v1
│ │ ├── train
│ │ └── test
│ │
│ ├── 210620_v2
│ │ ├── train
│ │ ├── test
│ │ └── holdout
│ │
│ ├── 010720_v1
│ └── final_v1
│
├── models
└── .gitignore

Figure 1: A sample folder tree containing multiple folders for different versions of processed datasets.

Data Folder NameStateData Pipeline Commit ID
210620_v1Processed54b11d42e12dc6e9f070a8b5095a4492216d5320
210620_v2Processedfd6cb176297acca4dbc69d15d6b7f78a2463482f
010720_v1Processedab0de062136da650ffc27cfb57febac8efb84b8d
final_v1Processed8cb8c4b672d0615d841d1c797779ee2e0768a7f3

Figure 2: A sample table/spreadsheet that can be used to track data folder versions with commit IDs of pipelines used to generate these artefacts.

  1. Git Large File Storage (Git-LFS)

Git-LFS allows one to push files that exceed the push limit imposed by Git. This solution leverages on special servers hosted by repository hosting platforms and each one has different limits set for Git-LFS uploads. For example, Github Free allows a maximum file size of 2GB to be uploaded to GitHub’s LFS servers. With that said, Git-LFS was created not with data science use cases in mind.

  1. Pachyderm
Pachyderm Logo

A much more elegant solution than the aforementioned ones would be Pachyderm. Pachyderm allows for the formulation and chaining of multiple different parts of a data pipeline and, in addition to this orchestration function, it has a data versioning function that can keep track of data as they are processed. It is built on Docker and Kubernetes but while it can be installed locally, it is better deployed and run in a Kubernetes cluster. While this would be an optimal choice for most data science teams out there, smaller or less mature organisations might not find this solution that accessible. Also, having to maintain a Kubernetes cluster without an engineer trained in infrastructure or DevOps skills would pose additional hurdles. So, is there an alternative solution which does not pose that high of a barrier for adoption? Well, that’s where DVC comes in.

Enter DVC

Three years since its first release, DVC by Iterative.ai is an open-source version control system that has gained a lot of traction within the data science community, built for versioning ML pipelines with a great emphasis on reproducibility. One of the best things about it (at least in my opinion) is that is incredibly easy to install and its usage is mainly through the command-line interface.

(Note: Me talking about DVC here is to bring awareness, summarising some of its key features and hopefully pique your interest. DVC is well-documented, has a forum to cultivate a community with many of its users blogging about it. I feel this introductory coverage on DVC is kind of redundant but this is my way of presenting a prequel to a deeper technical coverage to come in Part 2 of this series.)

Installation

Alright. So, how does one get started with DVC? Let’s start with the installation. To install DVC within a Python virtual environment…:

$ pip install dvc

…or to install it system-wide (for Linux):

$ sudo wget \
       https://dvc.org/deb/dvc.list \
       -O /etc/apt/sources.list.d/dvc.list
$ sudo apt update
$ sudo apt install dvc

# For Mac, using Brew:
# brew install dvc

# For Windows, using Chocolatey:
# choco install dvc

That’s it. No need for a server to be hosted, a container cluster or having to pay for the tool. The cache for versions of data/artefacts can be stored locally (say with a tmp folder), or on a remote storage for shared access. At AI Singapore, we are using on-premise S3-based data store (Dell ECS) for remote storage.

Features

There are a lot of things one can do with this tool and I am sure more are to come with the many issue tickets submitted to its GitHub repository. Let me list these features with my own words:

  • data versioning with the option to store caches on supported storage types and tracking through Git.
  • language & framework agnostic (you do not need Python to use the tool).
  • ability to reproduce artefacts generated by pipeline runs tracked through DVC’s commands.
  • compare model performances across different training pipeline runs.
  • easy integration and usage with Iterative.ai’s other tool CML.

Mechanism

One might wonder: how does DVC enable tracking of datasets? Well, each data or artefact being tracked is linked to files with the extension .dvc. These .dvc files are simply YAML formatted text files containing hash values tracking the states of the files. (Do look at this page of the documentation for reference of files relevant to DVC’s mechanism.) Since these text files are just as lightweight as code that are trackable by Git, they can be committed, tracked and versioned.

DVC Workflow Image 1

Figure 3: Diagram showcasing the relationship between the components of DVC.

This diagram above (taken from DVC’s website itself) shows how the file model.pkl is being tracked through the file model.pkl.dvc which is committed to a Git repository. The model file has a reflink to the local cache and it changes according to the version you call upon through DVC’s commands. (For some explanation on reflinks, one can look into this article.) There’s more to how DVC versions and tracks artefacts and pipelines but this is the basic coverage.

Basic Workflow

So what does a typical basic workflow look like for the usage of DVC? Assuming one has installed DVC, the following commands can be executed to start the versioning of files:

Let’s create/assume this project folder structure:

$ mkdir sample_proj
$ cd sample_proj
$ mkdir data models
$ wget -c https://datahub.io/machine-learning/iris/r/iris.csv -O ./data/dataset.csv
$ git init
$ tree
.
├── data
│ └── dataset.csv
└── models
  1. Initialise DVC workspace
$ dvc init
$ ls -la
drwxr-xr-x user group .dvc
drwxr-xr-x user group 352 .git
drwxr-xr-x user group 128 data
drwxr-xr-x user group  64 models

This command creates a /.dvc folder to contain configuration and cache files which are mainly hidden from the user. This folder is automatically staged to be committed to the Git repository. Do note that this command by default expects a Git repository. The flag --no-scm is needed to initialise DVC in a folder that’s not a Git repository.

  1. Start tracking files with DVC
$ dvc add data/dataset.csv
# For tracking a data folder
# dvc add ./data

To start tracking a file, run the command above with reference to either a directory or a file that is meant to be tracked. In this case, we are tracking just the dataset.csv file itself. A file dataset.csv.dvc within the /data folder would be created, to be committed to the Git repo.

  1. Track a pipeline/workflow stage along with its dependencies and outputs

Let’s get ahead of ourselves and assume that we have created a Python script to train a model where one of its inputs is the dataset.csv file and an output which is the predictive model itself: model.pkl.

$ dvc run \
    -n training
    -d data/dataset.csv \
    -o models/model.pkl \
    python train.py
Running stage 'training' with command:
        python train.py
Creating 'dvc.yaml'
Adding stage 'training' in 'dvc.yaml'
Generating lock file 'dvc.lock'
$ tree
.
├── data
│ └── dataset.csv
├── models
├── dvc.lock
├── dvc.yaml
└── train.py

The contents of the dvc.yaml file would be as such:

stages:
    training:
    cmd: 'python train.py'
    deps:
      - data/dataset.csv
    outs:
      - models/model.pkl
  1. Share and update the cache to remote and commit .dvc files
$ dvc push

This command will push whatever is stored in the local cache to a local remote or other storage you would configure for the project.

$ git add data/dataset.csv.dvc dvc.yaml
$ git commit -m "Commit DVC files"
$ git push
  1. Reproduce results

Let’s say we push the cache to a remote storage. Something that can be done following this is for someone to clone that same Git repository (containing the relevant DVC files that have been generated), configure a connection with the remote storage, and run the pipelines listed as stages in dvc.yaml.

$ git clone <repo>
$ dvc pull # to get the exact version of the dataset.csv file that was used by the one who initially ran the pipeline
$ dvc repro

After the above commands are executed, the model.pkl file would be generated through the same set of data (versioned through DVC) and pipeline (versioned through Git). This showcases the reproducibility end of things in this whole tooling mechanism.

Conclusion

What I have shared above is just me trying to highlight to you, the audience, how easy it can be in executing a workflow that leverages on DVC as a data versioning tool. It’s more of a selling point for you to look deeper into DVC’s documentation and guides. It would be an inefficient use of my time to provide further coverage on the tool which the existing documentation does an excellent job of. What I am keener to do is to show you how DVC can be plugged and integrated into an end-to-end workflow consisting of data preparation, model training and tracking, and then model serving, all through an attempt of leveraging a CI/CD pipeline. This is something that is much more complicated and which the documentation does not cover. All this would be a much more technical coverage that comes in Part 2, to be published very soon. So that’s it for now. Until then, take care.

It’s A Wrap: AI vs COVID-19 Ideation Challenge!

The COVID-19 pandemic has affected everyone, claiming hundreds of thousands of lives and disrupting the livelihood of millions.

AI has the potential to transform our economy and improve lives. Given this unprecedented global crisis, AISG launched an ideation challenge, with the aim to encourage the curation of innovative ideas using AI methods and technologies help Singapore emerge stronger and better from the current situation caused by the coronavirus.

The challenge was open to all students at Singapore public educational institutions i.e. secondary to university. They could participate as a team of 1-4 members. Each proposal was reviewed based on the following criteria: Innovation (novelty and creativity of the idea), Value-add (how much would society benefit from such a solution), Relevance (how AI-focused is the proposed approach, how well does it address the problems caused by Covid-19), Feasibility (in terms of implementation, scalability and cost). The Review Panel comprised representatives from AISG, National AI Office and MOH Office for Healthcare Transformation. The ten best proposals were selected and awarded SGD1000 each.

“We were heartened to receive more than 170 entries from secondary schools, ITE, JCs, and Polytechnics as well as Universities.  The submissions were very diverse in terms of topics, use cases, age groups as well as institutions.  In view of the current situation, one would expect that a lot of proposals focus on health, but a good number also considered the secondary effects of the pandemic e.g. mental health, home-based-learning and teaching, supporting people who have lost jobs, how to get groceries etc.” said Stefan Winkler, Deputy Director of AI Technology, AISG.

A virtual award ceremony was conducted on 30 July 2020 to acknowledge our winners. In attendance were 6 winners who each gave a short presentation on their winning proposal to the rest.

Group picture from the AI vs COVID-19 Ideation Challenge Virtual Award Ceremony
A screengrab on what our winners had to say about their experience working on the proposal for the challenge

Once again, we would like to thank everyone who has taken part in this Ideation Challenge and our heartiest congratulations to the following 10 winning teams.

Click here for the summary of each proposal.

PLUS-Skilling Not Re-Skilling Needs To Be the New Norm

AI Singapore runs a very successful apprenticeship programme – the AI Apprenticeship Programme (AIAP). Started in 2018, the AIAP is now into its 6th batch of apprentices. More than 90% of our graduates secure an AI job before they complete the 9-month programme.

More information about the AIAP programme can be found below.

AI Apprenticeship Programme

In short, we take in self-directed, highly motivated learners who have taught themselves the tools of the trade like Python, AI/ML libraries either through formal courses in the universities, online materials, MOOCs, or just from books and personal experimentation. We then take them on a 9-month journey to deepen their skills to build and deploy an AI model into production for a real-life “paying” customer.

Who are the Apprentices?

After 2 years of studying the trends and profiles of the apprentices, what is apparent is that the field of data science, AI/ML – is not just the domain of computer science graduates. In fact, computer science graduates make up only 21% of our cohort. Yes, you can say those with a computer science degree are less likely to join the AIAP since they have the background knowledge already, and you would be partly right.

However, the work of the data scientist and AI Engineer is not just about the development of novel machine learning or AI algorithms (we leave that to the researchers!). Data Scientists and AI Engineers are passionate about creating data products that provide the organization they work for with useful insights and actionable plans and/or products and services. The people who excel here are passionate about understanding the data and creating products, and the use of any AI/ML tool – is just that – a tool or technique – that is to be used as required.

No alt text provided for this image

The chart above shows the profile of our apprentices from batch 1 to 6. What is clear from the data is that people from any field, with the right motivation and training, can get into the AI field and do very well (all of our AIAP graduates have landed data science/AI/ML roles with major private and public organizations).

PLUS-skill and not Re-skill

What is interesting in speaking with the current apprentices and maintaining a relationship with past AIAP graduates and tracking their progress, is that it was never about re-skilling (because they found their skills irrelevant or their industry disappeared) but about getting new skills which would allow them to perform better in their chosen profession. Many AIAP graduates continued in a similar profession after the AIAP, but now in an enhanced role with their new-found and validated data science and AI skills.

In PLUS-skilling, an accountant learns how to use data science and AI tools and techniques to analyse the accounts faster or detect fraud. Some accountants may learn to code and build those AI models themselves, but they will be a minority and will be highly sought after.

A sales executive, who now understands AI and is able to build AI models, would be well-positioned to sell AI products and services intelligently and correctly. (I have lost count of the number of salespeople whom I have met trying to sell me AI products without understanding how their products actually work).

A software engineer, who is trained in machine learning, can now build more intelligent systems in a robust and correct manner. It is not just about learning how to call another API. Building machine learning systems involves the correct use of data and interpretation of the output of various ML algorithms. Garbage into an API, garbage out of the API.

A mechanical engineer, is now expected to design smart systems which use data to drive the design, and not necessarily from first principles or physics equations anymore. The engineer who is able to effectively use AI tools or build machine learning models will be the new norm.

A government policy planner, will now use her new data science skills to develop better policies with data instead of just gut feel, the popular vote, or based on often biased published reports.

The AIAP is for professionals who have that passion for data, the ability to code and want to become full-fledged AI engineers to develop AI software, products and solutions. Creating a nation of AI Engineers is just not possible and is not wise. 

So you may ask

if I cannot code, how can I participate in the new world of AI and data?

Fallacy of programming for everyone

Do not learn programming! 

Not everyone needs to learn programming to be data-savvy and do data science and AI roles. You do not need to know how to code even “Hello World” to be part of the data revolution. 

5-day bootcamps which offer HTML and CSS coding and make you the promise of landing you a job as a web designer is just so wrong. In 5 days you cannot even hope to build a website which can be generated with tools like WordPress, Wix or Squarespace in an hour. You will need months, if not years, of training and hands-on work to become a “real” web-designer.

3-week bootcamps which claim to train you to be a data scientist is just fake news! That is a story for another day.

Working professionals will encounter more and more AI tools and systems in the next few years. Learning how to use them will be part-and-parcel of the job (just like how you learn to use Excel or switch from Lotus 1-2-3 to Excel). 

Tools like Azure Automated Machine LearningOrange Data Mining and Rattle (package in R) allow you to click through an analysis and perform machine/data mining without coding. 

What you will need to learn here is not programming, but data analytics skills, understanding how to use the correct algorithm for the specific use case, and how to pre-process the data, analyse the output and then make a recommendation to management based on your analysis. Or you can view terabytes of data and present them intelligently with beautiful visualisations without a single line of code with highly intuitive GUI-based tools like Tableau, Qlik or PowerBI, etc. 

It is about learning beyond Y = MX + C (and who said you do not use data science!) and newer, more advanced algorithms combined with the data you have and domain expertise to deliver value to your organization. Just as most of us who need to do forecasting may have used Excel’s FORECAST() function at one time, you did not need to learn how to programme the FORECAST function to use it. You only had to learn there was such a tool and HOW TO USE IT correctly.

The push of getting everyone to learn programming is not only ineffective but also a waste of time and resources. Not everyone is cut out to code, just like not anyone can be a lawyer, painter or doctor. Programming is an art, and you need to have that artist in you before you can do well. 

If you are a PMET, and think you want to learn Python programming in 3-months and transition into a data science role, where do you stand compared to fresh graduates who may have spent 3-4 years in a University programming every week, coding and working on data-related problems? Would an employer hire you instead of a fresh graduate if you wanted that entry-level programming role?

But if you PLUS-skill yourself and present yourself as someone who has the domain knowledge and now with proper data analysis skills who can use common tools mentioned above to better drive an organisation’s strategy, you will likely be the manager of that fresh graduate who would be doing all the lower-level programming to get the right data to you for your analysis.

Everyone is (or will be) an expert in their chosen domain of profession after a few years and the focus should be on PLUS-skilling so that you can take on new tasks in the new age of data, data science and machine learning.

Happy learning (and not necessarily programming!).

Newly Launched – AISG PhD Fellowship Programme

We are pleased to announce the launch of the AISG PhD Fellowship Programme which is part of AISG’s Research programme to support top AI research talents in pursuing their PhD in Singapore-based Autonomous Universities (AUs)*.

We aim to nurture and train local AI talents to be able to perform advanced fundamental AI research and produce state-of-the-art AI algorithms, models and systems. In addition, we hope that these talents will contribute to the other pillars of AI Singapore, the local AI eco-system and the society.

Click here for more information.

 

Leveraging Kedro in 100E

Introduction

At AI Singapore, projects under the 100 Experiments (100E) programme delivering machine learning applications are usually staffed by engineers of different academic backgrounds and varying levels of experience. As a project technical lead and mentor to apprentice engineers within the programme, my first priority in each project is often to establish a foundation of practices that would both form an efficient, unified development workflow and provide opportunities to impart some engineering wisdom to the junior engineers. In the never-ending journey to address these challenges, I have found Kedro to be a useful tool, serving as a step-by-step guide for developing machine learning products, instilling sound software engineering principles, and ensuring production-level code.

The following is a summary of how I utilised Kedro in my 100E projects. For a full rundown of its features or instructions on how to set up and configure Kedro for your projects, refer to the official documentation.

What is Kedro?

Kedro is a Python library created by the boffins at QuantumBlack. It is an opinionated development workflow framework for building machine learning applications, with a strong focus on helping developers write production code. It seeks to provide a standardised approach to collaborative development by providing a template to structure projects, as well as by implementing its own paradigm for data management and parameter configuration. It also comes with a command-line interface to automate the execution of unit tests, generation of documentation, packaging into libraries, and other processes. This suite of features and its intended workflow encapsulates what its creators consider to be the ideal approach to machine-learning projects and Kedro is the realisation of it.

The Kedro Workflow

Workflows differ from team to team. A team comprising diverse professional backgrounds will inevitably have some degree of dissonance with regard to approaching work. Those with an affinity for analysis may be inclined to devote more time doing exploration and prefer a more fluid manner of problem solving. Those who enjoy building applications may prefer to work on a well-defined set of features and focus on productising. In building machine learning products, both approaches have their merits and Kedro seeks to harmonise the two with its workflow. I found that members from both camps took to this workflow readily because it is simple and structured.

The Kedro workflow can be summarised into three simple steps:

  1. Explore the data
  2. Build Pipelines
  3. Package the project

The Kedro Workflow

1. Explore Data

Projects typically begin with exploration of the given data and experimentation of viable models. There are a myriad ways to solve a problem in machine-learning — they depend on the available data, the features derived from that data, and the compatibility of the model to those features; they also differ in complexity, reliability and execution.

The first step to building an application is to address this ambiguity through an iterative process of statistical analysis, hypothesis testing, knowledge discovery and rapid prototyping. Generally, this is done through Jupyter notebooks, since the preference is on having a tight feedback loop and notebooks provide immediate visualisation of outputs. Kedro streamlines this process by providing a structured way to ingest and version data through its Data Catalog and integrates this into Jupyter.

As the name implies, the Data Catalog is a record of data in the project. It provides a consistent structure to define where data comes from, how it is parsed and how it is saved. This allows data to be formalised as it undergoes each stage of transformation and offers a shared reference point for collaborators to access each of those versions, in order to perform analysis or further engineering.

By ensuring all team members work on the same sets of data, they can focus on the objectives of data exploration: discovering insights to guide decision making, establishing useful feature engineering processes and assessing prototype models. Furthermore, this assurance on consistency also allows for easy synchronisation between individual work, since everyone references the same Data Catalog when reading and writing data.

2. Build Pipelines

Data exploration is meant to be rapid and rough — code written at this stage usually spawn from experiments and rarely survive the rigours of quality control. Hence, what follows exploration is a step to selectively refine processes that have utility and which ought to be implemented in the final product, and to formalise them as modules and scripts. Kedro has defined its own structure for processes, consisting of Nodes and Pipelines.

Nodes are the building blocks of Pipelines and represent tasks in a process. Each one is made up of a function, as well as specified input and output variable names. Pipelines are simply series of Nodes that operate on data sequentially and all data engineering processes should be consolidated into distinct Pipelines. Use of this structure allows for automatic resolution of dependencies between functions, so that more time can be spent on other aspects of refining code: ensuring it is reliable, maintainable and readable; and writing unit tests and documentation.

The pipeline building process is iterative and forms the bulk of development work. At any point during exploration, when any process has been deemed useful enough to be reused or contribute to the finished product, there should be a concerted effort to implement the experimental code as a pipeline. This ensures quality by preventing dependence on non-production code, accelerates collaboration by only sharing and using reliable code, and forms a practice of steadily contributing features to the final product. This process of exploration and refinement is repeated until the product is fully formed.

3. Package the Project

The output of a complete project is a repository of source code. While it may have met the technical objectives, it is not exactly suited for general use. The final stage of the workflow is to bundle the project into a Python package that can be either delivered as a user-friendly library, or integrated into an application framework to be deployed and served. Kedro has built-in automation for packaging. As long as development complies with the Kedro workflow and structure, very little tinkering is required at this stage. Kedro also has automatic generation of API documentation, as well as extensions that allow projects to be containerised and shipped.

Kedro Standards

Kedro’s features and its workflow were intended to establish software engineering best practices — aspects of the project deemed essential to enhancing collaboration and ensuring production-level code.

The following are the most significant practices from my experience:

  1. Data Abstraction & Versioning
  2. Modularisation
  3. Test-Driven Development
  4. Configuration Management

1. Data Abstraction & Versioning

Data often undergo a variety of transformations during experimentation. Each data set derived from these transformations may have its own separate purpose and may be required to be accessed at different times or by different people. Furthermore, different members of the team may perform their own independent engineering and analysis, and would therefore produce their own versions of data. The aforementioned Data Catalog offers a means to selectively formalise each of these versions, and consolidate them into a central document with which collaborators can reference and contribute. With it, data can be shared and reproduced by different team members, simplifying organisation and ensuring there is a cohesive system to track data.

2. Modularisation

Modularisation is a staple of software engineering. It encompasses the deconstruction of code into more independent units, reducing complexity and improving maintainability. Kedro encourages modularisation by imposing the use of its Node and Pipeline structure. Nodes are required to be defined as pure functions, with specified input and outputs. When combined into a Pipeline, data is easily passed from one Node to the next. This especially prevents the construction of ‘god functions’ which are commonly found in experimental code.

3. Test-Driven Development

Test-driven development (TDD) establishes a feedback loop that ensures an application’s functional needs are met, promotes development of high-quality code and reduces operational risks through proactive testing. Pure TDD is difficult to achieve when building machine learning applications because it needs software requirements to be defined before features are built, but requirements in machine learning projects are typically nebulous and morph throughout development. Regardless, Kedro comes with pytest built in and its project template is structured to allow tests to be written and recognised by it.

4. Configuration Management

Through the project template and automated commands, Kedro also provides the means to manage configuration of experiment parameters, logging and credentials. Not only does it abstract away the tedium of redefining each variable every time it changes, it also enables the separation of private information from the shared repository. This is a crucial security measure when developing code that will eventually be deployed. Kedro ensures that security principles such as this are implemented from the get-go.

Summary

I found Kedro to be overall a useful tool for developing machine-learning applications, especially for facilitating collaboration and establishing good development practices. The various means in which it automates and abstracts away the usually tedious organisational tasks enables the team to concentrate on more creative work. Using Kedro made building features straightforward, as it made sure that every time an experiment yields a promising snippet, it would be assimilated into the codebase. I feel Kedro is fully capable of achieving its goal of helping developers produce production-level code, since its multitude of features ensure that code is tested, refined and reliably integrated.

There is one drawback, however, which is that Kedro’s Nodes and Pipelines structure must be strictly adhered to. This presents a problem when writing code that is intended to fit into a specification that deviates from this pattern. For instance, deep learning projects that utilise other libraries have their own separate framework and that would take precedence over Kedro’s data engineering framework. Considerable time must then be spent on ensuring any shared objects can be passed between the two paradigms. Fortunately, it is not too difficult to extract code from Kedro’s structure and implement your own. Kedro is a development workflow framework after all, so once the majority of development work is done, the features can be refactored to fit other frameworks. That way, Kedro’s convenient offerings can still be exploited. Regardless, I would still rely on Kedro to be a helpful guide to building production-level machine learning models.

mailing list sign up

Mailing List Sign Up C360