Automated Data Validation & Exploration

A presentation I did at work where we walk through different tools to automate or assist Data Exploration and Validation. A very exciting topic for sure!
machine-learning
data
Author

Hampus Londögård

Published

March 20, 2023

To run this as slides use the following command in the terminal:

nbconvert posts/2023-03-20-deepchecks/index.ipynb --to slides --post serve

N.B. this blog is originally a presentation, hence it’s not really written in the best way. Potentially I’ll rewrite into a true blog in the future.

Data Validation & Exploration

Today we’ll dive into automated Data Validation and Data Exploration.

Every day we work through a multitude of data using heurestics, statistics and many tools. But is there better tools out there? Is there a way to automate some of the process to put greater emphasis on the important things?

Data Validation Tools

There is a few tools.

  1. Deepchecks Tests for Continuous Validation of ML Models & Data
  2. ydata-profiling (previously _pandas-profiling) Create HTML profiling reports from pandas DataFrame objects
  3. greatexpectations Always know what to expect from your data.
  4. pandera A Statistical Data Testing Toolkit

We’ll focus on a few discussion points today

  • When does it make sense to introduce this type of tool?
  • How do you use this type of tool today?
  • How can it be improved?
  • Can it be used as part of Data Analysis?
  • Can it be used in any other part of the process?

(This was for the presentation/‘journal circle’)

Introduction

As we all know to be true data is incredibly important when developing Machine Learning Applications.

Shit in, shit out

First we’ll make a quick introduction to each tool and their strengths.

Second I’ll share a few use-case examples. (didn’t have time to complete)

Finally we’ll end up discussing how we can use, or use, these tools.

Deepchecks

Figure 1: Deepchecks Checks|

Deepchecks is a Python package for comprehensively validating your machine learning models and data with minimal effort. This includes checks related to various types of issues, such as model performance, data integrity, distribution mismatches, and more.

Data Formats

Deepchecks supports the following formats:

  1. Tabular
  2. Computer Vision
  3. NLP (text)

Example

Video of a Deepcheck Evaluation Suite

Types of checks

The types of checks are divided into 3 variants,

Deepchecks Types and where they run

Running a Deepcheck

Either you run a full suite or a single feature. You choose!

Full Evaluation Suite

from deepchecks.tabular.suites import model_evaluation
suite = model_evaluation()
result = suite.run(train_dataset=train_dataset, test_dataset=test_dataset, model=model)
result.save_as_html() # replace this with result.show() or result.show_in_window() to see results inline or in window

Single Validation

from deepchecks.tabular.checks import FeatureDrift
import pandas as pd

train_df = pd.read_csv('train_data.csv')
test_df = pd.read_csv('test_data.csv')
# Initialize and run desired check
FeatureDrift().run(train_df, test_df)

ydata-profiling

ydata-profiling, previously pandas-profiling is a tool that allows you to easily profile a dataset quickly and grok the data.

Key features

  • Type inference: automatic detection of columns’ data types (Categorical, Numerical, Date, etc.)

  • Warnings: A summary of the problems/challenges in the data that you might need to work on (missing data, inaccuracies, skewness, etc.)

  • Univariate analysis: including descriptive statistics (mean, median, mode, etc) and informative visualizations such as distribution histograms

  • Multivariate analysis: including correlations, a detailed analysis of missing data, duplicate rows, and visual support for variables pairwise interaction

  • Time-Series: including different statistical information relative to time dependent data such as auto-correlation and seasonality, along ACF and PACF plots.

  • Text analysis: most common categories (uppercase, lowercase, separator), scripts (Latin, Cyrillic) and blocks (ASCII, Cyrilic)

  • File and Image analysis: file sizes, creation dates, dimensions, indication of truncated images and existence of EXIF metadata

  • Compare datasets: one-line solution to enable a fast and complete report on the comparison of datasets

  • Flexible output formats: all analysis can be exported to an HTML report that can be easily shared with different parties, as JSON for an easy integration in automated systems and as a widget in a Jupyter Notebook.

The report contains three additional sections:

  • Overview: mostly global details about the dataset (number of records, number of variables, overall missigness and duplicates, memory footprint)
  • Alerts: a comprehensive and automatic list of potential data quality issues (high correlation, skewness, uniformity, zeros, missing values, constant values, between others)
  • Reproduction: technical details about the analysis (time, version and configuration)

How to use

ydata-profiling is incredibly simple to use!

All that needs to be done is

profile = ProfileReport(df, title="Profiling Report")

Examples found on github.
For a specific example see titanic.

Great Expectations (GX)

great expectations

Great Expectations (GX) helps data teams build a shared understanding of their data through quality testing, documentation, and profiling.

GX is a well-known tool with a huge community. This means that there’s multiple plugins in other tools to support this framework.

It support things like Snowflake, BigQuery, Spark, Pandas, ..!

It’s easy to use and gives Data Documentation of the tests which can be saved in S3 or other places giving everyone a possibility to view and share these!

Example of GX

# great expectations check example
# can also be JSON
expect_column_values_to_be_between(
    column="passenger_count",
    min_value=1,
    max_value=6
)

automated data docs

Even has Data Assistant to build automated checks based on Golden Dataset!

There’s > 50 built-in expexctations and >300 including community added!

Our stakeholders would notice data issues before we did – which eroded trust in our data

pandera

  1. Define a schema once and use it to validate different dataframe types.
  2. Check the types and properties of columns/values.
  3. Perform more complex statistical validation like hypothesis testing.
  4. Seamlessly integrate with existing data analysis/processing pipelines via function decorators.
  5. Define dataframe models with the class-based API with pydantic-style syntax and validate dataframes using the typing syntax.
  6. Synthesize data from schema objects for property-based testing with pandas data structures.
  7. Lazily Validate dataframes so that all validation rules are executed before raising an error.
  8. Integrate with a rich ecosystem of python tools like pydantic, fastapi and mypy.

Pandera Dictionary Schema

import pandas as pd
import pandera as pa

# data to validate
df = pd.DataFrame({
    "column1": [1, 4, 0, 10, 9],
    "column3": ["value_1", "value_2", "value_3", "value_2", "value_1"],
})

# define schema
schema = pa.DataFrameSchema({
    "column1": pa.Column(int, checks=pa.Check.le(10)),
    "column3": pa.Column(str, checks=[
        pa.Check.str_startswith("value_"),
        pa.Check(lambda s: s.str.split("_", expand=True).shape[1] == 2)
    ]),
})

validated_df = schema(df)
print(validated_df)

Pandera (Pydantic) Class Schema

from pandera.typing import Series

class Schema(pa.DataFrameModel):
    column1: Series[int] = pa.Field(le=10)
    column2: Series[float] = pa.Field(lt=-1.2)
    column3: Series[str] = pa.Field(str_startswith="value_")

    @pa.check("column3")
    def column_3_check(cls, series: Series[str]) -> Series[bool]:
        """Check that column3 values have two elements after being split with '_'"""
        return series.str.split("_", expand=True).shape[1] == 2

Schema.validate(df)

Final Comparison Table

Comparing the tools this is how they can be used, and if I really like using them :wink:

Tool Data Stores (Pandas, Spark, DB, Other) Steps (Analysis, Training, Production, Non-ML) Drift Hypothesis Data Generation Data Types Personal Favorite(s)
deepchecks ✅😐❌✅ ✅✅✅✅
ydata-profiling ✅✅✅❌ ✅❌❌✅
greatexpectations ✅✅✅✅ ❌❌✅✅ 😐
pandera ✅✅⏳✅ ❌✅✅✅

Bonus: Additional Great Frameworks

Bonus 2: PyGWalker

I found a new tool lately called PyGWalker which was really cool! It cannot handle really large data, but it’s excellent for smaller datasets :)

Turn your pandas dataframe into a Tableau-style User Interface for visual analysis

PyGWalker

PyGWalker GIF