How StaticFrame Enables Comprehensive DataFrame Type Hints
Photo by Author
Since the advent of type hints in Python 3.5, statically typing a DataFrame has generally been limited to specifying just the type:
“`python
def process(f: DataFrame) -> Series: …
“`
This is inadequate, as it ignores the types contained within the container. A DataFrame might have string column labels and three columns of integer, string, and floating-point values; these characteristics define the type. A function argument with such type hints provides developers, static analyzers, and runtime checkers with all the information needed to understand the expectations of the interface. StaticFrame 2 (an open-source project of which I am lead developer) now permits this:
“`python
from typing import Any
from static_frame import Frame, Index, TSeries
Anydef process(f: Frame[ # type of the container
Any, # type of the index labels
Index[np.str_], # type of the column labels
np.int_, # type of the first column
np.str_, # type of the second column
np.float64, # type of the third column
]) -> TSeries: …
“`
All core StaticFrame containers now support generic specifications. While statically checkable, a new decorator, `@CallGuard.check`, permits runtime validation of these type hints on function interfaces. Further, using Annotated generics, the new Require class defines a family of powerful runtime validators, permitting per-column or per-row data checks. Finally, each container exposes a new `via_type_clinic` interface to derive and validate type hints. Together, these tools offer a cohesive approach to type-hinting and validating DataFrames.
Requirements of a Generic DataFrame
Python’s built-in generic types (e.g., tuple or dict) require specification of component types (e.g., tuple[int, str, bool] or dict[str, int]). Defining component types permits more accurate static analysis. While the same is true for DataFrames, there have been few attempts to define comprehensive type hints for DataFrames.
Pandas, even with the pandas-stubs package, does not permit specifying the types of a DataFrame’s components. The Pandas DataFrame, permitting extensive in-place mutation, may not be sensible to type statically. Fortunately, immutable DataFrames are available in StaticFrame.
Further, Python’s tools for defining generics, until recently, have not been well-suited for DataFrames. That a DataFrame has a variable number of heterogeneous columnar types poses a challenge for generic specification. Typing such a structure became easier with the new TypeVarTuple, introduced in Python 3.11 (and back-ported in the typing_extensions package).
A TypeVarTuple permits defining generics that accept a variable number of types. (See PEP 646 for details.) With this new type variable, StaticFrame can define a generic Frame with a TypeVar for the index, a TypeVar for the columns, and a TypeVarTuple for zero or more columnar types.
A generic Series is defined with a TypeVar for the index and a TypeVar for the values. The StaticFrame Index and IndexHierarchy are also generic, the latter again taking advantage of TypeVarTuple to define a variable number of component Index for each depth level.
StaticFrame uses NumPy types to define the columnar types of a Frame, or the values of a Series or Index. This permits narrowly specifying sized numerical types, such as np.uint8 or np.complex128; or broadly specifying categories of types, such as np.integer or np.inexact. As StaticFrame supports all NumPy types, the correspondence is direct.
Interfaces Defined with Generic DataFrames
Extending the example above, the function interface below shows a Frame with three columns transformed into a dictionary of Series. With so much more information provided by component type hints, the function’s purpose is almost obvious.
“`python
from typing import Any
from static_frame import Frame, Series, Index, IndexYearMonth
def process(f: Frame[Any,Index[np.str_],np.int_,np.str_,np.float64,]) -> dict[int,Series[ # type of the container
IndexYearMonth, # type of the index labels
np.float64, # type of the values
]]: …
“`
This function processes a signal table from an Open Source Asset Pricing (OSAP) dataset (Firm Level Characteristics / Individual / Predictors). Each table has three columns: security identifier (labeled “permno”), year and month (labeled “yyyymm”), and the signal (with a name specific to the signal).
The function ignores the index of the provided Frame (typed as Any) and creates groups defined by the first column “permno” np.int_ values. A dictionary keyed by “permno” is returned, where each value is a Series of np.float64 values for that “permno”; the index is an IndexYearMonth created from the np.str_ “yyyymm” column. (StaticFrame uses NumPy datetime64 values to define unit-typed indices: IndexYearMonth stores datetime64[M] labels.)
Rather than returning a dict, the function below returns a Series with a hierarchical index. The IndexHierarchy generic specifies a component Index for each depth level; here, the outer depth is an Index[np.int_] (derived from the “permno” column), the inner depth an IndexYearMonth (derived from the “yyyymm” column).
“`python
from typing import Any
from static_frame import Frame, Series, Index, IndexYearMonth, IndexHierarchy
def process(f: Frame[Any,Index[np.str_],np.int_,np.str_,np.float64,]) -> Series[ # type of the container
IndexHierarchy[ # type of the index labels
Index[np.int_], # type of index depth 0
IndexYearMonth], # type of index depth 1
np.float64, # type of the values
]: …
“`
Rich type hints provide a self-documenting interface that makes functionality explicit. Even better, these type hints can be used for static analysis with Pyright (now) and Mypy (pending full TypeVarTuple support). For example, calling this function with a Frame of two columns of np.float64 will fail a static analysis type check or deliver a warning in an editor.
Runtime Type Validation
Static type checking may not be enough: runtime evaluation provides even stronger constraints, particularly for dynamic or incompletely (or incorrectly) type-hinted values.
Building on a new runtime type checker named TypeClinic, StaticFrame 2 introduces @CallGuard.check, a decorator for runtime validation of type-hinted interfaces. All StaticFrame and NumPy generics are supported, and most built-in Python types are supported, even when deeply nested. The function below adds the @CallGuard.check decorator.
“`python
from typing import Any
from static_frame import Frame, Series, Index, IndexYearMonth, IndexHierarchy, CallGuard
@CallGuard.check
def process(f: Frame[Any,Index[np.str_],np.int_,np.str_,np.float64,]) -> Series[
IndexHierarchy[Index[np.int_], IndexYearMonth],
np.float64,
]: …
“`
Now decorated with @CallGuard.check, if the function above is called with an unlabelled Frame of two columns of np.float64, a ClinicError exception will be raised, illustrating that, where three columns were expected, two were provided, and where string column labels were expected, integer labels were provided. (To issue warnings instead of raising exceptions, use the @CallGuard.warn decorator.)
“`
ClinicError:
In args of (f: Frame[Any, Index[str_], int64, str_, float64]) -> Series[IndexHierarchy[Index[int64], IndexYearMonth], float64]
└── Frame[Any, Index[str_], int64, str_, float64]
└── Expected Frame has 3 dtype, provided Frame has 2 dtype
In args of (f: Frame[Any, Index[str_], int64, str_, float64]) -> Series[IndexHierarchy[Index[int64], IndexYearMonth], float64]
└── Frame[Any, Index[str_], int64, str_, float64]
└── Index[str_]
└── Expected str_, provided int64 invalid
“`
Runtime Data Validation
Other characteristics can be validated at runtime. For example, the shape or name attributes, or the sequence of labels on the index or columns. The StaticFrame Require class provides a family of configurable validators.
– Require.Name: Validate the `name` attribute of the container.
– Require.Len:
Source link