- Elliptic Envelope is suitable for normally-distributed data with low dimensionality. As its name implies, it uses the multivariate normal distribution to create a distance measure to separate outliers from inliers.
- Local Outlier Factor is a comparison of the local density of an observation with that of its neighbors. Observations with much lower density than their neighbors are considered outliers.
- One-Class Support Vector Machine (SVM) with Stochastic Gradient Descent (SGD) is an O(n) approximate solution of the One-Class SVM. Note that the O(n²) One-Class SVM works well on our small example dataset but may be impractical for your actual use case.
- Isolation Forest is a tree-based approach where outliers are more quickly isolated by random splits than inliers.
- On-base percentage (OBP), the rate at which a batter reaches base (by hitting, walking, or getting hit by pitch) per plate appearance
- Slugging (SLG), the average number of total bases per at bat
Code:
“`python
from pybaseball import (cache, batting_stats_bref, batting_stats, playerid_reverse_lookup)
import pandas as pd
cache.enable() # avoid unnecessary requests when re-running
MIN_PLATE_APPEARANCES = 200
# For readability and reasonable default sort order
df_bref = batting_stats_bref(2023).query(f”PA >= {MIN_PLATE_APPEARANCES}”).rename(columns={“Lev”:”League”,”Tm”:”Team”})
df_bref[“League”] = \
df_bref[“League”].str.replace(“Maj-“,””).replace(“AL,NL”,”NL/AL”).astype(‘category’)
df_fg = batting_stats(2023, qual=MIN_PLATE_APPEARANCES)
key_mapping = \
playerid_reverse_lookup(df_bref[“mlbID”].to_list(), key_type=’mlbam’)[[“key_mlbam”,”key_fangraphs”]].rename(columns={“key_mlbam”:”mlbID”,”key_fangraphs”:”IDfg”})
df = df_fg.drop(columns=”Team”).merge(key_mapping, how=”inner”, on=”IDfg”).merge(df_bref[[“mlbID”,”League”,”Team”]],how=”inner”, on=”mlbID”).sort_values([“League”,”Team”,”Name”])
“`
Code:
“`python
print(df[[“OBP”,”SLG”]].describe().round(3))
print(f”\nCorrelation: {df[[‘OBP’,’SLG’]].corr()[‘SLG’][‘OBP’]:.3f}”)
“`
Output:
“`
OBP SLG
count 362.000 362.000
mean 0.320 0.415
std 0.034 0.068
min 0.234 0.227
25% 0.300 0.367
50% 0.318 0.414
75% 0.340 0.460
max 0.416 0.654
Source link