preprocessing.tabular

ayniy.preprocessing.tabular.aggregation(train: pandas.core.frame.DataFrame, test: pandas.core.frame.DataFrame, groupby_dict: dict, nunique_dict: dict) → Tuple[pandas.core.frame.DataFrame, pandas.core.frame.DataFrame]

Aggregation

Parameters
  • train (pd.DataFrame) – train

  • test (pd.DataFrame) – test

  • groupby_dict (dict) – settings for groupby

  • nunique_dict (dict) – settings for nunique

Returns

train, test

Return type

Tuple[pd.DataFrame, pd.DataFrame]

ayniy.preprocessing.tabular.circle_encoding(train: pandas.core.frame.DataFrame, test: pandas.core.frame.DataFrame, encode_col: List[str]) → Tuple[pandas.core.frame.DataFrame, pandas.core.frame.DataFrame]

Circle encoding

Parameters
  • train (pd.DataFrame) – train

  • test (pd.DataFrame) – test

  • encode_col (List[str]) – encoded columns

Returns

train, test

Return type

Tuple[pd.DataFrame, pd.DataFrame]

ayniy.preprocessing.tabular.count_null(train: pandas.core.frame.DataFrame, test: pandas.core.frame.DataFrame, encode_col: List[str]) → Tuple[pandas.core.frame.DataFrame, pandas.core.frame.DataFrame]

Count NaN

Parameters
  • train (pd.DataFrame) – train

  • test (pd.DataFrame) – test

  • encode_col (List[str]) – encoded columns

Returns

train, test

Return type

Tuple[pd.DataFrame, pd.DataFrame]

ayniy.preprocessing.tabular.datetime_parser(train: pandas.core.frame.DataFrame, test: pandas.core.frame.DataFrame, encode_col: List[str]) → Tuple[pandas.core.frame.DataFrame, pandas.core.frame.DataFrame]

Datetime columns parser

Parameters
  • train (pd.DataFrame) – train

  • test (pd.DataFrame) – test

  • encode_col (List[str]) – encoded columns

Returns

train, test

Return type

Tuple[pd.DataFrame, pd.DataFrame]

ayniy.preprocessing.tabular.detect_delete_cols(train: pandas.core.frame.DataFrame, test: pandas.core.frame.DataFrame, escape_col: List[str], threshold: float) → Tuple[List, List, List]

Detect unnecessary columns for deleting

Parameters
  • train (pd.DataFrame) – train

  • test (pd.DataFrame) – test

  • escape_col (List[str]) – columns not encoded

  • threshold (float) – deleting threshold for correlations of columns

Returns

unique_cols, duplicated_cols, high_corr_cols

Return type

Tuple[List, List, List]

ayniy.preprocessing.tabular.fillna(train: pandas.core.frame.DataFrame, test: pandas.core.frame.DataFrame, encode_col: List[str], how: str) → Tuple[pandas.core.frame.DataFrame, pandas.core.frame.DataFrame]

Replace NaN

Parameters
  • train (pd.DataFrame) – train

  • test (pd.DataFrame) – test

  • encode_col (List[str]) – encoded columns

  • how (str) – how to fill Nan, chosen from ‘median’ or ‘mean’

Returns

train, test

Return type

Tuple[pd.DataFrame, pd.DataFrame]

ayniy.preprocessing.tabular.frequency_encoding(train: pandas.core.frame.DataFrame, test: pandas.core.frame.DataFrame, encode_col: List[str]) → Tuple[pandas.core.frame.DataFrame, pandas.core.frame.DataFrame]

Frequency encoding

Parameters
  • train (pd.DataFrame) – train

  • test (pd.DataFrame) – test

  • encode_col (List[str]) – encoded columns

Returns

train, test

Return type

Tuple[pd.DataFrame, pd.DataFrame]

ayniy.preprocessing.tabular.matrix_factorization(train: pandas.core.frame.DataFrame, test: pandas.core.frame.DataFrame, encode_col: List[str], n_components_lda: int = 5, n_components_svd: int = 3) → Tuple[pandas.core.frame.DataFrame, pandas.core.frame.DataFrame]

Matrix factorization

Parameters
  • train (pd.DataFrame) – train

  • test (pd.DataFrame) – test

  • encode_col (List[str]) – encoded columns

  • n_components_lda (int, optional) – the output dimensions for lda. Defaults to 5.

  • n_components_svd (int, optional) – the output dimensions for svd. Defaults to 3.

Returns

train, test

Return type

Tuple[pd.DataFrame, pd.DataFrame]

ayniy.preprocessing.tabular.standerize(train: pandas.core.frame.DataFrame, test: pandas.core.frame.DataFrame, encode_col: List[str]) → Tuple[pandas.core.frame.DataFrame, pandas.core.frame.DataFrame]

Standerization

Parameters
  • train (pd.DataFrame) – train

  • test (pd.DataFrame) – test

  • encode_col (List[str]) – encoded columns

Returns

train, test

Return type

Tuple[pd.DataFrame, pd.DataFrame]