Functions used to create pytorch `DataSet`s and `DataLoader`s.

class TestData[source]

TestData(t:array, b:Optional[array]=None, x:Optional[array]=None, t_scaler:MaxAbsScaler=None, x_scaler:StandardScaler=None) :: Dataset

Create pyTorch Dataset parameters:

  • t: time elapsed
  • b: (optional) breakpoints where the hazard is different to previous segment of time. Must include 0 as first element and the maximum time as last element
  • x: (optional) features

class Data[source]

Data(t:array, e:array, b:Optional[array]=None, x:Optional[array]=None, t_scaler:MaxAbsScaler=None, x_scaler:StandardScaler=None) :: TestData

Create pyTorch Dataset parameters:

  • t: time elapsed
  • e: (death) event observed. 1 if observed, 0 otherwise.
  • b: (optional) breakpoints where the hazard is different to previous segment of time.
  • x: (optional) features

class TestDataFrame[source]

TestDataFrame(df:DataFrame, b:Optional[array]=None, t_scaler:MaxAbsScaler=None, x_scaler:StandardScaler=None) :: TestData

Wrapper around Data Class that takes in a dataframe instead parameters:

  • df: dataframe. **Must have t (time) and e (event) columns, other cols optional.
  • b: breakpoints of time (optional)

class DataFrame[source]

DataFrame(data=None, index:Optional[Collection[T_co]]=None, columns:Optional[Collection[T_co]]=None, dtype:Union[ForwardRef('ExtensionDtype'), str, dtype, Type[Union[str, float, int, complex, bool]], NoneType]=None, copy:bool=False) :: NDFrame

Two-dimensional, size-mutable, potentially heterogeneous tabular data.

Data structure also contains labeled axes (rows and columns). Arithmetic operations align on both row and column labels. Can be thought of as a dict-like container for Series objects. The primary pandas data structure.

Parameters

data : ndarray (structured or homogeneous), Iterable, dict, or DataFrame Dict can contain Series, arrays, constants, or list-like objects.

.. versionchanged:: 0.23.0
   If data is a dict, column order follows insertion-order for
   Python 3.6 and later.

.. versionchanged:: 0.25.0
   If data is a list of dicts, column order follows insertion-order
   for Python 3.6 and later.

index : Index or array-like Index to use for resulting frame. Will default to RangeIndex if no indexing information part of input data and no index provided. columns : Index or array-like Column labels to use for resulting frame. Will default to RangeIndex (0, 1, 2, ..., n) if no column labels are provided. dtype : dtype, default None Data type to force. Only a single dtype is allowed. If None, infer. copy : bool, default False Copy data from inputs. Only affects DataFrame / 2d ndarray input.

See Also

DataFrame.from_records : Constructor from tuples, also record arrays. DataFrame.from_dict : From dicts of Series, arrays, or dicts. read_csv : Read a comma-separated values (csv) file into DataFrame. read_table : Read general delimited file into DataFrame. read_clipboard : Read text from clipboard into DataFrame.

Examples

Constructing DataFrame from a dictionary.

d = {'col1': [1, 2], 'col2': [3, 4]} df = pd.DataFrame(data=d) df col1 col2 0 1 3 1 2 4

Notice that the inferred dtype is int64.

df.dtypes col1 int64 col2 int64 dtype: object

To enforce a single dtype:

df = pd.DataFrame(data=d, dtype=np.int8) df.dtypes col1 int8 col2 int8 dtype: object

Constructing DataFrame from numpy ndarray:

df2 = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]), ... columns=['a', 'b', 'c']) df2 a b c 0 1 2 3 1 4 5 6 2 7 8 9

Create iterable data loaders/ fastai databunch using above:

create_dl[source]

create_dl(df:DataFrame, b:Optional[array]=None, train_size:float=0.8, random_state=None, bs:int=128)

Take dataframe and split into train, test, val (optional) and convert to Fastai databunch

parameters:

  • df: pandas dataframe
  • b(optional): breakpoints of time. Must include 0 as first element and the maximum time as last element
  • train_p: training percentage
  • bs: batch size

create_test_dl[source]

create_test_dl(df:DataFrame, b:Optional[array]=None, t_scaler:MaxAbsScaler=None, x_scaler:StandardScaler=None, bs:int=128, only_x:bool=False)

Take dataframe and return a pytorch dataloader. parameters:

  • df: pandas dataframe
  • b: breakpoints of time (optional)
  • bs: batch size

get_breakpoints[source]

get_breakpoints(df:DataFrame, percentiles:list=[20, 40, 60, 80])

Gives the times at which death events occur at given percentile parameters: df - must contain columns 't' (time) and 'e' (death event) percentiles - list of percentages at which breakpoints occur (do not include 0 and 100)