Much of the work here is summarised from the notes in Generalised Linear Models by Germán Rodríguez, Chapter 7.
The Hazard Function
Another important concept is the hazard function, $\lambda(t)$, which is the instantaneous rate of occurence, given that the object has survived until time $t$. $$ \lambda(t) = \lim\limits_{dt\to0} \frac{P(t<T<t+dt | T > t)}{dt}\\ $$
The above can be simplified down by using Bayes Rule, and the definition of $S(t)$ above: $$ \begin{aligned} \lambda(t) =& \lim\limits_{dt\to0} \frac{P(t<T<t+dt, T > t)}{P(T > t)\quad dt}\\ =& \lim\limits_{dt\to0} \frac{P(t<T<t+dt)}{dt} \frac{1}{S(t)}\\ =& \frac{f(t)}{S(t)} \end{aligned} $$
Since $S'(t) = -f(t)$ from the first equation, we can also state: $$ \lambda(t) = - \frac{d}{dt}\log S(t) $$
Therefore, the survival function will can be stated as: $$ S(t) = \exp\left(-\int_{-\infty}^t \lambda(x) dx\right) $$ and this will come in handy when we are coding up the survival function.
Likelihood Function
In any probabilistic framework we wish to maxmise the likelihood of observed data given the probability functions. However, unlike classification/ regression situations we need to modify the likelihood $f(t)$. Let us define the case where death, $d_i$ hasn't been observed as censored observations. In those case we know that death will occur at a point $T > t$. Therefore the likelihood $L$ can be defined as: $$ L_i = \begin{cases} f(t_i) = S(t_i)\lambda(t_i) &\text{ when }d_i = 1 \\ \int^{\infty}_t f(x) dx = S(t_i) &\text{ when }d_i = 0 \end{cases} $$
The above can be simplified as: $$ \begin{aligned} L =& \prod_{i=1}^N L_i = \prod_{i=1}^N \lambda(t_i)^{d_i} S(t_i) \\ \log L =& \sum_{i=1}^N d_i \log \lambda(t_i) - \Lambda(t_i) \\ -\log L =& \sum_{i=1}^N \Lambda(t_i) - d_i \log \lambda(t_i) \end{aligned} $$ where $\Lambda(t)\equiv -\log S(t)$ is cumulative hazard function.
Similarly if we wish to avoid taking into account the hazard function, we can define the likelihood function to be: $$ \begin{aligned} L =& \prod_{i=1}^N L_i = \prod_{i=1}^N f(t_i)^{d_i} S(t_i)^{1-d_i} \\ -log L =& -\sum_{i=1}^N d_i \log f(t_i) + (1 - d_i) \log S(t_i) \end{aligned} $$
These two formats of the likelihood will be used when modelling the behaviours of censored data (in different modelling settings). We use the negative log likelihood as this is more accommodating for modern deep learning libraries that do gradient descent.