Generative and discriminative models (ENG)

First attempt to summarize very broad topic of generative and discriminative models. I will be glad for every suggestion or discussion over this topic. I am sorry for math formulas but I wasn’t able to make it fully work yet. I will be trying to improve it over time.

Generative models:

  • Estimates the joint probability distribution of data P(X, Y) between the observed data X and corresponding labels Y
  • Provides a way to sample X, Y pairs
  • To impute missing data, compress your dataset or generate unseen data.

\[ argmax_yP(Y=y|X=x)= argmax_y(\frac{P(Y=y,X=x)}{P(X=x)}) \] since P(X=x) is constant in argmax over y function so: \[ argmax_yP(Y=y,X=x) \]

Conditional Independence:

  • If we can say that P(X Y,Z) = P(X Z) then X is conditionally independent of Y given Z

\[ P(X|Y) = P(X1,X2|Y) \] = general property of probabilities = \[ P(X1|X2,Y)P(X2|Y) \] = conditional independence assumption
= \[ P(X1|Y)P(X2|Y) \]

Method to estimate parameters:

  • MLE is a special case of MAP, where the prior is uniform
  • MLE - param = \[ argmax_p P(data|p) = argmax_p SUM(log(P(data|p))) \]
  • MAP - param = \[ argmax_p P(data|p)*P(p) = argmax_p SUM(log(P(data|p)*P(p))) \]
  • MLE vs MAP

Naive Bayes

  • Generative model (joint probability P(Y,X)) * The basic idea is, first, to establish the probability density model of the sample, and then use the model for inference prediction * I have to care about a distribution of the Y and also the X
  • Need to assume that features are independent
    • P(spam|”win” and “buy”) = P(spam win) * P(spam buy)
  • Train time: O(log n) * Discrete variant * Maximum likehood estimates - count occurences (can be smoothed - Laplace smoothing) * Maximum a posteriori estimation - if we assume a Dirichlet prior distribution then it is like Laplace smoothing
    • Continuos
      • Usually assuming Gaussian distribution of features
      • Estimating Mean and Standart deviation
      • Section 2.4.

Discriminative models:

  • P(tags|data)

Logistic regression

  • Dicriminitive model (conditional probability **P(Y X)**)
    • The basic idea is to establish the discriminant function with finite samples, and directly study the prediction model without considering the generative model of samples.
    • To don’t need a distribution of X
  • Sigmoid function on the output of the model -> treshold determine 0/1 (in binomial case)
    • Can be shown that the decision boundary is: Assign “Y=0” given “data” if 0 < w_0 + SUM(w_i* X_i)
    • Section 3
  • Try to find a relationship between output and input (via features)
  • Train time: O(n)
    • Gradient descent with respect to param vector W (with regularization)
  • Need generally more data than Naive Bayes
  • Naive Bayes can be derived from Logistic expression Section 3.1
Determine words from tag
Determine words from tag
Determine tag from words
Determine tag from words

HMM

  • Generative model (joint probability P(states,data))
    • Any observation sequence may be generated.
    • Also concern distribution over data
  • Get prediction with Viterbi algorithm
  • Create directed graph
  • Each state s_i depends only on state s_i-1 (Markov property)
  • Each observation o_i depends only on current state s_i
  • Propability distribution:
    • Initial - Probability of a state at the beginning
    • Transition - Probability of a transition from state A to state B
    • Observation - Probability of emision a visible output from state
  • Easier to estimate than CRF
  • Assume conditionally independence over data (X_i does’t corelate with X_i+1)
Hidden Markov Model
Hidden Markov Model

CRF

  • Dicriminitive model (conditional probability P(states|data))
    • No assumptions about P(data)
  • Versions:
    • linear CRF (only previous item in input sequence - similar to Markov assumption)
    • general CRF (you can use each item in input sequence)
  • Need to manually specify feature/potential function
    • can be indicator of some feature (fire up when some feature occur) - assign weights
  • Get prediction with Viterbi algorithm
  • Create undirected graph
  • Basic principle is multiplication of feature functions with learning weights
  • Logistic regression is very simply CRF (only one feature function)
Conditional Random Field
Conditional Random Field
All three models
All three models

Resources:

Written on July 30, 2018