9 Machine learning

Hypothesis in the content of ML represents a mathematical function that an algorithm uses to represent the relationship between the target variable and features. Learnings can be divided into supervised and unsupervised learning.

Supervised learning	Unsupervised learning
Works on the data that contains both inputs and the expected output, i.e. the labeled data	Works on the data that contains no mappings from input to output, i.e. the unlabeled data
Used to create models that can be employed to predict or classify things	Used to extract meaningful information out of large volumes of data
Egz. Decision trees, logistic regression, svm	Egz. K-means clustering, hierarchical clustering, apriori algorithm

Test set vs validation set: test set is used to evaluate the performance of the trained model, and the validation set is part of the training set that is used to select parameters for avoiding model overfitting. (80% training, 10% validation, 10% test)

9.1 Regression and classification algorithms

Regression: linear (when variables are continuous and numeric) and logistic (when variables are continuous and categorical)
- Linear regression is supervised learning algorithm, which helps in finding the linear relationship between two variables. It finds the smallest sum of squared residuals that is possible for the dataset.
Classification refers to a predictive modeling process where a class label is predicted for a given example of input data. It helps categorize the provided input into a label that other observations with similar features have.
- Naive Bayes is supervised classification ML algorithm based on the Bayes theorem, which deals with the probability of an event occuring given that another event has already occured (i.e. mathematical formula for determining conditional probability). It is based on two assumptions, first, each feature/attribute present in the dataset is independent of another, and second, each feature carries equal importance. It has „naive“ in it because it assumes that the occurence of a certain feature is independent of the occurence of other features (hence each feature individually contributes to identify the result), which is unrealistic for real-world data
- Support vector machine (SVM) is a supervised ML model that considers the classification algorithms for two-group classification problems. It is a representation of the training data as points in space that are seperated into categories with the help of a clear gap that should be as wide as possible. Kernel function is used to transform the data that is not linearly separable into one that is. It is generalized dot product function used for the computing dot product of vectors xx and yy in high dimensional feature space. This transformation is based on kernel trick (projecting data onto a higher dimension space where it can be linearly divided by a plane).
- Logistic regression is classification algorithm that is used to predict the probability of certain classes based on some dependent variables. Estimating probability is done by using its underlying logistic function (sigmoid). In short, the logistic regression model computes a sum of the input features and calculates the logistic of the result.
  
  Altough it is classiciation algorithm (it predicts a discrete class), it is part of the regression family as it involves predicting outcomes based on quantitative relationships between variables. Unlike, linear regression, it accepts both continuous and discrete variables as input and its output is qualitative.
  
  Sigmoid function: \(\quad S(x) = \frac{1}{1 + e^{-x}} = \frac{e^{x}}{e^{x} + 1} = 1 - S(-x).\)
Elbow method is used to select „k“ for k-means clustering. It plots the value of the cost function produced by different values of k (for egz. 1 to 15). k-means cost function is sum of squared distances of each data point to respective centroid of cluster to which the data points belong.
A ROC curve is a graph showing the performance of a classification model at all classification thresholds ([0,1]). It plots two parameters: true positive rate (sensitivity) and false positive rate (1-specificity). Also, decreasing the threshold moves up along the curve. Classifiers that give curves closer to the top-left corner indicate a better performance. Note that the ROC does not depend on the class distribution and this makes it useful for evaluating classifiers predicting rare events such as diseases or disasters. In contrast, evaluating performance using accuracy would favor classifiers that always predict a negative outcome for rare events. To compare different classifiers, it is useful to summarize the performance of each classifier into a single measure- AUC (area under the ROC curve). The AUC is the probability the model will score a randomly chosen positive class higher than a randomly chosen negative class.
CART is name for classification and regression trees
Decision tree is non-parametric model that can be used for both classification and regression. Non-parametric means that they don’t increase their number of parameters as we add more features. They are constructed using nodes and branches, where the root node testes a feature which best splits the data. Decision trees are built by recursively splitting our training samples using the features from the data that work best for the specific task. The process is done by evaluating certain metrics („information entropy“), depending if the feature is dicrete or continuous.

Steps are:
1. Take the entire data set as input
2. Look for a split that maximizes the separation of the classes
3. Apply the split (divide step)
4. Re-apply steps 1) and 2) to the divided data
5. Stop when you meet stopping criteria
6. Pruning (clean up the tree if you went too far)
Entropy in ML is the measurement of disorder or impurities in the information processed.

\[ E = -\sum^{N}_{i=1}P_{i}\text{log}_{2}P_{i}, \] where \(P_{i}\) is probability of randomly selecting an example in class i.

Information gain is a measure of how much entropy is reduced when a particular feature is used to split the data. It calculates the difference between entropy before and after the split.

Pruning is a technique that simplifies the decision tree by reducing the rules. It helps to avoid the complexity and improves accuracy.
Random forest is a model built up of a number of decision trees. If you split the data into different packages and make a decision tree in each of the different groups of data. The random forest brings all those trees together (individual trees need to have low correlations with each other).

Steps to build a model:
1. Randomly select k features from a total of m features (k<<m)
2. Among the k features, calculate the node using the best split point
3. Split the node into daughter nodes using the best split
4. Repeat steps two and three until leaf nodes are finalized
5. Build forest by repeating steps one to four for n times to create n number of trees

9.2 Tuning model parameters, evaluation

Overfitting referes to a model that is only set for a very small amount of data and ignores the bigger picture.

There are three main methods to avoid it:
  - feature selection
  - cross-validation
  - feature engineering (creating more data samples using the existing set of data, for egz. In CNN it is producing new images by rotating, scaling, flipping,..)
  - regularization
  - early stopping (regularization technique that identifies the point from where the training data leads to generalization error
  - dropouts (regularization technique used in the case of NN where we randomly deactivate a proportion of neurons in each layer)

Dimensionality reduction helps in compressing data and removing redundant features. Feature selection is the method of reducing the input variable to your model by using only relevant data and getting rid of noise in data.

\(~~~~~~~~\)- Filter method: features are dropped based on their relation to the output, or how they are correlating to the output

\(~~~~~~~~\)- Wrapper method: we split our data into subsets and train a model using this. Based on the output of the model, we add and subtract features and train the model again. It forms the subsets using a greedy approach and evaluates the accuracy of all the possible combinations of features.

Multicollinearity is reflected in the model when independent variables in a multiple regression model are deduced to possess high correlations with each other. It can be overcomed by removing a few highly correlated variables from the equation.
Feature scaling is one of the most important data preprocessing steps in ML. Algorithms that compute the distance between the features are biased towards numerically larger values if the data is not scaled. Most popular are normalizaton and standardization. Also, sklearn library provides transformers MinMaxScaler and StandardScaler.
Feature engineering is the method that is used to create new features from the given dataset using the existing variables. For egz. Imputation, discretization, categorical encoding,..
Cross-validation is a statistical method used to estimate the performance of ML models. It is used to protect against overfitting in a predictive model, particularly in a case where the amount of data may be limited.

k-fold cross validation guarantees that the score of our model does not depend on the way we picked the train and test set. The data is first randomly divided into k number of subsets. For each subset in your dataset, build your model on k-1 subsets of the dataset. Then, test the model to check the efectiveness for kth subset. Repeat this until each of k-subsets has served as the test set. The average of your k recorded accuracy is called the cross-validation accuracy and will serve you as your performance metric for the model. The disadvantage of this method is that the training algorithm has to be rerun from scratch k times. Also, it only estimates the accuracy but does not improve it.
Regularization is a form of regression which discourages learning a more complex or flexible model, so as to avoid the risk of overfitting. The general idea is to penalize complicated models by adding an additional penalty to the loss function in order to generate a larger loss. In this way, we can discourage the model from learning too many details and the model is much more general. Three popular methods are Ridge regression (L2 norm, most used), Lasso (L1 norm) and Dropout (used in neural networks). If there is noise in the training data, then estimated coefficients won’t generalize well to the future data and this is where regularization comes in. It happens by adding a tuning parameter λ that decides how much we want to penalize the flexibility of our model. As the value of λ rises, it reduces the value of coefficients and thus reducing the variance. Till a point, this increase in λ is beneficial as it is only reducing the variance (hence avoiding overfitting), without loosing any important properties in the data. But after certain value, the model starts lossing important properties, giving rise to bias in the model and thus underfitting.
Ensemble learning is combining several individual models together to improve performance.
1. Boosting is one of the ensemble learning methods where we create multiple models and sequentially train them by combining weak models iteratively in a way that training a new model depends on the models trained before it. We take the patterns learned by a previous model and test them on a dataset when training the new model. In each iteration, we give more importance to observations in the dataset that are incorrectly handled or predicted by previous models. It is useful in reducing bias also.
2. Bagging is an ensemble learning method where we generate some data using the bootstrap method, in which we use an already existing dataset and generate multiple samples of the „N“ size. This bootstrapped data is then used to train multiple models in parallel, which makes it more robust than a simple model. Once all the models are trained and it is time to make a prediction, we make predictions using all the trained models and then average the result in the case of regression, and for classification, we choose the result that has the highest frequency.
3. Stacking is an ensemble learning method where we can combine weak models that can additionaly use different learning algorithms as well. These learners are called heterogeneous learners (boosting and bagging are homogeneous learners). Stacking works by training multiple and different weak models or learners and then using them together by training another model, called a meta-model, to make predictions.

Three commonly used methods for finding the sweet spot between simple and complicated models are: regularization, boosting and bagging.

Gradient descent, in ML, is an iterative method that minimizes the cost function parametrized by model parameters. This improves the learning model’s efficacy by providing feedback to the model so that it can adjust the parameters to minimize the error and find the local or global minimum. Gradient measures the change in parameter with respect to the change in error. Learning rate or step size is the size of the steps that are taken to reach the minimum. This is typically a small value, and it is evaluated and updated based on the behavior of the cost function. High learning rates result in larger steps but risks overshooting the minimum. There are 3 types of gradient descent method:
- batch gradient descent: computation is carried out on the entire dataset
- stochastic gradient descent: computation is carried over only one training sample
- mini batch gradient descent: a small number/batch of training samples is used for computation