首页 > 其他分享 >General ML interview questions

General ML interview questions

时间:2025-01-01 12:25:31浏览次数:3  
标签:log ML sum will questions each interview prob Normalization

Questions

How to combat with overfitting

  1. Regularization (Search for regularization coefficient)
  2. Dropout (Randomly drops neurons during training)
  3. Simplify Model (Reduce the number of trainable parameters)
  4. Early Stopping
  5. Ensemble Methods (Bagging like Random Forest, Boosting like XGBoost)
  6. Data Augmentation
  7. Feature Selection (Remove irrelevant or highly correlated features)

Difference between Random Forest and XGBoost

  1. Random Forest will train multiple decision trees independently on subset of whole dataset. During inference, decision trees will process the input independently and then the results will be aggregated together. (vote for classification, average for regression).
  2. XGBoost will train multiple decision trees sequentially, where the later tree will try to revise the error made by the former tree. During inference, the first tree will produce an initial prediction and the following trees will produce the increment to adjust previous prediction. The final prediction is the sum of all predictions made by each tree.
FeatureRandom ForestXGBoost
Ensemble MethodBaggingBoosting
Tree GrowthIndependent, deep treesSequential, shallow trees
Objective FunctionAveraging/votingGradient-based with regularization
Bias-Variance TradeoffReduces varianceReduces both bias and variance
SpeedFaster due to parallelismSlower but optimized with advanced techniques
InterpretabilityMore interpretableLess interpretable without tools like SHAP

How to handle missing value?

  • Fill the missing value with mean, median or majority.
  • Fill the missing value by predict it with another model.
  • Treat it as a special feature, XGBoost and LightGBM can handle it automatically.

How to train a regression tree?

The regression tree will split the feature space into several subspace and each subspace has a certain prediction value. The goal is usually to minimize the MSE or MAE of the division, the corresponding prediction is the mean and median of the divided sub data.

Difference between Gradient Descent (GD) and Stochastic Gradient Descent (SGD)

GD uses the whole dataset to perform forward propagation, compute the gradient of loss function and back propagation. Comparatively, SGD only uses one sample to do the propagation and parameters update.

Usually, GD will converge more smoothly but the computation cost is expensive, especially for large dataset. SGD will converge more quickly with more noise and unstable, which means the order of samples will influent the final performance.

More Variants of SGD

  • Batch Gradient Descent (BGD)
    Uses a mini batch to compute gradient. When B = 1 B=1 B=1, it becomes SGD; when B = L e n ( D a t a s e t ) B=Len(Dataset) B=Len(Dataset), it becomes GD. BGD is more stable than SGD and less resource consuming than GD.
  • Momentum-Based SGD
    Momentum performs like a velocity vector that accumulates past gradients, so the parameter will move more consistently, which accelerates convergence, prevents fluctuation and more likely to pass the local minimums.
  • Adagrad (Adaptive Gradient Algorithm)
    Scale down the updates of frequently updated parameters (divide by the square of past gradients), which prevents fluctuation. However, the gradients will keep decreasing to 0, which ends the training process earlier than expected.
  • RMSprop (Root Mean Square Propagation)
    To prevent the gradients from becoming zero, RMSprop change the denominator of Adagrad into an exponentially decaying average of past squared gradients, which is stable and won’t lead to a zero.
  • Adam (Adaptive Moment Estimation)
    Combining RMSprop and Momentum, the nominator is exponentially decaying gradients, aka momentum and the denominator is exponentially decaying past squared gradients.
  • AdamW
    Adam just adds the regularization term to the loss function and compute gradients. However, AdamW add this term when each parameter is updated.

Non-differentiable?

  • Subgradient: a set of slopes that "lay under” the function. f ( y ) ≥ f ( x ) + g ( y − x ) , ∀ y f(y) \ge f(x) + g(y-x), \forall y f(y)≥f(x)+g(y−x),∀y, for differentiable function, the subgradient has only 1 solution; for convex function, the subgradient is a set; for non-convex function, the subgradient may not exist.
  • Proximal Gradient: Optimize differentiable part first and then find a Proximal Operator for non-differentiable part.
  • Smoothed Approximations: Find a differentiable approximation.
  • Gradient-Free Optimization Methods: Genetic Algorithm or Simulated Annealing.

L1 & L2 Regularization

L1 Regularization tends to produce spare parameters where many weights are exact zeros, while L2 Regularization tends to produce parameters with small weights but not zero.

Batch Normalization (BN) vs Layer Normalization (LN)

BN is usually applied to CNN and LN is usually applied to RNN or other sequential data. BN is normalized for every channel or say position in the sequence. For sequential data, it is not guaranteed that the lengths of each sequence are the same. Even though we can pad the sequence, there may not be enough data to train robust parameters in BN. Therefore, we usually apply LN to sequential data. In LN, we normalize the data for each sample. LN doesn’t require batch data and is appliable for sequence of any length.

BN will additionally maintain a global mean and std for inference to prevent the instability caused by the batch size. LN has no difference between training and test.

Normalization can keep the hidden states to a stable distribution and prevent gradient explosion and vanishing. Taking each block as an independent classifier, normalization is to make sure that the input of each classifier following the same distribution, or the distribution of input may vary a lot while the depth of network increase.

Normalization also provides some regularization function because it adds some noise into the hidden layer.

Dropout Principles & Implementation

During each training iteration, randomly set the output of certain neurons to 0 to perform regularization.

  • Discourage the dependencies among neurons, making the network more generalizable.
  • In inference, the subnetworks can be considered as sub-model for ensemble.
  • During training, the output need to be rescaled by 1 1 − p \frac{1}{1-p} 1−p1​ because p p p weights are dropped, to maintain a consistent magnitude of the output.
def dropout(x, prob):
    mask = (np.random.rand(*x.shape) > prob).astype(int)
    x = x * mask / (1-prob)
    return x

Remember to use * to unzip the tuple of x.shape

Order of Normalization, Activation and Dropout

Usually normalization -> activation -> dropout. Normalization will make the input to be the same distribution to make the training stable. Then, dropout the output of activation layer to perform regularization.

Implement Cross Entropy Loss

The definition of Cross Entropy Loss is L ( p , q ) = − ∑ i p i log ⁡ q i \mathcal{L}(p, q) = -\sum_i p_i\log q_i L(p,q)=−∑i​pi​logqi​. Literally, the computation should go through every value in the distribution. However, for most classification task, the ground truth is a one-hot encoding that the probability of only one value is 1 and the probability of other values are 0. Therefore, for each sample the loss is just − log ⁡ q g r o u n d   t r u t h -\log q_{ground\ truth} −logqground truth​.
Notice that here we assume that the input predicted label is after-softmax., but in torch.nn.CrossEntropy there is an inherent log-softmax in the function.

def CrossEntropy(y, y_pred):
    # y(B, ) y_pred(B, C)
    prob = y_pred[:, y]
    log_prob = torch.log(prob)
    sum_up = torch.sum(log_prob)
    return -sum_up

Implement Focal Loss

Focal Loss is designed to assign more weight on badly predicted sample, or it can be seen as a variant of Cross Entropy Loss. In detail, if the predicted probability of ground truth is p p p, then the assigned weight will be ( 1 − p ) γ (1-p)^\gamma (1−p)γ to emphasize badly performed sample and understate well performed sample.

def Focal_Loss(y, y_pred, gamma):
    prob = y_pred[:, y]
    log_prob = torch.log(prob)
    sum_up = torch.sum(((1-prob)**gamma)*log_prob)
    return -sum_up

标签:log,ML,sum,will,questions,each,interview,prob,Normalization
From: https://blog.csdn.net/ShadyPi/article/details/143766237

相关文章

  • 写个方法计算html中所有节点和子节点
    在前端开发中,你可以使用JavaScript来遍历并计算HTML中的所有节点和子节点。以下是一个简单的示例方法,它使用递归函数来遍历DOM树并计算节点数量:functioncountAllNodes(node){letcount=1;//计数当前节点//检查节点是否有子节点if(node.hasChildNodes()){......
  • spring boot迁移计划 第Ⅰ章 --chapter 1. rust hyper 结合rust nacos-client开发naco
    1.toml依赖toml="0.8"2.代码由于项目还未完成,部分配置(如数据库等)还未增加,后续更新增加uselog::info;useserde::Deserialize;usestd::{fs,sync::LazyLock};usecrate::init::constant::*;//创建全局静态配置文件staticCONFIG:LazyLock<Config>=LazyL......
  • 398.大学生HTML期末大作业 —【薛之谦明星主题精品网页(7页)】 Web前端网页制作 html+
    目录一、网页简介二、网页文件三、网页效果四、代码展示1.html2.CSS3.JS五、总结1.简洁实用2.使用方便3.整体性好4.形象突出5.交互式强欢迎来到我的CSDN主页!您的支持是我创作的动力!Web前端网页制作、网页完整代码、大学生期末大作业案例模板完整代码、技术交......
  • 破解滑动验证码中的 w 参数 (OCaml 版本)
    滑动验证码通常会通过加密的w参数来验证请求的合法性。在本文中,我们将深入探索如何使用OCaml解析和生成w参数,以通过滑动验证码。步骤1:准备关键请求参数在整个滑动验证流程中,gt和challenge是两个必要的参数,用于标识操作并生成下一步的challenge。此外,w参数则是验证的......
  • Qml 中实现毛玻璃效果
    【写在前面】毛玻璃效果(AcrylicEffect)是一种常见的UI设计风格,它通过模糊背景并添加透明度和噪声效果,使界面元素看起来像是半透明的磨砂玻璃。本文将介绍如何使用Qml实现这种效果,并提供一个完整的示例代码。【正文开始】1.效果图2.毛玻璃效果的实现原理毛玻璃效果的......
  • html学习笔记
    1.HTML基础HTML标题定义:HTML标题是通过<h1>到<h6>标签定义的,用于创建不同级别的标题。实例:<h1>这是最重要的标题</h1><h2>次重要的标题</h2><h3>一般标题</h3><!--以此类推,直到<h6>-->HTML段落定义:HTML段落是通过<p>标签定义的,用于组织文本内容。实......
  • 终端窗口使用HResult进行mlf数据比对时问题汇总
    弄好HTK的配置后,就可以使用HResult进行数据比对了,指令如下:HResults-t-I初始文件NUL结果文件ps:初始文件以及结果文件都是mlf格式的,HTK识别的基础在使用脚本将txt文件转换为mlf格式文件时,有不少问题,汇总一下!一、设备问题不理解,同样的脚本转换的文件,不同的设备会有区别,我自......
  • HTML概念
    一、每个网站都是用HTML代码构建的。HTML代码基于标签。标签使用尖括号<>。尖括号<>将要添加到页面的元素的名称括起来。按钮、文本和图像等元素被添加到带有不同标签的网页中。您可以使用image标签(<img>)向网页添加图像。 标签的主要类型:按钮:<button>图片:<img>文......
  • 关于 qt qml 报错 QtQuick.Controls 2 没有被注册
    qml报错无法加载QtQuick.Controls背景一个简单的qtcreaterdemo,qml文件如下importQtQuick2.15importQtQuick.Window2.15Window{visible:truewidth:640height:480title:qsTr("HelloWorld")//创建一个红色的矩形Rectangle{......
  • HTML5网页设计成品:汽车介绍特斯拉 (dreamweaver作业静态HTML网页设计模板)
    ......