MACHINE LEARNING for python
The process of teaching computer software to develop a statistical model based on data is referred to as machine learning. The purpose of machine learning (ML) is to transform data and extract essential patterns or
insights from it.
Machine Learning Interview Questions
1. What was the purpose of Machine Learning?
The most straightforward response is to make our lives simpler. Many systems employed hardcoded rules of "if" and "else" choices to analyze data or change user input in the early days of "intelligent" applications. Consider a spam filter responsible for moving relevant incoming email messages to a spam folder. However, we provide enough data to learn and find patterns using machine learning algorithms.
Unlike traditional challenges, we don't need to define new rules for each machine learning problem; instead, we need to utilize the same approach but with a different dataset.
For example, if we have a history dataset of real sales statistics, we may train machine learning models to forecast future sales.
Principal Component Analysis, or PCA, is a dimensionality-reduction approach for reducing the dimensionality of big data sets by converting a large collection of variables into a smaller one that retains the majority of the information in the large set.
2. Define Supervised Learning?
Supervised learning is a machine learning technique that uses labeled training data to infer a function. A series of training examples make up the training data.
Example:
Knowing a person's height and weight might help determine their gender. The most common supervised learning algorithms are shown below.
Support Vector Machines
K-nearest Neighbor Algorithm and Neural Networks. Naive Bayes
Regression Decision Trees
3. Explain Unsupervised Learning?
Unsupervised learning is a machine learning method that searches for patterns in a data set. There is no dependent variable or label to forecast in this case. Algorithms for Unsupervised Learning:
Clustering
Latent Variable Models and Neural Networks Anomaly Detection
Example:
A T-shirt clustering, for example, will be divided into "collar style and V neck style," "crew neck style," and "sleeve kinds."
4. What should you do if you're Overfitting or Underfitting?
Overfitting occurs when a model is too well suited to training data; in this scenario, we must resample the data and evaluate model accuracy using approaches such as k-fold cross- validation.
Whereas in Underfitting, we cannot interpret or capture patterns from the data, we must either adjust the algorithms or input more data points to the model.
5. Define Neural Network?
It's a simplified representation of the human mind. It has neurons that activate when it encounters anything comparable to the brain. The many neurons are linked by connections that allow information to travel from one neuron to the next.
6. What is the meaning of Loss Function and Cost Function? What is the main distinction between them?
When computing loss, we just consider one data point, referred to as a loss function. The cost function determines the total error for numerous data, and there isn't much difference. A loss function captures the difference between the actual and projected values for a single record, whereas a cost function aggregates the difference across the training dataset. Mean-squared error and Hinge loss are the most widely utilized loss functions. The Mean- Squared Error (MSE) measures how well our model predicted values compared to the actual values.
MSE = √(predicted value - actual value)2
Hinge loss: It is used to train the machine learning classifier, which is L(y) = max(0,1- yy)
Where y = -1 or 1 denotes two classes and y denotes the classifier's output form. In the equation y = mx + b, the most common cost function depicts the entire cost as the sum of the fixed and variable costs.
7. Define Ensemble Learning?
Ensemble learning is a strategy for creating more powerful machine learning models by combining numerous models.
There are several causes for a model's uniqueness. The following are a few reasons:
Various Populations Various Hypotheses
Various modelling approaches
We will encounter an error when working with the model's training and testing data. Bias, variation, and irreducible error are all possible causes of this inaccuracy. The model should now always exhibit a bias-variance trade-off, which we term a bias-variance trade-off. This trade-off can be accomplished by ensemble learning.
There are a variety of ensemble approaches available. However, there are two main strategies for aggregating several models:
Bagging is a natural approach for generating new training sets from an existing one.
Boosting is a more elegant strategy to optimize the optimum weighting scheme for a training set.
8. How do you know the Machine Learning Algorithm you should use?
It is entirely dependent on the data we have. SVM is used when the data is discrete, and we utilize linear regression if the dataset is continuous. As a result, there is no one-size-fits-all method for determining which machine learning algorithm to utilize; it all relies on exploratory data analysis (EDA). EDA is similar to "interviewing" a dataset. We do the following as part of our interview:
Sort our variables into categories like continuous, categorical, and so on.
Use descriptive statistics to summarize our variables. Use charts to visualize our variables.
Choose one best-fit method for a dataset based on the given observations.
9. How should Outlier Values be Handled?
An outlier is a dataset observation significantly different from the rest of the dataset. The following are some of the tools that are used to find outliers.
Z-score Box plot
Scatter plot, etc.
To deal with outliers, we usually need to use one of three easy strategies:
We can get rid of them.
They can be labeled as outliers and added to the feature set. Similarly, we may change the characteristic to lessen the impact of the outlier.
10. Define Random Forest? What is the mechanism behind it?
Random forest is a machine learning approach that may be used for regression and classification. Random forest operates by merging many different tree models, and random forest creates a tree using a random sampling of the test data columns.
The procedures for creating trees in a random forest are as follows: Using the training data, calculate the sample size.
Begin by creating a single node.
From the start node, run the following algorithm:
Stop if the number of observations is fewer than the node size. Choose variables at random.
Determine which variable does the "best" job of separating the data.
Divide the observations into two nodes. Run step 'a' on each of these nodes.
10. What are SVM's different Kernels?
In SVM, there are six different types of kernels, below are four of them: Linear kernel - When data is linearly separable.
Polynomial kernel - When you have discrete data with no
natural idea of smoothness.
Radial basis kernel - Create a decision boundary that can separate two classes considerably better than a linear kernel.
Sigmoid kernel - The sigmoid kernel is a neural network activation function.
11. What is Machine Learning Bias?
Data bias indicates that there is a discrepancy in the data. Inconsistency can develop for different causes, none of which are mutually exclusive. For example, to speed up the recruiting process, a digital giant like Amazon built a single-engine that will take 100 resumes and spit out the best five candidates to employ. The program was adjusted to remove the prejudice once the business noticed it wasn't providing gender-neutral results.
11. What is the difference between regression and classification?
Classification is used to provide distinct outcomes, as well as to categorize data into specified categories. An example is classifying emails into spam and non-spam groups. Regression, on the other hand, works with continuous data. An example is Predicting stock prices at a specific period in time.
The term "classification" refers to the process of categorizing the output into a set of categories. For example, is it going to be cold or hot tomorrow? On the other hand, regression is used to forecast the connection that data reflects. An example is, what will the temperature be tomorrow?
12. What is Clustering, and how does it work?
Clustering is the process of dividing a collection of things into several groups. Objects in the same cluster should be similar to one another but not those in different clusters.
The following are some examples of clustering: K means clustering
Hierarchical clustering
Fuzzy clustering
Density-based clustering, etc.
13. What is the best way to choose K for K-means Clustering?
Direct procedures and statistical testing methods are the two types of approaches available:
Direct Methods: It has elbows and a silhouette.
Methods of statistical testing: There are data on the gaps.
When selecting the ideal value of k, the silhouette is the most commonly utilized.
14. Define Recommender Systems
A recommendation engine is a program that predicts a user's preferences and suggests things that are likely to be of interest to them. Data for recommender systems comes from explicit user evaluations after seeing a movie or listening to music, implicit search engine inquiries and purchase histories, and other information about the users/items themselves.
15. How do you determine if a dataset is normal?
Plots can be used as a visual aid. The following are a few examples of normalcy checks:
Shapiro-Wilk Test Anderson-Darling Test Martinez-Iglewicz Test Kolmogorov-Smirnov Test D’Agostino Skewness Test
16. Is it possible to utilize logistic regression for more than two classes?
By default, logistic regression is a binary classifier, which means it can't be used for more than two classes. It can, however, be used to solve multi-class classification issues (multinomial logistic regression)
17. Explain covariance and correlation?
Correlation is a statistical technique for determining and quantifying the quantitative relationship between two variables. The strength of a relationship between two variables is measured by correlation. Income and spending, demand and supply, and so on are examples.
Covariance is a straightforward method of determining the degree of connection between two variables. The issue with covariance is that it's difficult to compare them without normalization.
18. What is the meaning of P-value?
P-values are utilized to make a hypothesis test choice. The P-value is the least significant level where the null hypothesis may be rejected. The lower the p-value, the more probable the null hypothesis is rejected.
19. Define Parametric and Non-Parametric Models
Parametric models contain a small number of parameters. Thus all you need to know to forecast new data is the model's parameter.
Non-parametric models have no restrictions on the number of parameters they may take, giving them additional flexibility and the ability to forecast new data. You must be aware of the current status of the data and the model parameters.
20. Define Reinforcement Learning
Reinforcement learning differs from other forms of learning, such as supervised and unsupervised learning. We are not provided data or labels in reinforcement learning.
21. What is the difference between the Sigmoid and Softmax functions?
The sigmoid function is utilized for binary classification, and the sum of the probability must be 1. On the other hand, the Softmax function is utilized for multi-classification, and the total probability will be 1.
PCA Interview Questions
1. What is the Dimensionality Curse?
All of the issues that develop when working with data in more dimensions are called the curse of dimensionality. As the number of features grows, so does the number of samples, making the model increasingly complicated. Overfitting is increasingly likely as the number of characteristics increases. A machine learning model trained on a high number of features becomes overfitted as it becomes increasingly reliant on the data it was trained on, resulting in poor performance on actual data and defeating the objective. Our model will make fewer assumptions and be simpler if our training data contains fewer characteristics.
2. Why do we need to reduce dimensionality? What are the disadvantages?
The amount of features is referred to as a dimension in Machine Learning. The process of lowering the dimension of your feature collection is known as dimensionality reduction.
Dimensionality Reduction Benefits
With less misleading data, model accuracy increases.
Less computation is required when there are fewer dimensions. Because there is less data, algorithms can be trained more quickly.
Fewer data necessitates less storage space.
It removes redundant features and background noise. Dimensionality reduction aids in the visualization of data on 2D and 3D graphs.
Dimensionality Reduction Drawbacks
Some data is lost, which might negatively impact the effectiveness of subsequent training algorithms.
It has the potential to be computationally demanding. Transformed characteristics are often difficult to decipher.
It makes the independent variables more difficult to comprehend.
3. Can PCA be used to reduce the dimensionality of a nonlinear dataset with many variables?
PCA may be used to dramatically reduce the dimensionality of most datasets, even if they are extremely nonlinear, by removing unnecessary dimensions. However, decreasing dimensionality with PCA will lose too much information if there are no unnecessary dimensions.
4. Is it required to rotate in PCA? If so, why do you think that is? What will happen if the components aren't rotated?
Yes, rotation (orthogonal) is required to account for the training set's maximum variance. If we don't rotate the components, PCA's influence will wane, and we'll have to choose a larger number of components to explain variation in the training set.
5. Is standardization necessary before using PCA?
PCA uses the covariance matrix of the original variables to uncover new directions because the covariance matrix is susceptible to variable standardization. In most cases, standardization provides equal weights to all variables. We obtain false directions when we combine features from various scales. However, if all variables are on the same scale, it is unnecessary to standardize them.
6. Should strongly linked variables be removed before doing PCA?
No, PCA uses the same Principal Component (Eigenvector) to load all strongly associated variables, not distinct ones.
7. What happens if the eigenvalues are almost equal?
PCA will not choose the principle components if all eigenvectors are the same because all principal components will be similar.
8. How can you assess a Dimensionality Reduction Algorithm's performance on your dataset?
A dimensionality reduction technique performs well if it removes many dimensions from a dataset without sacrificing too much information. If you use dimensionality reduction as a preprocessing step before another Machine Learning algorithm (e.g., a Random Forest classifier), you can simply measure the performance of that second algorithm. If dimensionality reduction did not lose too much information, the algorithm should perform well with the original dataset.
The Fourier Transform is a useful image processing method for breaking down an image into sine and cosine components. The picture in the Fourier or frequency domain is represented by the output of the transformation, while the input image represents the spatial domain equivalent.
9. What do you mean when you say "FFT," and why is it necessary?
FFT is an acronym for fast Fourier transform, a DFT computing algorithm. It takes advantage of the twiddle factor's symmetry and periodicity features to drastically reduce its time to compute DFT. As a result, utilizing the FFT technique reduces the number of difficult computations, which is why it is popular.
10. You're well-versed in the DIT Algorithm. Could you tell us more about it?
It calculates the discrete Fourier transform of an N point series and is known as the decimation-in-time algorithm. It divides the sequence into two halves and then combines them to produce the original sequence's DFT. The sequence x(n) is frequently broken down into two smaller subsequences in DIT.
Curse of Dimensionality
When working with high-dimensional data, the "Curse of Dimensionality" refers to a series of issues. The number of attributes/features in a dataset corresponds to the dataset's dimension. High dimensional data contains many properties, usually on a hundred or more. Some of the challenges that come with high-dimensional data appear while analyzing or displaying the data to look for trends, and others show up when training machine learning models.
1. Describe some of the strategies for dimensionality reduction.
The following are some approaches for reducing the dimensionality of a dataset:
Feature Selection - As we evaluate qualities, we pick or delete them based on their value.
Feature Extraction - From the current features, we generate a smaller collection of features that summarizes most of the data in our dataset.
2. What are the disadvantages of reducing dimensionality?
Dimensionality reduction has some drawbacks; they include:
The decrease may take a long time to complete.
The modified independent variables might be difficult to comprehend.
As the number of features is reduced, some information is lost, and the algorithms' performance suffers.
Support Vector Machine (SVM)
The "Support Vector Machine" (SVM) supervised machine learning to solve classification and regression problems. SVMs are especially well- suited to classifying complex but small or medium-sized datasets.
Let's go through several SVM-related interview questions.
1. Could you explain SVM to me?
Support vector machines (SVMs) are supervised machine learning techniques that may be used to solve classification and regression problems. It seeks to categorize data by locating a hyperplane that optimizes the margin between the training data classes. As a result, SVM is a big margin classifier.
Support vector machines are based on the following principle:
For linearly separable patterns, the best hyperplane is extended to patterns that are not linearly separable by original mapping data into new space using modifications of original data (i.e., the kernel trick).
2. In light of SVMs, how would you explain Convex Hull?
We construct a convex hull for classes A and B and draw a perpendicular on the shortest distance between their nearest points.
3. Should you train a model on a training set with millions of instances and hundreds of features using the primal or dual form of the SVM problem?
Because kernelized SVMs may only employ the dual form, this question only relates to linear SVMs. The primal form of the SVM problem has a
computational complexity proportional to the number of training examples
m. Still, the dual form has a computational complexity proportional to a number between m2 and m3. If there are millions of instances, you should use the primal form instead of the dual form since the dual form is slower.
4. Describe when you want to employ an SVM over a Random Forest Machine Learning method.
The fundamental rationale for using an SVM rather than a linearly separable problem is that the problem may not be linearly separable. We'll have to employ an SVM with a non-linear kernel in such a situation. If you're working in a higher-dimensional space, you can also employ SVMs. SVMs, for example, have been shown to perform better in text classification.
5. Is it possible to use the kernel technique in logistic regression? So, why isn't it implemented in practice?
Logistic regression is more expensive to compute than SVM — O(N3) versus O(N2k), where k is the number of support vectors. The classifier in SVM is defined solely in terms of the support vectors, but the classifier in Logistic Regression is defined over all points, not just the support vectors. This gives SVMs certain inherent speedups (in terms of efficient code- writing) that Logistic Regression struggles to attain.
6. What are the difference between SVM without a kernel and logistic regression?
The only difference is in how they are implemented. SVM is substantially more efficient and comes with excellent optimization tools.
7. Is it possible to utilize any similarity function with SVM?
No, it must comply with Mercer's theorem.
8. Is there any probabilistic output from SVM?
SVMs do not offer probability estimates directly; instead, they are derived through a time-consuming five-fold cross-validation procedure.
Overfitting and Underfitting
Overfitting and Underfitting: An Overview
When the prediction error on both the training and test datasets is large, the model is said to have underfitted, assuming an independent and identically distributed (I.I.d) dataset. This is referred to as model underfitting. Computer learning modelling strategies such as boosting machine learning algorithms, which integrate machine learning models to provide improved predictions using an ensemble of machine-learning model outputs, can tackle the underfitting problem. When underfitting ML models are present, low R-squared values and significant standard errors of estimate in regression analysis, residual plots from machine learning algorithm for linear or logistic regression model output, and so on can all be indicators.
A model is considered to be overfitted when the model accuracy on the training dataset is greater (very high) than the model accuracy on the test dataset. When a machine learning algorithm overfits the training data, the machine learning model may perform well on a small sample of your dataset and provide high accuracy. However, when a machine learning model encounters new input data, its performance suffers dramatically due to overfitting machine learning models to certain patterns in your data set, resulting in memorization of the machine learning model rather than generalization.
Regularization techniques like LASSO (least absolute shrinkage and selection operator) punish large coefficients more strongly than machine learning algorithms like gradient descent. Thus overfitting machine learning models can be avoided or addressed. Overfitting may be detected using a variety of machine learning approaches, such as validation curves and cross-fold plots.
1. What are the many instances in which machine learning models might overfit?
Overfitting of machine learning models can occur in a variety of situations, including the following:
When a machine learning algorithm uses a considerably bigger training dataset than the testing set and learns patterns in the
large input space, the accuracy on a small test set is only marginally improved.
It occurs when a machine learning algorithm models the training data with too many parameters.
Suppose the learning algorithm searches a large amount of hypothesis space. Let's figure out what hypothesis space is and what searching hypothesis space is all about. If the learning algorithm used to fit the model has a large number of possible hyperparameters and can be trained using multiple datasets (called training datasets) taken from the same dataset, a large number of models (hypothesis – h(X)) can be fitted on the same data set. Remember that a hypothesis is a target function estimator. As a result, many models may fit the same dataset. This is known as broader hypothesis space. In this case, the learning algorithm can access a broader hypothesis space. Given the broader hypothesis space, the model has a greater chance of overfitting the training dataset.
2. What are the many instances in which machine learning models cause underfitting?
Underfitting of machine learning models can occur in a variety of situations, including the following:
Underfitting or low-biased machine learning models can occur when the training set contains fewer observations than variables. Because the machine learning algorithm is not complicated enough to represent the data in these circumstances, it cannot identify any link between the input data and the output variable. When a machine learning system can't detect a pattern between training and testing set variables, which might happen when dealing with many input variables or a high-dimensional dataset, this might be due to a lack of machine learning model complexity. A scarcity of training observations for pattern learning or a lack of computational power restricts machine learning algorithms' capacity to search for patterns in high- dimensional space, among other factors.
3. What is a Neural Network, and how does it work?
Neural Networks are a simplified version of how people learn, inspired by how neurons in our brains work.
Three network layers make up the most typical Neural Networks: There is an input layer.
A layer that is not visible (this is the most important layer where
feature extraction takes place, and adjustments are made to train faster and function better)
A layer for output
4. What Are the Functions of Activation in a Neural Network?
At its most basic level, an activation function determines whether or not a neuron should activate. Any activation function can take the weighted sum of the inputs and bias as inputs. Activation functions include the step function, Sigmoid, ReLU, Tanh, and Softmax.
5. What is the MLP (Multilayer Perceptron)?
MLPs have an input layer, a hidden layer, and an output layer, just like Neural Networks. It has the same structure as a single layer perceptron with more hidden layers. MLP can identify nonlinear classes, whereas a single layer perceptron can only categorize linear separable classes with binary output (0,1). Each node in the other levels, except the input layer, utilizes a nonlinear activation function. This implies that all nodes and weights are joined together to produce the output based on the input layers, data flowing in, and the activation function. Backpropagation is a supervised learning method used by MLP. The neural network estimates the error with the aid of the cost function in backpropagation. It propagates the mistake backward from the point of origin (adjusts the weights to train the model more accurately).
6. what is Cost Function?
The cost function, sometimes known as "loss" or "error," is a metric used to assess how well your model performs. During backpropagation, it's used to
calculate the output layer's error. We feed that mistake backward through the neural network and train the various functions.
7. What is the difference between a Recurrent Neural Network and a Feedforward Neural Network?
The interviewer wants you to respond thoroughly to this deep learning interview question. Signals from the input to the output of a Feedforward Neural Network travel in one direction. The network has no feedback loops and simply evaluates the current input. It is unable to remember prior inputs (e.g., CNN).
The signals of a Recurrent Neural Network go in both directions, resulting in a looped network. It generates a layer's output by combining the present input with previously received inputs and can recall prior data thanks to its internal memory.
8. What can a Recurrent Neural Network (RNN) be used for?
Sentiment analysis, text mining, and picture captioning may benefit from the RNN. Recurrent Neural Networks may also be used to solve problems involving time-series data, such as forecasting stock values over a month or quarter.
No comments:
Post a Comment
Thank you for Contacting Us.