Interview question for Data Science 2 - Home Teachers India

Breaking

Welcome to Home Teachers India

The Passion for Learning needs no Boundaries

Translate

Monday 28 November 2022

Interview question for Data Science 2

 15. What if a dataset contains variables with more than 30% missing values?

 How would you deal with such a dataset? We use one of the following methods, depending on the size of the dataset: If the datasets are minimal, the missing values are replaced with the average or mean of the remaining data. This may be done in pandas by using mean = df. Mean (), where df is the panda's data frame that contains the dataset and mean () determines the data's mean. We may use df.Fillna to fill in the missing numbers with the computed mean (mean). The rows with missing values may be deleted from bigger datasets, and the remaining data can be utilized for data prediction.

 16. What is Cross-Validation, and how does it work? 

Cross-validation is a statistical approach for enhancing the performance of a model. It will be designed and evaluated with rotation using different samples of the training dataset to ensure that the model performs adequately for unknown data. The training data will be divided into groups, and the model will be tested and verified against each group in turn. The following are the most regularly used techniques: Leave p-out method K-Fold method Holdout method Leave-one-out method 

17. How do you go about tackling a data analytics project? 

In general, we follow the steps below: The first stage is to understand the company's problem or need. Then, sternly examine and evaluate the facts you've been given. If any data is missing, contact the company to clarify the needs. The following stage is to clean and prepare the data, which will then be utilized for modelling. The variables are converted, and the missing values are available here. To acquire useful insights, run your model on the data, create meaningful visualizations, and evaluate the findings. Release the model implementation and evaluate its usefulness by tracking the outcomes and performance over a set period. Validate the model using cross-validation. 

18. What is the purpose of selection bias? 

Selection bias occurs when no randomization is obtained while selecting a sample subset. This bias indicates that the sample used in the analysis does not reflect the whole population being studied.

 19. Why is data cleansing so important?  What method do you use to clean the data?

 It is critical to have correct and clean data that contains only essential information to get good insights while running an algorithm on any data. Poor or erroneous insights and projections are frequently the product of contaminated data, resulting in disastrous consequences. For example, while starting a large marketing campaign for a product, if our data analysis instructs us to target a product that has little demand, in reality, the campaign will almost certainly fail. As a result, the company's revenue is reduced. This is when the value of having accurate and clean data becomes apparent. Data cleaning from many sources aid data transformation and produces data scientists may work on. Clean data improves the model's performance and results in extremely accurate predictions. When a dataset is sufficiently huge, running data on it becomes difficult. If the data is large, the data cleansing stage takes a long time (about 80% of the time), and it is impossible to include it in the model's execution. As a result, cleansing data before running the model improves the model's speed and efficiency. Data cleaning aids in the detection and correction of structural flaws in a dataset, and it also aids in the removal of duplicates and the maintenance of data consistency. 

20. What feature selection strategies are available for picking the appropriate variables for creating effective prediction models? 

When utilizing a dataset in data science or machine learning techniques, it's possible that not all of the variables are required or relevant for the model to be built. To eliminate duplicating models and boost the efficiency of our model, we need to use smarter feature selection approaches. The three primary strategies for feature selection are as follows: Filter Approaches: These methods only take up intrinsic attributes of features assessed using univariate statistics, not cross-validated performance. They are simple and typically quicker than wrapper approaches and need fewer processing resources. The Chi-Square test, Fisher's Score technique, Correlation Coefficient, Variance Threshold, Mean Absolute Difference (MAD) method, Dispersion Ratios, and more filter methods are available. Wrapper Approaches: These methods need a way to search greedily on all potential feature subsets, access their quality, and evaluate a classifier using the feature. The selection method uses a machine-learning algorithm that must suit the provided dataset. Wrapper approaches are divided into three categories: Forward Selection: In this method, one feature is checked, and more features are added until a good match is found. Backward Selection: Here, all of the characteristics are evaluated, and the ones that don't fit are removed one by one to determine which works best. Recursive Feature Elimination: The features are examined and assessed recursively to see how well they perform. These approaches are often computationally expensive, necessitating highend computing resources for analysis. However, these strategies frequently result in more accurate prediction models than filter methods. Embedded Methods By including feature interactions while retaining appropriate computing costs, embedded techniques combine the benefits of both filter and wrapper methods. These approaches are iterative because they meticulously extract characteristics contributing to most training in each model iteration. LASSO Regularization (L1) and Random Forest Importance are two examples of embedded approaches.

 21. Will reclassifying categorical variables as continuous variables improve the predictive model? 

Yes! A categorical variable has no particular category ordering and can be allocated to two or more categories. Ordinal variables are comparable to categorical variables because they have a defined and consistent ordering. Treating the categorical value as just a continuous variable should result in stronger prediction models if the variable is ordinal. 

22. How will you handle missing values in your data analysis? 

After determining which variables contain missing values, the impact of missing values may be determined. If the data analyst can detect a pattern in these missing values, there is a potential to uncover useful information. If no patterns are detected, the missing numbers can be disregarded or replaced with default parameters such as minimum, mean, maximum, or median. The default values are assigned if the missing values are for categorical variables, and missing values are assigned mean values if the data has a normal distribution. If 80 percent of the data are missing, the analyst must decide whether to use default values or remove the variables.

 23. What is the ROC Curve, and how do you make one?

 The ROC (Receiver Operating Characteristic) curve depicts the difference between false-positive and true-positive rates at various thresholds. The curve is used as a surrogate for a sensitivity-specificity trade-off. Plotting values of true-positive rates (TPR or sensitivity) against false positive rates (FPR or (1-specificity) yields the ROC curve. TPR is the percentage of positive observations correctly predicted out of all positive observations, and the FPR reflects the fraction of observations mistakenly anticipated out of all negative observations. Take medical testing as an example: the TPR shows the rate at which patients are appropriately tested positive for an illness.

 24. What are the differences between the Test and Validation sets?

 The test set is used to evaluate or test the trained model's performance. It assesses the model's prediction ability. The validation set is a subset of the training set used to choose parameters to avoid overfitting the model. 

25. What exactly does the kernel trick mean? 

Kernel functions are extended dot product functions utilized in highdimensional feature space to compute the dot product of vectors xx and yy. A linear classifier uses the Kernel trick approach to solve a non-linear issue by changing linearly inseparable data into separable data in higher dimensions. 

26. Recognize the differences between a box plot and a histogram. Box plots and histograms are visualizations for displaying data distributions and communicating information effectively. Histograms are examples of bar charts that depict the frequency of numerical variable values and may calculate probability distributions, variations, and outliers. Boxplots communicate various data distribution features when the form of the distribution cannot be observed, but insights may still be gained. Compared to histograms, they are handy for comparing numerous charts simultaneously because they take up less space. 

27. How will you balance/correct data that is unbalanced? 

Unbalanced data can be corrected or balanced using a variety of approaches. It is possible to expand the sample size for minority groups, and the number of samples can be reduced for classes with many data points. The following are some of the methods used to balance data: Utilize the proper assessment metrics: It's critical to use the right evaluation metrics that give useful information while dealing with unbalanced data. Specificity/Precision: The number of relevant examples that have been chosen. Sensitivity: Indicates how many relevant cases were chosen. The F1 score represents the harmonic mean of accuracy and sensitivity, and the MCC represents the correlation coefficient between observed and anticipated binary classifications (Matthews's correlation coefficient). The AUC (Area Under the Curve) measures the relationship between truepositive and false-positive rates. Set of Instructions Resampling Working on obtaining multiple datasets may also be used to balance data, which can be accomplished by resampling. Under-sampling When the data amount is adequate, this balances the data by lowering the size of the plentiful class. A new balanced dataset may be obtained, which can be used for further modelling. Over-sampling When the amount of data available is insufficient, this method is utilized. This strategy attempts to balance the dataset by increasing the sample size. Instead of getting rid of excessive samples, repetition, bootstrapping, and other approaches are used to produce and introduce fresh samples. Correctly do K-fold cross-validation When employing over-sampling, cross-validation must be done correctly. Cross-validation should be performed before over-sampling since doing it afterward would be equivalent to overfitting the model to obtain a certain outcome. Data is resampled many times with varied ratios to circumvent this.

 28. Random forest or many decision trees: which is better? 

Because random forests are an ensemble approach that guarantees numerous weak decision trees learn forcefully, they are far more robust, accurate, and less prone to overfitting than multiple decision trees. Careers in Data Science Database Manager According to Indeed, a database manager is responsible for "maintaining an organization's databases, including detecting and repairing errors, simplifying information, and reporting," according to Indeed. They also assist in selecting appropriate hardware and software solutions for their company's requirements. Data Analyst For businesses, data analysts collect and analyze enormous volumes of data before making suggestions based on their findings. They can improve operations, decrease costs, discover patterns, and increase efficiency in many industries, including healthcare, IT, professional sports, and finance. Data Modeler Systems analysts who develop computer databases that transform complex corporate data into useful computer systems are data modelers. Data modelers collaborate with data architects to create databases that fulfil organizational goals using conceptual, physical, and logical data models. Machine Learning Engineer A machine learning engineer is an IT professional specializing in studying, developing, and constructing self-running artificial intelligence systems to automate predictive models. Machine learning engineers build and develop AI algorithms that can learn and make predictions, which machine learning is all about. Business Intelligence Developer Business intelligence developers construct systems and applications that allow users to locate and interact with their required information for an organization. Dashboards, search functions, data modelling, and data visualization apps are examples of this. Data scientists and user experience best practices must be well-understood by BI developers.

No comments:

Post a Comment

Thank you for Contacting Us.

Post Top Ad