Interview question for Data Science - Home Teachers India

Breaking

Welcome to Home Teachers India

The Passion for Learning needs no Boundaries

Translate

Monday, 28 November 2022

Interview question for Data Science

What exactly does the term "Data Science" mean?

         Data Science is an interdisciplinary discipline that encompasses a variety of scientific procedures, algorithms, tools, and machine learning algorithms that work together to uncover common patterns and gain useful insights from raw input data using statistical and mathematical analysis. Gathering business needs and related data is the first step; data cleansing, data staging, data warehousing, and data architecture are all procedures in the data acquisition process. Exploring, mining, and analyzing data are all tasks that data processing does, and the results may then be utilized to provide a summary of the data's insights. Following the exploratory phases, the cleansed data is exposed to many algorithms, such as predictive analysis, regression, text mining, pattern recognition, and so on, depending on the needs. In the final last stage, the outcomes are aesthetically appealingly when conveyed to the business. This is where the ability to see data, report on it, and use other business intelligence tools come into play.


2. What is the difference between data science and data analytics?

 Data science is altering data using a variety of technical analysis approaches to derive useful insights that data analysts may apply to their business scenarios. Data analytics is concerned with verifying current hypotheses and facts and answering questions for a more efficient and successful business decision-making process. Data Science fosters innovation by providing answers to questions that help people make connections and solve challenges in the future. Data analytics is concerned with removing current meaning from past context, whereas data science is concerned with predictive modelling. Data science is a wide topic that employs a variety of mathematical and scientific tools and methods to solve complicated issues. In contrast, data analytics is a more focused area that employs fewer statistical and visualization techniques to solve particular problems. 

3. What are some of the strategies utilized for sampling?

 What is the major advantage of sampling? Data analysis cannot be done on an entire amount of data at a time, especially when it concerns bigger datasets. It becomes important to obtain data samples that can represent the full population and then analyze it. While doing this, it is vital to properly choose sample data out of the enormous data that represents the complete dataset. There are two types of sampling procedures depending on the engagement of statistics, they are: Non-Probability sampling techniques: Convenience sampling, Quota sampling, snowball sampling, etc. Probability sampling techniques: Simple random sampling, clustered sampling, stratified sampling. 

4. List down the criteria for Overfitting and Underfitting Overfitting: 

The model works well just on the sample training data. Any new data is supplied as input to the model fails to generate any result. These situations emerge owing to low bias and large variation in the model. Decision trees are usually prone to overfitting. Underfitting: Here, the model is very simple in that it cannot find the proper connection in the data, and consequently, it does not perform well on the test data. This might arise owing to excessive bias and low variance. Underfitting is more common in linear regression. 

5. Distinguish between data in long and wide formats. 

Data in a long format Each row of the data reflects a subject's one-time information. Each subject's data would be organized in different/multiple rows. When viewing rows as groupings, the data may be identified. This data format is commonly used in R analysis and for writing to log files at the end of each experiment. Data in a Wide Format The repeated replies of a subject are divided into various columns in this example. By viewing columns as groups, the data may be identified. This data format is most widely used in stats programs for repeated measures ANOVAs and is seldom utilized in R analysis.

 6. What is the difference between Eigenvectors and Eigenvalues? 

Eigenvectors are column vectors of unit vectors with a length/magnitude of 1; they are also known as right vectors. Eigenvalues are coefficients applied to eigenvectors to varying length or magnitude values. Eigen decomposition is the process of breaking down a matrix into Eigenvectors and Eigenvalues. These are then utilized in machine learning approaches such as PCA (Principal Component Analysis) to extract useful information from a matrix. 

7. What does it mean to have high and low p-values?

 A p-value measures the possibility of getting outcomes equal to or greater than those obtained under a certain hypothesis, provided the null hypothesis is true. This indicates the likelihood that the observed discrepancy happened by coincidence. When the p-value is less than 0.05, we say have a low p-value, the null hypothesis may be rejected, and the data is unlikely to be true null. A high p-value indicates the strength in support of the null hypothesis, i.e., values greater than 0.05, indicating that the data is true null. The hypothesis can go either way with a p-value of 0.05. 

8. When to do re-sampling? 

Re-sampling is a data sampling procedure that improves accuracy and quantifies the uncertainty of population characteristics. It is observed that the model is efficient by training it on different patterns in a dataset to guarantee that variances are taken care of. It's also done when models need to be verified using random subsets or tests with labels substituted on data points.

 9. What does it mean to have "imbalanced data"?

 A data is highly imbalanced when the data is unevenly distributed across several categories. These datasets cause a performance problem in the model and inaccuracies. 

10. Do the predicted value, and the mean value varies in any way? 

Although there aren't many variations between these two, it's worth noting that they're employed in different situations. In general, the mean value talks about the probability distribution; in contrast, the anticipated value is used when dealing with random variables. 

11. What does Survivorship bias mean to you?

 Due to a lack of prominence, this bias refers to the logical fallacy of focusing on parts that survived a procedure while missing others that did not. This bias can lead to incorrect conclusions being drawn. 

12. Define key performance indicators (KPIs), lift, model fitting, robustness, and design of experiment (DOE). KPI is a metric that assesses how successfully a company meets its goals. Lift measures the target model's performance compared to a random choice model. The lift represents how well the model predicts compared to if there was no model. Model fitting measures how well the model under consideration matches the data. Robustness refers to the system's capacity to successfully handle variations and variances. DOE refers to the work of describing and explaining information variance under postulated settings to reflect factors. 

13. Identify confounding variables Another name for confounding variables is confounders. They are extraneous variables that impact both independent and dependent variables, generating erroneous associations and mathematical correlations. 

14. What distinguishes time-series issues from other regression problems? 

Time series data could be considered an extension of linear regression, which uses terminology such as autocorrelation and average movement to summarize previous data of y-axis variables to forecast a better future. Time series issues' major purpose is to forecast and predict when exact forecasts could be produced, but the determinant factors are not always known. The presence of Time in an issue might not determine that it is a time series problem. To be determined that an issue is a time series problem, there must be a relationship between target and Time. The observations that are closer in time are anticipated to be comparable to those far away, providing seasonality accountability. Today's weather, for example, would be in comparison to tomorrow's weather but not to weather four months from now. As a result, forecasting the weather based on historical data becomes a time series challenge. 


No comments:

Post a Comment

Thank you for Contacting Us.

Post Top Ad