Crack your Next Data Science Interview with these Top 22 Questions

Data science is one of the most advanced and popular technologies now in use. Every business/sector has understood that it needs data science professionals to play with data in order to maximize corporate profitability. Professionals in this discipline are being hired by big companies. Jobs in data science are to increase by 30%.
Easy but Mandatory Steps
Top Data Science Interview Questions
Following are the top 22 data science interview questions you should prepare if you want to crack it.
Q1: What distinguishes data science from conventional application programming?
Data Science involves analyzing and modeling large and complex datasets to extract insights, patterns, and trends, whereas traditional application programming focuses on developing software applications to perform specific tasks or functions.
-
Emphasis on statistical and machine learning techniques: When analysing data and formulating predictions or recommendations, data science heavily relies on statistical and machine learning techniques, as opposed to traditional application programming, which primarily concentrates on writing code to implement specific functionalities.
-
Data-centric approach: Data Science revolves around a data-centric approach, where data is the key driver of decision-making and problem-solving. In contrast, traditional application programming may not always have data as the central focus, but rather focuses on implementing functionalities or features.
-
Exploratory and iterative nature: Data Science often involves exploratory data analysis (EDA) and iterative modeling, where data scientists may need to experiment with different techniques and algorithms to find the best approach, whereas traditional application programming typically follows a more linear and structured development process.
-
Business and domain knowledge integration: Data Science often requires the integration of business and domain knowledge to understand the context and implications of data analysis and modeling results. Traditional application programming, on the other hand, may not always require deep business or domain knowledge.
It's important to note that while there are differences between Data Science and traditional application programming, they can also overlap in some areas, and the boundaries between the two can sometimes be blurry. The specific roles and responsibilities of a Data Scientist or an application programmer may vary depending on the organization and the project requirement.
{% module_block module "widget_4b2d4147-5267-46d4-8e4d-97951bf9ff94" %}{% module_attribute "child_css" is_json="true" %}{% raw %}{}{% endraw %}{% end_module_attribute %}{% module_attribute "css" is_json="true" %}{% raw %}{}{% endraw %}{% end_module_attribute %}{% module_attribute "definition_id" is_json="true" %}{% raw %}null{% endraw %}{% end_module_attribute %}{% module_attribute "field_types" is_json="true" %}{% raw %}{"image_desktop":"image","image_link":"link","image_mobile":"image"}{% endraw %}{% end_module_attribute %}{% module_attribute "image_desktop" is_json="true" %}{% raw %}{"alt":"Blog-Listing-Ad-_4_-2","height":300,"loading":"lazy","max_height":300,"max_width":1200,"size_type":"auto","src":"https://odinschool-20029733.hs-sites.com/hubfs/Blog-Listing-Ad-_4_-2.webp","width":1200}{% endraw %}{% end_module_attribute %}{% module_attribute "image_link" is_json="true" %}{% raw %}{"no_follow":false,"open_in_new_tab":true,"rel":"noopener","sponsored":false,"url":{"content_id":null,"href":"https://www.odinschool.com/datascience-bootcamp","type":"EXTERNAL"},"user_generated_content":false}{% endraw %}{% end_module_attribute %}{% module_attribute "image_mobile" is_json="true" %}{% raw %}{"alt":"Mobile-version-of-blog-ads-_1_-Sep-04-2023-12-20-19-1540-PM","height":300,"loading":"lazy","max_height":300,"max_width":500,"size_type":"auto","src":"https://odinschool-20029733.hs-sites.com/hubfs/Mobile-version-of-blog-ads-_1_-Sep-04-2023-12-20-19-1540-PM.webp","width":500}{% endraw %}{% end_module_attribute %}{% module_attribute "label" is_json="true" %}{% raw %}null{% endraw %}{% end_module_attribute %}{% module_attribute "module_id" is_json="true" %}{% raw %}132581904694{% endraw %}{% end_module_attribute %}{% module_attribute "path" is_json="true" %}{% raw %}"/OdinSchool_V3/modules/Blog/Blog Responsive Image"{% endraw %}{% end_module_attribute %}{% module_attribute "schema_version" is_json="true" %}{% raw %}2{% endraw %}{% end_module_attribute %}{% module_attribute "smart_objects" is_json="true" %}{% raw %}null{% endraw %}{% end_module_attribute %}{% module_attribute "smart_type" is_json="true" %}{% raw %}"NOT_SMART"{% endraw %}{% end_module_attribute %}{% module_attribute "tag" is_json="true" %}{% raw %}"module"{% endraw %}{% end_module_attribute %}{% module_attribute "type" is_json="true" %}{% raw %}"module"{% endraw %}{% end_module_attribute %}{% module_attribute "wrap_field_tag" is_json="true" %}{% raw %}"div"{% endraw %}{% end_module_attribute %}{% end_module_block %}
Q2: What is bias in Data Science?
Bias in data science refers to the presence of systematic errors in data or models that can lead to inaccurate or unfair results. There are several types of bias that can impact data science:
- Sampling bias: This occurs when the data collected for analysis is not representative of the entire population, leading to a skewed or incomplete view of reality. For example, if a survey on customer satisfaction is conducted only among customers who voluntarily provide feedback, it may not capture the opinions of less satisfied customers who chose not to participate.
- Measurement bias: This occurs when there are errors or inaccuracies in the measurement or recording of data. For example, if a temperature sensor used to collect weather data is not calibrated properly, it may introduce measurement bias, leading to inaccurate temperature readings.
- Labeling bias: This occurs when the labels or categories assigned to data samples are subjective or discriminatory, leading to biased training data for machine learning models. For example, if a resume screening model is trained on biased labeled data that favors male applicants, it may result in gender bias in the hiring process.
- Algorithmic bias: This occurs when machine learning models are trained on biased data, leading to biased predictions or decisions. For example, if a facial recognition system is trained on a dataset with predominantly light-skinned individuals, it may have reduced accuracy for darker-skinned individuals, leading to racial bias in its predictions.
- Confirmation bias: This occurs when data scientists or analysts selectively choose or interpret data that confirms their preconceived notions or beliefs, leading to biased conclusions or recommendations.
Bias in data science can have serious consequences, including perpetuating discrimination, unfair decision-making, and inaccurate insights. Therefore, it is important for data scientists to be aware of and address potential biases in their data, models, and interpretations to ensure that their work is fair, transparent, and reliable. Techniques such as re-sampling, re-labeling, re-calibrating, and using diverse datasets can be employed to mitigate bias in data science. Additionally, incorporating ethical considerations and diverse perspectives in the data science process can help minimize bias and promote fairness in data-driven decision-making.
Q3: Why is only Python used for Data Cleaning in DS?
It's not accurate to say that only Python is used for data cleaning in Data Science. Data cleaning, also known as data preprocessing, can be done using various programming languages and tools depending on the preferences, requirements, and expertise of data scientists and practitioners. However, Python is a popular choice for data cleaning in Data Science due to 3 main reasons:
- Rich ecosystem of libraries: Python has a rich ecosystem of open-source libraries specifically designed for data manipulation and cleaning, such as Pandas, NumPy, and SciPy. These libraries provide powerful and efficient functions for handling missing values, filtering, transforming, and aggregating data, making data cleaning tasks easier and more convenient.
- Ease of use: Python is a well-liked choice among data scientists and practitioners because of its versatility and usability. Even for people with little programming knowledge, it is simple to learn and use thanks to its accessible syntax and thorough documentation. Because of its adaptability, Python can be easily integrated with other programmes and libraries that are frequently used in data science workflows.
- Large community support: Python has a sizable and vibrant development and user community, so data scientists may easily find tutorials, forums, and other resources to assist with data cleaning chores. Many data scientists favour Python because of the community-driven approach's access to a wealth of knowledge and expertise.
Q4: How do you build a random forest model?
Start by preparing your data for model building. This typically involves tasks such as data cleaning, handling missing values, encoding categorical variables, and splitting the data into training and testing sets.
-
Ensemble of decision trees: An ensemble technique called random forest mixes various decision trees to produce forecasts. Each tree is trained using replacement (bootstrapping) on a random subset of the data and a random selection of features. Create a group of decision trees, each trained on a different subset of data and features from a training dataset.
-
Tree building: For each decision tree in the ensemble, recursively split the data into subsets based on feature values that minimize the impurity or maximize the information gain at each split. Continue this process until a stopping criterion, such as a maximum depth or minimum number of samples per leaf, is met.
-
Voting mechanism: When making predictions, each decision tree in the ensemble contributes its prediction, and the final prediction is determined through a voting mechanism. For classification tasks, the class with the most votes is chosen as the predicted class, and for regression tasks, the average of the predicted values is taken as the final prediction.
Remember to always validate your model on unseen data to ensure its generalization performance and fine-tune the model as needed based on the evaluation results. Random forests are a popular and powerful machine learning technique for classification and regression tasks, known for their ability to handle complex data patterns, handle missing values, and reduce overfitting compared to single decision trees.
Q5: A data collection containing variables with more than 30% of their values missing is handed to you. How are you going to manage them?
- Assess the impact of missing values: Evaluate the impact of missing values on the analysis or modeling task, considering the type of missingness (MAR, MCAR, NMAR). This will help determine the appropriate handling approach.
- Impute missing values: Use imputation techniques, such as mean or median imputation, mode imputation, regression imputation, k-nearest neighbors imputation, or machine learning-based imputation, to fill in the missing values with estimated values. Choose the imputation method based on the data nature and the underlying assumptions of the analysis or modeling task.
- Consider multiple imputation: Alternatively, consider using multiple imputation, where missing values are imputed multiple times to account for uncertainty. This generates multiple complete datasets with imputed values and combines the results for a final result.
- Create indicator variables: Create separate indicator variables to represent the presence or absence of missing values for each variable. This captures information about missingness and can be included as a separate feature in the analysis or modeling task.
- Document the handling approach: Thoroughly document the approach chosen for handling missing values, including any assumptions made, for transparency and reproducibility in the analysis or modeling task.
Remember that handling missing values should be done carefully, taking into consideration the specific characteristics of the data and the goals of the analysis or modeling task, and consulting with domain experts if possible.
{% module_block module "widget_193fe680-11a8-48c3-972e-807b12326701" %}{% module_attribute "child_css" is_json="true" %}{% raw %}{}{% endraw %}{% end_module_attribute %}{% module_attribute "css" is_json="true" %}{% raw %}{}{% endraw %}{% end_module_attribute %}{% module_attribute "definition_id" is_json="true" %}{% raw %}null{% endraw %}{% end_module_attribute %}{% module_attribute "field_types" is_json="true" %}{% raw %}{"image_desktop":"image","image_link":"link","image_mobile":"image"}{% endraw %}{% end_module_attribute %}{% module_attribute "image_desktop" is_json="true" %}{% raw %}{"alt":"Blog-Listing-Ad-_4_-2","height":300,"loading":"lazy","max_height":300,"max_width":1200,"size_type":"auto","src":"https://odinschool-20029733.hs-sites.com/hubfs/Blog-Listing-Ad-_4_-2.webp","width":1200}{% endraw %}{% end_module_attribute %}{% module_attribute "image_link" is_json="true" %}{% raw %}{"no_follow":false,"open_in_new_tab":true,"rel":"noopener","sponsored":false,"url":{"content_id":null,"href":"https://www.odinschool.com/datascience-bootcamp","type":"EXTERNAL"},"user_generated_content":false}{% endraw %}{% end_module_attribute %}{% module_attribute "image_mobile" is_json="true" %}{% raw %}{"alt":"Mobile-version-of-blog-ads-_1_-Sep-04-2023-12-21-11-3088-PM","height":300,"loading":"lazy","max_height":300,"max_width":500,"size_type":"auto","src":"https://odinschool-20029733.hs-sites.com/hubfs/Mobile-version-of-blog-ads-_1_-Sep-04-2023-12-21-11-3088-PM.webp","width":500}{% endraw %}{% end_module_attribute %}{% module_attribute "label" is_json="true" %}{% raw %}null{% endraw %}{% end_module_attribute %}{% module_attribute "module_id" is_json="true" %}{% raw %}132581904694{% endraw %}{% end_module_attribute %}{% module_attribute "path" is_json="true" %}{% raw %}"/OdinSchool_V3/modules/Blog/Blog Responsive Image"{% endraw %}{% end_module_attribute %}{% module_attribute "schema_version" is_json="true" %}{% raw %}2{% endraw %}{% end_module_attribute %}{% module_attribute "smart_objects" is_json="true" %}{% raw %}null{% endraw %}{% end_module_attribute %}{% module_attribute "smart_type" is_json="true" %}{% raw %}"NOT_SMART"{% endraw %}{% end_module_attribute %}{% module_attribute "tag" is_json="true" %}{% raw %}"module"{% endraw %}{% end_module_attribute %}{% module_attribute "type" is_json="true" %}{% raw %}"module"{% endraw %}{% end_module_attribute %}{% module_attribute "wrap_field_tag" is_json="true" %}{% raw %}"div"{% endraw %}{% end_module_attribute %}{% end_module_block %}
Q6: What do you understand about the true-positive rate and false-positive rate?
In binary classification problems, the true-positive rate (TPR), also known as sensitivity, recall, or hit rate, quantifies the percentage of positive cases that are properly predicted as positive by a model.
TPR = True Positives / (True Positives + False Negatives)
In binary classification problems, the false-positive rate (FPR), also known as fall-out or false alarm rate, quantifies the percentage of negative cases that a model mistakenly predicts as positive.
FPR = False Positives / (False Positives + True Negatives)
A binary classification model's effectiveness is assessed using both the true-positive rate and false-positive rate. On a receiver operating characteristic (ROC) curve, which illustrates the trade-off between TPR and FPR at various classification thresholds, they are frequently plotted. Higher values denote greater performance, and the area under the ROC curve (AUC-ROC) is a regularly used metric to sum up the entire performance of a binary classification model.
Q7: What are Exploding Gradients and Vanishing Gradients?
Exploding Gradients: They occur when the gradients during backpropagation become very large, causing the weights to be updated by excessively large values. This can result in the model's weights being updated in a way that overshoots the optimal values, leading to unstable training and poor convergence.
Exploding gradients are typically caused by deep networks with large weight initialization or activation functions that amplify the inputs to the point of saturation. Techniques to mitigate exploding gradients include gradient clipping, weight regularization methods (e.g., L1 or L2 regularization), and using weight normalization techniques (e.g., batch normalization) during training.
Vanishing Gradients: They occur when the gradients during backpropagation become very small, causing the weights to be updated by excessively small values. This can result in the model's weights being updated too slowly, leading to slow convergence and poor training performance.
Vanishing gradients are typically caused by deep networks with small weight initialization or activation functions that dampen the inputs, leading to gradients that approach zero. Techniques to mitigate vanishing gradients include using activation functions that have better gradient properties (e.g., ReLU, Leaky ReLU), initializing weights carefully (e.g., Xavier or He initialization), and using skip connections or residual connections to help gradients propagate more effectively through deep networks.
Both exploding and vanishing gradients are common challenges in deep learning and can severely impact the performance of neural networks. Proper weight initialization, activation functions, and regularization techniques can be used to mitigate these issues and ensure stable and effective training of deep neural networks.
Q8: The likelihood that you will see a shooting star or a group of them in a period of 15 minutes is 0.2. What percentage of the time, if you are exposed to it for about an hour, will you see at least one star shooting from the sky?
Let's call the likelihood that at least one shooting star will be visible within a 15-minute window p. Given that the likelihood of spotting a shooting star or a group of them in a 15-minute period is 0.2, we can state that the likelihood of spotting none at all is 1 - 0.2 = 0.8.
The likelihood of not seeing a shooting star in four consecutive 15-minute periods (which total up to an hour, or 60 minutes) must now be determined. We can multiply the probabilities together because the intervals are independent of one another.
Probability of not seeing a shooting star in a 60-minute interval = (Probability of not seeing a shooting star in a 15-minute interval) ^ 4
= 0.8 ^ 4
= 0.4096
Therefore, the probability of seeing at least one shooting star in an hour (60-minute interval) is the complement of the probability of not seeing any shooting star, which is:
Probability of seeing at least one shooting star in an hour = 1 - Probability of not seeing any shooting star in an hour
= 1 - 0.4096
= 0.5904
So, there is approximately a 59.04% chance of seeing at least one shooting star in an hour if the probability of seeing a shooting star or a bunch of them in a 15-minute interval is 0.2.
Q9: Explain the difference between Normalization and Standardization with an example.
Let's say you have a dataset of exam scores for three subjects: math, science, and history. The scores for math range from 60 to 100, the scores for science range from 30 to 90, and the scores for history range from 40 to 80. You want to preprocess the data to ensure that all the scores are on a common scale.
If you choose to normalize the data, you would scale each score to a range of 0 to 1, for example, by dividing each score by 100. So, a score of 80 in math would be normalized to 0.8, a score of 60 in science would be normalized to 0.6, and a score of 70 in history would be normalized to 0.7.
If you decide to standardise the data, you would determine the mean and normal deviation of the scores for each subject, subtract the mean from each score, and divide the result by the standard deviation. This would result in the scores for each subject having a mean of 0 and a standard deviation of 1, making comparisons between the various individuals simple.
As a result, while standardisation centres data around zero with a unit variance, normalisation adjusts data to a defined range.
Q10: Describe Markov chains
The probabilistic transitions of a system between states are described by Markov chains, a sort of mathematical model where the future state only depends on the present state and not on the past states. They have a collection of states, probabilities for transitions between the states, and an initial state. Markov chains are frequently used to represent and analyse systems that behave randomly or vary over time in a stochastic way in many different disciplines, including statistics, computer science, economics, and biology.
{% module_block module "widget_305e7275-ead2-4597-9b12-27602190304a" %}{% module_attribute "child_css" is_json="true" %}{% raw %}{}{% endraw %}{% end_module_attribute %}{% module_attribute "css" is_json="true" %}{% raw %}{}{% endraw %}{% end_module_attribute %}{% module_attribute "definition_id" is_json="true" %}{% raw %}null{% endraw %}{% end_module_attribute %}{% module_attribute "field_types" is_json="true" %}{% raw %}{"image_desktop":"image","image_link":"link","image_mobile":"image"}{% endraw %}{% end_module_attribute %}{% module_attribute "image_desktop" is_json="true" %}{% raw %}{"alt":"Blog-Listing-Ad-_4_-2","height":300,"loading":"lazy","max_height":300,"max_width":1200,"size_type":"auto","src":"https://odinschool-20029733.hs-sites.com/hubfs/Blog-Listing-Ad-_4_-2.webp","width":1200}{% endraw %}{% end_module_attribute %}{% module_attribute "image_link" is_json="true" %}{% raw %}{"no_follow":false,"open_in_new_tab":true,"rel":"noopener","sponsored":false,"url":{"content_id":null,"href":"https://www.odinschool.com/datascience-bootcamp","type":"EXTERNAL"},"user_generated_content":false}{% endraw %}{% end_module_attribute %}{% module_attribute "image_mobile" is_json="true" %}{% raw %}{"alt":"Mobile-version-of-blog-ads-_1_-Sep-04-2023-12-21-54-6276-PM","height":300,"loading":"lazy","max_height":300,"max_width":500,"size_type":"auto","src":"https://odinschool-20029733.hs-sites.com/hubfs/Mobile-version-of-blog-ads-_1_-Sep-04-2023-12-21-54-6276-PM.webp","width":500}{% endraw %}{% end_module_attribute %}{% module_attribute "label" is_json="true" %}{% raw %}null{% endraw %}{% end_module_attribute %}{% module_attribute "module_id" is_json="true" %}{% raw %}132581904694{% endraw %}{% end_module_attribute %}{% module_attribute "path" is_json="true" %}{% raw %}"/OdinSchool_V3/modules/Blog/Blog Responsive Image"{% endraw %}{% end_module_attribute %}{% module_attribute "schema_version" is_json="true" %}{% raw %}2{% endraw %}{% end_module_attribute %}{% module_attribute "smart_objects" is_json="true" %}{% raw %}null{% endraw %}{% end_module_attribute %}{% module_attribute "smart_type" is_json="true" %}{% raw %}"NOT_SMART"{% endraw %}{% end_module_attribute %}{% module_attribute "tag" is_json="true" %}{% raw %}"module"{% endraw %}{% end_module_attribute %}{% module_attribute "type" is_json="true" %}{% raw %}"module"{% endraw %}{% end_module_attribute %}{% module_attribute "wrap_field_tag" is_json="true" %}{% raw %}"div"{% endraw %}{% end_module_attribute %}{% end_module_block %}
Q11: Give one example where both false positives and false negatives are important equally?
One example where both false positives and false negatives are equally important is in medical testing for a life-threatening disease, such as cancer. In cancer screening, a false positive occurs when a person is mistakenly identified as having cancer when they do not, while a false negative occurs when a person with cancer is mistakenly identified as not having the disease. Both false positives and false negatives can have significant consequences:
False Positives: If a screening test produces a high rate of false positives, it can lead to unnecessary follow-up tests, treatments, and psychological distress for patients who do not have cancer. This can result in increased healthcare costs, unnecessary interventions, and potential harm from unnecessary treatments.
False Negatives: On the other hand, if a screening test produces a high rate of false negatives, it can result in missed diagnoses and delayed treatment for patients who do have cancer. This can lead to progression of the disease, reduced treatment options, and poorer health outcomes.
In this scenario, both false positives and false negatives are equally important as they can have significant implications for patient care and outcomes. Balancing the trade-off between false positives and false negatives is crucial in designing and evaluating the performance of medical screening tests to ensure accurate and timely detection of the disease while minimizing unnecessary interventions or missed diagnoses.
Q12: How do you know if a coin is biased?
-
Empirical Testing: This involves physically flipping the coin multiple times and recording the outcomes. If the coin is unbiased, it should produce roughly equal numbers of heads and tails over a large number of flips. However, if one side (heads or tails) consistently occurs more frequently than the other, it could indicate a biased coin.
-
Statistical Analysis: Statistical tests can be applied to the observed data from coin flips to determine if the coin is biased. For example, the chi-squared test or the binomial test can be used to assess if the observed frequencies of heads and tails deviate significantly from the expected frequencies of an unbiased coin.
-
Visual Inspection: Plotting the observed frequencies of heads and tails on a graph or a histogram can provide a visual indication of coin bias. If the distribution appears skewed or uneven, it may suggest that the coin is biased.
-
Comparison with Expected Probabilities: Coins are expected to be unbiased, meaning they have a 50% chance of landing on heads and a 50% chance of landing on tails. Therefore, comparing the observed frequencies of heads and tails with the expected probabilities (50% for each) can help identify if a coin is biased. If the observed frequencies consistently deviate from the expected probabilities, it may suggest a biased coin.
It's important to note that identifying coin bias may require a large number of flips to obtain statistically meaningful results. Additionally, other factors such as the shape, weight, and surface properties of the coin, as well as the flipping technique, can also affect the outcomes and need to be carefully controlled during testing.
Q13: Toss the selected coin 10 times from a jar of 1000 coins. Out of 1000 coins, 999 coins are fair and 1 coin is double-headed, assume that you see 10 heads. Estimate the probability of getting a head in the next coin toss.
Based on the information provided, we can use Bayesian probability to estimate the probability of getting a head in the next coin toss.
Let's define the following events:
A: Coin selected from the jar is fair
B: Coin selected from the jar is double-headed
C: 10 heads are observed in 10 tosses
We need to calculate P(A|C), the probability that the selected coin is fair given that 10 heads are observed in 10 tosses.
According to Bayes' theorem, we have:
P(A|C) = P(C|A) * P(A) / P(C)
where:
P(C|A): Probability of observing 10 heads in 10 tosses given that the coin is fair. Since a fair coin has a 0.5 probability of landing on heads, we have P(C|A) = 0.5^10.
P(A): Probability of selecting a fair coin from the jar. Since there are 999 fair coins out of 1000, we have P(A) = 999/1000.
P(C): Probability of observing 10 heads in 10 tosses, regardless of the type of coin. This can be calculated by summing the probabilities of two scenarios: (1) selecting a fair coin and getting 10 heads, and (2) selecting the double-headed coin and getting 10 heads. We can write this as: P(C) = P(C|A) * P(A) + P(C|B) * P(B), where P(C|B) = 1 (since the double-headed coin always lands on heads) and P(B) = 1/1000 (since there is only 1 double-headed coin out of 1000).
Plugging in the values, we get:
P(A|C) = P(C|A) * P(A) / P(C)
= (0.5^10) * (999/1000) / [(0.5^10) * (999/1000) + 1/1000]
Evaluating the above expression will give us the estimated probability of getting a head in the next coin toss, given that 10 heads were observed in 10 tosses and the selected coin is from the jar of 1000 coins with 999 fair coins and 1 double-headed coin.