Free Databricks Databricks-Machine-Learning-Associate Exam Actual Questions

The questions for Databricks-Machine-Learning-Associate were last updated On Apr 27, 2025

At ValidExamDumps, we consistently monitor updates to the Databricks-Machine-Learning-Associate exam questions by Databricks. Whenever our team identifies changes in the exam questions,exam objectives, exam focus areas or in exam requirements, We immediately update our exam questions for both PDF and online practice exams. This commitment ensures our customers always have access to the most current and accurate questions. By preparing with these actual questions, our customers can successfully pass the Databricks Certified Machine Learning Associate Exam exam on their first attempt without needing additional materials or study guides.

Other certification materials providers often include outdated or removed questions by Databricks in their Databricks-Machine-Learning-Associate exam. These outdated questions lead to customers failing their Databricks Certified Machine Learning Associate Exam exam. In contrast, we ensure our questions bank includes only precise and up-to-date questions, guaranteeing their presence in your actual exam. Our main priority is your success in the Databricks-Machine-Learning-Associate exam, not profiting from selling obsolete exam questions in PDF or Online Practice Test.

 

Question No. 1

Which of the following statements describes a Spark ML estimator?

Show Answer Hide Answer
Correct Answer: D

In the context of Spark MLlib, an estimator refers to an algorithm which can be 'fit' on a DataFrame to produce a model (referred to as a Transformer), which can then be used to transform one DataFrame into another, typically adding predictions or model scores. This is a fundamental concept in machine learning pipelines in Spark, where the workflow includes fitting estimators to data to produce transformers.

Reference

Spark MLlib Documentation: https://spark.apache.org/docs/latest/ml-pipeline.html#estimators


Question No. 2

Which of the following tools can be used to distribute large-scale feature engineering without the use of a UDF or pandas Function API for machine learning pipelines?

Show Answer Hide Answer
Correct Answer: D

Spark ML (Machine Learning Library) is designed specifically for handling large-scale data processing and machine learning tasks directly within Apache Spark. It provides tools and APIs for large-scale feature engineering without the need to rely on user-defined functions (UDFs) or pandas Function API, allowing for more scalable and efficient data transformations directly distributed across a Spark cluster. Unlike Keras, pandas, PyTorch, and scikit-learn, Spark ML operates natively in a distributed environment suitable for big data scenarios. Reference:

Spark MLlib documentation (Feature Engineering with Spark ML).


Question No. 3

A data scientist is working with a feature set with the following schema:

The customer_id column is the primary key in the feature set. Each of the columns in the feature set has missing values. They want to replace the missing values by imputing a common value for each feature.

Which of the following lists all of the columns in the feature set that need to be imputed using the most common value of the column?

Show Answer Hide Answer
Correct Answer: B

For the feature set schema provided, the columns that need to be imputed using the most common value (mode) are typically the categorical columns. In this case, loyalty_tier is the only categorical column that should be imputed using the most common value. customer_id is a unique identifier and should not be imputed, while spend and units are numerical columns that should typically be imputed using the mean or median values, not the mode.


Databricks documentation on missing value imputation: Handling Missing Data

If you need any further clarification or additional questions answered, please let me know!

Question No. 4

A data scientist has developed a linear regression model using Spark ML and computed the predictions in a Spark DataFrame preds_df with the following schema:

prediction DOUBLE

actual DOUBLE

Which of the following code blocks can be used to compute the root mean-squared-error of the model according to the data in preds_df and assign it to the rmse variable?

A)

B)

C)

D)

E)

Show Answer Hide Answer
Correct Answer: C

The code block to compute the root mean-squared error (RMSE) for a linear regression model in Spark ML should use the RegressionEvaluator class with metricName set to 'rmse'. Given the schema of preds_df with columns prediction and actual, the correct evaluator setup will specify predictionCol='prediction' and labelCol='actual'. Thus, the appropriate code block (Option C in your list) that uses RegressionEvaluator to compute the RMSE is the correct choice. This setup correctly measures the performance of the regression model using the predictions and actual outcomes from the DataFrame. Reference:

Spark ML documentation (Using RegressionEvaluator to Compute RMSE).


Question No. 5

A machine learning engineer wants to parallelize the training of group-specific models using the Pandas Function API. They have developed the train_model function, and they want to apply it to each group of DataFrame df.

They have written the following incomplete code block:

Which of the following pieces of code can be used to fill in the above blank to complete the task?

Show Answer Hide Answer
Correct Answer: B

The function mapInPandas in the PySpark DataFrame API allows for applying a function to each partition of the DataFrame. When working with grouped data, groupby followed by applyInPandas is the correct approach to apply a function to each group as a separate Pandas DataFrame. However, if the function should apply across each partition of the grouped data rather than on each individual group, mapInPandas would be utilized. Since the code snippet indicates the use of groupby, the intent seems to be to apply train_model on each group specifically, which aligns with applyInPandas. Thus, applyInPandas is a better fit to ensure that each group generated by groupby is processed through the train_model function, preserving the partitioning and grouping integrity.

Reference

PySpark Documentation on applying functions to grouped data: https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.GroupedData.applyInPandas.html