[2022] Pass Key features of Professional-Data-Engineer Course with Updated 253 Questions [Q60-Q77]

Share

[2022] Pass Key features of Professional-Data-Engineer Course with Updated 253 Questions

Professional-Data-Engineer Sample Practice Exam Questions 2022 Updated Verified

NEW QUESTION 60
Cloud Bigtable is Google's ______ Big Data database service.

  • A. mySQL
  • B. Relational
  • C. SQL Server
  • D. NoSQL

Answer: D

Explanation:
Cloud Bigtable is Google's NoSQL Big Data database service. It is the same database that Google uses for services, such as Search, Analytics, Maps, and Gmail.
It is used for requirements that are low latency and high throughput including Internet of Things (IoT), user analytics, and financial data analysis.
Reference: https://cloud.google.com/bigtable/

 

NEW QUESTION 61
Your organization has been collecting and analyzing data in Google BigQuery for 6 months. The majority
of the data analyzed is placed in a time-partitioned table named events_partitioned. To reduce the
cost of queries, your organization created a view called events, which queries only the last 14 days of
data. The view is described in legacy SQL. Next month, existing applications will be connecting to
BigQuery to read the eventsdata via an ODBC connection. You need to ensure the applications can
connect. Which two actions should you take? (Choose two.)

  • A. Create a new partitioned table using a standard SQL query
  • B. Create a new view over events using standard SQL
  • C. Create a new view over events_partitioned using standard SQL
  • D. Create a service account for the ODBC connection to use for authentication
  • E. Create a Google Cloud Identity and Access Management (Cloud IAM) role for the ODBC connection
    and shared "events"

Answer: B,E

 

NEW QUESTION 62
You want to automate execution of a multi-step data pipeline running on Google Cloud. The pipeline includes Cloud Dataproc and Cloud Dataflow jobs that have multiple dependencies on each other. You want to use managed services where possible, and the pipeline will run every day. Which tool should you use?

  • A. Workflow Templates on Cloud Dataproc
  • B. Cloud Scheduler
  • C. cron
  • D. Cloud Composer

Answer: D

 

NEW QUESTION 63
Your company's on-premises Apache Hadoop servers are approaching end-of-life, and IT has decided to migrate the cluster to Google Cloud Dataproc. A like-for-like migration of the cluster would require 50 TB of Google Persistent Disk per node. The CIO is concerned about the cost of using that much block storage.
You want to minimize the storage cost of the migration. What should you do?

  • A. Tune the Cloud Dataproc cluster so that there is just enough disk for all data.
  • B. Migrate some of the cold data into Google Cloud Storage, and keep only the hot data in Persistent Disk.
  • C. Use preemptible virtual machines (VMs) for the Cloud Dataproc cluster.
  • D. Put the data into Google Cloud Storage.

Answer: C

 

NEW QUESTION 64
Which Google Cloud Platform service is an alternative to Hadoop with Hive?

  • A. Cloud Datastore
  • B. BigQuery
  • C. Cloud Bigtable
  • D. Cloud Dataflow

Answer: B

Explanation:
Apache Hive is a data warehouse software project built on top of Apache Hadoop for providing data summarization, query, and analysis.
Google BigQuery is an enterprise data warehouse.
Reference: https://en.wikipedia.org/wiki/Apache_Hive

 

NEW QUESTION 65
You are deploying 10,000 new Internet of Things devices to collect temperature data in your warehouses globally. You need to process, store and analyze these very large datasets in real time.
What should you do?

  • A. Send the data to Google Cloud Datastore and then export to BigQuery.
  • B. Export logs in batch to Google Cloud Storage and then spin up a Google Cloud SQL instance, import the data from Cloud Storage, and run an analysis as needed.
  • C. Send the data to Google Cloud Pub/Sub, stream Cloud Pub/Sub to Google Cloud Dataflow, and store the data in Google BigQuery.
  • D. Send the data to Cloud Storage and then spin up an Apache Hadoop cluster as needed in Google Cloud Dataproc whenever analysis is required.

Answer: C

 

NEW QUESTION 66
MJTelco Case Study
Company Overview
MJTelco is a startup that plans to build networks in rapidly growing, underserved markets around the
world. The company has patents for innovative optical communications hardware. Based on these patents,
they can create many reliable, high-speed backbone links with inexpensive hardware.
Company Background
Founded by experienced telecom executives, MJTelco uses technologies originally developed to
overcome communications challenges in space. Fundamental to their operation, they need to create a
distributed data infrastructure that drives real-time analysis and incorporates machine learning to
continuously optimize their topologies. Because their hardware is inexpensive, they plan to overdeploy the
network allowing them to account for the impact of dynamic regional politics on location availability and
cost.
Their management and operations teams are situated all around the globe creating many-to-many
relationship between data consumers and provides in their system. After careful consideration, they
decided public cloud is the perfect environment to support their needs.
Solution Concept
MJTelco is running a successful proof-of-concept (PoC) project in its labs. They have two primary needs:
Scale and harden their PoC to support significantly more data flows generated when they ramp to more

than 50,000 installations.
Refine their machine-learning cycles to verify and improve the dynamic models they use to control

topology definition.
MJTelco will also use three separate operating environments - development/test, staging, and production
- to meet the needs of running experiments, deploying new features, and serving production customers.
Business Requirements
Scale up their production environment with minimal cost, instantiating resources when and where

needed in an unpredictable, distributed telecom user community.
Ensure security of their proprietary data to protect their leading-edge machine learning and analysis.

Provide reliable and timely access to data for analysis from distributed research workers

Maintain isolated environments that support rapid iteration of their machine-learning models without

affecting their customers.
Technical Requirements
Ensure secure and efficient transport and storage of telemetry data

Rapidly scale instances to support between 10,000 and 100,000 data providers with multiple flows

each.
Allow analysis and presentation against data tables tracking up to 2 years of data storing approximately

100m records/day
Support rapid iteration of monitoring infrastructure focused on awareness of data pipeline problems

both in telemetry flows and in production learning cycles.
CEO Statement
Our business model relies on our patents, analytics and dynamic machine learning. Our inexpensive
hardware is organized to be highly reliable, which gives us cost advantages. We need to quickly stabilize
our large distributed data pipelines to meet our reliability and capacity commitments.
CTO Statement
Our public cloud services must operate as advertised. We need resources that scale and keep our data
secure. We also need environments in which our data scientists can carefully study and quickly adapt our
models. Because we rely on automation to process our data, we also need our development and test
environments to work as we iterate.
CFO Statement
The project is too large for us to maintain the hardware and software required for the data and analysis.
Also, we cannot afford to staff an operations team to monitor so many data feeds, so we will rely on
automation and infrastructure. Google Cloud's machine learning will allow our quantitative researchers to
work on our high-value problems instead of problems with our data pipelines.
You need to compose visualizations for operations teams with the following requirements:
The report must include telemetry data from all 50,000 installations for the most resent 6 weeks

(sampling once every minute).
The report must not be more than 3 hours delayed from live data.

The actionable report should only show suboptimal links.

Most suboptimal links should be sorted to the top.

Suboptimal links can be grouped and filtered by regional geography.

User response time to load the report must be <5 seconds.

Which approach meets the requirements?

  • A. Load the data into Google Cloud Datastore tables, write a Google App Engine Application that queries
    all rows, applies a function to derive the metric, and then renders results in a table using the Google
    charts and visualization API.
  • B. Load the data into Google BigQuery tables, write Google Apps Script that queries the data, calculates
    the metric, and shows only suboptimal rows in a table in Google Sheets.
  • C. Load the data into Google BigQuery tables, write a Google Data Studio 360 report that connects to
    your data, calculates a metric, and then uses a filter expression to show only suboptimal rows in a
    table.
  • D. Load the data into Google Sheets, use formulas to calculate a metric, and use filters/sorting to show
    only suboptimal links in a table.

Answer: A

 

NEW QUESTION 67
You work on a regression problem in a natural language processing domain, and you have 100M labeled exmaples in your dataset. You have randomly shuffled your data and split your dataset into train and test samples (in a 90/10 ratio). After you trained the neural network and evaluated your model on a test set, you discover that the root-mean-squared error (RMSE) of your model is twice as high on the train set as on the test set. How should you improve the performance of your model?

  • A. Increase the share of the test sample in the train-test split.
  • B. Increase the complexity of your model by, e.g., introducing an additional layer or increase sizing the size of vocabularies or n-grams used.
  • C. Try out regularization techniques (e.g., dropout of batch normalization) to avoid overfitting.
  • D. Try to collect more data and increase the size of your dataset.

Answer: B

 

NEW QUESTION 68
Your company has recently grown rapidly and now ingesting data at a significantly higher rate than it was previously. You manage the daily batch MapReduce analytics jobs in Apache Hadoop. However, the recent increase in data has meant the batch jobs are falling behind. You were asked to recommend ways the development team could increase the responsiveness of the analytics without increasing costs. What should you recommend they do?

  • A. Rewrite the job in Pig.
  • B. Increase the size of the Hadoop cluster.
  • C. Rewrite the job in Apache Spark.
  • D. Decrease the size of the Hadoop cluster but also rewrite the job in Hive.

Answer: A

 

NEW QUESTION 69
You store historic data in Cloud Storage. You need to perform analytics on the historic data. You want to use a solution to detect invalid data entries and perform data transformations that will not require programming or knowledge of SQL.
What should you do?

  • A. Use Cloud Dataproc with a Hadoop job to detect errors and perform transformations.
  • B. Use Cloud Dataprep with recipes to detect errors and perform transformations.
  • C. Use Cloud Dataflow with Beam to detect errors and perform transformations.
  • D. Use federated tables in BigQuery with queries to detect errors and perform transformations.

Answer: C

 

NEW QUESTION 70
You are designing storage for two relational tables that are part of a 10-TB database on Google Cloud.
You want to support transactions that scale horizontally. You also want to optimize data for range queries on non-key columns. What should you do?

  • A. Use Cloud Spanner for storage. Add secondary indexes to support query patterns.
  • B. Use Cloud Spanner for storage. Use Cloud Dataflow to transform data to support query patterns.
  • C. Use Cloud SQL for storage. Add secondary indexes to support query patterns.
  • D. Use Cloud SQL for storage. Use Cloud Dataflow to transform data to support query patterns.

Answer: B

Explanation:
Explanation/Reference:
Reference: https://cloud.google.com/solutions/data-lifecycle-cloud-platform

 

NEW QUESTION 71
How would you query specific partitions in a BigQuery table?

  • A. Use the EXTRACT(DAY) clause
  • B. Use the DAY column in the WHERE clause
  • C. Use the __PARTITIONTIME pseudo-column in the WHERE clause
  • D. Use DATE BETWEEN in the WHERE clause

Answer: C

Explanation:
Partitioned tables include a pseudo column named _PARTITIONTIME that contains a date-based timestamp for data loaded into the table. To limit a query to particular partitions (such as Jan 1st and 2nd of
2017), use a clause similar to this:
WHERE _PARTITIONTIME BETWEEN TIMESTAMP('2017-01-01') AND TIMESTAMP('2017-01-02') Reference: https://cloud.google.com/bigquery/docs/partitioned-tables#the_partitiontime_pseudo_column

 

NEW QUESTION 72
You work for a shipping company that has distribution centers where packages move on delivery lines to route them properly. The company wants to add cameras to the delivery lines to detect and track any visual damage to the packages in transit. You need to create a way to automate the detection of damaged packages and flag them for human review in real time while the packages are in transit. Which solution should you choose?

  • A. Train an AutoML model on your corpus of images, and build an API around that model to integrate with the package tracking applications.
  • B. Use the Cloud Vision API to detect for damage, and raise an alert through Cloud Functions. Integrate the package tracking applications with this function.
  • C. Use TensorFlow to create a model that is trained on your corpus of images. Create a Python notebook in Cloud Datalab that uses this model so you can analyze for damaged packages.
  • D. Use BigQuery machine learning to be able to train the model at scale, so you can analyze the packages in batches.

Answer: D

 

NEW QUESTION 73
Cloud Bigtable is a recommended option for storing very large amounts of
____________________________?

  • A. single-keyed data with very high latency
  • B. multi-keyed data with very low latency
  • C. single-keyed data with very low latency
  • D. multi-keyed data with very high latency

Answer: C

Explanation:
Explanation
Cloud Bigtable is a sparsely populated table that can scale to billions of rows and thousands of columns, allowing you to store terabytes or even petabytes of data. A single value in each row is indexed; this value is known as the row key. Cloud Bigtable is ideal for storing very large amounts of single-keyed data with very low latency. It supports high read and write throughput at low latency, and it is an ideal data source for MapReduce operations.
Reference: https://cloud.google.com/bigtable/docs/overview

 

NEW QUESTION 74
You want to use a database of information about tissue samples to classify future tissue samples as either normal or mutated. You are evaluating an unsupervised anomaly detection method for classifying the tissue samples. Which two characteristic support this method? (Choose two.)

  • A. You expect future mutations to have similar features to the mutated samples in the database.
  • B. You expect future mutations to have different features from the mutated samples in the database.
  • C. There are roughly equal occurrences of both normal and mutated samples in the database.
  • D. There are very few occurrences of mutations relative to normal samples.
  • E. You already have labels for which samples are mutated and which are normal in the database.

Answer: A,D

Explanation:
Unsupervised anomaly detection techniques detect anomalies in an unlabeled test data set under the assumption that the majority of the instances in the data set are normal by looking for instances that seem to fit least to the remainder of the data set.
https://en.wikipedia.org/wiki/Anomaly_detection

 

NEW QUESTION 75
When a Cloud Bigtable node fails, ____ is lost.

  • A. all data
  • B. the last transaction
  • C. no data
  • D. the time dimension

Answer: C

Explanation:
A Cloud Bigtable table is sharded into blocks of contiguous rows, called tablets, to help balance the workload of queries. Tablets are stored on Colossus, Google's file system, in SSTable format. Each tablet is associated with a specific Cloud Bigtable node.
Data is never stored in Cloud Bigtable nodes themselves; each node has pointers to a set of tablets that are stored on Colossus. As a result:
Rebalancing tablets from one node to another is very fast, because the actual data is not copied. Cloud Bigtable simply updates the pointers for each node.
Recovery from the failure of a Cloud Bigtable node is very fast, because only metadata needs to be migrated to the replacement node.
When a Cloud Bigtable node fails, no data is lost
Reference: https://cloud.google.com/bigtable/docs/overview

 

NEW QUESTION 76
Your company is using WHILECARD tables to query data across multiple tables with similar names. The SQL statement is currently failing with the following error:
# Syntax error : Expected end of statement but got "-" at [4:11]
SELECT age
FROM
bigquery-public-data.noaa_gsod.gsod
WHERE
age != 99
AND_TABLE_SUFFIX = '1929'
ORDER BY
age DESC
Which table name will make the SQL statement work correctly?

  • A. 'bigquery-public-data.noaa_gsod.gsod'
  • B. bigquery-public-data.noaa_gsod.gsod*
  • C. 'bigquery-public-data.noaa_gsod.gsod'*
  • D. 'bigquery-public-data.noaa_gsod.gsod*`

Answer: D

 

NEW QUESTION 77
......

The New Professional-Data-Engineer 2022 Updated Verified Study Guides & Best Courses: https://www.getvalidtest.com/Professional-Data-Engineer-exam.html

Exam Study Guide Free Practice Test LAST UPDATED : https://drive.google.com/open?id=13yuszcV0S2znkTWAALPJlq5dTR7Co6bV