61 Practice Questions & Answers
You are designing a data pipeline that needs to process 500 GB of data daily with variable workloads. Which AWS service would be most cost-effective for running these batch ETL jobs?
-
A
AWS Lambda with S3 event triggers
-
B
AWS Glue with on-demand capacity
✓ Correct
-
C
Amazon Redshift with dense compute nodes
-
D
Amazon EMR with provisioned cluster running 24/7
Explanation
AWS Glue with on-demand capacity automatically scales for variable workloads without requiring you to manage infrastructure, making it cost-effective for batch ETL jobs. EMR running 24/7 would waste resources during low-demand periods.
A data analyst needs to create dashboards showing real-time metrics from IoT devices publishing data to Kinesis Data Streams. Which combination of services would provide the lowest latency visualization?
-
A
Kinesis Data Streams → EMR → Redshift → QuickSight
-
B
Kinesis Data Streams → Kinesis Data Firehose → S3 → QuickSight
-
C
Kinesis Data Streams → Kinesis Data Analytics → QuickSight
✓ Correct
-
D
Kinesis Data Streams → Lambda → DynamoDB → QuickSight
Explanation
Kinesis Data Analytics provides real-time SQL processing directly on streaming data with minimal latency, and can output directly to QuickSight for immediate visualization without intermediate storage layers.
You need to ensure that sensitive financial data in Amazon Redshift is encrypted both at rest and in transit. Which encryption approach is most comprehensive?
-
A
Use AWS Key Management Service (KMS) for at-rest encryption and require SSL/TLS for client connections
✓ Correct
-
B
Enable encryption at rest only; TLS is handled automatically for connections
-
C
Use S3 server-side encryption for all data before loading into Redshift
-
D
Configure column-level encryption in Redshift and disable public network access
Explanation
KMS encryption at rest combined with enforced SSL/TLS for connections provides comprehensive protection for sensitive data. While S3 encryption is helpful, it doesn't encrypt data in the Redshift cluster itself.
A company uses AWS Glue to catalog 10,000 tables across multiple data sources. They experience slow metadata queries. What optimization would most improve Glue Catalog performance?
-
A
Increase the number of Glue crawlers running in parallel
-
B
Use Glue Data Catalog Partitions feature to organize tables hierarchically
✓ Correct
-
C
Switch from Glue Catalog to Athena-managed metadata
-
D
Partition the catalog database into smaller logical databases
Explanation
The Glue Data Catalog Partitions feature enables hierarchical organization of metadata, significantly improving query performance for large catalogs. Partitioning the database logically would fragment the catalog further.
You are implementing a data lake on S3 with multiple teams accessing different datasets. What is the recommended approach for fine-grained access control?
-
A
Create separate S3 buckets for each team with bucket policies
-
B
Use S3 Object Lambda to intercept and filter data based on user identity
-
C
Implement Lake Formation with permission-based access control at the database and table level
✓ Correct
-
D
Use S3 ACLs combined with IAM roles for each team member
Explanation
AWS Lake Formation provides centralized permission management with database and table-level access control, which is more granular and manageable than bucket-level policies or ACLs for multi-team data lake scenarios.
A streaming analytics application needs to handle late-arriving data and out-of-order events. Which Kinesis Data Analytics feature addresses this requirement?
-
A
Watermarking with a grace period for delayed events
-
B
Session windows with automatic reordering
-
C
Event Time Windows with allowed lateness configuration
✓ Correct
-
D
Tumbling windows with event timestamp extraction
Explanation
Kinesis Data Analytics allows you to configure allowed lateness on event time windows, ensuring late-arriving data is still processed correctly rather than being discarded.
What is the primary limitation of using Amazon Athena for querying data in S3?
-
A
Maximum query execution time of 5 minutes
-
B
Cannot query data larger than 100 GB per query
-
C
Charges are based on the amount of data scanned, not retrieved
✓ Correct
-
D
Only supports simple SELECT statements, not complex joins
Explanation
Athena charges are based on the data scanned by queries, not the amount of data returned. This makes partitioning and columnar formats (Parquet) critical for cost optimization.
You need to perform complex machine learning transformations on data in a Glue job. Which approach is most efficient?
-
A
Export data from Glue to SageMaker for ML processing, then reimport results
-
B
Use AWS Glue DataBrew for automated ML-based data preparation
-
C
Create an EMR cluster and submit PySpark jobs with ML libraries
-
D
Write custom PySpark code in a Glue job using ML libraries like scikit-learn
✓ Correct
Explanation
Writing custom PySpark code directly in Glue jobs allows you to use Python ML libraries efficiently without data movement. This avoids the latency and complexity of exporting to SageMaker.
An organization wants to implement near-real-time reporting on customer transactions. The data arrives at 1 million events per second. Which architecture minimizes latency?
-
A
Kinesis Data Firehose → S3 → Athena → QuickSight
-
B
Kinesis Data Streams → DynamoDB Streams → Lambda → Redshift
-
C
Kinesis Data Streams → Kinesis Data Analytics (SQL) → Elasticsearch → Kibana
✓ Correct
-
D
Kinesis Data Streams (shards based on throughput) → Lambda → RDS → QuickSight
Explanation
Kinesis Data Analytics processes streaming SQL in real-time on Kinesis streams, and Elasticsearch/Kibana provide near-instant visualization capabilities suitable for high-volume event streams.
What is the key difference between AWS Glue DPU-based pricing and provisioned capacity?
-
A
There is no practical difference; they are interchangeable pricing models
-
B
Provisioned capacity pre-allocates resources for consistent workloads, while DPU scales automatically per second
-
C
Provisioned capacity guarantees minimum hourly cost regardless of usage; DPU pricing charges only for actual processing time
✓ Correct
-
D
DPU pricing is per-job; provisioned capacity is per-worker-hour
Explanation
Provisioned capacity commits to a minimum hourly cost for guaranteed capacity, while DPU-based pricing charges for every 0.25 DPU-hours of actual job execution, making it better for variable workloads.
You are building a federated query solution across Redshift, RDS, and S3 using Redshift Spectrum. What is a critical consideration?
-
A
Spectrum cannot query encrypted data in S3
-
B
All data sources must use the same authentication mechanism
-
C
Query performance depends on the distance between Redshift and external data sources
-
D
External tables in Spectrum must be defined in the Redshift Glue Catalog
✓ Correct
Explanation
Redshift Spectrum queries external data by defining external tables through the Glue Catalog, which enables federated queries without moving data into Redshift.
A data pipeline processes customer events with personally identifiable information (PII). Which AWS service should be used for automatic PII detection and masking?
-
A
Amazon Macie for discovering and protecting sensitive data
-
B
Amazon Comprehend for NLP-based PII identification in text
-
C
AWS Lambda with custom Python code for pattern matching
-
D
AWS Glue DataBrew with built-in PII detection recipes
✓ Correct
Explanation
AWS Glue DataBrew provides visual recipes specifically designed for PII detection and masking as part of data preparation workflows, with no coding required.
You need to optimize query performance on a 50 TB Redshift cluster. Which strategy would have the most significant impact?
-
A
Increase the number of slices per node to improve parallelism
-
B
Implement sort keys and distribution keys based on query patterns and data relationships
✓ Correct
-
C
Upgrade to dense storage (DS) nodes for more disk space
-
D
Enable result caching for all frequently run queries
Explanation
Sort keys and distribution keys directly impact query execution by minimizing data shuffling and enabling efficient joins and scans, which is more impactful than node type upgrades for most scenarios.
An organization collects sensor data from 100,000 IoT devices every 5 seconds. What is the recommended approach for ingesting this data into AWS?
-
A
Send to DynamoDB first, then replicate to S3 hourly
-
B
Batch upload to S3 every hour using AWS IoT Core rules
-
C
Upload to Redshift via COPY commands every minute
-
D
Stream directly to Kinesis Data Streams using AWS IoT Core integration
✓ Correct
Explanation
AWS IoT Core integrates directly with Kinesis Data Streams for high-throughput, low-latency ingestion of device data at scale, supporting the millions of concurrent connections typical of IoT workloads.
You are designing data retention policies for a compliance-heavy industry. How should you implement time-based data deletion in S3 using Redshift Spectrum?
-
A
Create partitioned external tables and drop partitions when they expire
✓ Correct
-
B
Store expiration dates in Glue Catalog metadata and use Lambda for cleanup
-
C
Use S3 Lifecycle policies to archive and delete objects; query archived data via Spectrum
-
D
Implement triggers in Redshift to run DELETE commands on external tables
Explanation
Partitioned external tables allow you to efficiently drop entire partitions based on date ranges, which is the most efficient approach for time-based retention in Spectrum without modifying underlying S3 data.
What is the primary advantage of using Amazon QuickSight's SPICE engine?
-
A
It eliminates the need for data preparation and schema definition
-
B
It automatically applies machine learning transformations to all datasets
-
C
It enables real-time connections to any data source without caching
-
D
It stores and indexes data in-memory, allowing sub-second query response times
✓ Correct
Explanation
SPICE (Super-fast, Parallel, In-memory Calculation Engine) caches data in-memory with optimized indexing, enabling extremely fast interactive dashboards and exploration without hitting the source database.
An analytics team needs to share a complex dataset with external partners while maintaining data security. Which approach is most appropriate?
-
A
Create an unencrypted copy of the dataset and share via email
-
B
Use AWS Lake Formation cross-account access with resource-based policies and row-level filters
✓ Correct
-
C
Create a read-only Redshift user account and share the cluster endpoint
-
D
Export to CSV and share through a secure file transfer service
Explanation
Lake Formation provides fine-grained access control across accounts with row and column-level filtering, enabling secure data sharing with external partners while maintaining data governance.
You are implementing a data warehouse that requires strict isolation of workloads to prevent analytical queries from impacting operational systems. Which Redshift feature addresses this?
-
A
Redshift RA3 nodes with managed storage
-
B
Redshift Workload Management (WLM) with query queues and resource allocation
✓ Correct
-
C
Redshift SSL/TLS encryption for network isolation
-
D
Redshift Spectrum for querying external operational data
Explanation
Redshift WLM allows you to define query queues with resource limits, CPU allocation, and query priorities, ensuring heavy analytical queries don't starve operational workloads of resources.
A company processes clickstream data with high cardinality dimensions (e.g., user IDs, session IDs). Which data storage format in S3 would minimize query costs in Athena?
-
A
CSV format with gzip compression
-
B
Parquet format with Snappy compression and partitioning by date
✓ Correct
-
C
ORC format without compression
-
D
JSON format with dictionary encoding
Explanation
Parquet format supports columnar storage and compression, allowing Athena to skip entire columns and partitions, significantly reducing scanned data and query costs compared to row-based formats like CSV.
What is the correct approach to handle slowly changing dimensions (Type 2) in an AWS data warehouse using Glue?
-
A
Overwrite existing records in the target table on each load
-
B
Store dimension changes in a separate delta table and apply them post-load
-
C
Use Glue bookmarks to only process new records since the last job run
-
D
Create surrogate keys and maintain version records with effective dates
✓ Correct
Explanation
Type 2 SCD requires maintaining historical records with surrogate keys and effective date ranges, enabling point-in-time analysis while preserving dimension change history in the data warehouse.
An organization needs to monitor data quality for thousands of tables daily. Which service provides automated data quality checks?
-
A
Amazon CloudWatch with custom metrics and alarms
-
B
AWS Glue Crawler with duplicate detection only
-
C
AWS Glue Data Quality with rule recommendations
✓ Correct
-
D
Athena with SQL-based validation queries
Explanation
AWS Glue Data Quality automatically generates and applies data quality rules, monitoring completeness, uniqueness, and accuracy across large numbers of datasets without manual rule creation.
You need to implement a data catalog that supports multi-region disaster recovery. Which approach minimizes failover time?
-
A
Maintain Hive metastores in both regions and manually sync definitions
-
B
Replicate Glue Catalog metadata to a secondary AWS account in another region
-
C
Use AWS Glue Catalog with cross-region replication enabled and failover to standby region
✓ Correct
-
D
Store catalog metadata in DynamoDB with global tables and stream to secondary region
Explanation
AWS Glue Catalog supports cross-region replication of metadata, enabling rapid failover to a standby region with minimal data loss and manual intervention.
What configuration is essential when using Amazon Kinesis Data Firehose for data transformation?
-
A
Configure CloudWatch Logs to capture all transformation errors automatically
-
B
Use only pre-built Firehose transformation templates
-
C
Enable Lambda transformation and ensure the Lambda function processes records successfully
✓ Correct
-
D
Set a very high buffer size to accommodate transformation latency
Explanation
Firehose can invoke Lambda functions for data transformation; these functions must complete successfully and return properly formatted records for delivery to the destination.
A financial services company requires immutable audit logs for compliance. Which S3 configuration ensures data cannot be modified or deleted?
-
A
Enable S3 Versioning and configure S3 Object Lock in GOVERNANCE mode
-
B
Enable S3 Versioning and configure S3 Object Lock in COMPLIANCE mode with a retention period
✓ Correct
-
C
Enable MFA Delete and restrict bucket access to a specific IAM role
-
D
Use S3 bucket encryption with a CMK and deny all delete permissions via bucket policy
Explanation
S3 Object Lock in COMPLIANCE mode provides WORM (Write-Once-Read-Many) protection with a retention period, preventing any deletion or modification even by administrators, meeting immutable audit log requirements.
You are designing an ETL process that must handle schema evolution in source data. Which Glue feature best supports this?
-
A
Glue Job Bookmarks to track schema changes
-
B
Glue Crawlers with automatic schema detection and catalog updates
✓ Correct
-
C
Glue Data Catalog versioning with table version tracking
-
D
Glue DPU auto-scaling to handle schema complexity
Explanation
Glue Crawlers automatically detect schema changes by scanning source data and updating the catalog accordingly, eliminating the need for manual schema management as source data evolves.
A startup wants to build a serverless analytics solution with minimal operational overhead. Which technology stack would be most appropriate?
-
A
API Gateway → EventBridge → Redshift Spectrum → custom API endpoints
-
B
DynamoDB Streams → Lambda → RDS → custom web dashboard
-
C
Kinesis Data Firehose → S3 → Glue for ETL → QuickSight dashboards
✓ Correct
-
D
SNS topics → SQS queues → EMR cluster → Tableau
Explanation
This stack is fully serverless and managed: Firehose ingests data automatically, S3 stores it, Glue transforms it on-demand, and QuickSight provides visualization without managing infrastructure.
You need to ingest real-time streaming data from IoT devices into AWS. Which service would best handle high-throughput, low-latency ingestion with automatic scaling?
-
A
Amazon Kinesis Data Streams
✓ Correct
-
B
Amazon S3 with event notifications
-
C
AWS Lambda with SQS
-
D
AWS Glue ETL jobs
Explanation
Kinesis Data Streams is purpose-built for real-time data ingestion with automatic scaling, high throughput, and low latency, making it ideal for IoT device data.
A data analyst needs to perform interactive SQL queries on data stored in S3 without provisioning servers. Which service should be used?
-
A
Amazon Athena
✓ Correct
-
B
AWS Lake Formation
-
C
Amazon EMR
-
D
Amazon Redshift Spectrum
Explanation
Amazon Athena provides serverless SQL query capability for S3 data without requiring infrastructure setup, making it the most straightforward solution.
You are designing a data lake and need to ensure fine-grained access control across multiple AWS accounts. Which AWS service provides centralized governance and permission management?
-
A
AWS IAM policies exclusively
-
B
Amazon S3 bucket policies only
-
C
Amazon Athena access controls
-
D
AWS Lake Formation
✓ Correct
Explanation
AWS Lake Formation provides centralized data governance with fine-grained access control, cross-account access management, and simplified permission administration.
What is the primary advantage of using Amazon Redshift over traditional on-premises data warehouses for analytics workloads?
-
A
Lower query latency for all query types
-
B
Eliminates the need for data modeling
-
C
Massive parallel processing with elastic scaling and pay-as-you-go pricing
✓ Correct
-
D
Automatic query optimization without user configuration
Explanation
Redshift's MPP architecture enables horizontal scaling, cost-effective pay-per-use pricing, and eliminates upfront capital investment compared to on-premises systems.
A company uses AWS Glue to catalog and transform data. Which component automatically discovers and catalogues data schema changes?
-
A
AWS Glue Crawlers
✓ Correct
-
B
AWS Glue DataBrew
-
C
AWS Glue Triggers
-
D
AWS Glue Studio
Explanation
AWS Glue Crawlers automatically scan data sources, infer schemas, and populate the Glue Data Catalog, detecting schema changes over time.
You need to stream data from Kinesis Data Streams to multiple destinations including S3, Redshift, and Elasticsearch. What is the most efficient approach?
-
A
Use a single Kinesis Data Firehose delivery stream configured with multiple destinations
-
B
Use Kinesis Data Firehose with multiple delivery streams
✓ Correct
-
C
Configure Lambda functions triggered by Kinesis events to write to each destination
-
D
Create three separate Kinesis consumer applications
Explanation
Kinesis Data Firehose can deliver data to multiple AWS destinations (S3, Redshift, Elasticsearch) with built-in transformations and fault tolerance, though separate streams may be needed for different batching requirements.
Your organization needs to perform machine learning predictions on streaming data. Which combination of services provides the most integrated solution?
-
A
Kinesis Data Analytics with Apache Flink and SageMaker
✓ Correct
-
B
S3 → Kinesis Firehose → SageMaker training jobs
-
C
EventBridge → Step Functions → SageMaker
-
D
Kinesis Data Streams → Lambda → SageMaker Endpoint
Explanation
Kinesis Data Analytics integrates directly with Apache Flink for streaming analytics and can invoke SageMaker endpoints for real-time ML predictions within the streaming pipeline.
When configuring Amazon Redshift, which parameter directly affects query performance and storage capacity?
-
A
VPC security group rules
-
B
Parameter group encryption settings
-
C
Node type and cluster size
✓ Correct
-
D
Backup retention period
Explanation
Node type (dense compute vs. dense storage) and cluster size determine both the computational resources and total storage capacity available for query processing.
A data pipeline ingests data from multiple heterogeneous sources. Which AWS Glue feature best handles complex ETL transformations with minimal code?
-
A
AWS Glue Python Shell jobs
-
B
AWS Glue crawler configurations
-
C
AWS Glue Scala API
-
D
AWS Glue Studio with visual ETL builder
✓ Correct
Explanation
AWS Glue Studio provides a visual, drag-and-drop interface for building complex ETL pipelines without writing extensive code, supporting various source and target combinations.
You want to analyze data using QuickSight and need to establish real-time connectivity to a database. Which connection type provides the lowest latency?
-
A
Import the data into SPICE
-
B
Use AWS Glue to transform and cache data
-
C
Use a direct database connection via JDBC/ODBC
✓ Correct
-
D
Export data to S3 and refresh hourly
Explanation
Direct database connections via JDBC/ODBC provide real-time query execution with lowest latency, though SPICE offers better performance for repeated queries on cached data.
Which Amazon DMS feature enables continuous data replication from source to target with minimal downtime?
-
A
Full load migration followed by Change Data Capture (CDC)
✓ Correct
-
B
Schema conversion without data migration
-
C
Replication instances only
-
D
Batch export to CSV files
Explanation
DMS performs an initial full load and then captures ongoing changes (CDC), enabling continuous replication that minimizes application downtime during database migration.
In Amazon Kinesis Data Analytics, what is the primary purpose of a source group?
-
A
Configure shard allocation strategies
-
B
Combine data from multiple Kinesis streams into a single logical stream for processing
✓ Correct
-
C
Manage failover across multiple AWS regions
-
D
Set access control policies for stream consumers
Explanation
Source groups in Kinesis Data Analytics allow you to logically combine and join data from multiple input streams for integrated processing within SQL or Flink applications.
You need to compress data before loading it into Redshift to reduce storage costs. Which compression algorithm provides the best ratio for columnar data?
-
A
Snappy compression
-
B
GZIP compression
-
C
AZ64 compression
✓ Correct
-
D
LZO compression
Explanation
AZ64 is a modern compression algorithm optimized for columnar data in Redshift, providing better compression ratios than older methods like LZO or Snappy.
A company wants to detect anomalies in real-time sensor data. Which AWS service is purpose-built for anomaly detection in streaming time-series data?
-
A
AWS Forecast
-
B
Amazon CloudWatch Anomaly Detector
-
C
Amazon Lookout for Metrics
✓ Correct
-
D
SageMaker Autopilot with Kinesis
Explanation
Amazon Lookout for Metrics is specifically designed for automated anomaly detection in time-series metrics and can integrate with various data sources for real-time detection.
What is the maximum retention period for data stored in Kinesis Data Streams without using enhanced fan-out?
-
A
30 days by default, extendable to 90 days
-
B
24 hours by default, extendable to 365 days
✓ Correct
-
C
7 days by default, no extension possible
-
D
1 hour by default, extendable to 24 hours
Explanation
Kinesis Data Streams retains data for 24 hours by default, and you can increase this to up to 365 days using the extended retention feature for an additional cost.
You are designing a cost-effective analytics solution for ad-hoc queries on large historical datasets. Which approach minimizes costs while maintaining query performance?
-
A
Store data in S3 and query with Athena, using Glue partitioning and projection
✓ Correct
-
B
Archive to Glacier and restore for each query
-
C
Use EMR on-demand instances for all queries
-
D
Keep all data in Redshift for instant access
Explanation
Storing data in S3 with Athena queries combined with proper partitioning and projection strategies provides cost-effective ad-hoc analysis without maintaining expensive data warehouse infrastructure.
In AWS Lake Formation, what does the 'LF-tag-based access control' feature enable?
-
A
Network-level access restrictions
-
B
Governance of data assets using metadata tags with automatic enforcement across integrated AWS services
✓ Correct
-
C
Encryption key rotation policies
-
D
Row-level security enforcement across analytics services
Explanation
LF-tag-based access control allows administrators to define permissions using metadata tags that are automatically enforced across Athena, Redshift, Glue, and other integrated services.
A data analyst needs to create visualizations with drill-down capability and complex calculations. Which QuickSight feature best supports this requirement?
-
A
QuickSight dashboards with parameters
-
B
QuickSight Q (natural language queries)
-
C
QuickSight analyses with hierarchical dimensions and calculated fields
✓ Correct
-
D
QuickSight paginated reports
Explanation
QuickSight analyses with hierarchical dimensions enable drill-down capability, and calculated fields support complex business logic and transformations at visualization time.
What is the primary difference between Kinesis Data Streams and Kinesis Data Firehose in terms of data delivery guarantee?
-
A
Both provide exactly-once delivery guarantees
-
B
Firehose guarantees exactly-once delivery; Streams guarantees at-least-once delivery
✓ Correct
-
C
Streams guarantees at-least-once; Firehose has no ordering guarantee
-
D
Streams guarantees exactly-once delivery; Firehose guarantees at-least-once delivery
Explanation
Kinesis Data Firehose guarantees at-least-once delivery to destinations, while Kinesis Data Streams provides at-least-once delivery to consumers with ordered records per shard.
You need to join streaming data from two Kinesis streams with different arrival rates. What is the best approach in Kinesis Data Analytics?
-
A
Process each stream separately and join in S3
-
B
Use Lambda to correlate events from both streams
-
C
Merge both streams into a single stream first
-
D
Use a windowed join with appropriate time window definition
✓ Correct
Explanation
Kinesis Data Analytics supports windowed joins (tumbling, sliding, session windows) that handle streams with different arrival rates by buffering events within the window duration.
Which AWS Glue feature allows scheduling of ETL jobs based on triggers, including event-based execution?
-
A
Glue Studio scheduling
-
B
Glue triggers with event patterns
-
C
Glue job bookmarks
-
D
Glue Workflows
✓ Correct
Explanation
AWS Glue Workflows enable orchestration of multiple jobs with triggers (time-based, on-demand, or event-based), providing complex job dependency and sequencing management.
A company processes petabytes of data daily using EMR. To optimize costs, they should implement which cost management strategy?
-
A
Use only reserved instances
-
B
Mix spot instances, reserved instances, and on-demand; use auto-scaling based on workload
✓ Correct
-
C
Disable Hadoop optimization features to reduce compute
-
D
Use only on-demand instances
Explanation
Combining spot instances (for cost), reserved instances (for baseline capacity), on-demand (for peaks), and auto-scaling provides optimal cost efficiency for variable EMR workloads.
When using Amazon Redshift, what is the impact of row-level security (RLS) on query performance?
-
A
Significant performance degradation regardless of configuration
-
B
Performance impact varies based on the number of RLS policies and data distribution
✓ Correct
-
C
No performance impact due to built-in optimizations
-
D
Minimal performance impact when RLS policies are properly indexed
Explanation
RLS performance impact depends on policy complexity, the number of policies, and how they interact with query predicates and data distribution, requiring careful design and testing.
You need to implement data quality checks in an AWS Glue ETL pipeline. Which feature is purpose-built for this?
-
A
AWS Glue Data Quality Rules framework
✓ Correct
-
B
AWS Glue Data Catalog descriptions
-
C
AWS Glue crawler statistics
-
D
AWS Glue job bookmarks
Explanation
AWS Glue Data Quality Rules provide a framework to define, validate, and monitor data quality checks directly within ETL jobs, with automatic recommendations and remediation.
In QuickSight, what is the primary advantage of using SPICE (Super-fast, Parallel, In-memory Calculation Engine)?
-
A
Automatically scales to handle unlimited data volumes
-
B
Eliminates the need for data preparation
-
C
Enables real-time updates from databases
-
D
Provides in-memory caching for faster dashboard interactivity without querying source systems repeatedly
✓ Correct
Explanation
SPICE imports and caches data in memory, enabling fast interactive dashboard performance and drill-downs without repeatedly querying the source database or data warehouse.
You are designing a data lake on Amazon S3 for a healthcare organization. Which data format would be most efficient for analytical queries while maintaining ACID transactions and schema enforcement?
-
A
Apache Iceberg or Delta Lake format
✓ Correct
-
B
JSON files stored in separate folders
-
C
CSV files partitioned by date
-
D
Apache Parquet with AWS Glue catalog
Explanation
Apache Iceberg and Delta Lake provide ACID transactions, schema enforcement, and time-travel capabilities essential for healthcare data lakes, while Parquet alone lacks transaction support and CSV/JSON are less optimal for analytical queries.
Your company uses Amazon Redshift and needs to perform real-time analytics on streaming data. Which approach best combines streaming ingestion with Redshift analytical capabilities?
-
A
Use S3 event notifications to trigger Lambda functions for data transformation
-
B
Use Kinesis Data Firehose to load data directly into Redshift, with automatic batching and compression
✓ Correct
-
C
Stream data to DynamoDB first, then manually export to Redshift nightly
-
D
Implement a custom Kafka consumer that writes directly to Redshift tables
Explanation
Kinesis Data Firehose is purpose-built for streaming data delivery to Redshift, handling batching, compression, and automatic retry logic without requiring custom code.
You need to optimize costs for infrequent analytical queries on historical data in Amazon Redshift. Which strategy would be most cost-effective?
-
A
Use Redshift Spectrum to query data directly in S3 while maintaining the cluster always-on for consistency
-
B
Keep all data in dense compute nodes with SSD storage and use pause/resume functionality during off-hours
-
C
Migrate everything to RDS PostgreSQL which has lower per-hour costs than Redshift
-
D
Archive historical data to S3 and use Amazon Athena for occasional queries, keeping only recent data in Redshift
✓ Correct
Explanation
Archiving historical data to S3 and using Athena for infrequent queries eliminates Redshift cluster costs for cold data, while Spectrum still requires an active cluster, making option B more cost-effective for this scenario.
An organization wants to implement a data quality framework in AWS Glue. Which combination of services would provide comprehensive data validation, cleansing, and quality metrics?
-
A
AWS Glue ETL jobs only, with quality checks implemented directly in PySpark transformations
-
B
AWS Lambda functions with custom Python scripts for all quality checks and transformations
-
C
AWS Glue DataBrew for profiling and cleansing, plus AWS Glue Data Quality rules for validation and metrics
✓ Correct
-
D
Amazon QuickSight anomaly detection combined with manual data review processes
Explanation
AWS Glue DataBrew and AWS Glue Data Quality are purpose-built for data profiling, cleansing, validation, and quality metric generation, providing a more comprehensive and maintainable solution than custom code or QuickSight alone.
You are implementing a machine learning pipeline where Amazon SageMaker needs access to training data stored in S3. What is the most secure way to grant this access without exposing AWS credentials?
-
A
Use temporary credentials generated by AWS STS and refresh them manually in notebook code
-
B
Store credentials in an encrypted RDS database and query them during training job execution
-
C
Embed AWS access keys in the SageMaker notebook environment variables
-
D
Create an IAM role with appropriate S3 permissions and attach it to the SageMaker execution role
✓ Correct
Explanation
Attaching an IAM role with S3 permissions to the SageMaker execution role follows AWS security best practices by using temporary credentials automatically managed by AWS, eliminating the need to handle credentials explicitly.
Your organization needs to perform complex SQL transformations on data that spans multiple data sources including RDS, S3, and DynamoDB. Which service would best enable unified querying across these heterogeneous sources?
-
A
AWS Glue ETL jobs that extract, transform, and load all data into a single S3 data lake
-
B
Amazon Athena with Iceberg tables and federated query capabilities
✓ Correct
-
C
Amazon RDS with DMS to replicate all sources into a single PostgreSQL database
-
D
ElasticSearch with Logstash pipelines to aggregate data from multiple sources
Explanation
Amazon Athena's federated query feature allows querying data across multiple sources without replicating them, providing SQL access to RDS, S3, and DynamoDB in a cost-effective manner without data movement.
You have implemented Amazon QuickSight dashboards that must comply with data masking requirements where certain users cannot see PII columns. What is the recommended approach?
-
A
Implement row-level security (RLS) using QuickSight's user and group policies
-
B
Create separate datasets with masked columns and apply different permissions per dataset
-
C
Implement field-level security with RLS rules in QuickSight, combined with dataset column selection
✓ Correct
-
D
Use column-level security in your source database and configure QuickSight dataset permissions accordingly
Explanation
QuickSight's field-level security combined with row-level security rules provides granular control over which columns users can access, enabling compliance with PII masking requirements while maintaining a single dataset.
An organization is building a real-time fraud detection system using Amazon Kinesis. Which configuration best supports low-latency processing with automatic scaling and exactly-once semantics?
-
A
Implement Kinesis Data Analytics with inline SQL transformations and auto-scaling capabilities
-
B
Stream to SQS for queuing and process with Lambda functions for parallel processing
-
C
Use Kinesis Data Streams with Kinesis Consumer Library (KCL) for scaling and checkpointing
✓ Correct
-
D
Use Kinesis Data Firehose with batch transformations using Lambda for exactly-once processing
Explanation
Kinesis Data Streams with KCL provides automatic horizontal scaling through shard-based processing, managed checkpointing for exactly-once semantics, and lower latency compared to Firehose which is optimized for batch delivery.
You need to analyze unstructured log data from AWS services and application logs stored in S3. Which combination of services provides the most efficient solution for log analysis and visualization?
-
A
Amazon CloudWatch Logs with embedded metric filters and CloudWatch Insights for analysis
-
B
AWS Glue to catalog logs, Amazon Athena for SQL queries, and QuickSight for visualization
✓ Correct
-
C
ElasticSearch with Filebeat for log collection and Kibana for visualization
-
D
Store all logs in RDS and use SQL queries with QuickSight for visualization
Explanation
Using Glue for cataloging, Athena for SQL queries, and QuickSight for visualization provides a serverless, cost-effective solution that scales automatically and requires minimal infrastructure management for S3-based log analysis.
Your organization maintains multiple data marts built on Redshift clusters across different AWS regions. To ensure consistency and reduce data duplication, what approach should be implemented?
-
A
Enable Redshift cross-region read replicas and configure automatic failover between regions
-
B
Set up AWS DMS to continuously replicate data from primary Redshift cluster to secondary regional clusters
-
C
Implement AWS DataSync to continuously sync Redshift tables between regions
-
D
Use AWS Glue to centralize ETL logic, write processed data to a central S3 data lake, and load region-specific marts from S3 using Redshift Spectrum
✓ Correct
Explanation
Centralizing ETL in Glue and using S3 as a single source of truth with Redshift Spectrum eliminates data duplication, ensures consistency across regions, and is more cost-effective than maintaining multiple cluster replicas or continuous replication.