Book - Chapter 1 Flashcards
(15 cards)
A developer is planning a mobile application for your company’s customers to use to track information about their accounts. The developer is asking for your advice on storage technologies. In one case, the developer explains that they want to write messages each time a significant event occurs, such as the client opening, viewing, or deleting an account. This data is collected for compliance reasons, and the developer wants to minimize administrative overhead. What system would you recommend for storing this data?
A. Cloud SQL using MySQL
B. Cloud SQL using PostgreSQL
C. Cloud Datastore
D. Stackdriver Logging
D. The correct answer is D. Stackdriver Logging is the best option because it is a managed service designed for storing logging data.
Neither Option A nor B is as good a fit because the developer would have to design and maintain a relational data model and user interface to view and manage log data.
Option C, Cloud Datastore, would not require a fixed data model, but it would still require the developer to create and maintain a user interface to manage log events.
You are responsible for developing an ingestion mechanism for a large number of IoT sensors. The ingestion service should accept data up to 10 minutes late. The service should also perform some transformations before writing the data to a database. Which of the managed services would be the best option for managing late arriving data and performing transformations?
A. Cloud Dataproc
B. Cloud Dataflow
C. Cloud Dataprep
D. Cloud SQL
B. The correct answer is B. Cloud Dataflow is a stream and batch processing service that is used for transforming data and processing streaming data.
Option A, Cloud Dataproc, is a managed Hadoop and Spark service and not as well suited as Cloud Dataflow for the kind of stream processing specified.
Option C, Cloud Dataprep, is an interactive tool for exploring and preparing data sets for analysis.
Option D, Cloud SQL, is a relational database service, so it may be used to store data, but it is not a service specifically for ingesting and transforming data before writing to a database.
A team of analysts has collected several CSV datasets with a total size of 50 GB. They plan to store the datasets in GCP and use Compute Engine instances to run RStudio, an interactive statistical application. Data will be loaded into RStudio using an RStudio data loading tool. Which of the following is the most appropriate GCP storage service for the datasets?
A. Cloud Storage
B. Cloud Datastore
C. MongoDB
D. Bigtable
A. The correct answer is A, Cloud Storage, because the data in the files is treated as an atomic unit of data that is loaded into RStudio.
Options B and C are incorrect because those are document databases and there is no requirement for storing the data in semi-structured format with support for fully indexed querying. Also, MongoDB is not a GCP service.
Option D is incorrect because, although you could load CSV data into a Bigtable table, the volume of data is not sufficient to warrant using Bigtable.
A team of analysts has collected several terabytes of telemetry data in CSV datasets. They plan to store the datasets in GCP and query and analyze the data using SQL. Which of the following is the most appropriate GCP storage service for the datasets?
A. Cloud SQL
B. Cloud Spanner
C. BigQuery
D. Bigtable
C. The correct answer is C, BigQuery, which is a managed analytical database service that supports SQL and scales to petabyte volumes of data.
Options A and B are incorrect because both are used for transaction processing applications, not analytics.
Option D is incorrect because Bigtable does not support SQL.
You have been hired to consult with a startup that is developing software for self-driving vehicles. The company’s product uses machine learning to predict the trajectory of persons and vehicles. Currently, the software is being developed using 20 vehicles, all located in the same city. IoT data is sent from vehicles every 60 seconds to a MySQL database running on a Compute Engine instance using an n2-standard-8 machine type with 8 vCPUs and 16 GB of memory. The startup wants to review their architecture and make any necessary changes to support tens of thousands of self-driving vehicles, all transmitting IoT data every second. The vehicles will be located across North America and Europe. Approximately 4 KB of data is sent in each transmission. What changes to the architecture would you recommend?
A. None. The current architecture is well suited to the use case.
B. Replace Cloud SQL with Cloud Spanner.
C. Replace Cloud SQL with Bigtable.
D. Replace Cloud SQL with Cloud Datastore.
C. The correct answer is C. Bigtable is the best storage service for IoT data, especially when a large number of devices will be sending data at short intervals.
Option A is incorrect, because Cloud SQL is designed for transaction processing at a regional level.
Option B is incorrect because Cloud Spanner is designed for transaction processing, and although it scales to global levels, it is not the best option for IoT data.
Option D is incorrect because there is no need for indexed, semi-structured data.
As a member of a team of game developers, you have been tasked with devising a way to track players’ possessions. Possessions may be purchased from a catalog, traded with other players, or awarded for game activities. Possessions are categorized as clothing, tools, books, and coins. Players may have any number of possessions of any type. Players can search for other players who have particular possession types to facilitate trading. The game designer has informed you that there will likely be new types of possessions and ways to acquire them in the future. What kind of a data store would you recommend using?
A. Transactional database
B. Wide-column database
C. Document database
D. Analytic database
C. The correct answer is C because the requirements call for a semi-structured schema.
You will need to search players’ possessions and not just look them up using a single key
because of the requirement for facilitating trading.
Option A is not correct. Transactional databases have fixed schemas, and this use case calls for a semi-structured schema.
Option B is incorrect because it does not support indexed lookup, which is needed for searching.
Option D is incorrect. Analytical databases are structured data stores.
The CTO of your company wants to reduce the cost of running an HBase and Hadoop cluster on premises. Only one HBase application is run on the cluster. The cluster currently supports 10 TB of data, but it is expected to double in the next six months. Which of the following managed services would you recommend to replace the on-premises cluster in order to minimize migration and ongoing operational costs?
A. Cloud Bigtable using the HBase API
B. Cloud Dataflow using the HBase API
C. Cloud Spanner
D. Cloud Datastore
A. The correct answer is A. Cloud Bigtable using the HBase API would minimize migration efforts, and since Bigtable is a managed service, it would help reduce operational costs.
Option B is incorrect. Cloud Dataflow is a stream and batch processing service, not a database.
Options C and D are incorrect. Relational databases are not likely to be appropriate choices for an HBase database, which is a wide-column NoSQL database, and trying to migrate from a wide-column to a relational database would incur unnecessary costs.
A genomics research institute is developing a platform for analyzing data related to genetic diseases. The genomics data is in a specialized format known as FASTQ, which stores nucleotide sequences and quality scores in a text format. Files may be up to 400 GB and are uploaded in batches. Once the files finish uploading, an analysis pipeline runs, reads the data in the FASTQ file, and outputs data to a database. What storage system is a good option for storing the uploaded FASTQ data?
A. Cloud Bigtable
B. Cloud Datastore
C. Cloud Storage
D. Cloud Spanner
C. The correct answer is C because the FASTQ files are unstructured since their internal format is not used to organize storage structures. Also, 400 GB is large enough that it is not efficient to store them as objects in a database.
Options A and B are incorrect because a NoSQL database is not needed for the given requirements. Similarly, there is no need to store the data in a structured database like Cloud Spanner, so Option D is incorrect.
You are developing a new application and will be storing semi-structured data that will only be accessed by a single key. The total volume of data will be at least 40 TB. What GCP database service would you use?
A. BigQuery
B. Bigtable
C. Cloud Spanner
D. Cloud SQL
B. The correct answer is B. Bigtable is a wide-column NoSQL database that supports semi-structured data and works well with datasets over 1 TB.
Options A, D, and C are incorrect because they all are used for structured data. Option D is also incorrect because Cloud SQL does not currently scale to 40 TB in a single database.
A group of climate scientists is collecting weather data every minute from 10,000 sensors across the globe. Data often arrives near the beginning of a minute, and almost all data arrives within the first 30 seconds of a minute. The data ingestion process is losing some data because servers cannot ingest the data as fast as it is arriving. The scientists have scaled up the number of servers in their managed instance group, but that has not completely eliminated the problem. They do not wish to increase the maximum size of the managed instance group. What else can the scientists do to prevent data loss?
A. Write data to a Cloud Dataflow stream
B. Write data to a Cloud Pub/Sub topic
C. Write data to Cloud SQL table
D. Write data to Cloud Dataprep
B. The correct answer is B, write data to a Cloud Pub/Subtopic, which can scale automatically to existing workloads. The ingestion process can read data from the topic and data and then process it. Some data will likely accumulate early in every minute, but the ingestion process can catch up later in the minute after new data stops arriving.
Option A is incorrect; Cloud Dataflow is a batch and stream processing service—it is not a message queue for buffering data.
Option C is incorrect; Cloud SQL is not designed to scale for ingestion as needed in this example.
Option D is incorrect; Cloud Dataprep is a tool for cleaning and preparing datasets for analysis.
A software developer asks your advice about storing data. The developer has hundreds of thousands of 1 KB JSON objects that need to be accessed in sub-millisecond times if possible. All objects are referenced by a key. There is no need to look up values by the contents of the JSON structure. What kind of NoSQL database would you recommend?
A. Key-value database
B. Analytical database
C. Wide-column database
D. Graph database
A. The correct answer is A. This is a good use case for key-value databases because the value is looked up by key only and the value is a JSON structure. Option B is incorrect. Analytical databases are not a type of NoSQL database.
Option C is not a good option because wide-column databases work well with larger databases, typically in the terabyte range.
Option D is incorrect because the data is not modeled as nodes and links, such as a network model.
A software developer asks your advice about storing data. The developer has hundreds of thousands of 10 KB JSON objects that need to be searchable by most attributes in the JSON structure. What kind of NoSQL database would you recommend?
A. Key-value database
B. Analytical database
C. Wide-column database
D. Document database
D. The correct answer is D. A document database could store the volume of data, and it provides for indexing on columns other than a single key.
Options A and C do not support indexing on non-key attributes.
Option B is incorrect because analytical is not a type of NoSQL database.
A data modeler is designing a database to support ad hoc querying, including drilling down and slicing and dicing queries. What kind of data model is the data modeler likely to use?
A. OLTP
B. OLAP
C. Normalized
D. Graph
B. The correct answer is B; OLAP data models are designed to support drilling down and slicing and dicing.
Option A is incorrect; OLTP models are designed to facilitate storing, searching, and retrieving individual records in a database.
Option C is incorrect; OLAP databases often employ denormalization.
Option D is incorrect; graph data models are used to model nodes and their relationships, such as those in social networks.
A multinational corporation is building a global inventory database. The database will support OLTP type transactions at a global scale. Which of the following would you consider as possible databases for the system?
A. Cloud SQL and Cloud Spanner
B. Cloud SQL and Cloud Datastore
C. Cloud Spanner only
D. Cloud Datastore only
C. The correct answer is C. Cloud Spanner is the only globally scalable relational database for OLTP applications.
Options A and B are incorrect because Cloud SQL will not meet the scaling requirements.
Options B and D are incorrect because Cloud Datastore does not support OLTP models.
A genomics research institute is developing a platform for analyzing data related to genetic diseases. The genomics data is in a specialized format known as FASTQ, which stores nucleotide sequences and quality scores in a text format. Files may be up to 400 GB and are uploaded in batches. Once the files finish uploading, an analysis pipeline runs, reads the data in the FASTQ file, and outputs data to a database. What storage system is a good option for storing the uploaded FASTQ data?
A. Cloud Bigtable
B. Cloud Datastore
C. Cloud Storage
D. Cloud Spanner
D. The correct answer is D because the output is structured, will be queried with SQL, and will retrieve a large number of rows but few columns, making this a good use case for columnar storage, which BigQuery uses.
Options A and B are not good options because neither database supports SQL.
Option C is incorrect because Cloud Storage is used for unstructured data and does not support querying the contents of objects.