Parcial 3 Flashcards

Estudiar (95 cards)

1
Q

Is a collection of facts, numbers, words, observations or other useful information. Through _____ processing and ____ analysis, organizations transform raw data points into valuable insights that improve decision-making and drive better business outcomes.

A

Data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Consists of values that can be measured
numerically. Examples of this type of data include discrete data points (such as the number of products sold) or continuous data
points (such as temperature or revenue figures). Is often structured, making it easy to analyze using mathematical tools and algorithms.

A

Quantitative Data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Is descriptive and non- numerical, capturing characteristics, concepts or experiences that numbers cannot measure. Examples include customer feedback, product reviews and social media comments. This type of data can be structured (such as coded survey responses) or unstructured (such as free-text responses or interview transcripts).

A

Qualitative Data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Is organized in a clear, defined format, often stored in relational databases or spreadsheets. It can consist of both quantitative (such as sales figures) and qualitative data (such as categorical labels like “yes or no”)

A

Structured Data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Lacks a strictly defined format. It often comes in complex forms such as text documents, images and videos. _______________ can include both qualitative information (such as customer comments) and quantitative elements (such as numerical values embedded in text).

A

Unstructured Data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

__________________ blends elements of structured and unstructured data. It doesn’t follow a rigid format but can include tags or markers that make it easier to organize and analyze. Examples of this type of data include XML files and JSON objects. Is widely used in
scenarios such as web scraping and data
integration projects because it offers flexibility
while retaining some structure for search and
analysis.

A

Semi-Structured Data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Is data about data. In other words, it is information about the attributes of a data point or data set, such as file names, authors, creation dates or data types.

A

Metadata

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Refers to massive, complex data sets that traditional systems can’t handle. It includes both structured and unstructured data from sources such as sensors, social media and transactions.

A

Big Data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Helps organizations process and analyze these large data sets to systematically extract valuable insights. It often requires advanced tools such as machine learning

A

Big Data Analytics

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

The ___________ refer to five fundamental characteristics that describe the challenges of handling vast amounts of data.

A

5 Vs In Big Data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Refers to the massive amount of data generated every second. From social media to commercial transactions, every action contributes to the __________ of Big Data.

A

First V: Volume

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Refers to the speed at which these data are generated and processed. In a world where information is power, speed is essential. Higher _________ allows companies to react in real time to emerging trends or issues, which can provide a significant competitive advantage.

A

Second V: Velocity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

For instance, financial trading platforms process millions of transactions per second, requiring high-speed data processing. Also, sensors in autonomous vehicles generate gigabytes of data per second that must be processed in real time to make navigation decisions.

A

Examples of “Velocity”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

A social network like Facebook generates terabytes of data every day through photos, status updates, and messages that users share. Imagine the volume of data that entails. On the other hand, Walmart, one of the largest retail chains, handles more than 1 million customer transactions per hour, which translates into large volumes of data.

A

Examples of “Volume”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Refers to the different types of data, such as structured, unstructured, and semi-structured, that can be processed and analyzed. This V allows a more comprehensive and enriching understanding of the environment by considering multiple perspectives and sources of information.

A

Third V: Variety

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

For a practical example, we could say that companies can collect data from various sources such as texts, images, sounds, transaction logs, emails, etc. Also, a hospital may have structured data like medical records, and unstructured data like doctors’ notes and medical imaging results.

A

Examples of “Variety”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Refers to the quality and accuracy of the data. The data must be accurate and reliable to obtain valid insights. It’s essential for making informed decisions and avoiding erroneous conclusions that can be costly.

A

Fourth V: Veracity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Veracity can be a challenge on social media, where information can be incorrect or misleading. In the medical field, incorrect or incomplete data can have severe consequences, making it crucial to ensure the veracity of the data.

A

Examples of “Veracity”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Refers to the usefulness and importance of the data and how they can be used to gain benefits and insights. The ______ of the data lies in how they can be used to improve decision-making, optimize processes, and generate new opportunities.

A

Fifth V: Value

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Giants like Netflix or Amazon assign superior utility to data. Netflix, for example, uses Big Data to analyze user preferences and recommend movies and series, creating value through a better user experience. On the other hand, Amazon uses Big Data analytics to optimize its logistics and supply chain, resulting in faster delivery and better customer service.

A

Examples of “Value”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Data enables organizations to transform raw information into actionable insights to predict customer behavior, optimize supply chains and fuel innovation. The term “data” comes from the plural of “datum”, a Latin word meaning “something given”: a definition that remains just as relevant today. Every day, millions of people provide data to businesses through interactions such as impressions, clicks, transactions, sensor readings or even just browsing online.

A

Why Data Is Important

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Organizations across industries use data for various purposes, including improving decision-making, streamlining operations and driving innovation.

A

How Data Is Used

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Is a branch of advanced analytics that predicts future trends and outcomes using historical data combined with statistical modeling, data mining and machine learning.

A

Predictive Analytics

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Sometimes called gen AI, is artificial intelligence (AI) that can create original content—such as text, images, video, audio or software code—in response to a user’s prompt or request. ___________ relies on sophisticated machine learning models called deep learning models. These models are trained on vast data sets, which allows them to do things such as understand users’ requests, generate personalized marketing content and write code.

A

Generative AI

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Data analytics can help healthcare providers improve patient care, predict disease outbreaks and enhance treatment protocols. For instance, monitoring patients through time series data, such as tracking patient vitals over time, provides real-time insights into patient conditions.
Healthcare Innovations
26
Frequently analyze quantitative and qualitative data from surveys, census reports and social media. Examining these data sets allows them to study behaviors, trends and policy impacts. For instance, researchers might use census data to track population changes, survey responses to measure public opinion and social media data to analyze emerging trends.
Social Science Research
27
As cyberattacks and data breaches become more frequent, organizations are increasingly turning to data analysis to identify and respond to threats faster, minimizing damage and reducing downtime.
Cibersecurity and Risk Management
28
Machine learning algorithms, trained on vast data sets, can help organizations boost operational efficiency by optimizing logistics, predicting demand, improving scheduling and automating workflows. For example, e-commerce companies frequently collect and analyze real-time sales data to inform inventory management, reducing the likelihood of stockouts or overstocking.
Operational Efficiency
29
Data is the backbone of personalized customer experiences, particularly in marketing, where organizations can use data analytics to tailor content and ads to different users. For example, streaming services rely on machine learning algorithms to analyze viewing habits and recommend content.
Customer Experience
30
Governments worldwide frequently use open data policies to make valuable data sets publicly accessible, encouraging businesses and organizations to use these resources for research and innovation. For example, the US government's Data.gov platform provides access to various data sets across healthcare, education and transportation.
Government Initiatives
31
Is a set of technological processes for collecting, managing and analyzing data, turning raw data into insights that can guide business decisions.
Business Intelligence (BI)
32
Complements BI by helping organizations interpret and visualize data through graphs, dashboards and reports, making it easier to spot trends and make informed decisions.
Business analytics
33
Is the systematic process of gathering data from various sources while helping to ensure its quality and integrity. Typically performed by data scientists and analysts, it is the foundation for accurate and reliable data analysis. __________________ starts with setting clear objectives and identifying relevant sources. Data is then acquired, cleaned and integrated into a unified data set.
Data collection
34
Data storage systems and ongoing quality checks help ensure the collected data is...
accurate and reliable
35
Organizations handle vast amounts of data in multiple formats scattered across public and private clouds, making data fragmentation and mismanagement significant challenges. This is the practice of collecting, processing and using data securely and efficiently to improve business outcomes. It addresses critical challenges such as managing large data sets, breaking down silos and handling inconsistent data formats.
Data management
36
Are low-cost storage environments that house raw, unstructured data, which can later be processed and analyzed.
Data lakes
37
Store structured data from various sources, optimized for data mining and analysis tasks
Data warehouses
38
Merge the best aspects of data warehouses and data lakes, offering a unified solution for managing both structured and unstructured data.
Data lakehouses
39
Data management solutions typically integrate with existing infrastructure to help ensure access to high-quality, usable data for data scientists, analysts and other stakeholders. These solutions often incorporate data lakes, data warehouses or data lakehouses, combined in a unified __________.
Data fabric
40
Perform complex, foundational data tasks. For example, they create models and algorithms to find insights in large data sets, often using advanced tools such as machine learning and predictive modeling.
Data scientist
41
Focus on more immediate, practical tasks. They use statistics to analyze data and answer specific business questions. Their main goal is to find useful insights that help with everyday decisions and strategies.
Data analyst
42
Is the practice of safeguarding sensitive information from data loss, theft and corruption. ____________ is increasingly important as organizations handle larger volumes of sensitive data across complex, distributed environments.
Data Protection
43
Involves protecting digital information from unauthorized access, corruption or theft. It encompasses various aspects of information security, spanning physical security, organizational policies and access controls.
Data Security
44
Focuses on policies that support the general principle that a person should have control over their personal data, including the ability to decide how organizations collect, store and use their data.
Data Privacy
45
Employees or contractors with authorized access can pose significant risks. According to the Cost of a Data Breach Report, data breaches initiated by malicious insiders cost USD 4.99 million on average.
Insider Threads
46
Threat actors often use __________________ attacks such as phishing to exploit human weaknesses to trick individuals into revealing sensitive information. Generative AI tools can now craft highly convincing phishing emails, increasing the success rate of such attacks.
Social Engineering
47
Cybercriminals use __________ to encrypt an organization’s data and demand a ransom in exchange for the decryption key. Healthcare systems, financial institutions and government data agencies are particularly vulnerable to these attacks.
Ransomware
48
With the widespread adoption of cloud services, misconfigurations, insecure APIs and poor access control can lead to public data leaks. According to the Cost of a Data Breach Report, data breaches involving public clouds are the most expensive, costing USD 5.17 million on average.
Cloud Security
49
Organizations use various data protection technologies to defend against threat actors and help ensure data integrity, confidentiality and availability.
Data Protection Solutions
50
Uses symmetric encryption or asymmetric encryption to protect data during storage and transmission, preventing attackers from reading or misusing it. End-to-end encryption (E2EE) specifically encrypts data before transferring it to another endpoint, keeping it secure throughout its journey.
Encryption
51
Regularly create and store copies of critical data, allowing fast restoration if there is loss or corruption while minimizing downtime.
Data backups
52
Monitor and control network traffic, acting as the first line of defense to block unauthorized access.
Firewalls
53
Verify user identities and control access to sensitive information. Multi-factor authentication (MFA) adds an extra layer of security, requiring users to provide multiple forms of verification.
Authentication and authorization
54
Manages how users access digital resources and what they can do with those resources to reduce insider threats and prevent unauthorized access.
Identity and access management (IAM)
55
Detect, prevent and remove malicious software such as viruses, spyware and ransomware that could compromise data.
Antivirus and anti-malware tools
56
Monitor user activity and flag suspicious behavior to prevent unauthorized access, transmission or leakage of sensitive information.
Data loss prevention (DLP) tools
57
Real-time data from platforms such as Twitter and Facebook can be used to track brand engagement, gauge public opinion and discover consumer sentiment.
Data sources: Social Media Interactions
58
Freely available data sets from governments and organizations, such as census data and economic indicators, can help provide context for demographic shifts, market segmentation and financial analysis
Data sources: Public data
59
Data sets from academic institutions and governments on topics such as climate change and geospatial data are often used for research and policymaking.
Data sources: Open data sets
60
Data from business transactions, such as sales records, invoices and payment information, can help businesses track performance, optimize pricing and improve the customer experience.
Data sources: Transactional data
61
Qualitative or quantitative data collected through customer feedback or research surveys can provide insights into preferences, opinions and trends.
Data sources: Surveys and questionnaires
62
Data from website interactions, such as page views and click-through rates, help companies understand user behavior, optimize content and improve user experiences.
Data sources: Web analytics
63
Data from Internet of Things (IoT) devices such as smart meters and wearable trackers can support real-time analytics and predictive maintenance and prevent equipment downtime.
Data sources: IoT devices
64
Is a model for enabling ubiquitous, convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction
Cloud Computing
65
Are delivered as services where a cloud is called a public cloud when it is made available in a pay-as-you-go manner to the general public, and is called a private cloud when the cloud infrastructure is operated solely for a business or an organization
Cloud Computing Resources
66
-> Cloud Services + Applications are accessible from several client devices + The provider is responsible for the application + Examples, SalesForce.com, NetSuit, Google, IBM, etc.
Software-as-a-Service (SaaS)
67
-> Cloud Services + The client is responsible for the end-to-end life cycle in terms of developing, testing and deploying applications + Providers supplies all the systems (operating systems, applications, and development environment) + Examples are Google’s appEngine, Microsoft´s Azure, etc.
Platform-as-a-Service (PaaS)
68
-> Cloud Services + The service client has control over the operating system, storage, and applications which are offered through a Web-based access point + In this type of service the client manages the storing and development environments for Cloud Computing application such as the Hadoop Distributed File System (HDFS) and the MapReduce development framework. + Examples of infrastructure providers are GoGird, AppNexeus, Eucalyptus, Amazon EC2, etc
Infrastructure-as-a-Service (IaaS)
69
Cloud Computing (CC) is a technology aimed at processing and storing very large amounts of data also known as
Big Data (BD)
70
Refers to an eclectic and increasingly familiar group of nonrelational data management systems; where databases are not built primarily on tables, and generally do not use SQL for data manipulation. This systems are distributed, non-relational databases designed for large-scale data storage and for massively parallel data processing across a large number of commodity servers.
NoSQL (Not Only SQL)
71
Organizations that collect large amounts of unstructured data are increasingly turning to non-relational databases, now frequently called ____________ . This focus on analytical processing of large scale datasets, offering increased scalability over commodity hardware.
NoSQL databases
72
1. The exponential growth of the volume of data generated by users, systems and sensors, further accelerated by the concentration of large part of this volume on big distributed systems like Amazon, Google and other cloud services. 2. The increasing interdependency and complexity of data accelerated by the Internet, Web2.0, social networks and open and standardized access to data sources from a large number of different systems.
Two trends to use NoSQL databases
73
Conflicts are arising between the different aspects of high availability in distributed systems that are not fully solvable - known as the _________
CAP-theorem
74
CAP-theorem ----------- "All clients see the same version of the data, even on updates to the dataset."
Strong Consistency
75
CAP-theorem ----------- "All clients can always find at least one copy of the requested data, even if some of the machines in a cluster is down."
High Availability
76
"The total system keeps its characteristic even when being deployed on different servers, transparent to the client."
Partition-tolerance
77
1. Large-scale data processing (parallel processing over distributed systems). 2. Embedded I-R (basic machine-to-machine information look-up & retrieval). 3. Exploratory analytics on semi-structured data (expert level). 4. Large volume data storage (unstructured, semi-structured, small-packet structured).
Primary uses of NoSQL Database
78
Typically, these Data Management Systems (DMS) store items as alphanumeric identifiers (keys) and associated values in simple, standalone tables (referred to as ―hash tables‖). The values may be simple text strings or more complex lists and sets.
Key-Value stores
79
Are designed to manage and store documents which are encoded in a standard data exchange format such as XML, JSON (Javascript Option Notation) or BSON (Binary JSON). Unlike the simple key-value stores described above, the value column in document databases contains semi-structured data – specifically attribute name/value pairs. Unlike simple key-value stores, both keys and values are fully searchable in document databases.
Document databases
80
This type of NoSQL Database employs a distributed, column-oriented data structure that accommodates multiple attributes per key. These NoSQL Databases generally replicate not just Google‘s Bigtable data storage structure, but Google‘s distributed file system (GFS) and MapReduce parallel processing framework as well, as is the case with Hadoop, which comprises the Hadoop File System (HDFS, based on GFS) + Hbase (a Bigtable-style storage system) + MapReduce.
Wide-Column (or Column-Family)
81
___________________ replace relational tables with structured relational graphs of interconnected key-value pairings. They are similar to object oriented databases as the graphs are represented as an object-oriented network of nodes (conceptual objects), node relationships (―edges‖) and properties (object attributes expressed as keyvalue pairs). In general, _________________ are useful when you are more interested in relationships between data than in the data itself.
Graph Databases
82
What is HBase?
Column-oriented database management system that runs on top of Hadoop Distributed File System. Is an open-source implementation of the Google BigTable architecture.
83
HBAse provides consistent read and write operations and thus can be used for high speed requirements. This also helps to increase the overall throughput of the system.
Consistency
84
Means that only one process can perform a given task at a given time. For example when one process is performing write operation no other processes can perform the write operation on that data.
Atomic Read and Write
85
HBase offers automatic and manual splitting of regions. This means that if a region reaches its threshold size it automatially splits into smaller sub regions.
Sharding
86
HBase provides Local Area Network(LAN) and Wireless Area Network(WAN) which supports failure recovery. There is a master server which monitors all the regions and metadata of the cluster.
High Availability
87
HBase offers access through the Java API which helps to programmatically access HBase.
Client API
88
This is one of the important characteristics of non-relational databases. HBase supports _____________ both in linear and modular form.
Scalability
89
This feature of HBase helps usage of distributed storage such as HDFS. And HBase can run on top of various systems such as Hadoop/HDFS
Distributed Storage and HDFS/Hadoop integration
90
The data in HBase are replicated over a number of clusters. This helps to recover data in case of any loss and high availability of data.
Data Replication
91
HDFS is internally distributed and supports automatic recovery. As HBase runs on top of HDFS it is also automatically recovered.
Load sharing and Support for Failure
92
HBase supports Java API which makes it easily available programmatically using java. And HBase supports map reduce which helps in parallel processing of data.
API and MapReduce Support
93
HBase uses keys and stores it in lexicographical order thus optimizing the requests.
Sorted Row Keys
94
HBase performs real time processing of data and supports block cache and bloom filters.
Real Time Processing of Data
95
When we are working with Big Databases, it is often a requirement that we design schema differently. Terms to describe the above principle are ________, __________ and ______________. It is about rethinking how data is stored in Bigtable-like storage systems, and how to make use of it in an appropriate way
Denormalization, Duplication, and Intelligent Keys (DDI).