Week 12: Graph Processing and Machine Learning in the Cloud Flashcards

Goals and Objectives Upon successful completion of this module, you will be able to: Understand the basics of graph processing using Pregel, Giraph, and Spark GraphX. Compare and contrast graph processing, machine learning, and deep learning. Learn about Cloud-based Machine Learning Offerings The Machine Learning and Data Science life cycles and workflows Human in the loop AI Explain how machine learning works with examples from Mahout and Spark. Explain K-means, Naive Bayes, and fpm as e (49 cards)

1
Q

What distinguishes a graph database from a relational database?

A

A graph database stores data as an explicit network of nodes and relationships (each node directly “knows” its neighbors), offering constant‑time local traversals and no need for costly join operations. A relational database stores data in tables of rows and columns, and must use foreign keys plus join operators to navigate relationships.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How does the graph model represent nodes, labels, and relationships?

A

Nodes are the fundamental entities (vertices).

Labels (or types) categorize nodes (e.g.:Person,:Page).

Relationships are first‐class, directed edges linking two nodes, each with its own type (e.g.[:FRIEND_OF],[:LINKS_TO]).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Why do graph databases eliminate the need for foreign keys and joins?

A

Because relationships are stored as direct pointers between nodes, traversals follow those pointers in constant time rather than performing set‐based joins via foreign keys.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are examples of real-world graphs at web, social, and brain scales?

A

Web graph: pages connected by hyperlinks

Social graph: people connected by friendships or “follows”

Brain graph: neurons connected by synapses

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How large can real-world graphs get?

A

Extremely large—on the order of billions of nodes and tens or hundreds of billions (even trillions) of edges. (E.g. the Web graph has ~1 billion+ pages and 50 billion+ links; the human brain ~10¹¹ neurons and 10¹⁵ synapses.)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Why are many application domains a natural fit for graphs?

A

Because they involve richly interconnected data where the relationships themselves carry semantics—think networks, recommendations, fraud detection, knowledge bases—so a graph’s native representation maps directly to the problem.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How does the concept of properties attached to nodes and edges enhance graph modeling?

A

By allowing arbitrary key‑value attributes on both nodes and relationships, you can record metadata (e.g. timestamps, weights, labels) directly in the graph, enabling flexible schemas and more expressive, property‐driven queries.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the main difference between OLTP and OLAP graph systems?

A

OLTP graph systems support real‑time queries by traversing a small, localized subgraph (responding in milliseconds–seconds), whereas OLAP graph systems run batch analytics over the entire graph (taking minutes–hours).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Why is local traversal critical for real-time OLTP graph queries?

A

Real‑time performance requires a local traversal—starting at a specific vertex (or small set) and touching only a limited number of connected vertices—so the system can answer in milliseconds to seconds.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How does OLAP processing work by analyzing the entire graph, and what is its tradeoff?

A

OLAP processes every vertex and edge—often iteratively for recursive algorithms—using a Bulk Synchronous Parallel model (e.g., Pregel → Giraph → Spark GraphX → GraphFrames or Gremlin VertexProgram). The tradeoff is comprehensive analytics at the cost of real‑time speed: jobs can take minutes or hours on massive graphs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are the four major communities in graph database ecosystems?

A

Semantic Web: RDF data model + SPARQL, declarative

Graph Databases (OLTP): Labeled Property Graphs with nodes, edges, properties; Gremlin (imperative) or Cypher (declarative)

Pattern‑Matching/Motif Finding: Traversal‑based queries to detect subgraph patterns

Big Data Graph Processing (OLAP): Bulk graph algorithms via BSP frameworks like Pregel, Giraph, GraphX, GraphFrames

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What features make Neo4j a popular graph database for OLTP workloads?

A

Neo4j is an open‑source, NoSQL, native graph database that provides an ACID‑compliant transactional backend and a declarative, graph‑optimized query language (Cypher), enabling fast, consistent real‑time traversals.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How does the Cypher language describe graph traversals and pattern matching?

A

Nodes in () with optional :Label and {properties}

Relationships in [] with optional :TYPE and {properties}, directed via ->/<-

MATCH clauses specify graph patterns, WHERE filters, and RETURN projects results.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How does Gremlin differ from Cypher in terms of imperative versus declarative programming?

A

Gremlin is primarily an imperative (step‑by‑step) traversal language—chaining functions to tell the engine how to walk the graph—though it also offers declarative constructs. Cypher, by contrast, is purely declarative, describing what pattern to match without prescribing execution steps.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is GraphSON, and how is it used to represent graph data in Gremlin-based systems?

A

GraphSON is the Gremlin standard JSON format for serializing graph elements—vertices, edges, and their single or multi‑valued properties—enabling language‑agnostic data exchange over the Gremlin protocol.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How do AWS Neptune and Azure CosmosDB support scalable OLTP graph processing?

A

AWS Neptune: Fully managed, ACID‑compliant service supporting both Property Graph (Gremlin over WebSocket) and RDF (SPARQL over REST), with six replicas across three AZs, automated backups, point‑in‑time recovery, and sub‑30 s failover.

Azure CosmosDB: Globally distributed Gremlin API with horizontal scaling via partitioning (vertices need a partition key; edges stored with their source), returning results in GraphSON format.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is the goal of the Semantic Web, and how does RDF support it?

A

The Semantic Web aims to link explicit “data” on the World Wide Web in a machine‑readable format—enabling targeted semantic search, automated agents, fraud detection, etc. RDF supports this by providing a triple‑based data model (subject‑predicate‑object) with multiple serializations (RDF/XML, Turtle, JSON‑LD) to express and interlink resources.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

How are RDF triples (subject-predicate-object) used to model knowledge?

A

Each RDF triple is a simple statement: the subject (resource) is connected via a predicate (property) to an object (value or another resource). Collections of these triples form a graph that models entities and their relationships in a structured, extensible way.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What are the purposes of SPARQL and OWL in semantic web technology?

A

SPARQL is the query language for RDF: using PREFIX, SELECT, WHERE and graph‑pattern matching to retrieve and manipulate triples.

OWL (Web Ontology Language) provides an ontology layer—defining classes, properties, and axioms—to add formal semantics (e.g., class hierarchies, cardinalities) on top of RDF data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

How does SPARQL syntax compare to Cypher and Gremlin for querying graph data?

A

SPARQL is a declarative, SQL‑like language for RDF graphs, using PREFIX declarations and SELECT…WHERE { … } triple patterns. Cypher is also declarative but uses ASCII‑art ()-[]->() graph patterns, while Gremlin is an imperative traversal API chaining steps to walk the graph.

21
Q

What competition exists among graph query languages like SPARQL, Cypher, and GQL?

A

There’s fierce competition among:

SPARQL for RDF graphs,

Cypher (and Gremlin) for property graphs,

GQL, a new standard (voted in 2019) rooted in Cypher and Oracle’s PGQL.
Multiple languages vie for dominance in querying graph data.

22
Q

How do GraphFrames build on Spark DataFrames for easier graph computation?

A

They provide a DataFrame‑based API (instead of RDDs), with Scala and Python interfaces that tap into DataFrame optimizations, simpler syntax, SQL‑style queries, and built‑in support for motif finding—making graph ops more user‑friendly and performant.

23
Q

How does GraphFrames differ from GraphX in terms of API, languages, and abstractions?

A

GraphFrames: DataFrames as core abstraction; Scala, Python (and Java?) APIs; vertex IDs of any Catalyst type; arbitrary DataFrame columns for vertex/edge properties; operations return GraphFrame/DataFrame.

GraphX: RDDs as core; Scala‑only; Long vertex IDs; VD/ED‑typed attributes; returns Graph[VD,ED] or RDDs.

24
Q

What are examples of graph algorithms that GraphFrames support?

A

PageRank, breadth‑first search (BFS), shortest paths, connected components, strongly connected components, label propagation, triangle counting, SVDPlusPlus—and wrappers around GraphX algorithms plus custom Pregel‑style routines.

25
How is motif finding used for structural pattern matching in graphs?
By calling g.find("(a)-[e1]->(b); (b)-[e2]->(c); !(c)-[]->(a)"), GraphFrames locates subgraphs matching the pattern and lets you filter them via DataFrame expressions (e.g., motifs.filter("e1.delay > 20")).
26
What methods exist for implementing algorithms in GraphFrames?
DataFrame & GraphFrame ops: motif finding via DataFrame joins or built‑in algorithms. Message passing: use aggregateMessages() to send/aggregate messages between vertices. Pregel API: iterative BSP-style computation with graph.pregel for custom algorithms.
27
What kinds of machine learning and graph analytics tasks can GraphX support?
A broad range of graph‑parallel algorithms, including collaborative filtering (ALS, SGD), tensor factorization, structured prediction (loopy belief propagation, max‑product LPs, Gibbs sampling), semi‑supervised learning (graph SSL, CoEM), community detection, triangle counting, k‑core and k‑truss decomposition, PageRank (and personalized), shortest‑path, graph coloring, classification, and neural networks.
28
How can graphs be viewed as tables, with vertex and edge properties represented?
As two tables: Vertex Property Table with columns (Id, V‑properties…) Edge Property Table with columns (SrcId, DstId, E‑properties…) This tabular view lets you use familiar table operations on graph data.
29
How is a graph constructed using RDDs for vertices and edges in Spark GraphX?
Create an RDD[(VertexId, V)] for vertices, e.g. val users: RDD[(VertexId,(String,String))] = sc.parallelize(Array((1L,("A","student")), …)) Create an RDD[Edge[E]] for edges, e.g. val relations: RDD[Edge[String]] = sc.parallelize(Array(Edge(1L,2L,"P8"), …)) Build the graph: val graph = Graph(users, relations) ``` :contentReference[oaicite:4]{index=4}:contentReference[oaicite:5]{index=5}
30
What operators are available in GraphX to manipulate graphs?
Structural Views: graph.vertices, graph.edges Transformations: graph.reverse, graph.subgraph(pV,pE), graph.mapVertices(f), graph.mapEdges(f) Joins: graph.joinVertices(otherTable) Analytics: graph.connectedComponents(), plus all graph‑parallel algorithms.
31
How does the programming model of GraphX differ from GraphFrames in terms of abstraction and API style?
GraphX uses a low‑level RDD‑based API (Scala only), with explicit RDD[(VertexId, V)] and RDD[Edge[E]] and vertex‑centric programming. GraphFrames build on Spark DataFrames (Scala + Python), exposing a higher‑level DataFrame API with declarative motif‑finding and SQL‑style operations.
32
How do Data Mining and Machine Learning relate to Artificial Intelligence? What are some applications?
Data Mining and Machine Learning are subsets of Artificial Intelligence, drawing on fields like information retrieval, statistics, biology, linear algebra, and marketing to analyze large datasets, extract knowledge, and predict future trends. Common applications include recommending friends/dates/products; classifying content; finding similar items; uncovering patterns and associations; summarizing text corpora; detecting anomalies or fraud; and ranking search results.
33
What is the OSEMN Data Science Model? What happens in each of its stages? What cloud tools can you use in each stage?
Obtain: Gather data from cloud sources—AWS Open Data Registry, Azure Open Datasets, Google Public Datasets—via CLI, REST APIs, or Jupyter notebooks. Tools: Cloud Storage, SQL/NoSQL databases (e.g., MongoDB), Big Data formats (Parquet, HDFS, HDF), Pig, Hive. Scrub: Clean and wrangle data—filter records, merge files, extract/replace values, split or merge columns. Tools: Jupyter notebooks; Python/R for single‐machine; Spark or MapReduce for big data. Explore: Inspect with descriptive statistics, correlation and feature selection, and visualize distributions. Tools: Jupyter; Python/R libraries (NumPy, Pandas, Matplotlib, SciPy); Spark or EMR for large datasets. Model: Perform feature engineering (e.g., dimensionality reduction) and train/test models—regression, classification, clustering, decision trees, random forests, XGBoost, deep learning. Tools: Scikit‑Learn or H2O for small data; Spark MLlib, Mahout, or Google Cloud Dataproc for big data. AutoML/Hyperparameter tuning: Azure ML, Google AutoML, AWS SageMaker Autopilot, H2O Driverless AI, DataRobot. iNterpret: Present findings and model results to stakeholders via visualizations. Tools: Matplotlib, Tableau, D3.js, Seaborn.
34
What is a hyperparameter? What is AutoML? How do they relate to one another?
Hyperparameters are settings that govern the training process of a model (e.g., number of iterations, neural‐network topology/size, learning rate). AutoML refers to automated strategies—such as grid search, random search, and gradient‐based methods—that explore the hyperparameter space for you, alleviating the very time‑consuming manual tuning process
35
What is the Google Cloud AI Platform? What tools does it provide?
Google Cloud AI Platform is a fully managed ML service offering: AI Platform Notebooks: managed Jupyter environments AI Platform Training: distributed training with built‑in hyperparameter optimization Continuous Evaluation: automatic model performance tracking AI Platform Predictions: server‑side model hosting and online inference Kubeflow: end‑to‑end ML workflow orchestration on Kubernetes AutoML Tables: no‑code modeling for tabular data
36
What does Microsoft Azure offer for cloud-based machine learning?
Microsoft Azure provides Azure Machine Learning, a managed PaaS for ML that includes: A visual designer for drag‑and‑drop, no‑code model pipelines Managed Jupyter notebooks that auto‑provision compute instances Built‑in algorithm and hyperparameter support via the Azure ML workspace
37
Describe how AWS SageMaker supports ML on the cloud.
AWS SageMaker offers end‑to‑end ML by: Training: packaging algorithms as Docker images, spinning up ML compute instances, injecting S3 data into containers, running training code, and storing model artifacts back to S3 Deployment: creating model resources that tie S3 artifact paths and inference Docker images, provisioning HTTPS endpoints on ML hosts Inference: allowing client applications to call the HTTPS endpoint for real‑time predictions
38
Why use Human‑in‑the‑Loop AI?
It provides transparency into the process, taps human judgment for edge cases, and removes the requirement to build a “perfect” fully automated model.
39
What is Human-in-the-Loop AI?
A design approach that selectively includes human participation in AI workflows—harnessing computer efficiency while leveraging human intelligence and reframing automation as a human–computer interaction problem.
40
What are the strengths of Human‑in‑the‑Loop AI?
Strengths: Gains in transparency, human judgment, and relaxed need for flawless AI.
41
What tools support Human‑in‑the‑Loop AI?
Amazon SageMaker Ground Truth for building accurate labeled datasets (using your own labelers or Amazon Mechanical Turk) Amazon SageMaker Augmented AI for human review of model predictions (with your reviewers or Mechanical Turk)
42
What are examples of unstructured data?
Vision (images/video), Voice (audio/speech), Language (text).
43
What ML cloud tools are specifically designed to handle unstructured data?
Vision: AWS Rekognition, Textract; Azure Form Recognizer, Computer Vision, Face API, Ink Recognizer, Video Indexer, Bing Visual & Video Search, Kinect SDK; Google Vision AI, Video AI; IBM Watson Visual Recognition. Voice: AWS Polly, Transcribe, Translate, Lex; Google Cloud Speech‑to‑Text, Text‑to‑Speech; Azure Speech to Text, Text to Speech, Speech Translation; IBM Watson Speech to Text, Text to Speech. Language: AWS Comprehend, Polly, Lex; Azure Language Understanding, QnA Maker, Translator Text, Immersive Reader; Google Natural Language, Translation; IBM Watson Natural Language Classifier, Watson Language Translator, Watson Knowledge Studio.
44
What is Spark? What are its goals?
Spark is a unified engine for distributed data processing; its MLlib component is Spark’s machine learning library, designed for ease of use and scalable performance.
45
What is classification with respect to machine learning? What are a few use cases?
Classification is a supervised learning task that uses labeled data to assign inputs to discrete categories. Spark MLlib provides algorithms like SVMs, logistic regression, naïve Bayes, decision trees, and ensemble methods. Common applications include content categorization, friend/product recommendations, anomaly and fraud detection, and search‐result ranking.
46
What is clustering? What is it used for?
Clustering is an unsupervised learning technique that groups similar data points into clusters. Spark MLlib supports algorithms such as k‑means, Gaussian mixture models, power iteration clustering (PIC), latent Dirichlet allocation (LDA), and streaming k‑means to discover natural groupings in data.
47
What is the K‑Means algorithm used for? How does it work?
K‑Means partitions a large dataset of points into K clusters by: Map phase: Assigning each point to its nearest centroid. Reduce phase: Recomputing each centroid as the mean of its assigned points. Iteration: Repeating assignment and centroid updates until convergence. In Spark MLlib, you invoke KMeans.train(parsedData, numClusters, numIterations).
48
Describe the Naïve Bayes algorithm and its input/output.
Naïve Bayes is a multiclass classification algorithm assuming feature independence. Input: An RDD of LabeledPoint and an optional smoothing parameter λ. Output: A NaiveBayesModel ready for prediction.
49
What is frequent pattern mining (FPM)? What are its algorithm, input/output, and use cases?
Frequent Pattern Mining finds recurrent itemsets in transactional data using the FP‑growth algorithm: Count item frequencies to identify frequent items. Build an FP‑tree (a compact suffix tree) encoding all transactions. Extract frequent itemsets from the FP‑tree. Input: An RDD of transactions (arrays of items). Output: An FPGrowthModel containing each frequent itemset and its frequency. Use cases include market‐basket analysis and association‐rule mining.