Week 12: Graph Processing and Machine Learning in the Cloud Flashcards
Goals and Objectives Upon successful completion of this module, you will be able to: Understand the basics of graph processing using Pregel, Giraph, and Spark GraphX. Compare and contrast graph processing, machine learning, and deep learning. Learn about Cloud-based Machine Learning Offerings The Machine Learning and Data Science life cycles and workflows Human in the loop AI Explain how machine learning works with examples from Mahout and Spark. Explain K-means, Naive Bayes, and fpm as e (49 cards)
What distinguishes a graph database from a relational database?
A graph database stores data as an explicit network of nodes and relationships (each node directly “knows” its neighbors), offering constant‑time local traversals and no need for costly join operations. A relational database stores data in tables of rows and columns, and must use foreign keys plus join operators to navigate relationships.
How does the graph model represent nodes, labels, and relationships?
Nodes are the fundamental entities (vertices).
Labels (or types) categorize nodes (e.g.:Person,:Page).
Relationships are first‐class, directed edges linking two nodes, each with its own type (e.g.[:FRIEND_OF],[:LINKS_TO]).
Why do graph databases eliminate the need for foreign keys and joins?
Because relationships are stored as direct pointers between nodes, traversals follow those pointers in constant time rather than performing set‐based joins via foreign keys.
What are examples of real-world graphs at web, social, and brain scales?
Web graph: pages connected by hyperlinks
Social graph: people connected by friendships or “follows”
Brain graph: neurons connected by synapses
How large can real-world graphs get?
Extremely large—on the order of billions of nodes and tens or hundreds of billions (even trillions) of edges. (E.g. the Web graph has ~1 billion+ pages and 50 billion+ links; the human brain ~10¹¹ neurons and 10¹⁵ synapses.)
Why are many application domains a natural fit for graphs?
Because they involve richly interconnected data where the relationships themselves carry semantics—think networks, recommendations, fraud detection, knowledge bases—so a graph’s native representation maps directly to the problem.
How does the concept of properties attached to nodes and edges enhance graph modeling?
By allowing arbitrary key‑value attributes on both nodes and relationships, you can record metadata (e.g. timestamps, weights, labels) directly in the graph, enabling flexible schemas and more expressive, property‐driven queries.
What is the main difference between OLTP and OLAP graph systems?
OLTP graph systems support real‑time queries by traversing a small, localized subgraph (responding in milliseconds–seconds), whereas OLAP graph systems run batch analytics over the entire graph (taking minutes–hours).
Why is local traversal critical for real-time OLTP graph queries?
Real‑time performance requires a local traversal—starting at a specific vertex (or small set) and touching only a limited number of connected vertices—so the system can answer in milliseconds to seconds.
How does OLAP processing work by analyzing the entire graph, and what is its tradeoff?
OLAP processes every vertex and edge—often iteratively for recursive algorithms—using a Bulk Synchronous Parallel model (e.g., Pregel → Giraph → Spark GraphX → GraphFrames or Gremlin VertexProgram). The tradeoff is comprehensive analytics at the cost of real‑time speed: jobs can take minutes or hours on massive graphs.
What are the four major communities in graph database ecosystems?
Semantic Web: RDF data model + SPARQL, declarative
Graph Databases (OLTP): Labeled Property Graphs with nodes, edges, properties; Gremlin (imperative) or Cypher (declarative)
Pattern‑Matching/Motif Finding: Traversal‑based queries to detect subgraph patterns
Big Data Graph Processing (OLAP): Bulk graph algorithms via BSP frameworks like Pregel, Giraph, GraphX, GraphFrames
What features make Neo4j a popular graph database for OLTP workloads?
Neo4j is an open‑source, NoSQL, native graph database that provides an ACID‑compliant transactional backend and a declarative, graph‑optimized query language (Cypher), enabling fast, consistent real‑time traversals.
How does the Cypher language describe graph traversals and pattern matching?
Nodes in () with optional :Label and {properties}
Relationships in [] with optional :TYPE and {properties}, directed via ->/<-
MATCH clauses specify graph patterns, WHERE filters, and RETURN projects results.
How does Gremlin differ from Cypher in terms of imperative versus declarative programming?
Gremlin is primarily an imperative (step‑by‑step) traversal language—chaining functions to tell the engine how to walk the graph—though it also offers declarative constructs. Cypher, by contrast, is purely declarative, describing what pattern to match without prescribing execution steps.
What is GraphSON, and how is it used to represent graph data in Gremlin-based systems?
GraphSON is the Gremlin standard JSON format for serializing graph elements—vertices, edges, and their single or multi‑valued properties—enabling language‑agnostic data exchange over the Gremlin protocol.
How do AWS Neptune and Azure CosmosDB support scalable OLTP graph processing?
AWS Neptune: Fully managed, ACID‑compliant service supporting both Property Graph (Gremlin over WebSocket) and RDF (SPARQL over REST), with six replicas across three AZs, automated backups, point‑in‑time recovery, and sub‑30 s failover.
Azure CosmosDB: Globally distributed Gremlin API with horizontal scaling via partitioning (vertices need a partition key; edges stored with their source), returning results in GraphSON format.
What is the goal of the Semantic Web, and how does RDF support it?
The Semantic Web aims to link explicit “data” on the World Wide Web in a machine‑readable format—enabling targeted semantic search, automated agents, fraud detection, etc. RDF supports this by providing a triple‑based data model (subject‑predicate‑object) with multiple serializations (RDF/XML, Turtle, JSON‑LD) to express and interlink resources.
How are RDF triples (subject-predicate-object) used to model knowledge?
Each RDF triple is a simple statement: the subject (resource) is connected via a predicate (property) to an object (value or another resource). Collections of these triples form a graph that models entities and their relationships in a structured, extensible way.
What are the purposes of SPARQL and OWL in semantic web technology?
SPARQL is the query language for RDF: using PREFIX, SELECT, WHERE and graph‑pattern matching to retrieve and manipulate triples.
OWL (Web Ontology Language) provides an ontology layer—defining classes, properties, and axioms—to add formal semantics (e.g., class hierarchies, cardinalities) on top of RDF data.
How does SPARQL syntax compare to Cypher and Gremlin for querying graph data?
SPARQL is a declarative, SQL‑like language for RDF graphs, using PREFIX declarations and SELECT…WHERE { … } triple patterns. Cypher is also declarative but uses ASCII‑art ()-[]->() graph patterns, while Gremlin is an imperative traversal API chaining steps to walk the graph.
What competition exists among graph query languages like SPARQL, Cypher, and GQL?
There’s fierce competition among:
SPARQL for RDF graphs,
Cypher (and Gremlin) for property graphs,
GQL, a new standard (voted in 2019) rooted in Cypher and Oracle’s PGQL.
Multiple languages vie for dominance in querying graph data.
How do GraphFrames build on Spark DataFrames for easier graph computation?
They provide a DataFrame‑based API (instead of RDDs), with Scala and Python interfaces that tap into DataFrame optimizations, simpler syntax, SQL‑style queries, and built‑in support for motif finding—making graph ops more user‑friendly and performant.
How does GraphFrames differ from GraphX in terms of API, languages, and abstractions?
GraphFrames: DataFrames as core abstraction; Scala, Python (and Java?) APIs; vertex IDs of any Catalyst type; arbitrary DataFrame columns for vertex/edge properties; operations return GraphFrame/DataFrame.
GraphX: RDDs as core; Scala‑only; Long vertex IDs; VD/ED‑typed attributes; returns Graph[VD,ED] or RDDs.
What are examples of graph algorithms that GraphFrames support?
PageRank, breadth‑first search (BFS), shortest paths, connected components, strongly connected components, label propagation, triangle counting, SVDPlusPlus—and wrappers around GraphX algorithms plus custom Pregel‑style routines.