Week 12: Graph Processing and Machine Learning in the Cloud Flashcards

Question

How is motif finding used for structural pattern matching in graphs?

Answer 1

By calling g.find("(a)-[e1]->(b); (b)-[e2]->(c); !(c)-[]->(a)"), GraphFrames locates subgraphs matching the pattern and lets you filter them via DataFrame expressions (e.g., motifs.filter("e1.delay > 20")).

Answer 2

DataFrame & GraphFrame ops: motif finding via DataFrame joins or built‑in algorithms. Message passing: use aggregateMessages() to send/aggregate messages between vertices. Pregel API: iterative BSP-style computation with graph.pregel for custom algorithms.

Answer 3

A broad range of graph‑parallel algorithms, including collaborative filtering (ALS, SGD), tensor factorization, structured prediction (loopy belief propagation, max‑product LPs, Gibbs sampling), semi‑supervised learning (graph SSL, CoEM), community detection, triangle counting, k‑core and k‑truss decomposition, PageRank (and personalized), shortest‑path, graph coloring, classification, and neural networks.

Answer 4

As two tables: Vertex Property Table with columns (Id, V‑properties…) Edge Property Table with columns (SrcId, DstId, E‑properties…) This tabular view lets you use familiar table operations on graph data.

Answer 5

Create an RDD[(VertexId, V)] for vertices, e.g. val users: RDD[(VertexId,(String,String))] = sc.parallelize(Array((1L,("A","student")), …)) Create an RDD[Edge[E]] for edges, e.g. val relations: RDD[Edge[String]] = sc.parallelize(Array(Edge(1L,2L,"P8"), …)) Build the graph: val graph = Graph(users, relations) ``` :contentReference[oaicite:4]{index=4}:contentReference[oaicite:5]{index=5}

Answer 6

Structural Views: graph.vertices, graph.edges Transformations: graph.reverse, graph.subgraph(pV,pE), graph.mapVertices(f), graph.mapEdges(f) Joins: graph.joinVertices(otherTable) Analytics: graph.connectedComponents(), plus all graph‑parallel algorithms.

Answer 7

GraphX uses a low‑level RDD‑based API (Scala only), with explicit RDD[(VertexId, V)] and RDD[Edge[E]] and vertex‑centric programming. GraphFrames build on Spark DataFrames (Scala + Python), exposing a higher‑level DataFrame API with declarative motif‑finding and SQL‑style operations.

Answer 8

Data Mining and Machine Learning are subsets of Artificial Intelligence, drawing on fields like information retrieval, statistics, biology, linear algebra, and marketing to analyze large datasets, extract knowledge, and predict future trends. Common applications include recommending friends/dates/products; classifying content; finding similar items; uncovering patterns and associations; summarizing text corpora; detecting anomalies or fraud; and ranking search results.

Answer 9

Obtain: Gather data from cloud sources—AWS Open Data Registry, Azure Open Datasets, Google Public Datasets—via CLI, REST APIs, or Jupyter notebooks. Tools: Cloud Storage, SQL/NoSQL databases (e.g., MongoDB), Big Data formats (Parquet, HDFS, HDF), Pig, Hive. Scrub: Clean and wrangle data—filter records, merge files, extract/replace values, split or merge columns. Tools: Jupyter notebooks; Python/R for single‐machine; Spark or MapReduce for big data. Explore: Inspect with descriptive statistics, correlation and feature selection, and visualize distributions. Tools: Jupyter; Python/R libraries (NumPy, Pandas, Matplotlib, SciPy); Spark or EMR for large datasets. Model: Perform feature engineering (e.g., dimensionality reduction) and train/test models—regression, classification, clustering, decision trees, random forests, XGBoost, deep learning. Tools: Scikit‑Learn or H2O for small data; Spark MLlib, Mahout, or Google Cloud Dataproc for big data. AutoML/Hyperparameter tuning: Azure ML, Google AutoML, AWS SageMaker Autopilot, H2O Driverless AI, DataRobot. iNterpret: Present findings and model results to stakeholders via visualizations. Tools: Matplotlib, Tableau, D3.js, Seaborn.

Answer 10

Hyperparameters are settings that govern the training process of a model (e.g., number of iterations, neural‐network topology/size, learning rate). AutoML refers to automated strategies—such as grid search, random search, and gradient‐based methods—that explore the hyperparameter space for you, alleviating the very time‑consuming manual tuning process

Answer 11

Google Cloud AI Platform is a fully managed ML service offering: AI Platform Notebooks: managed Jupyter environments AI Platform Training: distributed training with built‑in hyperparameter optimization Continuous Evaluation: automatic model performance tracking AI Platform Predictions: server‑side model hosting and online inference Kubeflow: end‑to‑end ML workflow orchestration on Kubernetes AutoML Tables: no‑code modeling for tabular data

Answer 12

Microsoft Azure provides Azure Machine Learning, a managed PaaS for ML that includes: A visual designer for drag‑and‑drop, no‑code model pipelines Managed Jupyter notebooks that auto‑provision compute instances Built‑in algorithm and hyperparameter support via the Azure ML workspace

Answer 13

AWS SageMaker offers end‑to‑end ML by: Training: packaging algorithms as Docker images, spinning up ML compute instances, injecting S3 data into containers, running training code, and storing model artifacts back to S3 Deployment: creating model resources that tie S3 artifact paths and inference Docker images, provisioning HTTPS endpoints on ML hosts Inference: allowing client applications to call the HTTPS endpoint for real‑time predictions

Answer 14

It provides transparency into the process, taps human judgment for edge cases, and removes the requirement to build a “perfect” fully automated model.

Answer 15

A design approach that selectively includes human participation in AI workflows—harnessing computer efficiency while leveraging human intelligence and reframing automation as a human–computer interaction problem.

Answer 16

Strengths: Gains in transparency, human judgment, and relaxed need for flawless AI.

Answer 17

Amazon SageMaker Ground Truth for building accurate labeled datasets (using your own labelers or Amazon Mechanical Turk) Amazon SageMaker Augmented AI for human review of model predictions (with your reviewers or Mechanical Turk)

Answer 18

Vision (images/video), Voice (audio/speech), Language (text).

Answer 19

Vision: AWS Rekognition, Textract; Azure Form Recognizer, Computer Vision, Face API, Ink Recognizer, Video Indexer, Bing Visual & Video Search, Kinect SDK; Google Vision AI, Video AI; IBM Watson Visual Recognition. Voice: AWS Polly, Transcribe, Translate, Lex; Google Cloud Speech‑to‑Text, Text‑to‑Speech; Azure Speech to Text, Text to Speech, Speech Translation; IBM Watson Speech to Text, Text to Speech. Language: AWS Comprehend, Polly, Lex; Azure Language Understanding, QnA Maker, Translator Text, Immersive Reader; Google Natural Language, Translation; IBM Watson Natural Language Classifier, Watson Language Translator, Watson Knowledge Studio.

Answer 20

Spark is a unified engine for distributed data processing; its MLlib component is Spark’s machine learning library, designed for ease of use and scalable performance.

Answer 21

Classification is a supervised learning task that uses labeled data to assign inputs to discrete categories. Spark MLlib provides algorithms like SVMs, logistic regression, naïve Bayes, decision trees, and ensemble methods. Common applications include content categorization, friend/product recommendations, anomaly and fraud detection, and search‐result ranking.

Answer 22

Clustering is an unsupervised learning technique that groups similar data points into clusters. Spark MLlib supports algorithms such as k‑means, Gaussian mixture models, power iteration clustering (PIC), latent Dirichlet allocation (LDA), and streaming k‑means to discover natural groupings in data.

Answer 23

K‑Means partitions a large dataset of points into K clusters by: Map phase: Assigning each point to its nearest centroid. Reduce phase: Recomputing each centroid as the mean of its assigned points. Iteration: Repeating assignment and centroid updates until convergence. In Spark MLlib, you invoke KMeans.train(parsedData, numClusters, numIterations).

Answer 24

Naïve Bayes is a multiclass classification algorithm assuming feature independence. Input: An RDD of LabeledPoint and an optional smoothing parameter λ. Output: A NaiveBayesModel ready for prediction.

Answer 25

Frequent Pattern Mining finds recurrent itemsets in transactional data using the FP‑growth algorithm: Count item frequencies to identify frequent items. Build an FP‑tree (a compact suffix tree) encoding all transactions. Extract frequent itemsets from the FP‑tree. Input: An RDD of transactions (arrays of items). Output: An FPGrowthModel containing each frequent itemset and its frequency. Use cases include market‐basket analysis and association‐rule mining.