Database Specialty - Neptune Flashcards

1
Q

Amazon Neptune – Overview

A
  • Fully managed graph database service (non-relational)
  • Relationships are first-class citizens
  • Can quickly navigate relationships and retrieve complex relations between highly
    connected datasets
  • Can query billions of relationships with millisecond latency
  • ACID compliant with immediate consistency
  • Supports transaction semantics for highly concurrent OLTP workloads (ACID transactions)
  • Supported graph query languages – Apache TinkerPop Gremlin and RDF/SPARQL
  • Supports 15 low-latency read replicas (Multi-AZ)
  • Use cases:
    • Social graph / Knowledge graph
    • Fraud detection
    • Real-time big data mining
    • Customer interests and recommendations (Recommendation engines)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Graph Database

A
  • Models relationships between data
    • e.g. Subject / predicate / object / graph (quad)
    • Joe likes pizza
    • Sarah is friends with Joe
    • Sarah likes pizza too
    • Joe is a student and lives in London
    • Let’s you ask questions like “identify Londoners who
      like pizza” or “identify friends of Londoners who like
      pizza”
  • Uses nodes (vertices) and edges (actions) to
    describe the data and relationships between
    them
  • DB stores – person / action / object (and a graph
    ID or edge ID)
  • Can filter or discover data based on strength,
    weight, or quality of relationships
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Graph query languages

A
  • Neptune supports two popular modeling frameworks – Apache
    TinkerPop and RDF/SPARQL
  • TinkerPop uses Gremlin traversal language
  • RDF (W3C standard) uses SPARQL
  • SPARQL is great for multiple data sources, has large variety of
    datasets available
  • We can use Gremlin or SPARQL to load data into Neptune and
    then to query it
  • You can store both Gremlin and SPARQL graph data on the same
    Neptune cluster
  • It gets stored separately on the cluster
  • Graph data inserted using one query language can only be queried
    with that query language (and not with the other)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Neptune Architecture

A
  • 6 copies of your data across 3 AZ (distributed design)
    • Lock-free optimistic algorithm (quorum model)
    • 4 copies out of 6 needed for writes (4/6 write quorum - data
      considered durable when at least 4/6 copies acknowledge the write)
    • 3 copies out of 6 needed for reads (3/6 read quorum)
    • Self healing with peer-to-peer replication, Storage is striped across
      100s of volumes
  • One Neptune Instance takes writes (master)
  • Compute nodes on replicas do not need to write/replicate
    (=improved read performance)
  • Log-structured distributed storage layer – passes incremental
    log records from compute to storage layer (=faster)
  • Master + up to 15 Read Replicas serve reads
  • Data is continuously backed up to S3 in real time, using
    storage nodes (compute node performance is unaffected)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Neptune Cluster

A
  • Loader endpoint – to load the data into Neptune (say, from S3)
    • e.g. https://<cluster_endpoint>:8182/loader</cluster_endpoint>
  • Gremlin endpoint – for Gremlin queries
    • e.g. https://<cluster_endpoint>:8182/gremlin</cluster_endpoint>
  • Sparql endpoint – for Sparql queries
    • e.g. https://<cluster_endpoint>:8182/sparql</cluster_endpoint>
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Bulk loading data into Neptune

A
  • Use the loader endpoint (HTTP POST to the loader endpoint)
    • e.g. curl –X POST –H ‘Content-Type: application/json’
      https://<cluster_endpoint>:8182/loader –d
      ‘{
      “source”: “s3://bucket_name/key_name,

      }’</cluster_endpoint>
  • S3 data can be accessed using an S3 VPC endpoint (allows access to
    S3 resources from your VPC)
  • Neptune cluster must assume an IAM role with S3 read access
  • S3 VPC endpoint can be created using the VPC management console
  • S3 bucket must be in the same region as the Neptune cluster
  • Load data formats
    • csv (for gremlin), ntripples / nquads / rdfxml / turtle (for sparql)
  • All files must be UTF-8 encoded
  • Multiple files can be loaded in a single job
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Neptune Workbench

A
  • Lets you query your Neptune cluster
    using notebooks
  • Notebooks are Jupyter notebooks
    hosted by Amazon SageMaker
  • Available within AWS console * Notebook runs behind the scenes on
    an EC2 host in the same VPC and has
    IAM authentication
  • The security group that you attach in
    the VPC where Neptune is running
    must have an additional rule that allows
    inbound connections from itself
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Neptune Replication

A
  • Up to 15 read replicas
  • ASYNC replication
  • Replicas share the same underlying
    storage layer
  • Typically take 10s of milliseconds
    (replication lag)
  • Minimal performance impact on the
    primary due to replication process
  • Replicas double up as failover targets
    (standby instance is not needed)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Neptune High Availability

A
  • Failovers occur automatically * A replica is automatically promoted to be the
    new primary during DR
  • Neptune flips the CNAME of the DB instance
    to point to the replica and promotes it
  • Failover to a replica typically takes under 30-120
    seconds (minimal downtime)
  • Creating a new instance takes about 15 minutes (post failover)
  • Failover to a new instance happens on a best-effort basis and can take longer
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Neptune Backup and Restore

A
  • Supports automatic backups
  • Continuously backs up your data to S3 for
    PITR (max retention period of 35 days)
  • latest restorable time for a PITR can be up
    to 5 mins in the past (RPO = 5 minutes)
  • The first backup is a full backup.
    Subsequent backups are incremental
  • Take manual snapshots to retain beyond 35 days
  • Backup process does not impact cluster performance
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Neptune Backup and Restore

A
  • Can only restore to a new cluster
  • Can restore an unencrypted snapshot to an
    encrypted cluster (but not the other way
    round)
  • To restore a cluster from an encrypted
    snapshot, you must have access to the KMS
    key
  • Can only share manual snapshots (can copy
    and share automated ones)
  • Can’t share a snapshot encrypted using the
    default KMS key of the a/c
  • Snapshots can be shared across accounts, but
    within the same region
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Neptune Scaling

A
  • Vertical scaling (scale up / down) – by resizing instances
  • Horizontal scaling (scale out / in) – by adding / removing up to 15 read replicas
  • Automatic scaling storage – 10 GB to 64 TB (no manual intervention needed)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Database Cloning in Neptune

A
  • Different from creating read replicas – clones
    support both reads and writes
  • Different from replicating a cluster – clones use
    same storage layer as the source cluster
  • Requires only minimal additional storage
  • Quick and cost-effective
  • Only within region (can be in different VPC)
  • Can be created from existing clones
  • Uses a copy-on-write protocol
    • both source and clone share the same data initially
    • data that changes, is then copied at the time it changes either on the source or on
      the clone (i.e. stored separately from the shared data)
    • delta of writes after cloning is not shared
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Neptune Security – IAM

A
  • Uses IAM for authentication and
    authorization to manage Neptune
    resources
  • Supports IAM Authentication (with AWS
    SigV4)
  • You use temporary credentials using an
    assumed role
  • Create an IAM role
  • Setup trust relationship
  • Retrieve temp creds
  • Sign the requests using the creds
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Neptune Security – Encryption & Network

A
  • Encryption in transit – using SSL / TLS
    • Cluster parameter neptune_enforce_ssl = 1 (is default)
  • Encryption at rest – with AES-256 using KMS
    • encrypts data, automated backups, snapshots, and replicas in the same cluster
  • Neptune clusters are VPC-only (use private subnets)
  • Clients can run on EC2 in public subnets within VPC
  • Can connect to your on-premises IT infra via VPN
  • Use security groups to control access
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Neptune Monitoring

A
  • Integrated with CloudWatch
  • can use Audit log files by enabling DB cluster parameter neptune_enable_audit_log
  • must restart DB cluster after enabling audit logs
  • audit log files are rotated beyond 100MB (not configurable)
  • audit logs are not stored in sequential order
    (can be ordered using the timestamp value of each record)
  • audit log data can be published (exported) to a CloudWatch Logs log group by enabling Log exports for your cluster
  • API calls logged with CloudTrail
17
Q

Query Queuing in Neptune

A
  • Max 8192 queries can be queued up per Neptune instance
  • Queries beyond 8192 will result in ThrottlingException
  • Use CloudWatch metric
    MainRequestQueuePendingRequests to get number of queries queued (5 min
    granularity)
  • Get acceptedQueryCount value using Query Status PI
    • For Gremlin, acceptedQueryCount = current count of queries queued
    • For SPARQL, acceptedQueryCount = all queries accepted since the server started
18
Q

Neptune Service Errors

A
  • Graph engine errors
    • Errors related to cluster endpoints, are HTTP error codes
    • Query errors – QueryLimitException /
      MemoryLimitExceededException / TooManyRequestsException etc.
    • IAM Auth errors – Missing Auth / Missing token / Invalid Signature /
      Missing headers / Incorrect Policy etc
  • API errors
    • HTTP errors related to APIs (CLI / SDK)
    • InternalFailure / AccessDeniedException / MalformedQueryString /
      ServiceUnavailable etc
  • Loader Error
    • LOAD_NOT_STARTED / LOAD_FAILED /
      LOAD_S3_READ_ERROR / LOAD_DATA_DEADLOCK etc
19
Q

SPARQL federated query

A
  • Query across multiple Neptune clusters or external data sources that
    support the protocol, and aggregate the results
  • Supports only read operations
20
Q

Neptune Streams

A
  • Capture changes to your graph (change logs)
  • Similar to DynamoDB streams
  • Can be processed with Lambda (use Neptune Streams API)
  • SPARQL
    • https://<cluster_endpoint>:8182/sparql/stream</cluster_endpoint>
  • Gremlin
    • https://<cluster_endpoint>:8182/gremlin/stream</cluster_endpoint>
  • Only GET method is allowed

Use cases
* Amazon ES Integration
* To perform full-text search queries on Neptune data
* Uses Streams + federated queries
* Supported for both gremlin and SPARQL
* Neptune-to-Neptune Replication

21
Q

Neptune Pricing

A
  • You only pay for what you use
  • On-demand instances – per hour pricing
  • IOPS – per million IO requests
    • Every DB page read operation = one IO
    • Each page is 16 KB in Neptune
    • Write IOs are counted in 4KB units
  • DB Storage – per GB per month
  • Backups (automated and manual) – per GB per month
  • Data transfer – per GB
  • Neptune Workbench – per instance hour