Amazon CloudSearch | Best Practices Flashcards

1
Q

Why is my domain in the “Processing” state?

Best Practices

Amazon CloudSearch | Analytics

A

A domain can be in one of three different states: “processing,” “active,” or “reindexing.” Normally, your domain will be in the “active” state, which indicates that no changes are currently being made, that the domain can be queried and updated, and that all previous changes are currently visible in the search results.

When a domain needs to be re-indexed, Amazon CloudSearch needs to rebuild the index entirely. However, the domain does not enter the “processing” state until you initiate reindexing. During this stage, the domain can still be queried and updated, but the configuration changes won’t be visible in search results until indexing is completed, and the domain’s status changes back to “active.”

You can also continue to upload document batches to your domain. However, if you submit a large volume of updates while your domain is in the “processing” state, it can increase the amount of time it takes for the updates to be applied to your search index. If this becomes an issue, slow down your update rate until the domain returns to the “active” state.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the best practices for bootstrapping data into CloudSearch?

Best Practices

Amazon CloudSearch | Analytics

A

After you’ve launched your domain, the next step is loading your data into Amazon CloudSearch. You’ll likely need to upload a single large dataset, and then make smaller updates or additions as new data comes in. The following guidelines will help make bootstrapping your initial data into CloudSearch quick and easy.

  1. Use the curl-v command line tool when preparing your script

During the upload of a dataset, the script you’ve written reads your data and uses it to create JSON or XML documents. We recommend preparing this script in advance, and using curl or another simple command line tool to see if you’re able to upload the documents that the script creates. The “-v” option in curl often provides more detailed information about syntax problems than the AWS SDK or Boto, which both suppress errors for production purposes. Curl displays more detailed error messages, which helps identify the sources of any issues.

  1. Use the UTF-8 character code

Make sure that all data is formatted in the UTF-8 character code format, and that any bad Unicode characters have been removed before uploading to CloudSearch. Illegal characters will cause the document upload to fail.

  1. Batch your documents

Batching your documents is perhaps the most important step in data bootstrapping. Submitting documents to CloudSearch individually is not only inefficient, but also leads to preventable errors.

A document batch is simply a collection of add and delete operations that represent the documents you want to add, update, or delete from your domain. Batches are described in either JSON or XML, and when you upload them to a domain, the data is indexed automatically, according to the domain’s indexing options. Since you’re billed for the total number of document batches uploaded to your search domain, it’s more cost-effective to upload your data in batches of 5 MB, the maximum allowed per upload. You can also upload batches in parallel to reduce the amount of time it takes to upload your data.

  1. Pre-scale

It’s also important to pre-scale your data before uploading it to CloudSearch. Pre-scaling involves selecting the appropriate instance type for the amount of data you wish to upload.

Choosing an instance with enough capacity to handle the size of your upload can help prevent errors and a high replication count. Although replication can help decrease search response time, it doesn’t increase the size of the data pipe or address core problems in data uploads.

CloudSearch will automatically scale up to larger instances as you send more data. Still, pre-selecting the appropriate instance type saves time later in the bootstrapping process, as scaling from one instance to another tends to be a slower process. Below is a sample script to pre-scale the domain for boostrapping and to restore the instance type after data is loaded.

Pre-scale before bootstrapping:

aws cloudsearch update-scaling-parameters –domain-name foo –scaling-parameters DesiredInstanceType=search.m3.2xlarge

aws cloudsearch index-documents –domain-name foo

Restore after data loading:

aws cloudsearch update-scaling-parameters –domain-name foo –scaling-parameters DesiredInstanceType=search.m1.small

aws cloudsearch index-documents –domain-name foo

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are some ways to avoid 504 errors?

Best Practices

Amazon CloudSearch | Analytics

A

If you’re seeing 504 errors or high replication counts, try moving to larger instance type. For example, if you’re having problems with m3.large, move up to m3.xlarge. If you continue to get 504 errors even after pre-scaling, start batching the data and increase the delay between retries.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly