CloudWiki

Amazon Web Service (AWS)

EMR

Database
Amazon Elastic MapReduce (EMR) is a service for big data frameworks, that helps you analyze and process large sets of data more efficiently. It is used for data analysis in log analysis, web indexing, data warehousing, machine learning (ML), financial analysis, scientific simulation and bioinformatics.
aws_emr_cluster
EMR
attributes:

The following arguments are required:

  • name - (Required) Name of the job flow.
  • release_label - (Required) Release label for the Amazon EMR release.
  • service_role - (Required) IAM role that will be assumed by the Amazon EMR service to access AWS resources.

The following arguments are optional:

  • additional_info - (Optional) JSON string for selecting additional features such as adding proxy information. Note: Currently there is no API to retrieve the value of this argument after EMR cluster creation from provider, therefore Terraform cannot detect drift from the actual EMR cluster if its value is changed outside Terraform.
  • applications - (Optional) A case-insensitive list of applications for Amazon EMR to install and configure when launching the cluster. For a list of applications available for each Amazon EMR release version, see the Amazon EMR Release Guide.
  • autoscaling_role - (Optional) IAM role for automatic scaling policies. The IAM role provides permissions that the automatic scaling feature requires to launch and terminate EC2 instances in an instance group.
  • auto_termination_policy - (Optional) An auto-termination policy for an Amazon EMR cluster. An auto-termination policy defines the amount of idle time in seconds after which a cluster automatically terminates. See Auto Termination Policy Below.
  • bootstrap_action - (Optional) Ordered list of bootstrap actions that will be run before Hadoop is started on the cluster nodes. See below.
  • configurations - (Optional) List of configurations supplied for the EMR cluster you are creating. Supply a configuration object for applications to override their default configuration. See AWS Documentation for more information.
  • configurations_json - (Optional) JSON string for supplying list of configurations for the EMR cluster.
  • core_instance_fleet - (Optional) Configuration block to use an Instance Fleet for the core node type. Cannot be specified if any core_instance_group configuration blocks are set. Detailed below.
  • core_instance_group - (Optional) Configuration block to use an Instance Group for the core node type.
  • custom_ami_id - (Optional) Custom Amazon Linux AMI for the cluster (instead of an EMR-owned AMI). Available in Amazon EMR version 5.7.0 and later.
  • ebs_root_volume_size - (Optional) Size in GiB of the EBS root device volume of the Linux AMI that is used for each EC2 instance. Available in Amazon EMR version 4.x and later.
  • ec2_attributes - (Optional) Attributes for the EC2 instances running the job flow. See below.
  • keep_job_flow_alive_when_no_steps - (Optional) Switch on/off run cluster with no steps or when all steps are complete (default is on)
  • kerberos_attributes - (Optional) Kerberos configuration for the cluster. See below.
  • list_steps_states - (Optional) List of step states used to filter returned steps
  • log_encryption_kms_key_id - (Optional) AWS KMS customer master key (CMK) key ID or arn used for encrypting log files. This attribute is only available with EMR version 5.30.0 and later, excluding EMR 6.0.0.
  • log_uri - (Optional) S3 bucket to write the log files of the job flow. If a value is not provided, logs are not created.
  • master_instance_fleet - (Optional) Configuration block to use an Instance Fleet for the master node type. Cannot be specified if any master_instance_group configuration blocks are set. Detailed below.
  • master_instance_group - (Optional) Configuration block to use an Instance Group for the master node type.
  • scale_down_behavior - (Optional) Way that individual Amazon EC2 instances terminate when an automatic scale-in activity occurs or an instance group is resized.
  • security_configuration - (Optional) Security configuration name to attach to the EMR cluster. Only valid for EMR clusters with release_label 4.8.0 or greater.
  • step - (Optional) List of steps to run when creating the cluster. See below. It is highly recommended to utilize the lifecycle configuration block with ignore_changes if other steps are being managed outside of Terraform. This argument is processed in attribute-as-blocks mode.
  • step_concurrency_level - (Optional) Number of steps that can be executed concurrently. You can specify a maximum of 256 steps. Only valid for EMR clusters with release_label 5.28.0 or greater (default is 1).
  • tags - (Optional) list of tags to apply to the EMR Cluster. If configured with a provider default_tags configuration block present, tags with matching keys will overwrite those defined at the provider-level.
  • termination_protection - (Optional) Switch on/off termination protection (default is false, except when using multiple master nodes). Before attempting to destroy the resource when termination protection is enabled, this configuration must be applied with its value set to false.
  • visible_to_all_users - (Optional) Whether the job flow is visible to all IAM users of the AWS account associated with the job flow. Default value is true.

bootstrap_action

  • args - (Optional) List of command line arguments to pass to the bootstrap action script.
  • name - (Required) Name of the bootstrap action.
  • path - (Required) Location of the script to run during a bootstrap action. Can be either a location in Amazon S3 or on a local file system.

auto_termination_policy

  • idle_timeout - (Optional) Specifies the amount of idle time in seconds after which the cluster automatically terminates. You can specify a minimum of 60 seconds and a maximum of 604800 seconds (seven days).

configurations

A configuration classification that applies when provisioning cluster instances, which can include configurations for applications and software that run on the cluster. See Configuring Applications.

  • classification - (Optional) Classification within a configuration.
  • properties - (Optional) Map of properties specified within a configuration classification.

core_instance_fleet

  • instance_type_configs - (Optional) Configuration block for instance fleet.
  • launch_specifications - (Optional) Configuration block for launch specification.
  • name - (Optional) Friendly name given to the instance fleet.
  • target_on_demand_capacity - (Optional) The target capacity of On-Demand units for the instance fleet, which determines how many On-Demand instances to provision.
  • target_spot_capacity - (Optional) Target capacity of Spot units for the instance fleet, which determines how many Spot instances to provision.

instance_type_configs

  • bid_price - (Optional) Bid price for each EC2 Spot instance type as defined by instance_type. Expressed in USD. If neither bid_price nor bid_price_as_percentage_of_on_demand_price is provided, bid_price_as_percentage_of_on_demand_price defaults to 100%.
  • bid_price_as_percentage_of_on_demand_price - (Optional) Bid price, as a percentage of On-Demand price, for each EC2 Spot instance as defined by instance_type. Expressed as a number (for example, 20 specifies 20%). If neither bid_price nor bid_price_as_percentage_of_on_demand_price is provided, bid_price_as_percentage_of_on_demand_price defaults to 100%.
  • configurations - (Optional) Configuration classification that applies when provisioning cluster instances, which can include configurations for applications and software that run on the cluster. List of configuration blocks.
  • ebs_config - (Optional) Configuration block(s) for EBS volumes attached to each instance in the instance group. Detailed below.
  • instance_type - (Required) EC2 instance type, such as m4.xlarge.
  • weighted_capacity - (Optional) Number of units that a provisioned instance of this type provides toward fulfilling the target capacities defined in aws_emr_instance_fleet.

launch_specifications

  • on_demand_specification - (Optional) Configuration block for on demand instances launch specifications.
  • spot_specification - (Optional) Configuration block for spot instances launch specifications.

on_demand_specification

The launch specification for On-Demand instances in the instance fleet, which determines the allocation strategy. The instance fleet configuration is available only in Amazon EMR versions 4.8.0 and later, excluding 5.0.x versions. On-Demand instances allocation strategy is available in Amazon EMR version 5.12.1 and later.

  • allocation_strategy - (Required) Specifies the strategy to use in launching On-Demand instance fleets. Currently, the only option is lowest-price (the default), which launches the lowest price first.

spot_specification

The launch specification for Spot instances in the fleet, which determines the defined duration, provisioning timeout behavior, and allocation strategy.

  • allocation_strategy - (Required) Specifies the strategy to use in launching Spot instance fleets. Currently, the only option is capacity-optimized (the default), which launches instances from Spot instance pools with optimal capacity for the number of instances that are launching.
  • block_duration_minutes - (Optional) Defined duration for Spot instances (also known as Spot blocks) in minutes. When specified, the Spot instance does not terminate before the defined duration expires, and defined duration pricing for Spot instances applies. Valid values are 60, 120, 180, 240, 300, or 360. The duration period starts as soon as a Spot instance receives its instance ID. At the end of the duration, Amazon EC2 marks the Spot instance for termination and provides a Spot instance termination notice, which gives the instance a two-minute warning before it terminates.
  • timeout_action - (Required) Action to take when TargetSpotCapacity has not been fulfilled when the TimeoutDurationMinutes has expired; that is, when all Spot instances could not be provisioned within the Spot provisioning timeout. Valid values are TERMINATE_CLUSTER and SWITCH_TO_ON_DEMAND. SWITCH_TO_ON_DEMAND specifies that if no Spot instances are available, On-Demand Instances should be provisioned to fulfill any remaining Spot capacity.
  • timeout_duration_minutes - (Required) Spot provisioning timeout period in minutes. If Spot instances are not provisioned within this time period, the TimeOutAction is taken. Minimum value is 5 and maximum value is 1440. The timeout applies only during initial provisioning, when the cluster is first created.

core_instance_group

  • autoscaling_policy - (Optional) String containing the EMR Auto Scaling Policy JSON.
  • bid_price - (Optional) Bid price for each EC2 instance in the instance group, expressed in USD. By setting this attribute, the instance group is being declared as a Spot Instance, and will implicitly create a Spot request. Leave this blank to use On-Demand Instances.
  • ebs_config - (Optional) Configuration block(s) for EBS volumes attached to each instance in the instance group. Detailed below.
  • instance_count - (Optional) Target number of instances for the instance group. Must be at least 1. Defaults to 1.
  • instance_type - (Required) EC2 instance type for all instances in the instance group.
  • name - (Optional) Friendly name given to the instance group.

ebs_config

  • iops - (Optional) Number of I/O operations per second (IOPS) that the volume supports.
  • size - (Required) Volume size, in gibibytes (GiB).
  • type - (Required) Volume type. Valid options are gp3, gp2, io1, standard, st1 and sc1. See EBS Volume Types.
  • throughput - (Optional) The throughput, in mebibyte per second (MiB/s).
  • volumes_per_instance - (Optional) Number of EBS volumes with this configuration to attach to each EC2 instance in the instance group (default is 1).

ec2_attributes

Attributes for the Amazon EC2 instances running the job flow:

  • additional_master_security_groups - (Optional) String containing a comma separated list of additional Amazon EC2 security group IDs for the master node.
  • additional_slave_security_groups - (Optional) String containing a comma separated list of additional Amazon EC2 security group IDs for the slave nodes as a comma separated string.
  • emr_managed_master_security_group - (Optional) Identifier of the Amazon EC2 EMR-Managed security group for the master node.
  • emr_managed_slave_security_group - (Optional) Identifier of the Amazon EC2 EMR-Managed security group for the slave nodes.
  • instance_profile - (Required) Instance Profile for EC2 instances of the cluster assume this role.
  • key_name - (Optional) Amazon EC2 key pair that can be used to ssh to the master node as the user called hadoop.
  • service_access_security_group - (Optional) Identifier of the Amazon EC2 service-access security group - required when the cluster runs on a private subnet.
  • subnet_id - (Optional) VPC subnet id where you want the job flow to launch. Cannot specify the cc1.4xlarge instance type for nodes of a job flow launched in an Amazon VPC.
  • subnet_ids - (Optional) List of VPC subnet id-s where you want the job flow to launch. Amazon EMR identifies the best Availability Zone to launch instances according to your fleet specifications.

kerberos_attributes

  • ad_domain_join_password - (Optional) Active Directory password for ad_domain_join_user. Terraform cannot perform drift detection of this configuration.
  • ad_domain_join_user - (Optional) Required only when establishing a cross-realm trust with an Active Directory domain. A user with sufficient privileges to join resources to the domain. Terraform cannot perform drift detection of this configuration.
  • cross_realm_trust_principal_password - (Optional) Required only when establishing a cross-realm trust with a KDC in a different realm. The cross-realm principal password, which must be identical across realms. Terraform cannot perform drift detection of this configuration.
  • kdc_admin_password - (Required) Password used within the cluster for the kadmin service on the cluster-dedicated KDC, which maintains Kerberos principals, password policies, and keytabs for the cluster. Terraform cannot perform drift detection of this configuration.
  • realm - (Required) Name of the Kerberos realm to which all nodes in a cluster belong. For example, EC2.INTERNAL

master_instance_fleet

  • instance_type_configs - (Optional) Configuration block for instance fleet.
  • launch_specifications - (Optional) Configuration block for launch specification.
  • name - (Optional) Friendly name given to the instance fleet.
  • target_on_demand_capacity - (Optional) Target capacity of On-Demand units for the instance fleet, which determines how many On-Demand instances to provision.
  • target_spot_capacity - (Optional) Target capacity of Spot units for the instance fleet, which determines how many Spot instances to provision.

instance_type_configs

See instance_type_configs above, under core_instance_fleet.

launch_specifications

See launch_specifications above, under core_instance_fleet.

master_instance_group

Supported nested arguments for the master_instance_group configuration block:

  • bid_price - (Optional) Bid price for each EC2 instance in the instance group, expressed in USD. By setting this attribute, the instance group is being declared as a Spot Instance, and will implicitly create a Spot request. Leave this blank to use On-Demand Instances.
  • ebs_config - (Optional) Configuration block(s) for EBS volumes attached to each instance in the instance group. Detailed below.
  • instance_count - (Optional) Target number of instances for the instance group. Must be 1 or 3. Defaults to 1. Launching with multiple master nodes is only supported in EMR version 5.23.0+, and requires this resource's core_instance_group to be configured. Public (Internet accessible) instances must be created in VPC subnets that have map public IP on launch enabled. Termination protection is automatically enabled when launched with multiple master nodes and Terraform must have the termination_protection = false configuration applied before destroying this resource.
  • instance_type - (Required) EC2 instance type for all instances in the instance group.
  • name - (Optional) Friendly name given to the instance group.

ebs_config

See ebs_config under core_instance_group above.

step

This argument is processed in attribute-as-blocks mode.

  • action_on_failure - (Required) Action to take if the step fails. Valid values: TERMINATE_JOB_FLOW, TERMINATE_CLUSTER, CANCEL_AND_WAIT, and CONTINUE
  • hadoop_jar_step - (Required) JAR file used for the step. See below.
  • name - (Required) Name of the step.

hadoop_jar_step

This argument is processed in attribute-as-blocks mode.

  • args - (Optional) List of command line arguments passed to the JAR file's main function when executed.
  • jar - (Required) Path to a JAR file run during the step.
  • main_class - (Optional) Name of the main class in the specified Java file. If not specified, the JAR file should specify a Main-Class in its manifest file.
  • properties - (Optional) Key-Value map of Java properties that are set when the step runs. You can use these properties to pass key value pairs to your main function.

Associating resources with a
EMR
Resources do not "belong" to a
EMR
Rather, one or more Security Groups are associated to a resource.
Create
EMR
via Terraform:
The following HCL creates a cluster that includes Spark
Syntax:

resource "aws_emr_cluster" "cluster" {
 name          = "emr-test-arn"
 release_label = "emr-4.6.0"
 applications  = ["Spark"]

 additional_info = <<EOF
{
 "instanceAwsClientConfiguration": {
   "proxyPort": 8099,
   "proxyHost": "myproxy.example.com"
 }
}
EOF

 termination_protection            = false
 keep_job_flow_alive_when_no_steps = true

 ec2_attributes {
   subnet_id                         = aws_subnet.main.id
   emr_managed_master_security_group = aws_security_group.sg.id
   emr_managed_slave_security_group  = aws_security_group.sg.id
   instance_profile                  = aws_iam_instance_profile.emr_profile.arn
 }

 master_instance_group {
   instance_type = "m4.large"
 }

 core_instance_group {
   instance_type  = "c4.large"
   instance_count = 1

   ebs_config {
     size                 = "40"
     type                 = "gp2"
     volumes_per_instance = 1
   }

   bid_price = "0.30"

   autoscaling_policy = <<EOF
{
"Constraints": {
 "MinCapacity": 1,
 "MaxCapacity": 2
},
"Rules": [
 {
   "Name": "ScaleOutMemoryPercentage",
   "Description": "Scale out if YARNMemoryAvailablePercentage is less than 15",
   "Action": {
     "SimpleScalingPolicyConfiguration": {
       "AdjustmentType": "CHANGE_IN_CAPACITY",
       "ScalingAdjustment": 1,
       "CoolDown": 300
     }
   },
   "Trigger": {
     "CloudWatchAlarmDefinition": {
       "ComparisonOperator": "LESS_THAN",
       "EvaluationPeriods": 1,
       "MetricName": "YARNMemoryAvailablePercentage",
       "Namespace": "AWS/ElasticMapReduce",
       "Period": 300,
       "Statistic": "AVERAGE",
       "Threshold": 15.0,
       "Unit": "PERCENT"
     }
   }
 }
]
}
EOF
 }

 ebs_root_volume_size = 100

 tags = {
   role = "rolename"
   env  = "env"
 }

 bootstrap_action {
   path = "s3://elasticmapreduce/bootstrap-actions/run-if"
   name = "runif"
   args = ["instance.isMaster=true", "echo running on master node"]
 }

 configurations_json = <<EOF
 [
   {
     "Classification": "hadoop-env",
     "Configurations": [
       {
         "Classification": "export",
         "Properties": {
           "JAVA_HOME": "/usr/lib/jvm/java-1.8.0"
         }
       }
     ],
     "Properties": {}
   },
   {
     "Classification": "spark-env",
     "Configurations": [
       {
         "Classification": "export",
         "Properties": {
           "JAVA_HOME": "/usr/lib/jvm/java-1.8.0"
         }
       }
     ],
     "Properties": {}
   }
 ]
EOF

 service_role = aws_iam_role.iam_emr_service_role.arn
}

Create
EMR
via CLI:
Parametres:

create-cluster
--release-label <value>   | --ami-version <value>
--instance-fleets <value> | --instance-groups <value> | --instance-type <value> --instance-count <value>
[--os-release-label <value>]
[--auto-terminate | --no-auto-terminate]
[--use-default-roles]
[--service-role <value>]
[--configurations <value>]
[--name <value>]
[--log-uri <value>]
[--log-encryption-kms-key-id <value>]
[--additional-info <value>]
[--ec2-attributes <value>]
[--termination-protected | --no-termination-protected]
[--scale-down-behavior <value>]
[--visible-to-all-users | --no-visible-to-all-users]
[--enable-debugging | --no-enable-debugging]
[--tags <value>]
[--applications <value>]
[--emrfs <value>]
[--bootstrap-actions <value>]
[--steps <value>]
[--restore-from-hbase-backup <value>]
[--security-configuration <value>]
[--custom-ami-id <value>]
[--ebs-root-volume-size <value>]
[--repo-upgrade-on-boot <value>]
[--kerberos-attributes <value>]
[--managed-scaling-policy <value>]
[--placement-group-configs <value>]
[--auto-termination-policy <value>]

Example:

aws emr create-cluster \
   --release-label emr-5.9.0 \
   --applications Name=Spark \
   --ec2-attributes KeyName=myKey \
   --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m4.large InstanceGroupType=CORE,InstanceCount=2,InstanceType=m4.large \
   --auto-terminate

Best Practices for
EMR

Categorized by Availability, Security & Compliance and Cost

Info
Ensure EMR cluster archive log files to S3
Critical
Ensure EMR cluster master nodes are not publicly accessible
No items found.
Warning
Ensure EMR clusters are encrypted in-transit and at-rest
Explore all the rules our platform covers
All Resources