Its not you Everytime, sometimes issue might be at AWS End

Today an issue reported to me that website of our client was loading very slow which was hosted on AWS Windows server and the same website was loading fine when accessed from outside AWS network,I just felt like might be a regular issue but it all together took me to an inside out of the network troubleshooting.

Initially, we checked for SSL certificate expiry, which was not the case, so below are the Two steps which we used to troubleshoot the issue:

Troubleshooting through Browser via Web developer Network tool

In browser we checked which part of code was taking time to load using Network option in developer tools:

  • Select web developer tools in firefox
  • Then select network

We identified one of the GET calls was taking time to load.
Then when this thing was reported to AWS support team they provided further analysis of this. We can save the report as (.HAR) file which tells us below things:

  • How long it takes to fetch DNS information
  • How long each object takes to be requested
  • How long it takes to connect to the server
  • How long it takes to transfer assets from the server to the browser of each object.

Troubleshooting using Traceroute

Then we tried to troubleshoot the AWS network flow using “tracert ” with below output:

Tracing route to example.gov [151.x.x.x] over a maximum of 15 hops:

1 <1 ms <1 ms <1 ms 10.x.x.x

2 * * * Request timed out.

3 * * * Request timed out.

4 * * * Request timed out.

5 * * * Request timed out.

6 * * * Request timed out.

7 <1 ms <1 ms <1 ms 100.x.x.x

8 <1 ms <1 ms 1 ms 52.x.x.x

9 * * * Request timed out.

10 2 ms 1 ms 1 ms example.net [67.x.x.x]

11 2 ms 2 ms 2 ms example.net [67.x.x.x]

12 2 ms 2 ms 2 ms example.net [205.x.x.x]

13 3 ms 3 ms 2 ms 63.x.x.x

14 3 ms 3 ms 3 ms 198.x.x.x

15 4 ms 4 ms 4 ms example.net [63.x.x.x]

And when this was reported to AWS team that RTO from 2-6 we were getting was due to connectivity with internal AWS network which needs to be byepass and was not an issue as packet still reached the next server within 1ms.

Traceroute gives an insight to your network problem.

  • The entire path that a packet travels through
  • Names and identity of routers and devices in your path
  • Network Latency or more specifically the time taken to send and receive data to each devices on the path.

Solution provided by AWS Team

After all the Razzle-Dazzle they just refreshed the network from their end and there was no more website latency after that while accessing from AWS internal network.

Tool recommended by AWS Support team for Network troubleshooting if the issue arises in future:

Wireshark along with .har file using network in web-developer tools from browser.

Wireshark is a network packet analyzer. A network packet analyzer will try to capture network packets and tries to display that packet data as detailed as possible.
You could think of a network packet analyzer as a measuring device used to examine what’s going on inside a network cable, just like a voltmeter is used by an electrician to examine what’s going on inside an electric cable (but at a higher level, of course).
In the past, such tools were either very expensive, proprietary, or both. However, with the advent of Wireshark, all that has changed.

Wireshark is perhaps one of the best open source packet analyzers available today.

Features

The following are some of the many features Wireshark provides:

  • Available for UNIX and Windows.
  • Capture live packet data from a network interface.
  • Open files containing packet data captured with tcpdump/WinDump, Wireshark, and a number of other packet capture programs.
  • Import packets from text files containing hex dumps of packet data.
  • Display packets with very detailed protocol information.
  • Save packet data captured.
  • Export some or all packets in a number of capture file formats.
  • Filter packets on many criteria.
  • Search for packets on many criteria.
  • Colorize packet display based on filters.
  • Create various statistics.

… and a lot more!

My stint with Runc vulnerability

Today I was given a task to set up a new QA environment. I said no issue should be done quickly as we use Docker, so I just need to provision VM run the already available QA ready docker image on this newly provisioned VM. So I started and guess what Today was not my day. I got below error while running by App image.

docker: Error response from daemon: OCI runtime create failed: container_linux.go:344: starting container process caused “process_linux.go:293: copying bootstrap data to pipe caused \”write init-p: broken pipe\””: unknown.

I figured out my Valentine’s Day gone for a toss. As usual I took help of Google God to figure out what this issue is all about, after few minutes I found out a blog pretty close to issue that I was facing

https://medium.com/@dirk.avery/docker-error-response-from-daemon-1d46235ff61d

Bang on I got the issue identified. There is a new runc vulnerability identified few days back.

https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2019-5736

The fix of this vulnerability was released by Docker on February 11, but the catch was that this fix makes docker incompatible with 3.13 Kernel version.

While setting up QA environment I installed latest stable version of docker 18.09.2 and since the kernel version was 3.10.0-327.10.1.el7.x86_64 thus docker was not able to function properly.

So as suggested in the blog I upgraded the Kernel version to 4.x.

rpm –import https://www.elrepo.org/RPM-GPG-KEY-elrepo.org
rpm -Uvh http://www.elrepo.org/elrepo-release-7.0-2.el7.elrepo.noarch.rpm
yum repolist
yum –enablerepo=elrepo-kernel install kernel-ml
yum repolist all
awk -F\’ ‘$1==”menuentry ” {print i++ ” : ” $2}’ /etc/grub2.cfg
grub2-set-default 0
grub2-mkconfig -o /boot/grub2/grub.cfg
reboot

And here we go post that everything is working like a charm.

So word of caution to every even
We have a major vulnerability in docker CVE-2019-5736, for more details go through the link

https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2019-5736

As a fix, upgrade your docker to 18.09.2, as well make sure that you have Kernel 4+ as suggested in the blog.

https://docs.docker.com/engine/release-notes/

Now I can go for my Valentine Party 👫

Using Ansible Dynamic Inventory with Azure can save the day for you.

As a DevOps Engineer, I always love to make things simple and convenient by automating them. Automation can be done on many fronts like infrastructure, software, build and release etc.

Ansible is primarily a software configuration management tool which can also be used as an infrastructure provisioning tool.
One of the thing that I love about Ansible is its integration with different cloud providers. This integration makes things really loosely coupled, For ex:- we don’t require to manage whole information of cloud in Ansible (Like we don’t need instance metadata information for provisioning it).

Ansible Inventory

Ansible uses a term called inventory to refer to the set of systems or machines that our Ansible playbook or command work against. There are two ways to manage inventory:-
  • Static Inventory
  • Dynamic Inventory
By default, the static inventory is defined in /etc/ansible/hosts in which we provide information about the target system. In most of the cloud platform when the server gets reboot then it reassigns a new public address and again we have to update that in our static inventory, so this can’t be the lasting option.
Luckily Ansible supports the concept of dynamic inventory in which we have some python scripts and a .ini file through which we can provision machines dynamically without knowing its public or private address. Ansible Dynamic Inventory is fed by using external python scripts and .ini files provided by Ansible for cloud infrastructure platforms like Amazon, Azure, DigitalOcean, Rackspace.
In this blog, we will talk about how to configure dynamic inventory on the Azure Cloud Platform.

Ansible Dynamic Inventory on Azure

The first thing that always required to run anything is software and its dependencies. So let’s install the software and its dependencies first. First, we need the python modules of azure that we can install via pip.
 
$ pip install 'ansible[azure]'
After this, we need to download azure_rm.py

$ wget https://raw.githubusercontent.com/ansible/ansible/devel/contrib/inventory/azure_rm.py

Change the permission of file using chmod command.

$ chmod +x azure_rm.py

Then we have to log in to Azure account using azure-cli

$ az login
To sign in, use a web browser to open the page https://aka.ms/devicelogin and enter the code XXXXXXXXX to authenticate.

The az login command output will provide you a unique code which you have to enter in the webpage i.e.
https://aka.ms/devicelogin

As part of the best practice, we should always create an Active Directory for different services or apps to restrict privileges. Once you logged in Azure account you can create an Active Directory app for Ansible

$ az ad app create --password ThisIsTheAppPassword --display-name opstree-ansible --homepage ansible.opstree.com --identifier-uris ansible.opstree.com

Don’t forget to change your password ;). Note down the appID from the output of the above command.

Once the app is created, create a service principal to associate it with.

$ az ad sp create --id appID

Replace the appID with actual app id and copy the objectID from the output of the above command.
Now we just need the subscription id and tenant id, which we can get by a simple command

$ az account show

Note down the id and tenantID from the output of the above command.

Let’s assign a contributor role to service principal which is created above.

$ az role assignment create --assignee objectID --role contributor

Replace the objectID with the actual object id output.

All the azure side setup is done. Now we have to make some changes to your system.

Let’s start with creating an azure home directory

$ mkdir ~/.azure

In that directory, we have to create a credentials file

$ vim ~/.azure/credentials

[default]
subscription_id=id
client_id=appID
secret=ThisIsTheAppPassword
tenant=tenantID

Please replace the id, appID, password and tenantID with the above-noted things.

All set !!!! Now we can test it by below command

$ python ./azure_rm.py --list | jq

and the output should be like this:-

{
  "azure": [
    "ansibleMaster"
  ],
  "westeurope": [
    "ansibleMaster"
  ],
  "ansibleMasterNSG": [
    "ansibleMaster"
  ],
  "ansiblelab": [
    "ansibleMaster"
  ],
  "_meta": {
    "hostvars": {
      "ansibleMaster": {
        "powerstate": "running",
        "resource_group": "ansiblelab",
        "tags": {},
        "image": {
          "sku": "7.3",
          "publisher": "OpSTree",
          "version": "latest",
          "offer": "CentOS"
        },
        "public_ip_alloc_method": "Dynamic",
        "os_disk": {
          "operating_system_type": "Linux",
          "name": "osdisk_vD2UtEJhpV"
        },
        "provisioning_state": "Succeeded",
        "public_ip": "52.174.19.210",
        "public_ip_name": "masterPip",
        "private_ip": "192.168.1.4",
        "computer_name": "ansibleMaster",
        ...
      }
    }
  }
}

Now you are ready to use Ansible in Azure with dynamic inventory. Good Luck 🙂

Log Parsing of Windows Servers on Instance Termination

Introduction

Logs play a critical role in any application or system. They provide deep visibility into what the application is doing, how requests are processed, and what caused an error. Depending on how logging is configured, logs may contain transaction history, timestamps, request details, and even financial information such as debits or credits.

In enterprise environments, applications usually run across multiple hosts. Managing logs across hundreds of servers can quickly become complex. Debugging issues by manually searching log files on multiple instances is time consuming and inefficient. This is why centralizing logs is considered a best practice.

Recently, I encountered a common challenge in AWS environments where application logs need to be retained from instances running behind an Auto Scaling Group. This blog explains a practical solution to ensure logs are preserved even when instances are terminated.

Problem Scenario

Assume your application writes logs to the following directory on a Windows instance.

C:\Source\Application\web\logs

Traffic to the application is variable. At low traffic, two EC2 instances may be sufficient. During peak traffic, the Auto Scaling Group may scale out to twenty or more instances.

When traffic increases, new EC2 instances are launched and logs are generated normally. However, when traffic drops, Auto Scaling triggers scale-down events and terminates instances. When an instance is terminated, all logs stored locally on that instance are lost.

This makes post-incident debugging and auditing difficult.

Solution Overview

The goal is to synchronize logs from terminating EC2 instances before they are fully removed.

This solution uses AWS services to trigger a PowerShell script through AWS Systems Manager at instance termination time. The script archives logs and uploads them to an S3 bucket with identifying information such as IP address and date.

To achieve this, two prerequisites are required.

  1. Systems Manager must be able to communicate with EC2 instances

  2. EC2 instances must have permission to write logs to Amazon S3

Environment Used

For this setup, the following AMI was used.

 
Microsoft Windows Server 2012 R2 Base
AMI ID: ami-0f7af6e605e2d2db5

Step 1 Configuring Systems Manager Access on EC2

SSM Agent is installed by default on Windows Server 2016 and on Windows Server 2003 to 2012 R2 AMIs published after November 2016.

For older Windows AMIs, EC2Config must be upgraded and SSM Agent installed alongside it.

The following PowerShell script upgrades EC2Config, installs SSM Agent, and installs AWS CLI.
Use this script only for instructional and controlled environments.

PowerShell Script to Install Required Components

 
# Create temporary directory if not present
if (!(Test-Path -Path C:\Tmp)) {
New-Item -ItemType Directory -Path C:\Tmp
}

Set-Location C:\Tmp

# Download installers
Invoke-WebRequest "https://s3.ap-south-1.amazonaws.com/asg-termination-logs/Ec2Install.exe" -OutFile Ec2Config.exe
Invoke-WebRequest "https://s3.amazonaws.com/ec2-downloads-windows/SSMAgent/latest/windows_amd64/AmazonSSMAgentSetup.exe" -OutFile ssmagent.exe
Invoke-WebRequest "https://s3.amazonaws.com/aws-cli/AWSCLISetup.exe" -OutFile awscli.exe

# Install EC2Config
Start-Process C:\Tmp\Ec2Config.exe -ArgumentList "/Ec /S /v/qn" -Wait
Start-Sleep -Seconds 20

# Install AWS CLI
Start-Process C:\Tmp\awscli.exe -ArgumentList "/Ec /S /v/qn" -Wait
Start-Sleep -Seconds 20

# Install SSM Agent
Start-Process C:\Tmp\ssmagent.exe -ArgumentList "/Ec /S /v/qn" -Wait
Start-Sleep -Seconds 10

Restart-Service AmazonSSMAgent

Remove-Item C:\Tmp -Recurse -Force

IAM Role for Systems Manager

The EC2 instance must have an IAM role that allows it to communicate with Systems Manager.

Attach the following managed policy to the instance role.

 
AmazonEC2RoleforSSM

Once attached, the role should appear under the instance IAM configuration.

index

Step 2 Allowing EC2 to Write Logs to S3

The EC2 instance also needs permission to upload logs to S3.

Attach the following policy to the same IAM role.

 
AmazonS3FullAccess

In production environments, it is recommended to scope this permission to a specific bucket.

index

PowerShell Script for Log Archival and Upload

Save the following script as shown below.

 
C:\Scripts\termination.ps1

This script performs the following actions.

  • Creates a date-stamped directory

  • Archives application logs

  • Uploads the archive to an S3 bucket

Log Synchronization Script

 
$Date = Get-Date -Format yyyy-MM-dd
$InstanceName = "TerminationEc2"
$LocalIP = Invoke-RestMethod -Uri "http://169.254.169.254/latest/meta-data/local-ipv4"

$WorkDir = "C:\Users\Administrator\workdir\$InstanceName-$LocalIP-$Date\$Date"

if (Test-Path $WorkDir) {
Remove-Item $WorkDir -Recurse -Force
}

New-Item -ItemType Directory -Path $WorkDir

$SourcePathWeb = "C:\Source\Application\web\logs"
$DestFileWeb = "$WorkDir\logs.zip"

Add-Type -AssemblyName "System.IO.Compression.FileSystem"
[System.IO.Compression.ZipFile]::CreateFromDirectory($SourcePathWeb, $DestFileWeb)

& "C:\Program Files\Amazon\AWSCLI\bin\aws.cmd" s3 cp `
"C:\Users\Administrator\workdir" `
"s3://terminationec2" `
--recursive `
--region us-east-1

Once executed manually, the script should complete successfully and upload logs to the S3 bucket.

index

index

Running the Script Using Systems Manager

To automate execution, run this script using Systems Manager Run Command.

Select the target instance and choose the document.

 
AWS-RunPowerShellScript

Configure the following.

 
Commands: .\termination.ps1
Working Directory: C:\Scripts
Execution Timeout: 3600

Auto Scaling Group Preparation

Ensure the AMI used by the Auto Scaling Group includes all the above configurations.

Create an AMI from a configured EC2 instance and update the launch configuration or launch template.

For this tutorial, the Auto Scaling Group is named.

 
group_kaien

Configuring CloudWatch Event Rule

Create a CloudWatch Event rule to trigger when an instance is terminated.

Event Pattern

 
{
"source": ["aws.autoscaling"],
"detail-type": [
"EC2 Instance Terminate Successful",
"EC2 Instance-terminate Lifecycle Action"
],
"detail": {
"AutoScalingGroupName": ["group_kaien"]
}
}
 
index

Event Target Configuration

Set the target as Systems Manager Run Command.

 
Document: AWS-RunPowerShellScript
Target: Instance ID
Command: .\termination.ps1
Working Directory: C:\Scripts

This ensures that whenever an instance is terminated, the PowerShell script runs and synchronizes logs to S3 before shutdown.

index

Validation

Trigger scale-out and scale-down events by adjusting Auto Scaling policies.

When instances are terminated, logs should appear in the S3 bucket with correct date and instance identifiers.

index

Conclusion

This setup ensures that application logs are safely preserved even when EC2 instances are terminated by an Auto Scaling Group. Logs are archived with proper timestamps and instance information, making debugging and auditing much easier.

With this approach, log retention is automated, reliable, and scalable for enterprise AWS environments.

Stay tuned for more practical infrastructure solutions.

Git-Submodule

Rocket Science has always fascinated me, but one thing which totally blows my mind is the concept of modules aka. modular rockets. The literal definition of modules statesA modular rocket is a type of multistage rocket which features components that can be interchanged for specific mission requirements.” In simple terms, you can say that the Super Rocket depends upon those Submodules to get the things done.
Similarly is the case in the Software world, where super projects have multiple dependencies on other objects. And if we talk about managing projects Git can’t be ignored, Moreover Git has a concept of Submodules which is slightly inspired by the amazing rocket science of modules.

Hour of Need

Being a DevOps Specialist we need to do provisioning of the Infrastructure of our clients which is sometimes common for most of the clients. We decided to Automate it, which a DevOps is habitual of. Hence, Opstree Solutions initiated an Internal project named OSM. In which we create Ansible Roles of different opensource software with the contribution of each member of our organization. So that those roles can be used in the provisioning of the client’s infrastructure.
This makes the client projects dependent on our OSM. Which creates a problem statement to manage all dependencies which might get updated over the period. And to do that there is a lot of copy paste, deleting the repository and cloning them again to get the updated version, which is itself a hair-pulling task and obviously not the best practice.
Here comes the git-submodule as a modular rocket to take our Super Rocket to its destination.

Let’s Liftoff with Git-Submodules

A submodule is a repository embedded inside another repository. The submodule has its own history; the repository it is embedded in is called a superproject.

In simple terms, a submodule is a git repository inside a Superproject’s git repository, which has its own .git folder which contains all the information that is necessary for your project in version control and all the information about commits, remote repository address etc. It is like an attached repository inside your main repository, which can be used to reuse a code inside it as a “module“.
Let’s get a practical use case of submodules.
We have a client let’s call it “Armstrong” who needs few of our ansible roles of OSM for their provisioning of Infrastructure. Let’s have a look at their git repository below.

$    cd provisioner
$    ls -a
     .  ..  ansible  .git  inventory  jenkins  playbooks  README.md  roles
$    cd roles
$    ls -a
     apache  java   nginx  redis  tomcat
We can see in this Armstrong’s provisioner repository(a git repository) depends upon five roles which are available in OSM’s repository to help Armstrong to provision their infrastructure. So we’ll add submodules osm_java and others.

$    cd java
$    git submodule add -b armstrong git@gitlab.com:oosm/osm_java.git osm_java
     Cloning into './provisioner/roles/java/osm_java'...
     remote: Enumerating objects: 23, done.
     remote: Counting objects: 100% (23/23), done.
     remote: Compressing objects: 100% (17/17), done.
     remote: Total 23 (delta 3), reused 0 (delta 0)
     Receiving objects: 100% (23/23), done.
     Resolving deltas: 100% (3/3), done.

With the above command, we are adding a submodule named osm_java whose URL is git@gitlab.com:oosm/osm_java.git and branch is armstrong. The name of the branch is coined armstrong because to keep the configuration of each of our client’s requirement isolated, we created individual branches of OSM’s repositories on the basis of client name.
Now if take a look at our superproject provisioner we can see a file named .gitmodules which has the information regarding the submodules.

$    cd provisioner
$    ls -a
     .  ..  ansible  .git  .gitmodules  inventory  jenkins  playbooks  README.md  roles
$    cat .gitmodules
     [submodule "roles/java/osm_java"]
     path = roles/java/osm_java
     url = git@gitlab.com:oosm/osm_java.git
     branch = armstrong

Here you can clearly see that a submodule osm_java has been attached to the superproject provisioner.

What if there was no submodule?

If that was a case, then we need to clone the repository from osm and paste it to the provisioner then add & commit it to the provisioner phew….. that would also have worked.
But what if there is some update has been made in the osm_java which have to be used in provisioner, we can not easily sync with the OSM. We would need to delete osm_java, again clone, copy, and paste in the provisioner which sounds clumsy and not a best way to automate the process.
Being a osm_java as a submodule we can easily update that this dependency without messing up the things.

$    git submodule status
     -d3bf24ff3335d8095e1f6a82b0a0a78a5baa5fda roles/java/osm_java
$    git submodule update --remote
     remote: Enumerating objects: 3, done.
     remote: Counting objects: 100% (3/3), done.
     remote: Total 2 (delta 0), reused 2 (delta 0), pack-reused 0
     Unpacking objects: 100% (2/2), done.
     From git@gitlab.com:oosm/osm_java.git     0564d78..04ca88b  armstrong     -> origin/armstrong
     Submodule path 'roles/java/osm_java': checked out '04ca88b1561237854f3eb361260c07824c453086'

By using the above update command we have successfully updated the submodule which actually pulled the changes from OSM’s origin armstrong branch.

What have we learned? 

In this blog post, we learned to make use of git-submodules to keep our dependent repositories updated with our super project, and not getting our hands dirty with gullible copy and paste.
Kick-off those practices which might ruin the fun, sit back and enjoy the automation.

Referred links:
Image: google.com
Documentation: https://git-scm.com/docs/gitsubmodules