40+ Frequently asked questions in a Solutions Architect Interview

These frequently asked questions to an AWS cloud Architect can help you in so many ways, it can help prepare for your next AWS interview or you can be well placed between your peers when communicating about AWS cloud.

Tell me about yourself?

I am a professional Solutions Architect working for over 15 years in IT and have over 8 years of experience working as a Solutions Architect. I have Architected various Enterprise Level applications ranging from web applications to multi-threaded robotic applications. I have worked with stakeholders from businesses like Qantas, Virgin, Singapore Airlines, Etihad, Galileo, Sabre, and Amadeus. I have worked with C-Suit executives as well as project managers, functional analysts, and developers to implement various business-critical projects.

As a Solutions Architect, I fundamentally look after the whole design of a Solution from concept to implementation. I possess a broad range of natural skills coupled with in-depth technical abilities which makes me feel confident as a Solutions Architect and I am sure I can provide a positive return on your investment.

The technology stack I have used during my career are: as Software Developer – .Net (from 1.1 to Core) (c#, vb, javascript, HTML), Database (Ms. Sql Server, Oracle, Sybase, Mysql), as Solutions architect AWS cloud stacks such as EC2, S3, Cloud Front, Cloud Formation, ELB, Route 53 and so on, for diagrams: Visio, Drawings.io, Lucid Chart, for code review Veracode, for system monitoring Nagios, for project management Ms. Project and for work delegation (Agile, Kanban) methodologies, Bitrix 24 and Hub spot.

What role does an Solutions Architect perform?

Solutions Architect bridges gaps between technology solutions and business problems. Solutions Architect lead and make effort to integrate IT systems so that they conform to organizations requirements.

Solutions Architect is the whole package of the business problem from understanding customer problems, identifying IT technologies to uses to solve the problem, initiating discussions in IT steering committees to chose the best technology options, get involved with the development team to make sure they are following the appropriate design patterns, design a deployment architecture and implement the system in coordination with the implementation team.

What is your biggest weakness & strengths?

To be honest, early in my career as a Solutions Architect, I had a lot of curiosity and energy that led me to say yes to taking projects without understanding the depth of the project and my ability to perform. At some point, I would take on so many projects and try to juggle between one project to another, resulting in working at night and weekends to complete the projects on time. This started to take a toll on myself and my family life, I was becoming more stressed that eventually impacted the quality of work.

At one point I realized, I had serious issues to fix, my mental health was not in order, my family life and my work performance were being impacted.

To fix this I took part in the mental health workshops and used the workload management tool (Bitrix 24) and started distributing the workload among team members, which significantly improved my productivity and freed up my mind.

Talking about strength, one of my strengths as a Solutions Architect, is the fact that I am always looking to improve, I am always keen on learning about new technologies which keep me updated on the latest technology trends and I can apply those latest trends in designing solutions.

Which area of the job do you find very challenging?

With technology advancing rapidly, I anticipate a challenge where some technological solutions under development become obsolete before they are delivered to the client.

To overcome that, I intend to keep up to date with the newest technology advancements and engage with teams to identify the right technological designs to implement for a particular project.

Can you give me an example of a situation where you insisted the highest standards?

(Behavioral question – Display resilience and courage – be open and honest – prepare to express views – willing to accept and commit to change)

I was working on an important web-based project which has to be completed on a tight schedule to meet the demands for the upcoming Christmas holidays. At one point I identified that the product development team was trying to cut corners to meet the deadlines.

Whilst I could understand their reasons for pushing forward the launch date, I raised serious concerns that, unless we followed protocol and completed the project to my proposed specifications and solutions, we would encounter problems later on down the line due to the vulnerability of some aspects of the website.

I made the suggestion that we should maintain exacting standards but look into increase the team size so that we could attempt to still meet their desired launch date. My proposal was appreciated by the CIO, team size was increased, the development team met their deadlines without cutting corners and the project was a success.

Tell me about a time at work when your integrity was challenged. How did you handle it and what was the outcome?

S – While I was working as Senior Software Developer, one of the web servers that had business-critical websites that my team had implemented needed to install security software patched and it required a restart.

T- A change request was raised and I was included in the change request, my role was to be available for 1 hour at midnight to test, monitor, and make sure all the websites are up and running after the webserver is restarted.

A – I woke up 20 minutes before midnight and tried to connect to the server, but somehow my VPN was not working and I couldn’t access the server, I notified the team responsible for the patch upgrade and provided instructions on how to perform basic tests on the website. But, the next day I found out that, there was an issue raised challenging my integrity in the CID log (Change, Issues, Decision).

I didn’t react immediately and remained calm, to prove my honesty I contacted the network team managing VPN and asked them to provide me a written reason for VPN failure in my Laptop and provide me logs of VPN attempt from my laptop. I presented my defense to the stakeholders, along with the proof from the network team.

T- My immediate manager and the CIO became aware of my honesty and during fortnightly Friday casual team meetings, glass was raised for my integrity and for being able to handle the situation without conflict.

Tell me about a time you drove a group of people towards a common goal, overcome obstacles and difficulties

In my previous role, I was leading a team to complete a project that was tasked to refund over $20 million worth of Airline tickets and hotel bookings.

S- Due to covid, the refunds requests were overwhelming the call centers and the backlog of refund requests was from over a month. The call center team was forced to hire casual employees to address the increased workloads, but also was struggling to cope with the level of workload.

T- I was given the task to lead a team and complete the project within a month and clear all the backlogs within 15 days by using an automated system. As a Solutions Architect, it was my responsibility to find the best solution and design a system that can be developed by a small team of 8 members, implement the system and complete the task on time. I had to lead the limited team to work efficiently and overcome tight deadlines. I had to make sure the team members are working in coordination with a full understanding of the system, one small misunderstanding would cost our team valuable time.

A- I made sure all the team members were with me on the whole process of Solutions Architecture. So, we approached Agile methodologies by building products using short cycles of work that allowed for rapid development and constant revision.

We analyzed the system by taking the C4D (Context, Container, Component, Code, and Deployment) model. Identified all the system context to communicate to, identified what containers to build, break down each container into various components, identify technologies to use for development and finally design a cloud architecture for deployment.

R- We completed the project ahead of time that gave us plenty of time to test the system, we implemented the project and cleared the refund within 15 days. The call center team was able to reduce the casual staff by 60% resulting in huge savings during the period of revenue constraints.

Tell me about a time you took the initiative to solve a problem that was not not originally in your scope of work?

In my previous role, I got an opportunity to act as a project manager to complete a small $4M worth of business acquisition project. The project manager who was to be working on the project was on annual leave. The project had to be completed ASAP which included a business-critical software that was to be acquired to upgrade a system on the latest technology.

Although it was out of my scope of work, I stepped up and took the initiative to act as project manager for the project.

I managed project timelines utilizing the WBS approach – break each work package into tasks, determine project dependencies, determine the time needed for each task, identify resource availability, identify project milestones.

Identified the stakeholders of the project and communicated project strategy and objectives with various stakeholders (from our organization and the UK-based company).

Identified and defined project risks and implemented mitigation strategies by utilizing the project risk status matrix and monitored project goals by monitoring KPIs.

The key challenge I faced during the newly acquired role of project manager was that I had to travel between Sydney and London on multiple occasions. These travel times were killing some valuable project times which were not factored in while scheduling the project timeline at the start of the project due to which I had to work extra hours over the weekends to cover the timelines.

I completed the acquisition deal successfully and our organization owned the software that was required for key ticketing operations. Later that year the same application migrated to the latest technology that currently processes $2 billion worth of transactions every year. I feel proud to look at the system grow that big that I help acquire.

Can you describe a time when you have gone above and beyond to please your customer, What was the situation and what did you do?

S- At one time one of our customers (a retail travel agent) to whom we provide customized web-based applications as software as service. The customers get charged for the components they subscribe to. Once the admin login was active and one of her new employees accessed the same terminal to serve a client and unknowingly subscribed to the services that they were not using. After the end of the month, the customer was surprised to see the additional components.

T- I was directly involved with the customer in the past and she contacted me to let me know her issue. Even though I was not responsible for the customer’s accounts, knew I had to do something for the customer to give some relief from her mishap.

A- I contacted the third-level customer support team to provide me usage log report of the customer to find out what services the customer used during the billing period. I then contacted the accounts department manager and let her know the situation with the access log report. The account manager understood the problem and asked me to file for refund request on behalf of the customer. I sought permission from the customer, raised a refund request, and got approved.

R- As a result, the customer was very happy with me and sent a personal thank you gift hamper along with a very long email to thank me. I felt happy for her and felt a sense of achievement.

Can you recall a time when you experienced conflict with a co-worker? or Tell me about a recent problem you faced and how did you resolve it?

(STAR- Situation (what happened) – task (what was your task or role) – action (what you did to fix) – Result (outcome))

I was working as a Project Manager/Solutions Architect on a tight scheduled Application Development project when one developer was consistently delivering his tasks late. When I approached him about it, he initially reacted defensively.

I kept calm, appreciated his efforts, and explained to him that the deadline of the project was challenging. I asked how I could assist him to help improve his performance. He then calmed down and explained that the workload in his parallel day-to-day tasks has increased in recent days and so he couldn’t focus on the project tasks.

I then arranged a meeting with his immediate manager and we were able to come to a resolution that made the developer’s workload much more manageable. For the remainder of the project, the developer delivered great performance and was on time.

What is your most significant work achievement, and how did you accomplish this?

S – My biggest achievement was during the move from a Software Developer role to a Solutions Architect. I had to work with the AWS cloud to migrate various applications in the cloud, some were mission-critical and some were reporting sites. We migrated the sites successfully with the help of AWS technical teams but now it was my responsibility to manage the architecture to make it cost-efficient, highly scalable, resilient, and fault-tolerant.

T – I had no knowledge of AWS before and I knew there is more in AWS cloud than just hosting the websites as a physical webserver. I had to upskill myself to be able to continue working as a Solutions Architect.

A – I decided to study AWS cloud, and initially I studied watching more than 100 hours of youtube videos about different aspects of AWS cloud. But, I hadn’t gained the overall picture of the cloud, all the studies I did was not meaning anything, then finally I decided this will not work unless I get the certification. I joined a course online for certification and completed the course and finally passed the certification.

T- After completing the certification, I realized there was so much to improve in our server environment that would save cost and make them high-performing servers. This was the biggest achievement in my career.

How do you approach if you have to choose between two different options or ideas (or how do you approach if someone disagrees with your idea)?

Working with people is the major role of a Solutions Architect. While working with people there will be disagreements or conflicting ideas, as different people have different thought processes.

When I encounter a disagreement or conflicting ideas, I remain calm and note down or understand the idea that has been presented. Then I ask the other person to give me some time and let them know I will come back to them later (same day or next). I go back to my desk and research the idea that has been presented, I will find out all the positives/negatives of implementing the idea. I will then do the same for my idea or the other option.

The next day I will go back to the person and present both the research and ideas, and we both come to the conclusion on which idea to adopt or implement.

Can you tell me about the governance process around the area (Solution Architecture) while working on your previous roles?

( Architecture Governance must ensure that: The Architecture representation of proposed concepts and strategies matches the future needs of the organization. The Architecture role is clear in governing the architecture alignment of projects)

Since our organization is not a large enterprise where we would have to work with the core Enterprise Architecture team following the strict Architecture governance principles. However, the Architecture governance is delivered by following following following procedures:

Functional analysts in cooridnation with Solutions Architects get the business requirements.
A Solutions Architect presents HLD with cost benefits for using different technology choices to IT steering committee. All the pros and cons of the project will be discussed and dbated during the IT steering committee meeting.
IT steering committee decides on the appropriate technology option and approves or disapproves the project.
Solutions Architect guides the projects until delivery working with the developement and implementation team by co-ordinating with the cutomers.

During the project’s progress, various standards have to be maintained such as documentation standard, coding standard, change management, incident management, and meeting various compliance requirements.

Explain how you decide which technology to use to solve a given problem?

There can be various ways to identify the problems of the business and deciding on what technology to use to solve a problem.

First I will approach by calculating the impact of the problem by asking questions like:
1. Which processes, departments, or operations does the problem affect?
2. How much money the problem is costing annually?
3. What opportunities the organisation is missing? (opportunity cost)
4. Is the problem likely to worsen over time or stay the same?
Secondly I will talk to team members effected by the problem, understand their pain points and identify the specific challenges they experience. Then I will seek advice from SMEs (subject matter experts) to provide different viewpoints about the problems and prospective solutions.
Then I will create a list of precisely what’s needed in a solution?
Look for the solutions or technology that can solve the problem, Does the technology already exit in the market? or will we have to build it from scratch?
Once narrowing down the technology options, I think about scalibility. Does this technology able to scale up when the business grows or do we need to replace it?.
Then perform cost benefit analysis of the narrowed down technolory, find competiting vendors supplying the technology, get different quotations. Figure out all strengths and weaknesses of choosing one vendor against another and discuss with the team and relevant stakeholders to come to final decision on what technology to choose.

What is Cloud Computing?

Cloud computing is the on-demand delivery of IT resources over the internet with the pay-as-you-go pricing model. Instead of buying, owning, and maintaining infrastructure resources, you can access these services such as computing power, storage, and databases, on an as-needed basis from the cloud provides like AWS, Azure, Google Cloud, and so on.

What is the difference between a Hybrid Cloud and a Full Cloud environment? and what are the advantages or disadvantages of each?

In a hybrid cloud environment the organization has its own data centers consisting of servers, compute storage, and network security plus they are using the cloud for certain tasks. Cloud and the organization’s data centers are connected by some means so that the organization can get the benefit of the cloud as well as use existing infrastructure. A hybrid approach can be beneficial for specific applications that require extremely low latency, for those applications hosting in the local data centers will provide a high level of performance rather than hosting them in the cloud which will have some latency due to networking speed lag between data centers and cloud. For example stock broking applications.

In a full cloud environment, all the application load is hosted in the cloud which saves a lot of money as the organization won’t have to manage the infrastructure. Organizations will only have to pay for the resources that they use and can scale up or down their resources on a need basis.

What is the difference between IOPS and Throughput while working with storage in AWS?

IOPS measures the number of reads and writes operations per second, while throughput measures the number of bits reads or written per second.

IOPS: It determines how frequently or how fast you can write and read to the disk. It is the speed of disk access. IOPS is mainly related to latency, the higher the IOPS the lower the Latency. IOPS has an inverse relationship to latency. For example, databases need extreme speed in terms of reading and write and so will require higher IOPS.

Throughput: It is the amount of data that you can put inside the disk in one second. For example, video editors need to process big chunks of video files and so need higher throughput but can tolerate a little latency.

What is the difference between RAID 0, RAID 1, RAID 5, and RAID 10 storage?

RAID: Redundant Array of Independent Disks. RAID is a way of logically putting multiple general-purpose HDDs (Hard Disk Drives) into an array to work together to create large reliable data stores.

RAID 0 (striping): This is the most basic kind of RAID and is called striping. Data is evenly split across two or more disks to improve performance, but it doesn’t provide fault tolerance or redundancy. The failure of one drive will cause the entire array to fail; as a result of having data striped (split) across all disks, the failure will result in total data loss. RAID 0 is chosen in the scenarios where speed is the ultimate requirement and the system can tolerate data loss.

RAID 1 (mirror): It contains the exact copy of a set of data on two or more disks. The data is copied from one drive to the other drive in real-time, so if one of the drives is lost, then another drive has all the data. The disadvantage of this type of RAID is it won’t provide an increase in capacity or speed but it provides a high level of availability and redundancy. This type of layout is used when read performance or reliability is more important than write performance or the resulting data storage capacity.

RAID 5 (Striping with distributed parity): RAID 5 combines the use of disk striping and parity (instead of mirroring) for data redundancy. It is designed in a way that it can handle single drive failure “hot-swappable” capability which means if a drive fails in the array, it can be easily swapped without interruptions. Raid 5 is the most common form of RAID used in enterprises but AWS doesn’t recommend running in their network.

For example: In 3 disk RAID5 the data gets rotated, 1 gets data – 2 gets data – 3 gets parity next – 2 gets data – 3 gets data – 1 gets parity next 3 gets data – 1 gets data – 2 gets parity.

RAID 5 gives good throughput and great redundancy with a little lag in latency as it has to write in all the disks.

RAID 10 (combine RAID 1 & RAID 0): Use the redundancy and availability of mirroring and utilize the high performance and speed of RAID 0 (striping), to create RAID 10. The only downside of this, because it requires double the number of disks and it gets very expensive very quickly.

What is the difference between Network NACL and Security group?

Network access control lists (ACLs): A network ACL keeps traffic outside of a subnet. NACLs act as a firewall for associated subnets, controlling both inbound and outbound traffic at the subnet level. NACLs are stateless, which means that there is no means to track the state of the connection, the traffic allowed in is not being monitored and so therefore it would not know the traffic trying to come out. Therefore, NACLs need to be applied in both directions.

Security group (SG): A security group keeps traffic out of a host. SG acts as a firewall for associated EC2 instances, controlling both inbound and outbound traffic at the instance level. SGs are stateful, which means it knows the traffic coming in and will recognize it when it is trying to go out. Therefore, SG can only have to be enabled for inbound traffic.

The main difference to know is NACL is stateless that protects the Subnet, whereas the security group is stateful that protects the server.

What is CIDR?

CIDR or Classless Inter-Domain Routing is a method for allocating IP addresses more efficiently to allow for a flexible and simplified way to identify IP addresses and route network traffic. CIDR is basically saying we are no longer using the classful boundaries, meaning Class A/8, Class B/16, Class C/24. Instead, we are going to subnet them down and make the subnets the size that we actually need them. For example, 10.10.10.0/24 (1 subnet, 256 hosts) can be divided into /25 (2 subnets, 128 hosts ), /26 (4 subnets, 64 hosts), /27 (8 subnets, 32 hosts) and so on. Check out the following posts to know more about CIDR:

10 things to consider while setting up AWS VPC

What is the IP address or CIDR range

What is difference between public and private subnets?

Public subnets are subnets that are exposed to the public internet by using an internet gateway. Public subnets have both inbound and outbound traffic to the internet and you can normally host public-facing applications like websites or APIs.

Private subnets are private in nature as the name suggests, it doesn’t have direct internet access and usually, they can access outbound only internet by the use of NAT gateway. The application layer and the database layers can be hosted in private subnets that require to be safe and secure.

Describe the AWS shared responsibility model?

This typically means know what are the things that AWS manages and what are the things that the organizations have to manage themselves.

AWS will basically manage the network and its data centers. This means AWS is responsible for protecting the infrastructure that runs all the services offered in the AWS cloud. This comprises hardware, software, networking, and maintenance. AWS responsibility “Security of the Cloud“.

Organizations are responsible for managing their own applications and their security without worrying about the underlying infrastructure where they host their application. Customer responsibility “Security in the Cloud“.

How can you make your application scalable for a big traffic day?

Put the AWS EC2 in the auto-scaling group with the Elastic load balancer balancing the load between different EC2 instances.

For the known big traffic days like Christmas or Easter, there can be a high burst of traffic. So, if you let the Elastic load balancer scale up naturally, it might not keep up with the traffic volume so it is better to contact AWS in advance to “pre-warm” the Elastic load balancer.

Similarly, you can implement a scheduled scaling policy in the Autoscaling group to increase the capacity of EC2 instances just before the expected high traffic days and scale down after the event is complete. For EC2 instances that need to frequently scale up or down use Lightweight AMIs with minimal resources so that the cooldown period will be minimal and can scale up or down fast.

For database performance, you can use a database proxy (e.g, RDS proxy). Amazon RDS proxy allows applications to pool and share connections established with the database, improving database efficiency and application scalability.

Use AWS IEM (Infrastructure Event Management) for known events, to help access operational readiness, identify and mitigate risks, and execute events confidently with AWS experts by your side.

What is your disaster recovery (DR) procedure for AWS cloud application?

Disaster recovery procedures may differ as per the organization’s RTO (Recovery time objective) or RPO (Recovery point objective).

RPO limits the maximum allowable amount of data loss by defining the backup frequency, if you have a backup every hour then the maximum data loss will be 1hr.

RTO is related to downtime and represents how long it takes to restore to be fully functional.

There are 4 kinds of DR strategies: (know more here – Disaster Recovery options in the Cloud)

1. Backup & Restore: This approach can also be used to mitigate against a regional disaster by replicating data to other AWS Regions, or to mitigate the lack of redundancy for workloads deployed to a single Availability Zone.

2. Pilot Light: Resources required to support data replication and back-ups, such as databases and object storage, are always on. Other elements, such as application servers, are loaded with application code and configurations but are switched off and are only used during testing or when disaster recovery failover is invoked.

3. Warm standby: The warm standby approach involves ensuring that there is a scaled-down, but a fully functional, copy of your production environment in another Region. This approach extends the pilot light concept and decreases the time to recovery because your workload is always on in another Region.

4. Multi-site active/active: Multi-site active/active serves traffic from all regions to which it is deployed, whereas hot standby serves traffic only from a single region, and the other Region(s) are only used for disaster recovery. With a multi-site active/active approach, users are able to access your workload in any of the Regions in which it is deployed. This approach is the most complex and costly approach to disaster recovery, but it can reduce your recovery time to near zero for most disasters.

How do you secure your application on the cloud?

While thinking about securing the application in the cloud you need to think of defending every layer of your system.

Above VPC Layer:

Using CloudFront for caching and distributing traffic, as CloudFront has native basic level DDoS protection. Cloudfront has a huge network of edge-cached servers and it is almost impossible to take down the whole CloudFront. If the CloudFront is down basically it means the whole AWS cloud is down and CloudFront is managed by AWS, which gives our application base-level security.

Further, you can have a Web Application Firewall (WAF) configured at the CloudFront. WAF can protect against BotNet type attacks by identifying and blocking traffics from bad IP addresses (a certain range of IP addresses or from a certain country, or identifying certain HTTP headers, or some patterns of traffic ). It will also protect against certain attacks such as cross-site requests or SQL injection

Say if the traffic passed through the CloudFront and WAF then we can have a defense at the VPC layer.

VPC Layer

You can use NACLs (Network Access Control List) at the subnet level where you can configure only the IP range that you want to allow inbound as well as outbound traffic. You can configure the database layer to communicate to only the application layer in private subnets and the application layer to talk to certain public subnets.

Instance level

Within the instance level, such as EC2, Lambda Function, Container, or Virtual machine, we can implement Security Groups. Allow only certain ports/protocols/IP addresses that are required by those instances. By default Security Group is implicit deny (deny all), then you can go and specify what port or protocols or IP addresses are allowed in that instance.

Let’s go one more level further, and say the intruder got into the security group too.

Then we will have to have Authentication and Authorisation security

Authentication and Authorisation

On the user level, you can use IAM policies and implement secure authentication by utilizing MFA (multi-factor authentication). After the login, there is an authorization part, you can protect your servers or applications by implementing the principle of least privilege. This means only allowing the user to do the minimal task required in the server, say a database admin can do CRUD operations in the table but a developer might only need to insert and read from the database.

While Transfering and Storing data

Secure Data at transit by using SSL/TLS/HTTPS certificates and for data at rest utilize Amazon KMS encryption.

Common checks for Security

Check your Access Keys / Secret keys – are they being rotated.
Do you have password poicy in place?
Are you rotating your passwords?
Are you databases in public or private subnets? if in public why?
In three tire architecture find out if web server communicating directly to Database layer? if so why?
Don’t make S3 public unless there is specific requirement.

How to know or monitor if the security protocols are followed?

You can utilize CloudTrial which is responsible for logging all the events happening in the AWS cloud. You can create a cloud trail event and monitor the activity of certain events and you can automate the defense mechanism. Let’s say if an EC2 has a security group that is only allowed to access through Bastian Host, but someone changed the SG to make it public then CloudTrail event can catch it and we can revoke the policy automatically by writing some lambda functions.

You can use Amazon Guard Duty that reduces risk using intelligent and continuous threat detection of your AWS accounts, data, and workloads. Amazon Guard Duty acts as a central consolidated location for monitoring logs like Cloudtrail logs, VPC logs, and DNS logs. It uses threat intelligence feeds, such as lists of malicious IP addresses and domains, and machine learning to identify unexpected and potentially unauthorized and malicious activity within your AWS environment. Amazon Guard Duty will also send the information to the AWS security hub which ultimately is a single dashboard to manage security findings and identify the highest priority security issues across your AWS environment.

Major Security issues you have come across?

There were various account-level compromises that I have witnessed while working as an AWS solutions architect, I will share with you some common ones.

During the beginning years working on AWS cloud, at one instance the AWS account was compromised and the intruder had launched more than 200 EC2 instances running (20 X 10 regions that were present at that time) in 5 minutes and the AWS account was already displaying thousands of dollars on the billing dashboard. This was picked up by the AWS team and they notified our team about the unusual behavior happening in our account.

How can autoscaling help with DDoS protection?

In DDoS attacks, the attacker overwhelms the server with a bogus huge amount of traffic, blocking the service from legitimate users.

Let’s say you have a web server that handles 10,000 web requests per second and on average you normally get 5,000 requests per second. Let’s say you get a DDoS attack flooding the webserver with 30,000 requests per second which will take over the webserver preventing the actual customers from accessing it.

If you have an auto-scaling setup that will spin up additional servers when the traffic increases, let’s say if it spun up 10 more instances that handle 5,000 requests on average, now the webserver can handle 50,000 requests per second which will keep your website up and running.

Biggest challenges that you faced on cloud?

During the initial stages of my career as an AWS cloud architect, we had implemented a high-traffic web application on AWS cloud, the web application was utilizing 60% of CPU on average and was running smoothly for months. Christmas holidays were close and I was asked by my manager to prepare the server for the increased load. The load was expected to grow by 50% during the peak Christmas holiday ticketing periods.

I was aware of the issue and I checked the server capabilities and there was an Auto-scaling setup based on the weighted scaling policy. I thought that Auto Scaling Group will automatically scale up on high traffic days but I was wrong as our servers didn’t scale up as expected and our application suffered significant performance issues. Due to my ignorance whole team had to work extended hours to fix the issues.

Then, after that experience, I identified all the processes like using step scaling policy, pre-warming up the Elastic load balancer, using elastic-cache and database proxies to improve database performance, and always utilize the benefits of Amazon IEM (infrastructure event management) for known high traffic events. I made a standard procedures checklist to ensure that all the high traffic events in the future follow those procedures.

The other challenge I had faced is to optimize the cost of AWS resources used. In my previous role, I was responsible for setting up an environment for the development of web-based MVC applications with an SQL server database. While setting up the environment we set up highly available architecture where we had a replication server for the database, EC2, and load balancers in two different AZs. The architecture was good but the development teams were using these servers and incurring costs that were not necessary during the development phases.

After the quarterly billing cycle, I reviewed the costs associated with the resources being used by utilizing Amazon Cloudwatch Insights, AWS compute optimizer to check if the resources are being to the optimal levels. I identified that all the big-ticket resources were used by the development teams and production servers were sitting idle without doing much work. I used AWS Cost Explorer to find out detailed information on the costs associated with the resources, I notices the development resources were incurring the highest charges.

After that, I consulted the development team and later found out that the project was delayed than expected and it has not gone to production yet.

I immediately consulted with my manager and initiated the change request to keep the production servers in hibernate, disable multi-AZ replication of the database, and change the instance types to spot instances for development teams.

This initiation saved more than 60% of the cost incurred last quarter and moving forward when there was time for launching the application in production we made the hibernated servers active and implemented successfully.

What is AWS Service … (can be any from below)?

Lambda is serverless compute service that lets you run code without provisioning or managing servers, creating workload-aware cluster scaling logic, maintaining event integrations, or managing runtimes.

S3 is an object storage service that offers industry-leading scalability, data availability, security, and performance.

Cloud Formation provides an easy way to model a collection of related AWS and third-party resources, provision them quickly and consistently, and manage them throughout their lifecycles, by treating infrastructure as code.

EKS Amazon Elastic Kubernetes Service is a managed service that you can use to run Kubernetes (open-source container-orchestration system) on AWS without needing to install, operate, and maintain your own Kubernetes control plane or nodes.

etc.

Can we use Visual Studio for coding AWS cloud formation?

We can download AWS Toolkit for Visual Studio that provides Visual Studio project templates that can be used as starting point for AWS console and web applications. As your application runs, you can use the AWS explorer to view the AWS resources used by the application.

How to encrypt existing EBS volumes?

When you launch an EC2 instance by default the EBS volume is not encrypted (you can manually choose to encrypt the volume while launching an instance), but after the instance is launched if you look at actions on EBS volume there is no option to encrypt the existing volume. To encrypt the existing EBS volume:

Create Snapshot of existing EBS volume, by right-clicking EBS volume.
From the Snapshot you just created, create a volume, tick Encrypt this volume and provide Master key to the volume.
Now you will have a encrypted volume ready to be attached to the EC2 instance, right click the volume and attach to the instance.
Once attached you can detach the previous unencrypted instance. (check how to Replace an Amazon EBS volume)
Go to the instance > Actions > Monitor and Troubleshoot > Replace root volume
Create a Replacement Task wihout selecting the snapshot.

What is the difference between Intenet Gateway and NAT?

The Internet gateway is the gateway for internet access, it can provide both inbound and outbound traffic access. On the other hand, NAT is used to allow one-way traffic to the public subnet from a private subnet. If application servers want internet access to download patches or software updates then we can use NAT gateway.

How to restart EC2 automatically if it’s CPU utilization reaches 80% or more?

This can be achieved in several different ways but one of the approaches is by using Cloudwatch Alarms.

Go to the Cloudwatch console and create an alarm.

Select EC2 pre-Instance metrics > Pick CPUUtilization metric for the EC2 instance
Statistics minimum > Period 1 minute > Greater/Equal to 80
In configure ations > EC2 action > Reboot the instance

The second approach can be configuring the Cloudwatch Events

Create cloudwatch event rule, service name > EC2, Event Type > EC2 Instance state change Notification
Specifc state > Stopping, stopped, select Specific Instance ID > provide volume Id of EC2 instance.
Add Target > Lambda Function > write a Lamda function to restart the EC2 instance and select here.

What is the difference between AMI and a Snapshot?

AMI (Amazon Machine Image) is a complete EC2 backup, it contains all the resources used by the EC2 including EBS volumes and the installed applications. Whereas, a snapshot is just an EBS volume backup. The snapshot can create a volume that can be attached or detached to the EC2 whereas AMI is a complete package including EC2 compute and all of its attached volumes and networking.

What is high availability, how do you achieve it in EC2?

High availability means making the applications are available even during disasters. We can implement high availability by setting up EC2 instances in multiple AZs balanced by the load balancer. Implement Auto Scaling in the EC2 instances if the traffic increases, new EC2 instances are automatically provisioned based on the metrics provided for Autoscaling.

What is the difference between Launch template and launch configuration?

A launch template is similar to a launch configuration which can be used by Auto scaling groups to launch EC2 instances, and it has additional benefits than using launch configuration. Launch template allows you to create a saved instance configuration that can be reused, shared, and launched at a later time. Launch templates can have multiple versions.

A launch template contains information such as AMI, Instance type, key pair, VPC info, storage, and network interface so that you do not need to define these parameters every time you launch a new instance. Launch template is a newer version of Launch configuration and AWS recommends using Launch template over launch configuration.

What is the difference between Spot Instance and Reserved Instance?

Reserved instances are the instances that can be reserved for longer-term commitments (1 year or 3 years) and you have the choice to make payment up-front or partial or at the end and can have dedicated Tenancy if required. If you use reserved instances AWS provides significant discounts than using on-demand instances. If you have mission-critical applications that run 24 hrs for a year or more then choosing a reserved instance will be the best option.

Spot instances are instances allocated by AWS from its unused capacity. So, AWS will sell this unused capacity via spot marketplace at very cheap prices. (You can get up to 90% cost savings compared to on-demand instances) But it has downsides like the price of spot instances change frequently and if the price of spot instances is higher than your target price then there is the possibility of instance terminations. (The best use case for using spot instances is for development works, or processing non-critical workload where data loss can be tolerated OR you can use a small percentage of spot instances in an Auto Scaling group).

How to stop and not terminate Spot instances when there are interruptions?

While purchasing the spot instance there is an option to select “persistent request”, and in Interruption behavior, you can specify “stop”. This will provide time to move data from the spot instances and/or request a new spot instance.

How do you monitor memory utilization of EC2?

By default, Cloudwatch doesn’t monitor the memory utilization of EC2. You can download and install the Cloudwatch agent in the EC2 instance (sudo yum install amazon-cloudwatch-agent) and send that to the Cloudwatch dashboard to monitor memory utilization.

What are the tools or techniques that can be utilized in AWS to optimize cost of the resources used?

When we talk about cost optimization, we need to understand that the most cost-consuming services in the cloud are Compute (EC2, Containers, API, Lambda, etc) this will consume 70% of the cost and 30% is Storage and networking.

The starting point to optimizing cost will be the AWS Compute Optimiser service that recommends optimal AWS resources for your workloads to reduce costs and improve performance by using machine learning to analyze historical utilization metrics. It will provide good recommendations on instances that are over or underutilized.

For example:

If you are using 20 CPU instances but only utilizing 2 or 3 CPUs then you can downgrade the instance to save cost.

In another use case, you might be constantly running the server 24/7 and the server is utilizing all 20 CPUs on average and you are running it in an on-demand instance, in this scenario, it will be best for getting the reserved instances (under savings plan) to save cost.

Other use cases can be that your workload is only for Monday to Friday (9 am to 6 pm). You don’t want those compute services to be running during idle hours or over the weekends. You can utilize AWS Instance Scheduler to schedule the instances and RDS databases to run only on the scheduled time and day.

In terms of storage, in my experience, I have noticed that most of the storage is consumed by snapshots/ backup data or log files. You can go through the policies of saving the snapshots and identify the snapshots that may not be necessary or look into logs generated by AWS or applications and identify up to what level the logs are necessary. You can save costs by either archiving them or deleting the unnecessary backup or log data.

The tools that you can utilize to monitor costs can be:

Billing Dashboard – Check the Top free Tier Services usage table and figure out the services that are costing you more.
AWS Cost Explorer – It helps to visualise the costs of the services used. It can provide the report of past 13 months of the costs associated with the resources used. It also provide forecasting of possible cost for next 3 months.
AWS Budgets: You can setup budgets and get notified if the budget is exceeding that amount and take steps to reduce the cost of the resources in that month.
Tag and allocate costs: You can assign a lable to every AWS resources and organise them to keep track of your AWS costs. Then you will know by looking at the tags what department or what servers are exceeding monthly budgets.

What are the factors will you consider while migrating to AWS?

Before moving to AWS cloud it is important to understand the business perspective, then move to technical details. The first step is to identify what is the business requirement? is cost the priority? Is performance the priority? or is high availability and flexibility the main focus of the business.

The next step is to identify the technical problems and find the solutions, for example:

Problem: You have terabytes of data and many gigabytes of changes made every day how would you migrate the data to AWS.

Solutions: Think about services like “AWS Data Sync“, “AWS Database Migration Service” (for databases), Aws Storage Gateway (File Gateway, Tape Gateway, Volume Gateway), utilize VPC Peering and AWS Direct Connect to accelerate network transfers between data centers, or use AWS Snowball or Snowmobile services.

What kind of query functionality does DynamoDB support?

DynammoDB supports GET/PUT operations by using a user-defined primary key.

DyanamoDB provides flexibility querying by letting you query on non-primary key attributes using global secondary indexes and local secondary indexes

A primary key can be either a single-attribute partition key or a composite partition-sort key.

DynamoDb indexes a composite partition-sort key as a partition key element and a sort key element.