Responsible for the availability, performance, scaling, monitoring and incident response of the E-Commerce platform and services.
Ensure the E-commerce sites are up 24*7 building in the site reliability engineering.
Engineering services support and Application support to the Open source or Enterprise software stack of the E-commerce platform.
Troubleshooting of exceptions, performance issues and latencies / errors across multiple technologies.
Debugging of the code issues based on web service and API responses, errors, events, logs, etc.
Work on / triage of the daily tickets related to Application support.
Automate the critical jobs across the entire platform to minimize manual errors and human intervention.
Work closely with the Technology stakeholders, Product, Application development, QA, etc. and offer right feedback on the Java stack or Enterprise E-commerce stack from production engineering perspective.
Implementation of effective monitoring for all the events and logs with right alerting / escalations for the critical alerts.
Capacity planning and Infrastructure upgrades timely for best reliability of the site.
Ensure proper reviews are built to minimize the Mean Time to Recover (MTTR) and Mean Time to Failure (MTTF).
Implementation of ITIL processes like Incident management, problem management and change management.
Documentation of runbooks, incident response and post-mortem reports, etc.
Understand the business flow and map the technology problems to get the right solutions out.
Ability to understand the end-to-end product life cycle and map it to production engineering
Support the Engineering services of the entire technology platforms from the scaling and performance perspective.
Manage the uptime of each of the micro services by building and implementation of the right monitoring and alerts.
Ability to automate any repeatable job
Strong incident management with less response and resolution times to keep the Site Up always.
Build in the redundancies and proactively avoid any downtime situations.
Strong problem management abilities by automating any repeatable jobs and working with the stakeholders to ensure the incidents do not repeat again.
BS degree in Computer Science or related engineering disciplines
4-14 years of relevant work experience in any of the Online companies like Media, E-Commerce or Cloud based product companies.
Strong understanding of the business flow and software design.
Experience in at least one programming and scripting language; Python preferred.
Experience in optimizing the routine tasks through automation.
Experience in monitoring / alerting tools, both Infrastructure and application monitoring.
Strong experience on Opensource and Java stack
Experience managing varied stakeholders like Business, Development, QA and Product.
Strong troubleshooting / debugging skills and proven experience in Site reliability engineering of highly scalable and performance technology platforms.
Experience in Open source technologies like Elasticsearch, SolR, etc.
Experience in building monitoring and alerting framework for Infrastructure, applications, databases and NoSQL, is a plus.
Working knowledge of commercial technologies like Hybris, Sterling, etc.
Good to Have
Linux and Database Administration at intermediate level.
Ability to review the code and suggest inputs to the Development teams.
Ability to use commercial APM tools and troubleshoot the issues.