RE evolution road show

RE evolution road

Handmade age

In the earliest days, our front end was a 4-layer load balancing, static resources were cached through Varnish/Squid, and dynamic requests were run under the LAMP architecture. At this time, there are very few machines, and there are few processes required, and there is no distinction between application operation and maintenance, system operation and maintenance. There are also very few operators, and the network, machines, and services are responsible. Most of the work of operation and maintenance relies on manual work. In fact, there was no formed operation and maintenance system at the time. Many startups now have this kind of architecture.

Cloud infrastructure

As the business evolved, our architecture was also appropriately adjusted. Especially after entering the mobile era, the proportion of mobile traffic is getting bigger and bigger. The access layer is not just a web resource, but also contains many API interface services. The development language of the backend is no longer limited to PHP. Java, Python, C++, etc. are introduced according to service requirements, and the entire business architecture begins to change to microservices. As the business architecture changes, the underlying infrastructure changes. The biggest change is that in mid-2014, all the business has already run on the cloud, as shown below.

Cloud SRE development and practice

One of the benefits of running on the cloud is to abstract the underlying host and network, which is equivalent to the cloud platform to encapsulate host creation, network policy modification, etc. into the corresponding system, providing users with a unified platform interface. When we do maintenance, we can connect the previously complex processes. Also at this time, the SRE team was initially established, and we have split the work related to the entire operation and maintenance. The cloud computing part (responsible by Meituan cloud) is mainly responsible for the host, network, and system-related; SRE docking service side, responsible for the machine environment, business side architecture optimization and business side related issues.

Problem & Solution

Next, we will introduce the problems we encountered and some solutions in the process of doing cloud infrastructure construction.

Cloud SRE development and practice

As shown in the above figure, the first is the problem of resource isolation, because this problem has caused several failures. The CPU and network card of our online VM are shared. Once, the traffic of the pressure measurement is very high, and the bandwidth of the host network card is basically occupied. (The host at that time is mostly Gigabit, it is easy to play. ), the resources of the same host were vying for it, and the time of the services deployed on other VMs became very large, resulting in a service that we paid for at the time (the VMs that were paid and the VMs that were tested were deployed on the same host) ) hangs directly.

In response to this problem, we did two things, one is to isolate all network resources, the corresponding quota for each VM, and the other is to split the host cluster for business characteristics. Offline service, it does not consider the competition of the CPU, each service is not very concerned about the specific response time of the deployed service, as long as the business can be completed within an allowed time period, we put these services separately. An offline cluster. Online business is divided into multiple small clusters according to the importance of different businesses.

The second problem is that the VM is broken up. This problem was not obvious at the beginning. At that time, the entire online business had not been carefully serviced and split, and the services were deployed in a large cluster. In this case, Even if the VM is not broken (multiple VMs of the same service are in the same host), the host is hanged, and the impact is not great. However, with the development of business changes, after the service split, there are basically no hundreds of online services to make a large cluster, more than a dozen, or dozens of such small clusters.

If we have a service with 10 VMs, 5 of which are on one host, then once the host hangs, the overall carrying capacity of the service is cut in half, and the risk is high. If the peak period is half, the business will ç˜«ç—ªNot available. In response to this problem, the SRE team and the cloud computing classmates have done a optimization that lasted for more than half a year, and the VM break rate is controlled to over 90%. Finally, on the same host, the same service, no more than two VMs. .

The third problem is to improve the scheduling success rate. After the cooperation efforts of SRE and cloud computing students, the success rate has reached 3-9.

c infrastructure

Cloud SRE development and practice

The above figure is the architecture diagram related to our cloud computing infrastructure network. It can be seen that the above is the entrance of the public network, and most of the traffic access is the BGP link. Down is the high-speed line of multi-machine rooms. The stability of the line has undergone the verification of large-scale online business, such as take-out, group purchase, wine travel, etc., all of which are deployed in multiple computer rooms.

In addition, there is a highly redundant network architecture. Basically, each node has a redundant device to ensure that the entire traffic is not affected when one of the devices has a problem. The entrance and exit access to some self-developed components, such as MGW, NAT, etc., make our traffic control more flexible.

The US group comment should be the largest user of Meituanyun. The benefits that Meituanyun can bring to the US group have perfect API support, highly customized resource isolation and scheduling mechanism, as well as multi-machine room fiber direct connection and higher Resource utilization.

Operation and maintenance automation

With the rapid growth of orders and machines, we have to move towards automation for more efficient operation and maintenance.

In the process of automation evolution, we have summed up our own set of methodology.

Complex things are simplistic. For example, the introduction of the cloud platform, the basic device management is done through the system of the cloud platform, all the related things are encapsulated, and finally exposed to us is the interface or the web interface.

Simple things are standardized. If you want to do processes or automation, there is no single standard, and there are many points to consider. Therefore, we have a lot of standards in terms of the naming of the host, domain name and other resources, the system's basic environment, and the operation of the upper and lower lines. These standards have experienced the practice of polishing on the line and finally form a unified standard. After the standards are formed, we will introduce the process, such as creating some machines, I will list the required operations, and then do the SOP according to the standard, first process and then automation. We released the manual work through the code and finally reached an automated level.

Cloud SRE development and practice

This is the service tree, which includes the mapping relationship between the cloud host, service and service leader on the line, and a tree display according to different levels. It opens up multiple peripheral systems because there are labels on it that identify unique services. At present, the system we have opened has a configuration management system, a capacity system, a monitoring platform, etc., and also includes login rights of online hosts.

In addition to the latest cost accounting, the service tree has also been opened. Through the nodes of the service tree, only the simple operation is required to see the cost of each business group.

Cloud SRE development and practice

The above picture is a simple process for creating a machine. First, the technician initiates the process and then goes to the process center. The process center obtains the basic information of the service from the service tree, and then sends the information to the operation and maintenance platform. The operation and maintenance platform goes according to the information. The cloud platform creates machines.

After that, the cloud platform will return to the operation and maintenance platform. The operation and maintenance platform will add the created machine to the service node provided by the process center, and call the configuration management system to initialize the environment of the machine. After the initialization, the basic monitoring information will be automatically added. The deployment system is then invoked to deploy the service. After deployment, the service is finally registered to the service governance platform based on the tags of its services, and then online services are available. Equivalent to the entire process can be completed automatically as long as the technician initiates.

This section of automation is a brief introduction to these, the following is the current status.

Data operation

Cloud SRE development and practice

As shown in the above figure, the size of the company is now very large. We have made some corresponding splits. The red part of the picture is all responsible for the cloud platform, from the initial access layer to the underlying infrastructure. For example, the equipment room, network, and host are all encapsulated by the cloud platform. A layer has been opened in the middle, which is the responsibility of SRE.

Now that the process system has been completed, our new exploration goal is data operation. The first is fault management, which provides a unified management for online faults, including the time, cause, and responsible person of the fault, and is classified into different fault levels according to its severity. We will continue to follow up on the subsequent improvements of the failure to ensure that every TODO can be implemented.

In addition, through the fault platform, we summarize all the faults, the system can classify different faults according to the summarized information, and also can sum up the proportion of different fault types on our line, and then do some fixed breakthroughs.

After the fault management, we did some data mining related work. In the initial stage, our operation and maintenance data mainly comes from the monitoring platform or the business is actively reported. At this stage, we will actively explore some information, such as the line. Do some targeted analysis on the request volume, response time, etc. of the service.

Responsibility & Mission

Cloud SRE development and practice

As shown in the above picture, our mission has gradually changed from fire protection and drive change to the beginning of the change and fire fighting. Through data operations, we can drive the business in reverse. The core of the work is stability, and this has not changed.

We can understand operation and maintenance as operation and maintenance. Operation refers to the improvement of overall service quality through experience accumulation and data analysis. Maintenance is for online services and business needs. We can use professional technology to satisfy them. .

Let's talk about the practice of stability protection.

Business stability guarantee practice

Failure cause & instance

First, let's summarize the cause of the failure, and give some examples to illustrate the specific situation.

1 Change. The US group commented on the daily release of online services more than 300 times, in addition to some basic changes in operation and maintenance, including network and service components. For example, when making changes offline, let's write a simple Nginx configuration, as shown below.

Cloud SRE development and practice

It and the configuration written on the line, the order of the red part has changed, if the rewrite instruction is valid after the set instruction, the result is in line with expectations. When we put the rewrite instruction in front, the break instruction will be executed first, and the whole rewriting process will be finished. The set after rewrite will not be executed. After the configuration is online, Nginx can't find the backend service, the whole online service. It just collapsed. If we do a good gray level, we can find the problem and solve it in time, but we are missing the gray process in the process of going online. In fact, the standard SOP (standard operating procedure) should be the five steps in the above picture, but the students responsible for the change should of course take it for granted, or carelessly. After the online test, no abnormalities are found, and the whole line is directly online. Eventually caused a great disaster.

2 capacity. Some big holidays or spikes will bring a lot of traffic, and abnormal traffic attacks or crawler crawls will also lead to traffic spikes. As shown in the figure below, this is a big accident in the cat's eye. The main reason for this failure is that the bottom-level and last-end service capacity is not in place. When the traffic changes greatly, it does not support, and the key service peaks. Rising five times, the DAU intersected New Year's Day (the previous historical peak) doubled.

Cloud SRE development and practice

Mainly caused by two problems, one is that we are not accurate in evaluating large activities, and the other is that its capacity is not equal. Equivalent to the front-end application evaluation can be supported, but the bottom layer is not supported, the front-end traffic is hit to the back end, the back end can not hold, the entire service is hung up. Therefore, we must at least do two things. First, we must know ourselves and understand the capacity that we can carry. We can obtain this capacity through pressure measurement or some historical data reference. The second one must know and know exactly how much traffic is coming from the front end. Through the linkage of operation and technology, when there are some big activities or big holidays, make corresponding judgments through their capacity assessment and historical data. In addition, to understand the capacity level of the downstream system, once the capacity of the service is lower, we must do a current limit and remind the downstream service to do the corresponding capacity matching.

3 hidden dangers. Hidden dangers are mainly for some defects in system design, as well as cross-calling of some components, lack of critical alarms, and asymmetric link capacity. Such problems are more difficult to find and require us to conduct in-depth research. In this case, we can look at the following figure. Before the operation, its data packet goes along the green line. After the operation, some data packets go along the red. The main effect before and after the change is that the packet session of the red link has changed, because the session is initially on IMGW1, after the link changes, the TCP stateful connection is not found afterwards. At the end, the data packet can't be sent in the past. At this time, the data is lost, and the database cannot be connected. This service is hanged.

Cloud SRE development and practice

However, at the beginning of the design architecture, the business layer should consider the instability of the network. There are probably three ways to deal with the above hidden dangers.

The first is to do a full-link exercise, simulate a real scene, and after a simulation exercise, it is more or less exposed. We can solve these problems, improve our fault plan, fix online loopholes, and verify that our alarm system is working properly when we do exercises.

The second is the SLA, which sets a stricter stability indicator for the service and continues to optimize for this indicator. For example, our online HTTP access service extracts a stability indicator for the status code and response time in the accesslog. This provides an additional reference value for the stability of the service itself. Stability index fluctuation service is bound to have problems. At this time, we must analyze the points of its fluctuations accordingly. According to the analysis, we can finally find some hidden dangers. The indicator is to use real data to feed back the stability on the line.

The third is to manage the faults, each fault can find the problem, TODO can be implemented, the experience summary of each fault can also be shared to multiple lines of business.

Summary of experience

The core of the accident (such as standard SOP, capacity assessment, flow pressure measurement) is to prevent it before it happens.

The core of the accident is the fast stop loss. Finding the problem is a relatively difficult and long process, because this time is uncontrollable. But if we have a good emergency plan in advance, we can achieve a fast stop loss. In addition, there must be self-protection of services. Another point is that communication is also very important. When the problem first started, it was actually quite chaotic, because everyone found the problem very urgent. Many people are asking why. At this time, the reason you asked is useless, because most of you donâ€™t know, you can know Give a solution. So at this time, we need a perfect communication mechanism, and the correct time to feed back the correct news. The principle of feedback is to talk less about the surface phenomenon, and try to say some information that can help the problem or help the stop loss.

After the accident, like the TODO implementation, the improvement plan, and so on, the core point is to eat a long and a wise, the same problem can not occur a second time.

User experience optimization

First, starting from the user side, when users access our online business, the traffic is from the public network to the private cloud to the server. Public network problems mainly include network hijacking, multi-operator environment, and uncontrollable public network links. For the Server, it is mainly the protocol of the transport layer, or the protocol of the application layer. At present, most of the business interactions are still using HTTP 1.0/1.1. In fact, the HTTP protocol needs to be improved. It is not suitable for frequent services. Interaction.

We have made some attempts to address these issues:

First, in the public network access this section to enable BGP, we have now made a self-built BGP network, no longer concerned about the problem of multi-operator access. Only the BGP network needs to be used. When the data packet is addressed in the public network, the optimal routing can be performed.

In the face of hijacking, we tried the HTTP DNS solution, and also tried Shark, which is similar to public network link acceleration. It is equivalent to deploying a server at the user's near end, embedding the SDK on the app, and the user launching through the App. Instead of doing DNS resolution, the request is sent to Shark (see the previous blog "Meeting Review Mobile Network Optimization Practice"), and then Shark interacts with the backend service. At present, through continuous optimization by various means, the problem of hijacking has been much less.

For the protocol of business interaction, the SPDY protocol is launched, and the service improvement for frequent interactions is still obvious. Currently testing HTTP 2.0, the server side has a small number of bugs for HTTP 2.0 support, and I am working hard to fix it. I hope it will be used soon.

Future prospects

First of all, technically, we are currently doing a better job of automation, and we will continue to do it. The next step is intelligence. Why is it intelligent? In fact, it is mainly faced with a bottleneck. Some problems cannot be solved by automation. For example, the automatic fault location mentioned above is very decision-making and requires many steps of decision-making. It is not directly handled by the program. We are now experimenting with some AI algorithms and introducing artificial intelligence to make breakthroughs.

In terms of products, all the tools we have now, through large-scale verification of online business, are developing in the direction of productization. We hope that we can make it into a molded product and put it on the US Mission Cloud. The user provides the service. Not only serve ourselves but also others.

Finally, the technical framework, the solution of some difficult problems in the development process of the US delegation, or the accumulation of experience for the challenge, after the verification of large-scale online business, can finally form some mature programs, which can be used for the US Mission Users provide cutting-edge technical references.

The cloud is the general trend, it can encapsulate many low-level issues, let us have more energy to do more important things.

9V Switching Wall Charger

9V Switching Wall Charger

About this item

9V Switching wall charger
110V input voltage / 9VDC 1A/2A/3A... output voltage
For use with Arduino Uno, Mega and MB102 Power supply boards
Connector size: 5.5 x 2.1mm/5.5*2.5mm...
Center or Tip is positive, sleeve is negative

9v wall charger,AC Power Supply Wall Plug,Wall Adapter Power Supply,9V Power Adapter,ac 50/60hz power adapter,Wall Adapter Power Supply - 9VDC,100-240v converter switching power adapter

Shenzhen Waweis Technology Co., Ltd. , https://www.huaweishiadapter.com