QQ space social advertising system technology architecture practice
Note: This article is compiled from QCon 2017 Beijing station speech, originally entitled: "QQ space platform 10 billion levels of traffic of social advertising system mass practice"
Space ads in addition to some effects of the display ads, most of them are done by ourselves, such as brand display ads, this is our QQ space independent APP, will put some brand ads like OPPO, as well as the brand itself to do the push live, yellow diamond revenue ads, big advertising theme and internal advertising theme is the problem we have to solve.
QQ space business indicators QBOSS, take this name is located in Qzone business support system, it has about 30 billion daily traffic, traffic access is very large, QQ space and other products access advertising channels have 100, daily peak of 400,000 per second, high performance requirements, our requirements are the entire calculation time to 50ms.
Before you do a technical architecture, pick some core features to look at. The first thing to do is to establish the concept of channels, which is very standardized management. In addition, as a platform, to prevent harassment of users, repeated ads also appear is not allowed. Then fixed investment, it is impossible that any ads are all visible, this is also a lot of algorithms, do data mining can play the space. Fourth, our customer system is a bridge between the broadcast side and advertisers, the broadcast side of different channels have development demand, advertising this side of the advertising some of the core delivery logic to achieve, is scheduled advertising system.
Defining order, the middle core is advertising, according to the business characteristics of advertising divided into four dimensions. This one on the left is the advertising space, which is a very simple thing, for the role of the advertiser. The right side reflects the direction of QQ space, reflects the operation orientation of the product, prevent harassment operation and so on. The top direction is that real users can feel the advertising experience, resources are divided into two parts, one is data, such as what this image looks like, what the video looks like, this is the material problem. It is also worth saying that for developers, we give him a definition mechanism for display templates, which can support developers in the customer layer to develop what kind of logic they want, to develop their own. The fourth part is the following user segmentation, advertising came to give more suitable users, data is the user portrait and number package.
Let's talk about Tencent's general technology accumulation. In the classic three-tier technology layer, the access layer has access methods that are used by TGW WNS TSW and are available for comparison threads. This middle layer of special SPP is the most widely used multithreaded model service framework within Tencent, which has brought back the functions related to network communication. There is also a synchronization center, which is part of our self-research technology selection, this part of the volume of products is now Shenzhen, Shanghai, Tianjin three synchronization services, related to the data level synchronization problem, are abstracted into a service to build out to do synchronization center. There is more we use is SSVR, it is similar to a queue-type service, but unlike the cloud queue service we use, we are a local flow-based service, this simple and controllable, practice has proved that so many years of use down no problem.
Storage layer research is more, first of all, CKV is the most widely used product, CDB is built on MySQL Cloud DB, Redis is also used more, TDW is for large amounts of data to do Tencent database. On the left are some systems or components, such as weaving cloud is our operations portal, L5 is to do load balancing, the pplod is a report, because we use more advertising systems, DC escalation is distributed mass data reporting collection system, God Shield to do the recommendation. Then we have to build a large number of systems still need to do things, in addition to design, use it together, a large number of systems are more reliable.
The whole advertising system sparrows are small and dirty, many services after three years can not see. I've divided the online services with policy centers, user centers, data centers, and offline processing that primarily processes users' data. The three core services are mainly introduced.
The strategy center is a very complicated department, which satisfies the central control of the advertising system, mainly some logic.
User center service DMP, a part is like the left, we go to collect a lot of platform-related active features, such as whether you are yellow diamond, is not logged in to active users, the data into a label, the advertising group can choose the label to achieve fixed cast. There is also a number package way, Tencent has a lot of internal data mining team, including data generation, Tencent's business is too big, a lot of data generation parties can not all come to me here, must use the number package way to organize, this is more extensive.
This is the simplest architecture at first. External services, to provide a protocol is not this user group, is to answer YES is not on NO, or do a matching logic, the data is very simple, the number package of data users to save the data to solve, this is the simplest and most starting architecture.
But encountered some problems, one is DMP to do strong, tag data must be more, age, gender, dress up active, login time times are tag format, tag source a lot, a lot of data team does not necessarily put data in your place, and update frequency is different, even if it gives you offline cache permissions, save the entire platform data for the customer system is also a great burden. Cost and development efficiency, do this system when labor costs are limited, unlike some advertising system team dedicated to DMP, we only have one or two people here to develop to do this. There is memory storage costs, storage selection can only use memory, but the memory price is too high, then these hundreds of indicators how much data?
The first question is how do we improve the efficiency of the construction of user portrait tags. Tencent is still doing this, such as to do age and gender to come in, certain data to access a name, so that each additional tag efficiency is very low. Later we decided to give up this structure, according to the ID, each additional tag to add an ID is very simple thing, ID for which bit bit is also better mapping, such as 64 bits up, then go up is 128, anyway numbered, there is serial number, each tag to add, assign some places, I add a bunch of tags no workload only need to allocate the agreement.
There is also the addition of an appropriate layer to handle data sources, there are many teams have data, but there is no channel to expand the user, we will judge the user's permissions open. Adapter Server is simply to satisfy the protocol and write the logic yourself to develop it.
The third problem is the number package fixed investment, at first we thought is also relatively simple, save the user in which number package, each time the visit to take this data judgment on the line, it seems very simple and easy to understand, but the practice will find shortcomings. The difficulty of running is very large, advertisers to run a 50 million or 100 million number package, he needs to put this number package all users' data to read and write over, a delivery operation will affect the online service, because to change the online data, online capacity will also be fluctuated, advertisers when to run the number package is also uncertain, a few do not know.
The big risk is to put in an online service that waits a long time and affects you. We have three places to serve, Shenzhen Shanghai Tianjin, if you put in Shenzhen also want to synchronize this service to Shanghai and Tianjin. Storage recycling is also more troublesome, a lot of number packages follow the ads, advertising once offline, the number package meaning is not great, it is difficult to recycle it, you can only take out this number package to update the user. We've done passive updates before, and when we use this user, we take it out and check the number pack, which is very passive.
To solve this problem, we used QZone's self-developed memory tool, called Friends Participation System, which is ideal for topic storage of number packs. We think topic is a theme, the number package is a theme, with the number package are involved in this theme, such as the previous hottest QQ farm, you can see how many friends are playing farm, such as thousands of friends to see if he is playing farm, this is a very large amount of computation, so we did this storage system, according to the data structure needs of the storage system, the order of the number package processed into a B tree tree structure, with tmpfs to handle these number packages, so that each The btree file is a visible package, visualized. Our synchronization is also very simple, originally need to cross the region, now are visual operations.
Data center service difficulties, mainly to manage users' ad feedback data. Advertising system, data center and other and UGC great function is different, is to read more than write multi-model, such as an ad exposure, exposure before you need to see if this user has seen this advertisement, you need to read once, if you think this user can be exposed, you have to write exposure records, read and write frequency is very high. For example, I published a talk album, generally read a lot of cases, write relatively few cases. In addition, the need for off-the-go service access, data synchronization problems are also, because the volume is large, it is easy to block the channel.
Data center service architecture analysis, each region is such an architectural design, first of all, the structure is a simple logical service, CKV storage exposure click record, in the logic read and write commands are deployed separately, the code is together. Also do bypass water, every time you write to do a lot of things, such as to count the exposure of ads clicks, click-through rate, third-party monitoring, in your ads in the end there is no need to run statistics, these needs are very slow, there are some statistical functions, including synchronous operation to the other two are very slow, this absolutely can not be placed in the online writing operation, writing operation can only do the key operation is to write CKV. s_server played a queue role, quickly received data values written to the local file, and then a set of process reading files, on demand with the channel communication, the back-end slow service is also on-demand service, as long as the running over the time generated by the flow of water on the line.
We do two channels in the statistics, the left is Tencent level to do one, is based on the user dimension, the user actually exposed and clicked on which ads, on the pprow system to make a statistical report. We have done a statistical data, according to the dimensions of the advertising ID to do, do not want to deal with a large number of data problems, we do not pay attention to user information, only pay attention to the quality of the advertising ID, you can merge and write operations, data volume, traffic reduction, very simple data statistics can be generated this report. Two channels two logic, the resulting data is consistent, the data of this report is very accurate.
Off-the-go synchronization usually first compares the model of the usual functional products, generally read and write three places, for example, you want to see the photo three places can see, you want to publish a photo must first access to Shenzhen, and then sync to Shanghai and Tianjin. It is relatively simple, to consider timing, if which operation failed to try again, will certainly block the channel. We must read more and write more, in your synchronization process, the second request has come, such as in Shanghai ordered an advertisement, would not have appeared again, but also appeared, because the data is still in Shenzhen to Shanghai on the way. So need to synchronize, three write, three read to build six channels, than the general product synchronization volume will be much bigger, because it emphasizes consistency and timing.
According to this we boldly do not do block the channel synchronization logic, we dare to do so, and we only sync some information, such as I ordered an advertisement in Shenzhen, I do not tell you what is stored in Shenzhen, what is synchronized in Shanghai, only tell incremental information, because many channels of contact is not the same, and finally Shanghai Tianjin storage capacity is not the same. In this way, if there are some channel failures will form three islands, the biggest problem is that if the user in Shanghai ordered an advertisement, the next time asked him to go to Shenzhen, this advertisement he saw again, in this case, we boldly try not to consider this problem, the actual verification, such users according to our statistics, the impact of the day is only five thousandths, we are not so tangled, this is more characteristic of off-site services.
Platform mass service operation capacity, one is monitoring capacity, interface level monitoring, real statistical success rate and delay, user-side monitoring, really look at the delay pulled out of advertising, as well as advertising data monitoring, advertising has not appeared on time, click-through rate is in line with expectations. Second, the quality of service capacity, quality of service is a capacity problem, capacity is good enough quality, capacity is not good enough quality. How does the capacity come about? This is the problem that the advertising system will encounter, the load rate of advertising calculation will increase, especially the platform type of advertising system, unlike the effect of advertising can maintain a smooth ad output, we are not the same here, for example, there are some ads on your system has a surge or drop situation.
The amount of computing per ad is different, what we can do, according to the elasticity of cloud computing to do this thing, the current situation to maintain a good quality of service is still difficult. At least have a general warning about the future of the debt, you have to have a model to calculate, because advertisers will have an advance, such as the afternoon ads cast in the morning, will not take effect immediately, there is always a time to check whether the load is sufficient.
There is to do high efficiency, leave a 5 times, 8 times buffer, you stay more operations on the pressure is very big, need to improve stand-alone performance, some of our monitored data will not be coded in the server, in addition to advertising analysis data, are some UDP protocol, as far as possible to simplify this logical operation. There are a variety of separation deployment to reduce burrs, such as data center reading and writing and recycling, etc. , must not be separated, in order to avoid some ads suddenly up, there are some operations will bring costs, on average down three nine, four nine success rate is no problem, but occasionally is a few minutes like this.
Disaster tolerance, we will not store some information with the user routing on a single machine, unless a machine hangs up, do load balancing coverage on it. Including equipment management, high-priority, low-priority equipment allocation management, targeted to do some work. Supporting city-level disaster tolerance has been tested like the Tianjin Big Bang. There are details, so-called disaster tolerance, in some very critical places can be covered, but you are killed by some details, heavy on the details, looking for short boards.
Again, the advertising system ROI optimization path, this is the detour we have gone through. Taking ROI is the recommended algorithm, including working with Tencent internally to do something, and when you're done, you can see some of the things we've done. First of all, before ROI optimization to find their own bottlenecks, we do recommend algorithms, the magnitude is not large, because are the needs of large business, may be advertising on five, six, not greater than ten ads, but the effect is very large, even after the primary election, careful selection, complexity is not the same, really can do a thousand people. Demand is also volatile, but the quality of advertising is manageable because advertisers are close to us and we understand their needs.
Three attempts were made. The first is the first detour, there are users come in through a variety of ways to check data, check online optimization, will find that even if you know what kind of ads you like, but now there is no such ads in inventory, you can not. The second way, the same project advertising through more channels to do, perhaps this advertising user base did not cast in this channel, he went to another channel can also do, for advertisers to run the cost is very large, because each delivery will bring development costs and cost increases, each channel is different.
The third is a very useful way, or channels to you, but you can no longer use an ad, you have to design multiple ads to come in and compete with their own, the difference is that advertising A is the user base, advertising B is cast another user base. Let users vote, twenty or thirty minutes can get the click-through rate of each ad, let the click-through rate of high ads penetrate down, play the bridge between advertisers and users, you do good or bad work users will give you feedback, this is stimulated to advertising mainly to do a new ad.
Another important functional point of ROI is negative feedback. For example, some people look at the advertisement seven times will not point in, you give him many times he will not point, he selectively ignored, this part of the crowd do not give him advertising, for him the experience is also good, for us there is no loss of revenue. There is also a frequency limit, the flow out is not immediately consumed, smooth in a day or more of consumption. Shenzhen found a benefit, third-party delivery, because they often do not know the traffic, until the site resisted, traffic consumption, advertisers do not know how much traffic we Qzone, but he certainly knows how much the system can top, let's give him frequency limit.
Also provide a tool called QBOSS crowd analysis statistics, you cast ads after we will analyze the ad delivery is good or bad, select a set of characteristics, exposure of the population accounted for the proportion of feature values, click on the percentage of the population how much, if the two are flat, may say that this in the advertising audience is not obvious things, if the click ratio is higher than the exposure ratio, this is very positive thing, you will be the next to cast this kind of ads to give priority to this part of the population. We provide just a tool, a channel for you to do positive and negative feedback, advertiser optimization must be your own way to do things, do well is four hundred percent, do not do well or ten percent.
The entire customer layer also has a lot of things that reflect Tencent's mass service values, such as dynamic operation, network scheduling and so on, are in use, I will not repeat.
Finally, the snack, technical value must be reflected in the business above, we began to do advertising systems, such as the three systems sound simple, but can solve business problems is a good thing. Also do responsibility recognition, we put an advertising system into a large direction, each big direction has its responsibility problem, abstract can also meet the big system small do, a complex problem simple, such as the user center only need to judge whether this user is a user base, other problems do not care.
Then there is the abstract layer, which can solve many requirements problems, some tag recognition is very special, but abstract one layer can often solve many complex problems. Good at doing performance improvements on storage such as redis success, the logic is simple, performance is also very good, including just introduced the friend participation system can quickly solve the problem.
Low cost, a high cost of memory, may take into account other storage media, see if there are some protocol compression and so on, a lot of skills we will master, but I suggest that you still consider the problem from a business point of view, may bring you ten times, a hundred times the cost savings. Try to do stateless service, can ensure that your load balance, consistency, once there is a state really want to tragedy. There is also the separation of various deployments, in some small flow systems this is a bit around, but in a large number of systems this is still very necessary. All of our architectural designs rely on our own data experience, but be sure to validate it with the data when you're online. The problem with doing off-the-go synchronization is also the extent to which the architecture can continue to work after going online.
Go to "Discovery" - "Take a look" browse "Friends are watching"