Preparation of initial data for analysis. Requirements for an excel file for building a model / running a test

Preparation of initial data for analysis. Requirements for an excel file for building a model / running a test

The Scoring Machine system creates scoring models and tests for them by uploading excel files with the .xls / .xlsx extension, but it is very important that the files used meet the requirements for correct analysis and, it is desirable that the user also follow the recommendations during the formation file and data preparation so that the model is as efficient as possible.

Preparation and collection of data for creating a scoring model and testing. General information

The construction of scoring maps is based on statistical models. To build them, there must be sufficient and high-quality information about customers. The quality of the initial statistical data for building a statistical model determines its forecasting accuracy and the success of developing a scoring model as a whole.

The development of a scoring model is based on the analysis of previous experience. A sufficient amount of information is one of the main prerequisites for building a model. The amount of data may vary depending on specific models, but in general the data should satisfy the requirements of statistical significance and randomness. The initial data for building the model may contain internal questionnaire data, as well as external data received from partners or even computational data that the user / user system calculates on its side (for example, the difference and periods between calls to the company). Everything you know or can learn about the client at the time of their contact.

Ideally, scoring models should be applied to the same products, market sector, and economic situation that formed the basis of the past experience data. For example, information on consumer loans cannot be adequately used in the development of a scorecard for car loans. This requirement specifies the period for which data is collected. The historical period of data for building a model is usually determined by the type of scoring and the type of product, the type of client. This means that if your customers are segmented, it is important to create separate scorecards not only for different products (loan, credit card, car loan, early debt collection, late debt collection, email advertising, sms advertising, telemarketing use), but and for different customer segments. The simplest client segmentation: a new client and a repeat client who has already used the company's services before. For them, completely different attributes may be important.

Therefore, the use of several scoring cards and models often gives a much better result.

It is best to take data as fresh as possible, but where the target event has already been committed. For example, you have issued a loan and your goal is that the client does not go into delay for more than 1 month. Then it is best to take data for the longest possible period, but only for the period where all customers, in theory, could already either close the loan or go into arrears. It is desirable to avoid ambiguities as much as possible. Or you do SMS-mailing to clients, where the target task is to get the necessary activity from them during the week. Then you can use these clients for analysis at least after a week, not earlier.
Data about a certain type of customer must be excluded from the source infobase. These may be atypical clients - scammers, employees, VIP clients, deceased clients, i.e. all those customers who are knocked out of the mass with something bright. For each such type, if necessary, it is better to build a separate scoring model.

Definition of dependent variable

The choice of the dependent variable is determined by the purpose of constructing the scoring model. For example, exit or not exit on delay, response to activity from a certain channel, purchase of additional goods.At the stage of determining the dependent variable, customers are divided into three groups: "bad", "good" and "uncertain". Bad - those where the desired goal was not achieved. The good ones are those where the goal was achieved. Uncertain - those where the goal could not yet be achieved or there is little data on the data of clients, for example, an incomplete questionnaire with a lack of many attributes or the inability to calculate most of the data, etc.
When building a scorecard, only clients defined as "bad" and "good" are used. Undefined clients are excluded from the source data to create the model.

Formation of training and test lists

The information data available for building a scoring model is often called a historical sample. The historical sample should reflect the target population of customers as accurately as possible, i.e. be representative. So after preparing this customer data, breaking it down into different segments by customer type, region, or product, you can move on to the next step.
To check the adequacy and accuracy of the scoring model prediction at the stage of its development, the historical sample must be divided into two groups:
- training list - observations, according to which the model will be directly built;- test or control list - observations from which the value of the dependent variable will be known, but they will not participate in the construction of the model, but will be used to test the accuracy of the model prediction.The training and control samples should be formed on the basis of a random selection mechanism, usually in the ratio of 70–80% and 30–20%, respectively, of the initial volume of the historical sample.
Checking the reliability of the model consists in its application and comparison of the results on the control and test samples. The model should give correct predictions not only on the training set, but also in practice when it is applied. Typically, a two-sample model generalization strategy is used. Similar accuracy scores obtained on the training and test sets are a sign that in practice the scoring model will work in much the same way.
If the quality of the model is insufficient, then you can change the attributes, attribute weights, or coefficient weights for the final score to change the model and rerun the test.

Determination of the list size

It is possible to build a model on repeating dependencies. Accordingly, the larger the sample, the better. In practice, a sample size of at least 5,000–10,000 clients is recommended for model building. But there is a practice of building a model for 1,000 clients. In any case, try to use the maximum number of clients, but on the condition of dividing into the necessary segments and excluding those clients for which the data is not complete, as described above.A 50/50 ratio of good to bad is best, but this is usually not possible. In any case, we do not recommend specifically making such a ratio for the sample. It is better to take everyone for a certain period, even if the ratio of bad ones to good ones is an order of magnitude smaller.After preparing this data, you can find the necessary dependencies and build a scoring model, thereby significantly improving the achievement of desired results with lower costs.

File Requirements

To upload data and build a model and test, you have prepared all the data, and they are ready for analysis. Now you need to prepare the file for certain requirements so that the model can be built.
1. The file type must be .xls or .xlsx saved as an Excel workbook and nothing else.
2. The first column in the file should be filled in with a good or bad client. For example, GOOD or BAD or other values that are specified in the settings.
3. Use only the data that is known before the decision for which you are building a scoring model.
4. For data analysis, each product is 1 line. And this means that you do not need to combine the client into one line if the client contacted several times for products. In such cases, there should be exactly as many lines as there are products and the data in the file should be up-to-date exactly at the time the client asked product.
5. Some personal data are not needed for analysis, they will not be used to build the model anyway, so it is better not to include them in the file. For example, name, phone number and other similar personal data.
6. You can not leave empty cells, missing data! If there is an empty cell, write down some other value there, some kind of your own constant, for example "null" or "unknown". It is important! There should not be gaps in the uploaded file, all cells, if there is data in the line, must be filled with something.
7. No more than 20–30 value options are recommended, and it is best to have up to 10 attribute value options. This means that if any of the attributes consists of many options, then it is better to fill this data with groups of values. For example, the age of the client. This attribute is best served in groups of several years, for example: 18–21; 22–25; 26–30… But this does not mean that if required, you can not use a lot of values, just, as a rule, this will distort the model and will not give the best result.
8. To upload the test model, be sure to note that the attributes and their values in the test model must have exactly the same names as in the scoring model itself. If the scoring model has combined some attributes, then you do not need to do it yourself, Scoring Machine will combine the same attributes during testing, if necessary.
Uploading data example:

Illustration

When the data is ready, if you do not plan to use the same file for building the model and for the test, then we immediately recommend divide it by 80 / 20 or 90 / 10, where the first part is the percentage for the training sample with which you will build a model, and the second part is a test sample, on which you will check the finished model.

As soon as the data is ready, you can safely start building a scoring model using the Scoring Machine, you will be surprised how easy it is!