Factors Affecting Probability of Income Increase in Agricultural Holdings Specialised in Milk Production

The paper uses logistic regression model to determine factors affecting probability of income increase in agricultural holdings specialised in milk production. Analysis had a dynamic character and covered the period between 2009 and 2011. Independent variables included both quality and quantity features associated with farm activities. The analysis showed that, in each year, the most important variables in the logit model concerned the utilised agricultural area and the number of dairy cows, which had a positive impact on probability of income increase. All estimated models are characterised by high quality and thus can be used to correctly classify agricultural holdings.


4(349) 2016
output. The share of animal production in the structure of commercial production of agriculture was characterised by similar trends: from 56.5% in 2009 it dropped to 53.4% in 2011. But then, the share of cattle, calves and milk in the analysed period was stable (23-24%, including milk at 17-18%).
This research is to identify the factors of improvement of the revenues of farms specialising in milk production. It was based on information from the Polish Farm Accountancy Data Network (Polish FADN). The surveyed group amounted to 670 farms and the research period covered the years between 2009 and 2011. The data concerned farms of type 45, i.e. holdings specialising in dairy cattle rearing. Given the fact that the level of specialisation of the surveyed farms was different, their percentage share was calculated in the value of sales of milk and milk products in total output value. The calculated coefficient was used to correct costs, assuming their proportional allocation to respective sections and branches.
The variables concerning farming activity included both qualitative and quantitative variables, and logit model was used to achieve the aim which helped to estimate the probability of better revenues of farms. Logistic regression gains increasingly more recognition in many disciplines of science, such as: medicine, psychology, technical sciences, banking, insurances, demography and economics. Examples of works applying the logit model include: (Jackowska and Wycinka, 2011;Kmieć, 2015;Kowerski, Bielak and Długosz, 2006;Kasprzyk and Fura, 2011).

Research method
Logistic regression model (similarly to the multiple linear regression) allows for researching the impact of many independent variables X 1 ..., X k on the dependent variable Y. The dependant variable has only two values and is dichotomous. The two values are coded as 1 and 0, where: value 1 stands for the presence of one property, and 0 for absence of the given property (Hosmer and Lemenshow, 2000). Logistic function is used in the logit regression to describe the below correlation (Stanisz, 2007): (1) This function takes values from the range (0;1) and its graph resembles an elongated letter S. It is possible to separate three stages of changes in the function value: initially to a certain threshold value they practically do not change the probability, upon reaching the threshold value probability abruptly grows to one and stays at the level. Such a function has many applications to the description of phenomena in medicine, epidemiology, psychology and economics, e.g. disease risk, chance of recovery, ability to find a job, etc.

Problems of Agricultural Economics
Logistic model is a very good tool to consider probability of presence of a given event. It enables to present how the probability of presence of the researched event depends on some variables which can be both quantitative and qualitative. The logistic model makes it possible to devise a mathematical formula -which is its advantage -and that formula will be used to determine the strength and direction of impact of respective variables on the modelled event.
Additionally the logistic regression model does not require certain assumptions necessary for linear regression. The vector of independent variables and the rest do not have to have a normal distribution. An additional advantage of logistic regression is the fact that analysis and interpretation of results are similar as in classical regression models.
Transforming the logistic model (2) by calculating its logarithm, gives odds ratio, which should be understood as the relation of probability of presence of a given event to the probability of its absence. The natural logarithm of odds is known as logit (Stanisz, 2007;Cramer, 2003;Kleinbaum and Klein, 2002): This equality is a logit form of a logistic model. In the logit model, the log--odds of a presence of an event is a linear function of independent variables.
Upon estimation of the parameters of logistic regression model it is possible to establish the theoretical value of the Y variable according to the standard rule of forecasting: (4) where: -theoretical probabilities obtained on the basis of logistic regression model estimated on a random sample. When the sample is unbalanced, i.e. the number of ones is considerably different then the number of zeros, modification of the standard rule may be used to forecast theoretical values and forecasts may be computed by the rule of optimum limit value α: The α limit value is set as the share of ones in the sample. It is then possible to assess the correctness of the estimated model, by calculating correctly and incorrectly classified cases (Table 1).
To measure the goodness of fit of the logistic regression model to the empirical data, it is possible to use the count R 2 which takes the value from the range, which is defined as follows (Maddala, 2008): The closer to one the value of the measure is, the better goodness of fit of the logistic model to empirical data of the researched phenomenon, count R 2 means the percentage of correctly classified cases. The model works well in forecasting a researched event when count R 2 >50%. This means that classification on the basis of the model is better than random classification. Another measures of goodness of fit can be found in the work (Sompolska-Rzechuła, Machowska--Szewczyk et al., 2014).
Another method to assess the quality of logistic regression model is the Hosmer-Lemeshow test (Hosmer et al., 1989;Homer et al., 2008), which -for different subgroups of data -compares the observed number of presence in the given subgroup of objects having the specified property O g and expected number E g of presence of the specified value. If O g and E g are close enough, then it can be count R 2 -----Electronic copy available at: https://ssrn.com/abstract=2897521 Problems of Agricultural Economics assumed that a well-fitted model has been constructed. Usually, observations are divided for calculations into G subgroups with the use of, e.g., deciles. The hypotheses in the test have the following form: H 0 : O g = E g for all categories, H 1 : O g ≠ E g for at least one category. The value of test statistics is devised as follows: where: This statistics has asymptotic distribution χ 2 with G-2 degrees of freedom.
The assessment of the quality of logit model uses, apart from various measures, also the Receiver Operating Characteristic (ROC) curve which is constructed on the basis of dependent variable value and expected probability of dependent variable, making it possible to assess the abilities of the constructed logistic regression model to classify the cases in two groups: having a specific property and not having a specific property. The ROC curve is created by combining points with the following coordinates (1-specificity, sensitivity). Sensitivity means the ability to detect units that have the specified property: Whereas specificity describes the ability to find units not having the specified property: Thus the created curve and especially the area underneath it, illustrate the classification quality of a model. When the ROC curve overlaps with the diagonal y = x, then the decision on assignment of a given case to a selected class (1) or (0) made on the basis of the model is as good as the random distribution of the researched cases into these groups. The classification quality of the model is good when the curve is much above the diagonal y = x, i.e. when the area underneath the ROC curve is much larger than 0.5.
In the research period, 2009-2011, the values of variables: Y, X 1 , X 2 , X 3 , X 4 , X 5 , X 6 , X 7 , X 8 , X 9 clearly increased. Revenues on sales increased yearly by 16.9%. Utilised agricultural area also increased on average by 1.69% and the number of cows by 1.61%. The yield-forming inputs also increased on average by 3.75%. The costs linked to purchase of fodder grew yearly by an average of 9.36%. Among other costs, the highest increase was noted for the costs of energy (by an average of 16.62%) and upkeep costs of machines and buildings (by 6.37%). Depreciation costs increased the least (by an average of 3.54%). In the analysed years the average age of a farm manager also changed -from 44 years in 2009 to 45.6 years in 2011.
In 2009, 18% of researched farms did not have a successor and in 2011 -already 21%. In 2009, 17% of farms had a successor and in 2011 -20%. The share of farms managed by women increased by one percentage point (from 13% to 14%). However, the structure of farms in the researched period did not change as regards education of the farm manager. The largest share of farms (29%) was run by managers having vocational agricultural education, 25% by managers 1 LU -Livestock Unit. having secondary agricultural education, and 10% by managers having primary education. Only for 5% of farms managers had higher or higher agricultural education.
The variable taken as dependant variable determining the amount of farm revenues in each of the analysed years is characterised by very strong right-side asymmetry ( Fig. 1-3). For this reason, to estimate farms of higher level of revenues, the measure of position was used that determines the average level of the researched phenomenon -median, as contrary to the average, it is more resilient to the presence of outliers 2 .
Analysing the dependant variable distributions, a very strong right-sided asymmetry is clear (the strongest in 2011), which means the presence of a larger group of farms of lower revenue than the average. Farms are very highly differentiated as regards the amount of revenues, from 82% in 2009 to over 86%  2 Medians are often used in socio-economic research, given the asymmetric distribution of properties (Młodak, 2006).

Modelling results using logit regression
Because independent variables include both qualitative and quantitative variables, logit regression model was used to realise the objective. Dependant variable was determined in a binary manner as a level of revenue equalling at least the median value (value 1) and adopting the value below the median (value 0).
In order to find the best combination of variables having a significant impact on probability of increasing revenue, formal selection of properties was done with the use of stepwise regression which gave the following set of variables.
• for 2009 and 2010: X 1 , X 2 , X 3 , X 4 , X 5 , X 9 , • for 2011: X 1 , X 2 , X 3 , X 4 , X 5 , The generated sets form new lists of variables, which are poorly correlated between each other and, at the same time, strongly correlated with other variables. Table 2 presents assessments of parameters of a logit model. All independent variables in all years have a positive statistically significant impact on the dependent variable in the model. Two variables have the most important impact on the amount of revenue at farms: X 1 -UAA (ha), and X 2number of dairy cows.
An interpretation of the odds ratio of these variables (assuming that the other variables in the model are constant) gives the following information:  Table 3 data cover assessment of correctness of the estimated model computing the fitness of farm classification.
Based on results presented in Table 3, it can be stated that the estimated logit models are characterised by very high sensitivity and specificity, i.e. they have high ability to designate farms of actually higher or lower revenue. The value of count R 2 count coefficient are much higher than 50% which means that classification based on the model is better than the random classification.
The results of the Hosmer-Lemeshow test show that there are no significant differences between the empirical and theoretical numbers following from the estimated logistic regression models (Table 4). The quality assessment of the obtained logit models also uses the ROC curve and an area under the curve (Table 4 and  of very high quality of estimated models in each year. The area is significantly larger than 0.5 (at significance levels larger than 0.000001 for each model), thus it is possible to classify farms based on the constructed models. On the basis of estimated logit models it is possible to determine how the independent variable impacts probability of getting a higher revenue by a farm. All variables have a major positive impact thereon. This means that the higher the values of independent variables, the higher the probability of considering a farm as achieving a higher revenue level. It is illustrated in Fig. 5 on the basis of the number of dairy cows and UAA. Based on Fig. 5, it can be stated that the probability of considering a farm as fit for achieving higher revenues largely depends both on the number of dairy cows and UAA. Along with a growth in the number of dairy cows, the probability of considering a farm as having higher revenues also grows. For example, for farms having four dairy cows the probability is 0.7 and for 20 cows -it grows to 0.99. The case is similar for UAA, but the rise in the probability value is more moderate. For a farm having the smallest UAA amounting to ca. 4 ha, the probability of considering it as a farm having higher revenue is high, within the limits of 0.97, and if the UAA is increased even more, i.e. up to 20 ha, the researched probability grows to 0.99.

Conclusions
The use of logit modelling enabled to determine the factors improving revenues of farms specialising in milk production and to establish the probability of classifying the farm to the group with higher revenues.
The research gives grounds to make the following conclusions: 1. Throughout the research period the value of properties taken as independent variables in the logit model showed upward trends -this regularity concerned quantitative variables. 2. The revenue termed as dependant variable also increased the average values year-on-year, and its distribution was characterised by very strong right--sided asymmetry. 3. The greatest change concerned the presence of a successor. In 2011, 20% of farms declared that they have a successor and this was an increased by 3 percentage points against 2009. 4. The structure of surveyed farms did not change significantly as regards the sex of farm manager, i.e. 87% of farms was managed by men. 5. The structure of farms was also stable as it comes to the education level of the farm manager. The highest share referred to people having secondary agricultural education and the lowest -higher agricultural education. 6. Among independent variables taken to the logit regression model, factors having a significant impact on the probability of increasing income in 2009--2011 included: UAA, number of dairy cows, yield-forming inputs, fodder from purchase, other direct costs of animal production and depreciation costs. 7. Each year the following independent variables were the most important elements in the models: UAA and number of dairy cows, which had a positive impact on the level of revenues. 8. All of the estimated logit regression models were characterised by very good quality which is evidenced by the values of: count R 2 coefficient amounting to over 90%, Hosmer-Lemeshow test pointing to a lack of important differences between empirical and theoretical numbers and the area under ROC curves amounting to over 90%, which informs on high classification ability of models.