A BAYESIAN APPROACH TO NORMAL REGRESSION MODELLING WITH AGGREGATE CLIMATIC DATA

Modeling the relationship between some climatic determinants in the wet or cropping season of Makurdi, Benue State, Nigeria requires data aggregation. The consequence of this aggregation is the reduction in data sample size. This poses serious challenge of lack of model-fit when the Classical Linear Regression Modeling Approach is employed. The Bayesian Normal Regression Modeling Approach was therefore employed in surmounting this problem. Three Bayesian Normal Regression Models were fitted namely; the Solar radiation, Total Rainfall Amount and the Number of Dry Days model. Each model result was compared with that of its Classical model counterpart. The discrepancies observed were blamed on the sample size reduction. The results of the Bayesian models revealed that; Solar radiation increases by 0.791 MJ/m for each unit increase in the natural logarithm of Relative humidity. While it increases by 0.895 MJ/m for each unit increase in the natural logarithm of Wind speed. Total Rainfall Amount increases by 66.280 mm for each unit increase in the natural logarithm of the Number of Dry Days while it increases by 2.912 mm for each unit increase in the natural logarithm of the Number of Wet Days. Furthermore, the Number of Dry Days decreases by 6.905 days for each unit increase in the natural logarithm of Number of Wet days, while it increases by 2.028 days for each unit increase in the natural logarithm of Total Rainfall Amount. The study affirmed that the cropping season climate of Makurdi is becoming warmer and drier and that the Bayesian Normal Regression Modelling approach be employed when sample size reduction due to data aggregation is a concern.


INTRODUCTION
Statistical modeling using individuallevel data is paramount for obtaining accurate estimation most especially in Normal Regression Modeling (Moineddin and Urquia, 2014). However, there are some circumstances where individuallevel data are not available due to confidentiality concerns or the research problem requires that data be aggregated. The latter is the case with modeling the relationship between some climatic determinants in the wet or cropping season of Makurdi, Benue State, Nigeriathe focus of this research. The wet season is the cropping season of the area whose inhabitants are predominantly farmers. These farmers can control for most productivity variables, except for climatic determinants, which have been stated to be out of their control due to the impact of climate change (Agada, Imande and Amedu, 2018). It therefore becomes paramount to understand how these climatic determinants interplay in the cropping season. This we believe can be achieved by modeling the relationship among them in the aforementioned season. Moreover, seasonal relationship among these determinants will require aggregating (averaging) data over each rainy month of the year (March -October) and over the entire data period of 34 years . The consequence of this aggregation is the reduction in the sample size to eight (8). This poses serious challenge of lack of model-fit or incorrect fit when the Classical Linear Regression Modeling Approach is employed. A way of surmounting this obstacle is by employing the Bayesian Normal Regression Modeling Approach; a method proven to be efficient in handling cases of modeling with incomplete and sparse data (Taeryon, Mark, Ketra and Mitchell, 2008;Agada, Udoumoh and Gboga, 2019;Tripathi, Singh and Singh, 2019). This was implemented on the Windows Bayesian Inference Using Gibbs Sampling (WinBUGS) platform.
Some researchers (Moineddin et al, 2014) have demonstrated that Normal Regression Modeling using aggregate data yield similar regression coefficients as the individual-level data based models. Other researchers such as Aymen and Mohammed (2019) argued that aggregating both the dependent and independent variable values smoothes out variation at the individual level and therefore, synthetically inflate the correlation coefficient values. They also opined that, assuming that relationships observed for groups necessarily holds for individuals is an ecological fallacy, therefore models fit into group (aggregate) level data cannot be applied to datasets at the individual level. Other supporters of this claim include; Dias, Sutton, Welton and Aides (2013). Aymen and Mohammed (2019) therefore recommended that partial aggregation of data should be done to wade off the aforementioned problem. In the same vein, Robinson (1950) stated that the correlation between the properties of individual-level data have no bearing on the properties of groups or aggregates. Furthermore, Eric, Hanushek, John, Jackson and Kain (1974) expanded the scope of Robinson's independent variables of Negro or White race to include Mexican and Indian born, as well as elementary school age population at the state level. They proved Robinson's conclusion wrong stating that individual-level data or not is not the problem, but improper model specification. This is because their expanded model better estimated the literacy rate of a race than that of Robinson. The authors emphasized that since individual-level data may not always be available due to confidentiality concerns, difficulties associated with the use of aggregate data should be dealt with through proper model specification.
A recent research in the study area by Agada et al. (2018) revealed that the area is becoming warmer and drier. The authors could neither identify and specify the climatic determinants nor quantify the magnitude of their effects on the warming and dry climate. This gap in literature we have been able to identify and intend to fill. From the aforementioned researches, the problems we must surmount for using aggregate data is that of inflated correlations among independent variables and that of improper model specification. This we have handled by proper data transformation and the use of correlation matrix of the climatic determinants in identifying independent determinants that correlate significantly with the dependent determinant and or with themselves. Pathan (2015) emphasized the use of the correlations among climatic variables in accomplishing the aforementioned task. The implication of this study to crop farming in the area, led us to fitting three normal regression models namely; the Solar Radiation, Total Rainfall Amount and the Number of Dry Days Model. Results of the Classical and Bayesian models employed in the study were compared. The rest of the paper is sectioned into; Methodology, Results, Discussion, Conclusion and Recommendations.

Source of data and transformation
The data for this work is a 34 year (1977 -2010) daily data on some climatic variables or determinants of the city of Makurdi, Nigeria. The climatic variables or determinants are; Solar radiation (MJm -2 day -1 ), Relative humidity (%), Air temperature ( 0 C), Wind speed (m/s), Sunshine Duration (hrs), Total Rainfall Amount (mm), Number of wet days and Number of dry days. The number of Wet and Dry days were respectively determined by counting the number of times the rainfall amount of a particular day is greater than or equal to 0.85mm and less than 0.85mm respectively. This correspond to rainfall state transformations 1 and 0 respectively which were used for the counts. The minimum and maximum Air temperature and Relative humidity data primary were averaged and used as Air temperature and Relative Humidity data in the study. Using the entire dataset, the climatic variables or determinants were aggregated (averaged) over the rainy months of Makurdi, Nigeria (March, April, May, June, July, August, September and October). This was done to reflect their association or interplay in the wet or cropping season of the area.
The associations among these determinants were determined from their matrix of bivariate correlation coefficient. This was nonparametrically obtained using the Spearman Rank Correlation Statistic. The reason for adopting the Spearman Rank Correlation Statistic is that unlike its Pearson counterpart, it is not distribution dependent. The specification of the right variables in the Bayesian Normal model employed in this work is greatly eased by the correlation between the climatic determinants. This is because, predictor variables or determinants that relate significantly with the response variables can be easily identified and those that correlate with themselves identified for logarithm transformation in order to reduce multicolinearity.

Specifying the independent prior distributions
According to Ioannis (2009), the simplest approach to normal regression modeling is to assume that all parameters are a priori independent having the structure; , 2 ,a and b are respectively; the mean of the ′ , the variance of the ′ , the location and scale parameter of the gamma distribution.
The gamma prior of the precision parameter induces prior mean and variance given by In this prior setup, we ensure compatibility with the WinBUGS notation by substituting the variance 2 by the corresponding precision parameter . The gamma prior used for corresponds to an inverse gamma prior distribution for the original variance parameter with prior mean and variance given by; respectively. When no information is available, a usual choice for the prior mean is the zero value ( = 0). This prior choice centers our prior beliefs around zero, which corresponds to the assumption of no effect of X, on Y.
The prior variance 2 of the effect is set equal to a large value (e.g., 10 4 ) to represent high uncertainty or prior ignorance. Similarly, for , equal low prior parameter values is used, setting in this way its prior mean equal to one and its prior variance

Interpretation of the regression coefficients
Each regression coefficient pertains to the effect of explanatory variable on the mean of the response variable Y adjusted for the remaining covariates. The inference concerning the model parameters were made by giving answers to the following questions posed by Ioannis (2009).

Is the effect of X, important for the prediction or description of Y?
To find answer to this question, posterior distribution of , is examined to see if it is scattered around zero (or not). Posterior distributions far away from the zero value indicate an important contribution of X, on the prediction of the response variable. This can be judged by examining the proportion of times that exceed zero [p( > 0)] or the proportion of times that is less than zero [ ( < 0)]. This analysis offers a first and reliable tool for tracing important variables in the model.

What is the association between Y and X, (positive or negative)?
The task here is to identify whether the relationship is positive or negative. This we base on the signs of the posterior summaries of central and relative location (e.g., mean, median, 2.5% and 97.5% percentiles). If it happens that all of them are positive or negative, then the corresponding association can be concluded. Positive association means that changes of the explanatory variable X, cause changes in the same direction for variable Y while negative association means that changes of the explanatory variable X, cause changes in the opposite direction for variable Y. Within this analysis, we a posteriori calculate the posterior probability: 0 = min{ ( < 0∖ ), ( > 0 ∖ )} (6) When the zero value lies at the center of the posterior distribution, then the value shown above will be close to indicating that there is no clear positive or negative effect of X, on Y. When 0 is low (e.g., lower than 2.5%, 1%, or 0.5%), then we may conclude positive or negative association depending on the sign of the posterior location summaries. Within WinBUGS we calculate the posterior probability ( > 0 ∖ ) using the syntax this creates a binary node . taking values equal to one when , is positive and zero otherwise. Obtaining the posterior mean via the sample monitor tool provides us the estimate of the posterior probability ( > 0 ∖ ).

What is the magnitude of the effect of X, on Y?
The magnitude of the effect of variable on is given by the posterior distribution of ( = 1, 2, … , ) since △ = ( , 1 , 2 , … −1 , = + 1, +1 , … , ) -( , 1 , 2 , … −1 , = , +1 , … , ) = It therefore follows that the posterior mean or median of , will correspond to the corresponding posterior measures of the expected change of the response variable . Hence, an increase of one unit of , given that the remaining covariates will remain stable, induces an a posteriori average change on the expectation of equal to the posterior mean of (Ioannis, 2009).
We verify graphically, the assumptions of the Bayesian normal models using the history, density and autocorrelation plots of the residuals computed from the WinBUGS program code. The normality of errors assumption is verified by examining the density plot obtained for each month of the season to see if they follow a normal distribution about mean zero and constant variance. The homogeneity or equal variance assumption of errors is checked, by observing the history plots of the residuals for each month. A rectangular band with approximately constant width is an indication of equal or constant variance. The independence assumption of errors is checked by observing the autocorrelation plots for each month to see if the bars cut-off for each lag. If this happens, it is an indication of zero autocorrelation at each lag. This implies independence of the errors. The idea of using this graphical approach in the check of model assumption was stimulated from the Stationerity, Autocorrelation function (ACF) and Partial Autocorrelation (PACF) concepts in time series modeling. Stationariety of time series is graphically determined by plotting the actual or historic data (history plot) to see if it exhibits a constant mean and variance while the PACF and ACF plots are used to graphically determine the order (p) of the AR(p) and the order (q) of the MA (q) components of an ARMA (p,q) or ARIMA (p,q) process. This principle is based on the dependency property of the series (Klaus, 2016). General goodness of fit. The general goodness of the Bayesian model-fit is determined using the model's coefficient of determination ( 2 ) computed using the WinBUGS code. Since the precision parameter and the variance 2 indicates the precision of the model (if the precision is high and its variance low, then the model can accurately predict (or describe) the expected values of . Using the coefficient of determination ( 2 ), this can be rescaled as: where 2 is the sample variance of . The quantity 2 can be interpreted as the proportional reduction of uncertainty concerning the response variable Y achieved by incorporating the explanatory variables in the model. Moreover, it can be regarded as the Bayesian analog of the adjusted coefficient of determination 2 (used in the frequentist approach of the normal regression model). It is given by: or we can directly incorporate the precision parameter and use it as follows We state here that the model equation (9) is the classical version of the Bayesian normal model. We implement the model on the datasets using the Statistical Package for Social Science (SPSS) version 21 to enable us compare its results of and those of the Bayesian normal model.

The Bayesian normal model convergence diagnostics
Model convergence diagnostics was done using history plots, density plots and autocorrelation plots of the beta coefficients. The plots were produced when the model parameters and measures were monitored on the run of the WinBUG program. Our approach for investigating convergence issues is by inspecting the mixing and time trends within the chains of individual parameters. The history plots are the most accessible convergence diagnostics and are easy to inspect visually. The history plot of a parameter plots the simulated values for the parameter against the iteration number. The history plot of a well-mixing parameter should traverse the posterior domain rapidly and should have nearly constant mean and variance. The density plots of the model parameters were checked against their actual probability distributions to see whether the right distribution is simulated. Samples simulated using MCMC methods are correlated. The smaller the correlation, the more efficient the sampling process. Though, the Gibbs, MCMC algorithm typically generates less-correlated draws, there is a need to monitor the autocorrelation of each parameter to ensure samples are independent. The autocorrelation plot that comes from a well-mixing chain becomes negligible fairly quickly, after a few lags. This was achieved for each of the model parameters and measures.

RESULTS AND DISCUSSION
As earlier mentioned, the specification of the right variables in the Bayesian Normal model employed in this work is greatly eased by the correlations among the climatic determinants. The reason been that, the independent variables or determinants that relate significantly with the dependent variable can be easily identified and those that correlate with themselves identified for logarithm transformation in order to reduce multicolinearity. The results of each Bayesian Normal Regression Model coefficients and those of its Classical counterpart are captured in table 2, 3 and 4 for the Solar radiation, Total Rainfall Amount and the Number of Dry Days model respectively. Table 2 shows for the Bayesian approach, that the Solar radiation model relates positively and significantly with Relative humidity and Wind speed (with prob (beta > 0) equals 1 for each variable). For this approach it can be inferred from the table that, in the wet or cropping season of Makurdi, Solar radiation increases by 0.791 MJ/m 2 for each unit increase in the natural logarithm of Relative humidity, while, it increases by 0.895 MJ/m 2 for each unit increase in the natural logarithm of Wind speed. The result for its Classical counterpart shows that Solar radiation neither relate significantly with Relative humidity nor Wind speed (p values > 0.05). Although the coefficient of determination is over 65 % for both models, this discrepancy still exist. Observe for the Bayesian model on table 3 shows that Total Rainfall Amount relates positively and significantly with Number of Wet Days and Number of Dry Days (with prob (beta > 0) equals 1 for each variable).
Further result reveal for the Bayesian model that, Total Rainfall Amount increases by 66.280 mm for each unit increase in the natural logarithm of Number of Dry Days while it increases by 2.912 mm for each unit increase in the natural logarithm of the Number of Wet Days. The result for its Classical counterpart on the same table shows that Total Rainfall Amount relates negatively and significantly with Number of Wet days (p value < 0.05) but has no significant relationship with Number of Dry Days (p value > 0.05). This result differs from that of the Bayesian model despite the fact that its coefficient of determination is as high as 96 %. A good coefficient of determination may not always mean a goodfit. Table 3     We also state that the reason for the discrepancies in the results of the Bayesian models and their Classical counterparts lie on the aggregation of the entire dataset of 34 years into an eight (8) -Sample size dataset. This amounts to an incorrect fit with the Classical approach. As earlier mentioned, researchers have affirmed that the Classical or Frequentist Regression Modeling approach may not yield good results with sparse or incomplete datasets (Taeryon, et al, 2008;Agada, et al, 2019;Tripathi, et al, 2019). According to these authors, this is not the case with the Bayesian modeling approach as it does not depend chiefly on data sampling but on model parameter sampling.
We certified the correctness of the Bayesian Normal Model via the satisfaction of structural assumptions, goodness of fit test and model diagnostic checks. We adopt a graphical approach, in the verification of the assumptions of the Bayesian normal models. The history, density and autocorrelation plots of the residuals computed from the WinBUGS program code were employed for this purpose. For the normality of errors assumption, observe for each model, that the density plots of residuals obtained for each month of the season follow a normal distribution in the neighborhood of a zero mean as seen in Fig. 1. The homogeneity or equal variance assumption of errors can be checked, by examining the history plots of the residuals for each model and for each month. Observe a rectangular band with approximately constant width which characterizes each plot (see Fig. 2). This is an indication of equal or constant variance. The independence assumption of errors can be checked by examining the autocorrelation plots for each model and for each month. Observe that the bars cut-off at each lag (Fig. 3). This is an indication of zero autocorrelation at each lag signifying the independence of the errors.
We judge the goodness of fit of the Bayesian models from the value of the Coefficient of Determination (R 2 ) for each model. Observe that the history plots shows that the model parameters and measures are wellmixed. This is because they traverse the posterior domain rapidly with nearly constant mean and variance. The model prior distributions for the beta coefficients are normal (0, 0.001). The normal density plots of these priors reflect these normal (0, 0.001) distributions. This further validates each model. The autocorrelation plots of each parameter and measure depict the independence of the samples generated. This is because the autocorrelations become negligible fairly quickly, after a few lags.

Conclusion
The Bayesian Normal Regression Climatic Model is more adequate in modeling the wet or cropping season climate of Makurdi than its Classical counterpart. Solar radiation increases by 0.791 MJ/m 2 for each unit increase in the natural logarithm of Relative humidity while it increases by 0.895 MJ/m 2 for each unit increase in the natural logarithm of Wind speed. Total Rainfall Amount increases by 66.280 mm for each unit increase in the natural logarithm of Number of Dry Days while it increases by 2.912 mm for each unit increase in the natural logarithm of the Number of Wet Days. The number of Dry Days decreases by 6.905 days for each unit increase in the natural logarithm of the Number of Wet days while it increases by 2.028 days for each unit increase in the natural Total Rainfall Amount. Moreover, the wet or cropping season climate of Makurdi is becoming warmer and drier.

Recommendations
The Bayesian Normal Regression Climatic Models should be employed in modeling the relationships between climatic determinants in the wet or cropping season of Makurdi, Nigeria when sample size reduction due to data aggregation is a concern. Furthermore, climate change impact on the warming and dry climate in the cropping season should be checked if the area is to retain its potentials for crop production in the long run.