By Z.H. Fu https://fuzihaofzh.github.io/blog/ # Abstract This article compares four optimization approaches on the logistic regression of mnist dataset. The results of Gradient Descent(GD), Stochastic Gradient Descent(SGD), L-BFGS will be discussed in detail. We proposed a Combined Stochastic Gradient Descent with L-BFGS(CL-BFGS) which is a improved version of L-BFGS and SGD. we conclude that when dataset is small, L-BFGS performans the best. If the dataset is big, SGD is recommended.

Dataset

The mnist dataset is one of the most popular dataset in machine learning especially in handwriting recognition systems. It contains 50000 pictures of 28x28 size as the training set while 10000 as validation set and 10000 as test set. The Bench mark error rate of mnist in traditional linear classification method is 7.6%[1]. However, the result improved dramatically after the deep learning method is proposed. DNN0.35%[2], CNN0.23%[3]. In this article, we only compare the optimization method in logistic regression.

Experiment

Gradient Descent

GD is the most basic optimization method in machine learning problems. The gradient is calculated after the input vector is set. The input vector changes along the gradient vector to get close to the local minimal/maximum solution during each epoch. However, GD is perhaps the worst choice to use among the methods above.

##Stochastic Gradient Descent
SGD does better than GD especially when the data is large. Sometimes, the data is too large to calculate at one time. So, it’s necessary to calculate in batches. SGD divided the data into some batches, and optimization one batch at each iteration. The parameter is passed to another batch after one is finished. The advantage of this machanicse is that comparing with the GD method, at each iteration, the parameter is optimized and get better and better after the first batch. However, the parameter of the batches keeps the same during each epoch in GD.

L-BFGS

L-BFGS stand for the Limited-memory Broyden–Fletcher–Goldfarb–Shanno. It approximates the Broyden–Fletcher–Goldfarb–Shanno (BFGS) algorithm using a limited amount of computer memory. It can often get the better solution than the two methods mentioned above with less iteration.

Combined Stochastic Gradient Descent with L-BFGS

CL-BFGS works the same as SGD. However the optimization method in each batch is substitutes with L-BFGS. It get a sharp improvement in the short term, but cost a lot of time, because it will initial L-BFGS at each epoch.

#Results
The results is as follow, the error rate of each method coresponding to a epoch is listed.

|epoch|GD train|GD valid|GD test|SGD train|SGD valid|SGD test|L-BFGD train|L-BFGS valid|L-BFGS test|CL-BFGS train|CL-BFGS valid|CL-BFGS test|
|-|-|-|-|-|-|-|-|-|-|-|-|
|1|0.89404|0.8998|0.8966|0.46814|0.4499|0.4605|0.74056|0.7311|0.7273|0.15724|0.144|0.1531|
|10|0.81976|0.8195|0.817|0.17524|0.1645|0.1687|0.1587|0.1396|0.1494|0.08622|0.087|0.0925|
|20|0.79408|0.7964|0.7944|0.14498|0.1355|0.1373|0.12376|0.1173|0.1215|0.07998|0.084|0.0896|
|30|0.75974|0.7627|0.7588|0.13136|0.1243|0.1249|0.10712|0.1007|0.1094|0.07684|0.0821|0.089|
|40|0.71346|0.7171|0.7145|0.1229|0.1177|0.1182|0.09956|0.0982|0.102|0.0748|0.0818|0.0878|
|50|0.66438|0.664|0.6648|0.11716|0.112|0.1129|0.09642|0.0918|0.1001|0.07376|0.0816|0.0865|
|60|0.61738|0.6167|0.6184|0.11248|0.1084|0.11|0.08876|0.0872|0.0939|0.073|0.0809|0.0862|
|70|0.57716|0.5732|0.5755|0.10916|0.1062|0.1078|0.08532|0.0848|0.0907|0.07254|0.0806|0.0866|
|80|0.54128|0.5339|0.5393|0.10592|0.1042|0.1051|0.08094|0.0823|0.0881|0.07204|0.0812|0.0863|
|90|0.50982|0.4978|0.509|0.10296|0.1018|0.1029|0.08016|0.0803|0.0879|0.0718|0.0813|0.0865|
|100|0.48344|0.47|0.4815|0.10148|0.1004|0.1073|0.0782|0.0807|0.0877|0.07146|0.0809|0.0871|
|200|0.36656|0.3436|0.3437|0.08698|0.0891|0.0916|0.0635|0.0766|0.0791|0.06916|0.0814|0.0864|
|300|0.30336|0.2822|0.2827|0.08034|0.0843|0.084|0.06106|0.0743|0.0774|0.0687|0.0821|0.0881|
|400|0.26512|0.2451|0.245|0.07718|0.0801|0.0803|0.06016|0.0749|0.077|0.0687|0.0816|0.0891|
|500|0.23936|0.2206|0.2226|0.07414|0.0778|0.0788|0.05924|0.0755|0.0775|0.0684|0.0821|0.0892|
|600|0.21986|0.2048|0.2052|0.07164|0.0771|0.0778|0.05916|0.0757|0.0782|0.06826|0.0821|0.0897|
|700|0.20634|0.1932|0.1909|0.07002|0.0762|0.077|0.05848|0.0761|0.0784|0.0683|0.082|0.0894|
|800|0.1954|0.1833|0.1794|0.06876|0.075|0.0768|0.05834|0.0754|0.0789|0.06832|0.082|0.0893|
|900|0.18702|0.1756|0.1717|0.06782|0.0741|0.0767|0.05818|0.0765|0.0797|0.06828|0.0827|0.0902|
|1000|0.17968|0.1696|0.1655|0.06694|0.0733|0.0768|0.05804|0.0763|0.0796|0.06832|0.083|0.0905|
|1100|0.17342|0.1646|0.1586|0.06642|0.073|0.0773|0.05788|0.0769|0.0802|0.06842|0.0821|0.0909|
|1200|0.16904|0.1597|0.154|0.06568|0.0728|0.0776|0.05776|0.0766|0.0794|0.06836|0.0825|0.0916|
|1300|0.16414|0.1557|0.1503|0.06528|0.0729|0.0779|0.05756|0.0766|0.08|0.06838|0.0831|0.0907|
|1400|0.15982|0.1527|0.1457|0.06468|0.0729|0.0778|0.05768|0.0767|0.0796|0.06834|0.0832|0.0906|
|1500|0.15606|0.1479|0.143|0.06442|0.0729|0.0778|0.0577|0.0771|0.0797|0.06826|0.0828|0.0916|
|1600|0.15316|0.1439|0.1397|0.06402|0.0728|0.0777|0.05732|0.0771|0.08|0.06826|0.0828|0.0918|
|1700|0.15028|0.142|0.137|0.06362|0.0732|0.0775|0.05748|0.0771|0.0798|0.06832|0.0831|0.0915|
|1800|0.1476|0.1404|0.1354|0.0634|0.0731|0.0775|0.05718|0.0764|0.0798|0.06822|0.0832|0.0909|
|1900|0.14506|0.1386|0.1334|0.06296|0.0733|0.0777|0.05706|0.0771|0.0801|0.0686|0.0839|0.0917|
|2000|0.14276|0.1367|0.1314|0.0628|0.0732|0.0776|0.05676|0.0766|0.0797|0.06826|0.0837|0.0907|
|2100|0.14062|0.1345|0.1301|0.06278|0.0734|0.0776|0.0567|0.0772|0.0802|0.06808|0.0841|0.0915|
|2200|0.13868|0.1331|0.1287|0.06258|0.0736|0.0774|0.0562|0.0769|0.08|0.0679|0.0835|0.0923|
|2300|0.13676|0.1311|0.1274|0.06232|0.0736|0.0775|0.05662|0.0778|0.0802|0.06784|0.0832|0.0922|
|2400|0.1352|0.1295|0.1266|0.06222|0.0734|0.0774|0.05618|0.0772|0.0805|0.0678|0.0831|0.0919|
|2500|0.1338|0.1287|0.1257|0.06208|0.0733|0.0778|0.0564|0.0777|0.0803|0.0678|0.0832|0.0922|
|2600|0.1325|0.127|0.1244|0.06182|0.0735|0.0776|0.0563|0.0775|0.0804|0.06788|0.0831|0.0923|
|2700|0.1314|0.1257|0.1237|0.0616|0.0736|0.0772|0.05614|0.0781|0.0804|0.06788|0.0832|0.092|
|2800|0.1301|0.1247|0.1221|0.06148|0.0737|0.0773|0.05608|0.0776|0.0804|0.06786|0.0834|0.0921|
|2900|0.12886|0.1235|0.1214|0.06132|0.0733|0.077|0.05604|0.0779|0.0807|0.06766|0.0829|0.092|
|3000|0.12766|0.1232|0.1205|0.06118|0.0733|0.0769|0.0559|0.0778|0.0807|0.06756|0.083|0.0925|
|3100|0.12696|0.1217|0.1201|0.06104|0.0733|0.0771|0.05592|0.0781|0.081|0.06754|0.083|0.0922|
|3200|0.12618|0.1211|0.1186|0.06076|0.0737|0.0769|0.05586|0.0779|0.0805|0.0676|0.083|0.0921|
|3300|0.12548|0.1208|0.1179|0.06066|0.0735|0.077|0.05586|0.0781|0.0813|0.06758|0.083|0.092|
|3400|0.12442|0.1199|0.1174|0.06054|0.0737|0.077|0.05586|0.0779|0.0808|0.06732|0.0832|0.0919|
|3500|0.12364|0.1195|0.1161|0.06048|0.0734|0.0774|0.05588|0.078|0.0814|0.06762|0.0832|0.0922|
|3600|0.12294|0.1187|0.1158|0.06046|0.0733|0.0774|0.05584|0.078|0.0807|0.06808|0.0829|0.0916|
|3700|0.12218|0.1178|0.1149|0.06032|0.0734|0.0777|0.05578|0.0779|0.0812|0.06762|0.0833|0.0925|
|3800|0.12152|0.1169|0.1142|0.06026|0.0734|0.0778|0.05578|0.0779|0.0812|0.06746|0.0835|0.0924|
|3900|0.12072|0.1158|0.1135|0.06026|0.0736|0.0775|0.05578|0.0779|0.0812|0.06808|0.0832|0.0924|
|4000|0.11994|0.1145|0.1136|0.0602|0.0735|0.0773|0.05578|0.0779|0.0812|0.06772|0.0835|0.0927|
|4100|0.11926|0.1135|0.1132|0.06018|0.0736|0.0773|0.05578|0.0779|0.0812|0.06794|0.0833|0.0924|
|4200|0.11896|0.1128|0.1122|0.06008|0.0737|0.0772|0.05578|0.0779|0.0812|0.06786|0.0838|0.0924|
|4300|0.11804|0.1125|0.1115|0.06002|0.0739|0.0774|0.05578|0.0779|0.0812|0.06776|0.0835|0.0927|
|4400|0.1173|0.1122|0.1113|0.05986|0.0738|0.0774|0.05578|0.0779|0.0812|0.06756|0.0836|0.0926|
|4500|0.11664|0.1119|0.111|0.05976|0.0737|0.0774|0.05578|0.0779|0.0812|0.06792|0.0836|0.0923|
|4600|0.11614|0.1117|0.1104|0.05972|0.0738|0.0775|0.05578|0.0779|0.0812|0.06768|0.0838|0.0925|
|4700|0.11538|0.1109|0.1096|0.05968|0.0738|0.0774|0.05578|0.0779|0.0812|0.06768|0.0839|0.0925|
|4800|0.11502|0.1107|0.1094|0.05964|0.0739|0.0774|0.05578|0.0779|0.0812|0.06778|0.0839|0.0925|
|4900|0.11434|0.1101|0.1086|0.05966|0.074|0.0775|0.05578|0.0779|0.0812|0.06762|0.0839|0.0926|
|5000|0.11382|0.1088|0.1082|0.05968|0.0741|0.0775|0.05578|0.0779|0.0812|0.06766|0.0838|0.0927|
|5100|0.1134|0.1087|0.1081|0.05964|0.0743|0.0775|0.05578|0.0779|0.0812|0.06766|0.0839|0.0928|
|5200|0.11286|0.1083|0.1077|0.05962|0.0745|0.0776|0.05578|0.0779|0.0812|0.06762|0.0838|0.0928|
|5300|0.11262|0.1083|0.1075|0.05952|0.0744|0.0778|0.05578|0.0779|0.0812|0.0676|0.0839|0.0929|
|5400|0.1121|0.1079|0.107|0.05954|0.0744|0.0777|0.05578|0.0779|0.0812|0.06766|0.0843|0.0931|
|5500|0.11164|0.1074|0.1064|0.05954|0.0745|0.0777|0.05578|0.0779|0.0812|0.06772|0.0842|0.0927|
|5600|0.11128|0.1069|0.1059|0.05956|0.0745|0.0776|0.05578|0.0779|0.0812|0.06772|0.0844|0.093|
|5700|0.111|0.1069|0.1058|0.05958|0.0747|0.0774|0.05578|0.0779|0.0812|0.06768|0.0843|0.093|
|5800|0.11032|0.1068|0.1051|0.0595|0.0745|0.0775|0.05578|0.0779|0.0812|0.06772|0.0844|0.0931|
|5900|0.10988|0.1063|0.1047|0.0595|0.0744|0.0774|0.05578|0.0779|0.0812|0.06776|0.0844|0.0931|
|6000|0.10946|0.106|0.104|0.0595|0.0744|0.0776|0.05578|0.0779|0.0812|0.06774|0.0844|0.0931|
|6100|0.1091|0.1057|0.1036|0.0595|0.0744|0.0778|0.05578|0.0779|0.0812|0.06752|0.0846|0.0932|
|6200|0.10868|0.1055|0.1036|0.05944|0.0744|0.0778|0.05578|0.0779|0.0812|0.06764|0.0845|0.0931|
|6300|0.10822|0.1055|0.1035|0.05942|0.0741|0.0778|0.05578|0.0779|0.0812|0.06762|0.0848|0.0933|
|6400|0.10794|0.105|0.103|0.05938|0.0744|0.0779|0.05578|0.0779|0.0812|0.0676|0.0845|0.0926|

#Conclusion
We can see from the results above:

In the short term, batch method works better than using the whole data. Parameter can improve during the calculation of each batch, so the error rate decrease faster.
In the long term, however, using the whole data is better than the batch method, because at last the classifier should balance the error of the whole data. Using a batch to optimize the parameter will make other batches work worse.
L-BFGS works better than the GD and SGD method.
CL-BFGS works better than L-BFGS method in the short term.
So, if we want to handling data that can be dealt at one time, it is recommended to use L-BFGS. However, to a lot of problems that the data is too large, it’s necessary to use a batch method. CL-BFGS perform better than SGD in each epoch, but cost a lot of time. So SGD is recommended under such circumstances.

[1] LeCun, Yann; Léon Bottou; Yoshua Bengio; Patrick Haffner (1998). “Gradient-Based Learning Applied to Document Recognition” (PDF). Proceedings of the IEEE 86 86 (11): 2278–2324. doi:10.1109/5.726791
[2] Ciresan, Claudiu Dan; Dan, Ueli Meier, Luca Maria Gambardella, and Juergen Schmidhuber (December 2010). “Deep Big Simple Neural Nets Excel on Handwritten Digit Recognition”. Neural Computation 22 (12). doi:10.1162/NECO_a_00052
[3] Cires¸an, Dan; Ueli Meier; Jürgen Schmidhuber (2012). “Multi-column deep neural networks for image classification” (PDF). 2012 IEEE Conference on Computer Vision and Pattern Recognition: 3642–3649. arXiv:1202.2745. doi:10.1109/CVPR.2012.6248110

切问录

Comparison of Gradient Descent, Stochastic Gradient Descent and L-BFGS

Dataset

Experiment

Gradient Descent

L-BFGS

Combined Stochastic Gradient Descent with L-BFGS