0%

By Z.H. Fu
https://fuzihaofzh.github.io/blog/

We have introduced many tools and tricks in my previous post Struggling in importing wikipedia into Elasticsearch. In that post, we successfully import Wikipedia into elasticsearch with logstash. However, things are not settled. The biggest problem of logstash is that it’s extremely user-unfriendly. Even filter an html tag will waste you half days searching google how to do modify the config file. It made very annoy sine I totally forget all the tricks of logstash. Therefore, we will explore how to use python to import the Wikipedia into elasticsearch directly.

The first step is to convert the Wikipedia source file into a more formatted one. I choose to use gensim’s Wikipedia tool to do this. It can be run like this:

1
python -m gensim.scripts.segment_wiki -i -f enwiki-20190320-pages-articles-multistream.xml.bz2 -o enwiki-20190320-pages-articles-multistream.json.gz -w 100

It will convert the latest

Read more »

By Z.H. Fu
https://fuzihaofzh.github.io/blog/

在数学优化中,我们经常遇到需要使用特殊技巧或方法来处理的约束。本篇博客将详细介绍如何将一些常见的复杂约束转化为更易处理的形式,包括整数约束、或约束、If-Then约束、保留N个约束中的K个、函数有N个可能取值、固定支出问题,以及一般整数变量的二值表示等。这些转化技巧可以帮助我们更好地理解和解决实际问题中的优化问题,下面让我们一起逐一研究。

Read more »

By Z.H. Fu
https://fuzihaofzh.github.io/blog/

动态规划是一种强大的数学工具,它可以解决一系列看似复杂的优化问题。本篇博客文章将向你展示如何利用动态规划来解决各种类型的问题,包括资源分配、生产计划、不确定性采购、背包问题、复合系统可靠性、设备更新、旅行商问题和整数规划等。我们将从每个问题的静态规划模型开始,然后详细解释如何转化为动态规划模型,并给出求解的过程。希望这篇文章能帮助你更好地理解和应用动态规划,以解决实际中的问题。

Read more »

By Z.H. Fu
https://fuzihaofzh.github.io/blog/

在运筹学中,建模是解决实际问题的第一步,而常用的数学模型可以帮助我们更好地理解和解决各种复杂的管理和决策问题。本文将介绍一些运筹学中常用的数学模型,涵盖了运输问题、指派问题、网络问题以及它们的具体应用和求解方法。通过深入了解这些模型,您将能够更有效地处理与资源分配、路径规划和流量优化等相关的挑战。以下我们将一起探索这些有趣且实用的数学模型,为运筹学的应用提供更多的启发和帮助。

Read more »

By Z.H. Fu
https://fuzihaofzh.github.io/blog/

In the realm of stochastic processes and queueing theory, understanding various mathematical models and their applications is essential for tackling real-world problems efficiently. This comprehensive “Stochastic Process Cheat Sheet” provides a valuable resource for grasping the fundamentals of Poisson processes, Markov chains, continuous-time Markov chains, Martingales, Queueing Theory, and more. This guide covers a wide range of topics, from basic definitions to intricate equations and practical examples.

Read more »

By Z.H. Fu
https://fuzihaofzh.github.io/blog/

本文将总结一些与随机过程相关的重要题型和概念,涵盖了Markov链、可数Markov链、连续Markov链、Martingale以及一些常见模型。这些内容将有助于理解和解决与随机过程相关的问题。
在随机过程中,Markov链是一个关键的概念,我们讨论了如何根据给定的转移矩阵求稳态分布、计算状态转移概率、计算从特定状态出发到达目标状态的概率等问题。此外,我们还探讨了与Markov链相关的一些概念,如吸收态、recurrent态以及首次到达和改变的期望时间。
在可数Markov链方面,我们介绍了如何判定一个Markov链是transient或positive recurrent,以及如何求解极限概率。我们还讨论了Gambler’s Ruin问题、Alarm Clock问题等经典模型,并提供了相应的解决方法。
对于连续Markov链,我们研究了生成矩阵的构建、转移矩阵的计算以及稳态分布的求解方法。此外,我们还探讨了首次改变和到达的期望时间的计算。
最后,我们提到了Martingale概念,以及如何判断一个随机过程是否满足Martingale性质。

Read more »

By Z.H. Fu
https://fuzihaofzh.github.io/blog/
Elasticsearch is a powerful search engine based on Lucene. It can build indexes on huge amount of data and we can query the data fast by keywords. Different from common database, Elasticsearch build inverted index and is capable of search keywords on all documents. The serch tool of wikipedia now is Elasticsearch. In this post, we introduce how to make a local Elasticsearch and import wikipedia dump into it. Some basic usages of Elasticsearch are also introduced.

Download Elasticsearch

We run Elasticsearch on Linux system. Fisrly, we download Elasticsearch from the web.

1
2
3
4
5
wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-5.5.2.zip
unzip elasticsearch-5.5.2.zip
cd elasticsearch-5.5.2
bin/elasticsearch-plugin install analysis-icu
./bin/elasticsearch-plugin install org.wikimedia.search:extra:5.5.2

We installed some plugins provided by wikimedia. Afterwards, we start Elasticsearch by:

1
./bin/elasticsearch

Do not follow the official old post

This famous post provided by Elasticsearch https://www.elastic.co/blog/loading-wikipedia is out-of-date since recently, a lot of changes have been made in the development of Elasticsearch. I personally encountered a lot of problems during reproduce this tutorial. And I cannot find the solution.

Use Logstash

Logstash is log processing tool. However, it can also be used as the processing tool of our data. Since wikipedia dump is a text stream, Logstash can process text data by a rich set of plugins.
The framework of Logstash is shown as follows
Logstash
We can define our own Imput, Filter and Output part to process the text stream to our desired data.

Read more »

By Z.H. Fu
https://fuzihaofzh.github.io/blog/

Optimization is the foundation of many other popular technologies and is widely used in our modern society. In this post, we will go through the basics in the foundation of optimization. We will start from the simplest linear programming, extend it to a more general conic version and finally arrive at Lagrange optimization. In order to concentrate on the main framework of optimization, we leave out some long proof which can be easily found in reference materials. All this post comes from course ENGG5501 taught by Prof. Anthony Man-Cho So in CUHK.

(I) Linear Programming

In linear programming (LP), the objective function and constraints are all linear w.r.t. variables. The standard form of linear programming is given as follows:

(P)      vp=minx   cTxs.t.   Ax=bx0\begin{aligned} (P)\ \ \ \ \ \ v_p^* = \min_\mathbf{x}\ \ \ &\mathbf{c}^T \mathbf{x}\\ \text{s.t. }\ \ &A\mathbf{x}=\mathbf{b}\\ & \mathbf{x} \ge \mathbf{0} \end{aligned}

Where x,cRn\mathbf{x},\mathbf{c}\in \mathbb{R}^n, bRm\mathbf{b}\in \mathbb{R}^m ARm×nA\in \mathbb{R}^{m \times n }. Since x0\mathbf{x}\ge \mathbf{0}, the feasible region contains no line. It implies that it has at least one vertex. As stated by a theorem, if (P) has at least one vertex. Then, either the optimal value is -\infty or there exists a vertex that is optimal.

Weak Duality

By observing problem (P). We are minimizing some function. A natural question is whether we can construct a lower bound, such that, no matter how we change our variables, the object functions always bigger than the lower bound. Now, let construct such a lower bound.

If we can find a vector yRm\mathbf{y} \in \mathbb{R}^m, s.t. ATycA^T \mathbf{y}\le c, then we have bTy=xTATycTx\mathbf{b}^T\mathbf{y}=\mathbf{x}^TA^T\mathbf{y}\le \mathbf{c}^T\mathbf{x}. This is in fact another optimization problem. Because, for all y\mathbf{y}s, the inequality above all holds. Obviously we want to find the max one to get the maximal lower bound. We firstly write the problem as follows:

(D)      vd=maxx   bTys.t.   ATyc\begin{aligned} (D)\ \ \ \ \ \ v_d^* = \max_\mathbf{x}\ \ \ &\mathbf{b}^T \mathbf{y}\\ \text{s.t. }\ \ &A^T \mathbf{y}\le c \end{aligned}

Here, we find that for any LP primal problem (P), there exists a dual problem (D). The optimal value of the dual problem (D) is a lower bound of the primal problem (P). This is called the weak duality. By “weak”, we mean that we have an inequality relation holds between the two problems ( $ v _ p ^ * = v _ d ^ * $ ). A natural question is when the primal optimal value equals to the dual optimal value ($v _ p^ * = v _ d^ * $). It will be powerful if the equality holds, because if $v _ p^* = v _ d^* $, we can solve the primal problem by just solving the dual problem if the primal problem is to complicated. When the equality will hold is studied in the strong duality theory.

Strong Duality

To get the strong duality, we first introduce Farkas’s Lemma without proof.

(Farkas’s Lemma) Let ARm×nA \in \mathbb{R}^{m\times n} and bRmb \in \mathbb{R}^m be given. Then, exactly one of the following systems has a solution:

Ax=b,x0ATy0,bTy>0 \begin{aligned} &Ax=b, x\ge0 \\ &A^Ty\le 0, b^Ty>0 \end{aligned}

Farkas’s Lemma shows that there must be one and only on of the two systems has a solution. This can be used to prove the strong duality of LP problem.

(LP Strong Duality) Suppose that (P) has an optimal solution xRnx^* \in \mathbb{R}^n. Then, (D) also has an optimal solution yRmy^* \in \mathbb{R}^m, and cTx=bTyc^T x^* = b^T y^*.

Read more »

By Z.H. Fu
https://fuzihaofzh.github.io/blog/

SVM (Support Vector Machine) is an important tool in pattern recognition before the wave of different kind of neural networks. In this post, we revisit the derivation of SVM and get some important intuition of the result.

The original problem

Suppose we have some training data D={(x1,y1),(x2,y2),,(xn,yn)}\mathcal{D}=\{(\mathbf{x}_1, y_1),(\mathbf{x}_2, y_2),\cdots,(\mathbf{x}_n, y_n)\}, in which x\mathbf{x} is a feature vector, and yiy_i is the class label for each sample and it is -1 or 1. Our task is to find a classifier that uses a separate hyper plane to divide the points in D\mathcal{D} into two parts. Points on each side of the hyper plane have the same class label y respectively. This classifier is denoted as f(x)=sign(wTx+b)f(\mathbf{x})=\text{sign}(\mathbf{w}^T\mathbf{x}+b). For any hyper plane that successfully divide the two classes. We define the geometry margin of the iith point as the distance from the point to the hyper plane:

γi=yi(wTxi+b)w\gamma_i=\frac{y_i(\mathbf{w}^Tx_i+b)}{||\mathbf{w}||}


We observe that for both classes y=1y=1 and y=1y=-1, the geometry margin is larger than 0. We can define the geometry margin of our dataset D\mathcal{D} as:

γ=mini=1,,nγi\gamma=\min_{i=1,\cdots,n} \gamma_i

Read more »

By Z.H. Fu
https://fuzihaofzh.github.io/blog/

Jordan Canonical Form is an important form of matrix, because not all matrices have the eigen decomposition. But it is guaranteed that all real matrices have Jordan canonical form decomposition.

This article gives an example to show that when a matrix cannot be diagonalized, it is still similar to some approximate diagonal form. We can construct varieties of approximate diagonal form, but the Jordan canonical form is the simplest.

We have

A=[434102110]A=\begin{bmatrix} 4 &3 &-4\\ -1 &0 &2\\ 1 &1 &0 \end{bmatrix}

We solve the equation AV=λvAV=\lambda v to get the eigen value and the eigenvectors. We get:

λ1=2,  v1=(2,0,1)Tλ2=1,  v2=(1,1,0)T\begin{aligned} &\lambda_1 = 2,\ \ v_1=(2, 0, 1)^T \\ &\lambda_2 = 1,\ \ v_2=(-1, 1, 0)^T \end{aligned}

Read more »