We have introduced many tools and tricks in my previous post Struggling in importing wikipedia into Elasticsearch. In that post, we successfully import Wikipedia into elasticsearch with logstash. However, things are not settled. The biggest problem of logstash is that it’s extremely user-unfriendly. Even filter an html tag will waste you half days searching google how to do modify the config file. It made very annoy sine I totally forget all the tricks of logstash. Therefore, we will explore how to use python to import the Wikipedia into elasticsearch directly.
The first step is to convert the Wikipedia source file into a more formatted one. I choose to use gensim’s Wikipedia tool to do this. It can be run like this:
1 | python -m gensim.scripts.segment_wiki -i -f enwiki-20190320-pages-articles-multistream.xml.bz2 -o enwiki-20190320-pages-articles-multistream.json.gz -w 100 |
It will convert the latest Wikipedia dump to a .json.gz file, in which each contains a JSON dict representation a page. The detail can be obtained from the official sites of gensim.
Afterwards, we get a well-formated file enwiki-20190320-pages-articles-multistream.json.gz . We will then import this line by line into elasticsearch.
Before we begin to write our code. We should firstly set some parameters for elasticsearch to make it more capable to hold large data and query. There are two modifications:
- in
config/elasticsearch.yml
, add:
1 | thread_pool: |
This will make the thread pool much larger than the default one. It makes elasticsearch capable to handle multiple queries.
- config/jvm.options:
change-Xmx1g
to-Xmx32g
to make it have a larger jvm heap size. If the jvm heap size is not large enough, the elasticsearch will throw errors.
Finally, we can read from the new Wikipedia data file and import into elasticsearch, the script is as follows:
1 | from multiprocessing import Process,Queue, Pool |
In this script, we use a producer-consumer model to read and import the data into elasticsearch.