Documentation for Wikiparse¶
Wikiparse is a tree-based approach to representing Wikipedia pages using tightly connected nodes to allow for flexible connections between both visible and invisible information available on a page (such as a link and the page it points to). Data is cached as raw wikitext and as JSON, both for speed and for flexibility, allowing other libraries to use the produced JSON instead of the front-end Python module if desired. The tree-based layout of pages and the caching of parsed data results in Wikiparse being both much faster and much more capable than similar libraries. This opens the door for using Wikipedia in new ways more easily to further research into Natural Language Processing (NLP) using Wikipedia as a broad, open-source, and up-to-date corpus.
wikipage¶
The wikipage module exposes the WikiPage class, which is the primary tool for retrieving and analyzing Wikipedia pages. The method simply takes a page title and retrieves the page as a WikiPage object
>>> page = wikipage.WikiPage('Python (programming language)')
The resulting object contains all the textual content of the specified
page. Note that to follow redirections, it is recommended that you use
WikiPage.resolve_page()
instead.
WikiPages are structured internally as trees. Each element in the tree
inherits from PageElement
. The most common type of
PageElement is a Context
, essentially a loose collection
of other page elements. Most types derive from Context, adding some
kind of peripheral information on the side. For example, an
InternalLink
is a Context
whose content is the textual
element of the link, but with an added attribute target
that
specifies which Wikipedia page that link points to. This kind of
separation of the text (e.g., “legislation”) from the metadata attached
to that text (e.g., a link to “List of United States federal legislation”)
allows for a very clean and simple presentation of the plaintext of a
page without any loss of information.
WikiPage¶
-
class
wikiparse.wikipage.
WikiPage
(title, follow_redirections=True)[source]¶ Loads the data for and constructs a page object representing a page from Wikipedia. This process automatically obtains the wikitext and JSON cached representations of the page.
Parameters: title (str) – The name of the page to obtain. Titles cannot be corrected before retrieval of the page, so consider using wikiparse.filemanager.possible_titles()
if you want to make sure that you are using a cached page.-
all_elements
¶ A flat dictionary of every element contained in this page, indexed by ID.
-
content
¶ The entire content of the main body of this page.
-
external_links
¶ A list of all the external links contained in this page.
-
internal_links
¶ A list of all the internal links contained in this page.
-
intro
¶ A context containing the introductor section of the page.
-
redirection
¶ Gets which page this page redirects to, or None if this is not a redirection page.
-
refs
¶ A list of the references tagged in the References section of this page.
-
static
resolve_page
(title, follow_redirections=True)[source]¶ Retrieves the specified page, capable of following redirection pages.
Parameters: - title (str) – The title of the page to construct
- follow_redictions (bool) – Whether or not to follow redirection pages automatically
-
section_tree
¶ A dictionary tree of the sections in this page. Each key is the title of a section, with its value being a tuple of the section’s object and an ordered dictionary containing any subsections.
-
sections
¶ A flat list of all the sections (and subsections etc) contained in this page.
-
templates
¶ A list of the templates on this page.
-
title
¶ The title of this page, as was given to construct the page.
-
WikiPage Elements¶
The following types are available directly from the wikiparse.wikipage
module, but are all the possible
types of nodes that can exist inside WikiPage trees.
PageElement¶
-
class
wikiparse.wikipage.
PageElement
(page, cur_section, parent, json_data, make_fake=False)[source]¶ Represents any node in a
WikiPage
tree.-
is_part_of
(target_type)[source]¶ Checks whether or not this object belongs, at any level, in a node of the specified type.
Parameters: target_type (type) – The type of node to check against. Returns: True if any context in which this object lives is of the specified type. Return type: bool
-
Context¶
-
class
wikiparse.wikipage.
Context
(page, cur_section, parent, json_data, make_fake=False)[source]¶ Bases:
wikiparse.wikipage.PageElement
An iterable and indexable node that contains other nodes.
-
content
¶ The list of elements contained in this context. Note that Context itself is iterable and indexable, which is the preferred way to access the context’s contents.
-
label
¶ A string that identifies which kind of context this is. Checking the type of this node is preferred, but this string is available if desired. Possible prefixes include:
- nothing: Just a context to group other elements
__link_
: A link, either internal or external__heading_
: A header, usually to a section__image_
: A context that contains information about an image__section_
: A context that is a section division of its own__template_
: A template context, which usually contains template arguments
-
Text¶
-
class
wikiparse.wikipage.
Text
(page, cur_section, parent, json_data)[source]¶ Bases:
wikiparse.wikipage.PageElement
An element representing displayed text, possibly with formatting properties.
-
properties
¶ The list of properties of this text, contained as tuples like (id, ‘name’). These are named tuples, so data can also be pulled out of them using “properties[n].id” and “properties[n].name”. The following properties are possible:
Basic formatting
bold
- Bolded text.italics
- Italicized text.link
- The title of a link.
Definition lists
defList
- A definition list.term
- The term being defined in a definition list.def
- The definition of a term in a definition list.
Enumerations
enum
- An enumeration list.enumItem
- An item in an enumeration list.
Itemized lists
items
- An itemized list.itemsItem
- An item in an itemized list.
Tables
table
- Text within a table.tableCaption
- The caption of a table.tableCell
- The body of a table cell.tableHeader
- The header of a table.tableRow
- A row of a table.
XML (uncommon and poorly parsed, these probably shouldn’t be showing up very often)
xml
- And XML element.xmlClose
- The close of an XML element.xmlEmpty
- An XML element with no content.xmlOpen
- The open of an XML element.xmlEntRef
- A tag surrounding external references. These shouldn’t survive, please report with wikitext if you encounter one of thesexmlCharRef
- A tag surrounding character references. These shouldn’t survive, please report with wikitext if you encounter one of these
Templates
tempParameter
- A parameter to a template.
-
text
¶ The raw text in this object that gets displayed when printing the page.
-
Link¶
-
class
wikiparse.wikipage.
Link
(page, cur_section, parent, json_data)[source]¶ Bases:
wikiparse.wikipage.Context
A hyperlink of some sort. Links are never created directly, but provide an identical interface for both
InternalLink
andExternalLink
.-
default_text
¶ What the display text would be for this link if no other text is explicitly defined
-
target
¶ What the link points to. For an
InternalLink
, this is a another Wikipedia page. For anExternalLink
, this is a URL.
-
InternalLink¶
-
class
wikiparse.wikipage.
InternalLink
(page, cur_section, parent, json_data)[source]¶ Bases:
wikiparse.wikipage.Link
A
Link
to another Wikipedia page.
ExternalLink¶
-
class
wikiparse.wikipage.
ExternalLink
(page, cur_section, parent, json_data)[source]¶ Bases:
wikiparse.wikipage.Link
A
Link
to a page outside of Wikipedia.
Heading¶
-
class
wikiparse.wikipage.
Heading
(page, cur_section, parent, json_data)[source]¶ Bases:
wikiparse.wikipage.Context
A heading or label in the text.
-
level
¶ The level of this heading relative to other headings
-
Section¶
Image¶
Template¶
-
class
wikiparse.wikipage.
Template
(page, cur_section, parent, json_data)[source]¶ Bases:
wikiparse.wikipage.Context
A Wikipedia template, often used to define common constructs such as latitude-longitude or quick-info sidebars.
-
title
¶ The title of this template
-
TemplateArg¶
-
class
wikiparse.wikipage.
TemplateArg
(page, cur_section, parent, json_data)[source]¶ Bases:
wikiparse.wikipage.Context
A name-value pair to be interpreted by a template
-
name
¶ The name of the argument in the template which this argument addresses
-
value
¶ The value passed to the template
-
Redirection¶
-
class
wikiparse.wikipage.
Redirection
(page, cur_section, parent, json_data)[source]¶ Bases:
wikiparse.wikipage.Text
An element indicating that this Wikipedia page should redirect to another. See
WikiPage.resolve_page()
for automatically following redirections.-
target
¶ The page to which this page redirects
-
RichText¶
-
class
wikiparse.wikipage.
RichText
(element)[source]¶ A flattened representation of the text belonging to a node in a
WikiPage
tree. Generate usingmy_page_element.get_text()
on anyPageElement
in your page. Converting this object to a string (usingstr()
) returns a raw text form, or you can index this object directly using the exact same indexing as you would on the raw string. Each value returned from indexing this object returns a tuple of the requested character in the string paired with the object from which that character’s text came.-
groups
¶ The list of strings and their associated objects from which this object is indexed.
-
Configuration¶
The installation directory for Wikiparse includes a configuration JSON file, config.json
. This file specifies
certain behaviors and default values for various tools in the Wikiparse toolset. Unless needed, it is recommended to
leave these values at their defaults. If any changes need to be made, make them before using (or better yet, before
installing) Wikiparse, as many settings can break the cache system if made after unpacking or live-fetching files.
The file is structured as a single object or dictionary with the following keys:
try_pulls
: Whether or not to live-fetch wikitext when the file isn’t already cached.cache_pulls
: Whether or not to cache files when they get generated.cache_dir
: The directory in which the cache should live.page_index
: The file in which to keep the page index. Note that this file doesn’t get used for much, but is maintained in case later implementations can make use of it. This index file currently only holds details about pages that get unpacked bywikiparse.wikisplitter
.dir_nesting
: The max depth to tree directories. Each subdirectory is chosen based on an alpha-numeric character from a page’s name, but excessive nesting only creates more directories than is necessary for efficiency.fetch_url
: The URL (as a Python formatting string) from which wikitext pages can be obtained. To use this library on a Wikimedia-backed site besides Wikipedia, change this setting.disallowed_file_names
: A dictionary of filenames that aren’t allowed for one reason or another (such as being reserved by the OS or filesystem), and what such filenames should become instead.verbose_filemanager
: Whether or not thewikiparse.filemanager
should report what it’s doing. Use only for debugging.
wikidownloader¶
A runnable script that can guide a user through finding the download links for wikipedia archive files. Execute from terminal and follow the directions to find the best links for downloading. Note that this script does not actually download any files itself, since this process can take long enough that a proper download manager is recommended to be used. For use with wikiparse, it is recommended to download English -> latest -> pages-articles -> single -> xml.bz2
Each menu in the process has the following features: - Help documentation to instruct you in that particular menu’s use. There is help here if you’re not sure what the menu options mean. - Back, to return to previous menus - Case-insensitive command completion, as long as your partial input unambiguously refers to a single command, it will be run
If using wikiparse as an offline Wikipedia module, this is your first step to obtain the wikitext for all Wikipedia pages. If you don’t want to download the entire archive, individual pages can be obtained on demand, but this may be much slower for large numbers of pages.
This script has no command-line options.
Below shows a sample run of this script:
Language selection
English
Catalan
Chinese
French
German
Italian
Japanese
Polish
Portugese
Russian
Spanish
Other
BACK
> eng
Date selection
latest
20150602
20150515
20150403
20150304
20150205
20150204
20150112
20141208
20141106
BACK
> latest
Type selection (use 'pages-articles' for wikiparse toolchain)
abstract
all-titles-in-ns
all-titles
category
categorylinks
externallinks
flaggedpages
flaggedrevs
geo_tags
image
imagelinks
interwiki
iwlinks
langlinks
md5sums
page
page_props
page_restrictions
pagelinks
pages-articles-multistream-index
pages-articles-multistream
pages-articles
pages-logging
pages-meta-current
pages-meta-history
protected_titles
redirect
site_stats
stub-articles
stub-meta-current
stub-meta-history
templatelinks
user_groups
HELP
BACK
> pages-articles
File selection
Single file
Multiple files
BACK
> single
Extension selection
xml.bz2
xml.bz2-rss.xml
BACK
> xml.bz2
Download the following links:
https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
wikisplitter¶
A runnable script that can efficiently process an entire wikipedia archive file and extract the wikitext for every contained page into its own file, which can be retrieved quickly through the filemanager module. If downloading wikipdeia for offline caching, this is the second step (after downloading) to make the downloaded archive usable by the wikiparse library.
Wikisplitter needs to be given the path to the archive file. It is recommended that, if you download anything, you download the single large archive of Wikipedia rather than the multiple broken archives. The latter has not been tested, but the former works correctly.
The xml
(x
) flag allows you to unzip the archive yourself into just its xml
file (recommended approach if you didn’t download the single large zip file),
and then this script will skip the unzipping step. By default, the script
assumes that you are giving it the archive file that’s still zipped.
The update
(u
) flag specifies that you are updating the currently
cached files, which allows wikisplitter to overwrite existing files with
newly unpacked pages.
The no_redirects
(r
) flag skips outputting redirection pages. Note
that redirection pages can already be handled by wikiparse correctly, so unless
you’re trying to save space, reduce the number of files output, or only interested
in actual content pages, you may want to not use this flag.
The verbose
(v
) flag outputs file names as well as numerical progress
indications while unzipping, and also specifies when other major steps are happening
in the unpacking process.
usage: wikisplitter.py [-h] [-x] [-u] [-r] [-v] filename
Expand wikipedia file into page files
positional arguments:
filename The filepath to the wikipedia dump file
optional arguments:
-h, --help show this help message and exit
-x, --xml Indicates that the specified file is an already-
unzipped xml file (rather than bz2)
-u, --update Forces overwriting of pages that already exist
-r, --no_redirects Ignores redirection pages
-v, --verbose Prints page titles as they get output
filemanager¶
Manages all loading and caching of data for wikiparse. This module is meant primarily for use within wikiparse, but can be used from the outside to provide raw data or aid in debugging.
-
wikiparse.filemanager.
possible_titles
(partial_title=None, max_edit_distance=-1)[source]¶ Retrieves all cached pages starting with the specified title text.
Parameters: partial_title (str) – The beginning of a title. Returns: A generator that provides the possible title names matching the specified title beginning Return type: Generator of str
-
wikiparse.filemanager.
read_json
(title)[source]¶ Reads the json for the specified page, generating it with the parser from the wikitext if no cached version is available
Parameters: title (str) – The name of the wikipedia page to retrieve json for Returns: The json text Return type: str
-
wikiparse.filemanager.
read_wikitext
(title)[source]¶ Reads the wikitext for the specified page, fetching it directly from wikipedia if no cached version is available
Parameters: title (str) – The name of the wikipedia page to retrieve wikitext for Returns: The wikitext Return type: str
-
wikiparse.filemanager.
write_json
(title, content, overwrite=False)[source]¶ Writes a json page to its appropriate file
Parameters: - title (str) – The title of the page that is being written
- content (str) – The json as a string
- overwrite (bool) – Whether or not to overwrite the existing file if the file already exists
-
wikiparse.filemanager.
write_wikitext
(title, content, overwrite=False)[source]¶ Writes a wikitext page to its appropriate file
Parameters: - title (str) – The title of the page that is being written
- content (str) – The wikitext as a string
- overwrite (bool) – Whether or not to overwrite the existing file if the file already exists