Documentation for Wikiparse

Wikiparse is a tree-based approach to representing Wikipedia pages using tightly connected nodes to allow for flexible connections between both visible and invisible information available on a page (such as a link and the page it points to). Data is cached as raw wikitext and as JSON, both for speed and for flexibility, allowing other libraries to use the produced JSON instead of the front-end Python module if desired. The tree-based layout of pages and the caching of parsed data results in Wikiparse being both much faster and much more capable than similar libraries. This opens the door for using Wikipedia in new ways more easily to further research into Natural Language Processing (NLP) using Wikipedia as a broad, open-source, and up-to-date corpus.

wikipage

The wikipage module exposes the WikiPage class, which is the primary tool for retrieving and analyzing Wikipedia pages. The method simply takes a page title and retrieves the page as a WikiPage object

>>> page = wikipage.WikiPage('Python (programming language)')

The resulting object contains all the textual content of the specified page. Note that to follow redirections, it is recommended that you use WikiPage.resolve_page() instead.

WikiPages are structured internally as trees. Each element in the tree inherits from PageElement. The most common type of PageElement is a Context, essentially a loose collection of other page elements. Most types derive from Context, adding some kind of peripheral information on the side. For example, an InternalLink is a Context whose content is the textual element of the link, but with an added attribute target that specifies which Wikipedia page that link points to. This kind of separation of the text (e.g., “legislation”) from the metadata attached to that text (e.g., a link to “List of United States federal legislation”) allows for a very clean and simple presentation of the plaintext of a page without any loss of information.

WikiPage

class wikiparse.wikipage.WikiPage(title, follow_redirections=True)[source]

Loads the data for and constructs a page object representing a page from Wikipedia. This process automatically obtains the wikitext and JSON cached representations of the page.

Parameters:title (str) – The name of the page to obtain. Titles cannot be corrected before retrieval of the page, so consider using wikiparse.filemanager.possible_titles() if you want to make sure that you are using a cached page.
all_elements

A flat dictionary of every element contained in this page, indexed by ID.

content

The entire content of the main body of this page.

A list of all the external links contained in this page.

A list of all the internal links contained in this page.

intro

A context containing the introductor section of the page.

redirection

Gets which page this page redirects to, or None if this is not a redirection page.

refs

A list of the references tagged in the References section of this page.

static resolve_page(title, follow_redirections=True)[source]

Retrieves the specified page, capable of following redirection pages.

Parameters:
  • title (str) – The title of the page to construct
  • follow_redictions (bool) – Whether or not to follow redirection pages automatically
section_tree

A dictionary tree of the sections in this page. Each key is the title of a section, with its value being a tuple of the section’s object and an ordered dictionary containing any subsections.

sections

A flat list of all the sections (and subsections etc) contained in this page.

templates

A list of the templates on this page.

title

The title of this page, as was given to construct the page.

WikiPage Elements

The following types are available directly from the wikiparse.wikipage module, but are all the possible types of nodes that can exist inside WikiPage trees.

PageElement

class wikiparse.wikipage.PageElement(page, cur_section, parent, json_data, make_fake=False)[source]

Represents any node in a WikiPage tree.

get_text()[source]

Gets a RichText object representing this node and all its children as text.

is_part_of(target_type)[source]

Checks whether or not this object belongs, at any level, in a node of the specified type.

Parameters:target_type (type) – The type of node to check against.
Returns:True if any context in which this object lives is of the specified type.
Return type:bool
page

The WikiPage to which this node belongs.

parent

The immediate Context that contains this node.

part_of()[source]

Returns a set including this object’s type and all the types of the contexts in which this object exists.

section

The most immediate Section object in which this node lives.

Context

class wikiparse.wikipage.Context(page, cur_section, parent, json_data, make_fake=False)[source]

Bases: wikiparse.wikipage.PageElement

An iterable and indexable node that contains other nodes.

content

The list of elements contained in this context. Note that Context itself is iterable and indexable, which is the preferred way to access the context’s contents.

label

A string that identifies which kind of context this is. Checking the type of this node is preferred, but this string is available if desired. Possible prefixes include:

  • nothing: Just a context to group other elements
  • __link_: A link, either internal or external
  • __heading_: A header, usually to a section
  • __image_: A context that contains information about an image
  • __section_: A context that is a section division of its own
  • __template_: A template context, which usually contains template arguments

Text

class wikiparse.wikipage.Text(page, cur_section, parent, json_data)[source]

Bases: wikiparse.wikipage.PageElement

An element representing displayed text, possibly with formatting properties.

properties

The list of properties of this text, contained as tuples like (id, ‘name’). These are named tuples, so data can also be pulled out of them using “properties[n].id” and “properties[n].name”. The following properties are possible:

  • Basic formatting

    • bold - Bolded text.
    • italics - Italicized text.
    • link - The title of a link.
  • Definition lists

    • defList - A definition list.
    • term - The term being defined in a definition list.
    • def - The definition of a term in a definition list.
  • Enumerations

    • enum - An enumeration list.
    • enumItem - An item in an enumeration list.
  • Itemized lists

    • items - An itemized list.
    • itemsItem - An item in an itemized list.
  • Tables

    • table - Text within a table.
    • tableCaption - The caption of a table.
    • tableCell - The body of a table cell.
    • tableHeader - The header of a table.
    • tableRow - A row of a table.
  • XML (uncommon and poorly parsed, these probably shouldn’t be showing up very often)

    • xml - And XML element.
    • xmlClose - The close of an XML element.
    • xmlEmpty - An XML element with no content.
    • xmlOpen - The open of an XML element.
    • xmlEntRef - A tag surrounding external references. These shouldn’t survive, please report with wikitext if you encounter one of these
    • xmlCharRef - A tag surrounding character references. These shouldn’t survive, please report with wikitext if you encounter one of these
  • Templates

    • tempParameter - A parameter to a template.
text

The raw text in this object that gets displayed when printing the page.

Heading

class wikiparse.wikipage.Heading(page, cur_section, parent, json_data)[source]

Bases: wikiparse.wikipage.Context

A heading or label in the text.

level

The level of this heading relative to other headings

Section

class wikiparse.wikipage.Section(page, cur_section, parent, json_data, make_fake=False)[source]

Bases: wikiparse.wikipage.Context

A section or subsection of the page

body

The content of this section

level

The level of this section relative to other sections

title

The name of this section

Image

class wikiparse.wikipage.Image(page, cur_section, parent, json_data)[source]

Bases: wikiparse.wikipage.Context

A region based on an image

page

The page for this image

target

The page that this image links to

title

The title of this image

url

The url for this image

Template

class wikiparse.wikipage.Template(page, cur_section, parent, json_data)[source]

Bases: wikiparse.wikipage.Context

A Wikipedia template, often used to define common constructs such as latitude-longitude or quick-info sidebars.

title

The title of this template

TemplateArg

class wikiparse.wikipage.TemplateArg(page, cur_section, parent, json_data)[source]

Bases: wikiparse.wikipage.Context

A name-value pair to be interpreted by a template

name

The name of the argument in the template which this argument addresses

value

The value passed to the template

Redirection

class wikiparse.wikipage.Redirection(page, cur_section, parent, json_data)[source]

Bases: wikiparse.wikipage.Text

An element indicating that this Wikipedia page should redirect to another. See WikiPage.resolve_page() for automatically following redirections.

target

The page to which this page redirects

RichText

class wikiparse.wikipage.RichText(element)[source]

A flattened representation of the text belonging to a node in a WikiPage tree. Generate using my_page_element.get_text() on any PageElement in your page. Converting this object to a string (using str()) returns a raw text form, or you can index this object directly using the exact same indexing as you would on the raw string. Each value returned from indexing this object returns a tuple of the requested character in the string paired with the object from which that character’s text came.

groups

The list of strings and their associated objects from which this object is indexed.

Configuration

The installation directory for Wikiparse includes a configuration JSON file, config.json. This file specifies certain behaviors and default values for various tools in the Wikiparse toolset. Unless needed, it is recommended to leave these values at their defaults. If any changes need to be made, make them before using (or better yet, before installing) Wikiparse, as many settings can break the cache system if made after unpacking or live-fetching files.

The file is structured as a single object or dictionary with the following keys:

  • try_pulls: Whether or not to live-fetch wikitext when the file isn’t already cached.
  • cache_pulls: Whether or not to cache files when they get generated.
  • cache_dir: The directory in which the cache should live.
  • page_index: The file in which to keep the page index. Note that this file doesn’t get used for much, but is maintained in case later implementations can make use of it. This index file currently only holds details about pages that get unpacked by wikiparse.wikisplitter.
  • dir_nesting: The max depth to tree directories. Each subdirectory is chosen based on an alpha-numeric character from a page’s name, but excessive nesting only creates more directories than is necessary for efficiency.
  • fetch_url: The URL (as a Python formatting string) from which wikitext pages can be obtained. To use this library on a Wikimedia-backed site besides Wikipedia, change this setting.
  • disallowed_file_names: A dictionary of filenames that aren’t allowed for one reason or another (such as being reserved by the OS or filesystem), and what such filenames should become instead.
  • verbose_filemanager: Whether or not the wikiparse.filemanager should report what it’s doing. Use only for debugging.

wikidownloader

A runnable script that can guide a user through finding the download links for wikipedia archive files. Execute from terminal and follow the directions to find the best links for downloading. Note that this script does not actually download any files itself, since this process can take long enough that a proper download manager is recommended to be used. For use with wikiparse, it is recommended to download English -> latest -> pages-articles -> single -> xml.bz2

Each menu in the process has the following features: - Help documentation to instruct you in that particular menu’s use. There is help here if you’re not sure what the menu options mean. - Back, to return to previous menus - Case-insensitive command completion, as long as your partial input unambiguously refers to a single command, it will be run

If using wikiparse as an offline Wikipedia module, this is your first step to obtain the wikitext for all Wikipedia pages. If you don’t want to download the entire archive, individual pages can be obtained on demand, but this may be much slower for large numbers of pages.

This script has no command-line options.

Below shows a sample run of this script:

Language selection
  English
  Catalan
  Chinese
  French
  German
  Italian
  Japanese
  Polish
  Portugese
  Russian
  Spanish
  Other
  BACK
> eng
Date selection
  latest
  20150602
  20150515
  20150403
  20150304
  20150205
  20150204
  20150112
  20141208
  20141106
  BACK
> latest
Type selection (use 'pages-articles' for wikiparse toolchain)
  abstract
  all-titles-in-ns
  all-titles
  category
  categorylinks
  externallinks
  flaggedpages
  flaggedrevs
  geo_tags
  image
  imagelinks
  interwiki
  iwlinks
  langlinks
  md5sums
  page
  page_props
  page_restrictions
  pagelinks
  pages-articles-multistream-index
  pages-articles-multistream
  pages-articles
  pages-logging
  pages-meta-current
  pages-meta-history
  protected_titles
  redirect
  site_stats
  stub-articles
  stub-meta-current
  stub-meta-history
  templatelinks
  user_groups
  HELP
  BACK
> pages-articles
File selection
  Single file
  Multiple files
  BACK
> single
Extension selection
  xml.bz2
  xml.bz2-rss.xml
  BACK
> xml.bz2
Download the following links:
https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2

wikisplitter

A runnable script that can efficiently process an entire wikipedia archive file and extract the wikitext for every contained page into its own file, which can be retrieved quickly through the filemanager module. If downloading wikipdeia for offline caching, this is the second step (after downloading) to make the downloaded archive usable by the wikiparse library.

Wikisplitter needs to be given the path to the archive file. It is recommended that, if you download anything, you download the single large archive of Wikipedia rather than the multiple broken archives. The latter has not been tested, but the former works correctly.

The xml (x) flag allows you to unzip the archive yourself into just its xml file (recommended approach if you didn’t download the single large zip file), and then this script will skip the unzipping step. By default, the script assumes that you are giving it the archive file that’s still zipped.

The update (u) flag specifies that you are updating the currently cached files, which allows wikisplitter to overwrite existing files with newly unpacked pages.

The no_redirects (r) flag skips outputting redirection pages. Note that redirection pages can already be handled by wikiparse correctly, so unless you’re trying to save space, reduce the number of files output, or only interested in actual content pages, you may want to not use this flag.

The verbose (v) flag outputs file names as well as numerical progress indications while unzipping, and also specifies when other major steps are happening in the unpacking process.

usage: wikisplitter.py [-h] [-x] [-u] [-r] [-v] filename

Expand wikipedia file into page files

positional arguments:
  filename              The filepath to the wikipedia dump file

optional arguments:
  -h, --help            show this help message and exit
  -x, --xml             Indicates that the specified file is an already-
                        unzipped xml file (rather than bz2)
  -u, --update          Forces overwriting of pages that already exist
  -r, --no_redirects    Ignores redirection pages
  -v, --verbose         Prints page titles as they get output

filemanager

Manages all loading and caching of data for wikiparse. This module is meant primarily for use within wikiparse, but can be used from the outside to provide raw data or aid in debugging.

wikiparse.filemanager.possible_titles(partial_title=None, max_edit_distance=-1)[source]

Retrieves all cached pages starting with the specified title text.

Parameters:partial_title (str) – The beginning of a title.
Returns:A generator that provides the possible title names matching the specified title beginning
Return type:Generator of str
wikiparse.filemanager.read_json(title)[source]

Reads the json for the specified page, generating it with the parser from the wikitext if no cached version is available

Parameters:title (str) – The name of the wikipedia page to retrieve json for
Returns:The json text
Return type:str
wikiparse.filemanager.read_wikitext(title)[source]

Reads the wikitext for the specified page, fetching it directly from wikipedia if no cached version is available

Parameters:title (str) – The name of the wikipedia page to retrieve wikitext for
Returns:The wikitext
Return type:str
wikiparse.filemanager.write_json(title, content, overwrite=False)[source]

Writes a json page to its appropriate file

Parameters:
  • title (str) – The title of the page that is being written
  • content (str) – The json as a string
  • overwrite (bool) – Whether or not to overwrite the existing file if the file already exists
wikiparse.filemanager.write_wikitext(title, content, overwrite=False)[source]

Writes a wikitext page to its appropriate file

Parameters:
  • title (str) – The title of the page that is being written
  • content (str) – The wikitext as a string
  • overwrite (bool) – Whether or not to overwrite the existing file if the file already exists