A ZIM archive.
This object can be used to read, write and/or modify a ZIM file.
NOTE on modifying ZIM archives: to ensure optimal compression, some modifications will not immediately be written. This also means that reading previously modified entries may not be immediately effective. You can force-write all outstanding changes by calling pyzim.archive.Zim.flush. This will be done automatically on ZIM close.
| Class Method | open |
Open the Zim archive at the specified path. |
| Method | __enter__ |
Called upon entering a with-statement. Provides self as object for the context. |
| Method | __exit__ |
Called upon exiting a with-statement. Closes self. |
| Method | __init__ |
The default constructor for opening a ZIM file. |
| Method | acquire |
A context manager that locks the file access and provides the wrapped file object for the context. |
| Method | add |
Add a redirect from the source (full) url to the target (full) url. |
| Method | add |
Add an item to this archive. |
| Method | add |
Add a redirect from the source (non-full) url to the target (non-full) url. |
| Method | calculate |
Calculate the checksum of this ZIM file and return it. |
| Method | close |
Close the ZIM file. Can be safely called multiple times. |
| Method | entry |
Check if the entry at the specified full url is an article. |
| Method | flush |
Write all changes to disk. |
| Method | get |
Read the checksum of this ZIM file and return it. |
| Method | get |
Return the cluster at the specified location (offset) in the ZIM file. |
| Method | get |
Return the cluster for the specified index. |
| Method | get |
Return the cluster index for the cluster at the specified offset. |
| Method | get |
Return the entry at the specified (non-full) URL in the "C" namespace. |
| Method | get |
Calculate the size of this object when written to a file. |
| Method | get |
Return the entry at the specified location (offset) in the ZIM file. |
| Method | get |
Return the entry at the specified full URL. |
| Method | get |
Return the entry at the specified (non-full) URL. |
| Method | get |
Return the entry at the specified index in the URL pointer list. |
| Method | get |
Return the entry for the mainpage. |
| Method | get |
Read a metadata entry, returning its value. |
| Method | get |
Return a dict containing all metadata of this ZIM. |
| Method | get |
Read all metadata keys, returning them as a list. |
| Method | get |
Return the mimetype with the specified index. |
| Method | get |
Return the mimetype of the specified entry. |
| Method | get |
Return an object that can be used to search this ZIM. |
| Method | has |
Return True if this ZIM file contains an entry for the specified full URL. |
| Method | install |
Install a processor on this archive. |
| Method | iter |
Iterate over all article entries in this ZIM. |
| Method | iter |
Iterate over all clusters in this ZIM. |
| Method | iter |
Iterate over all entries in this ZIM. |
| Method | iter |
Iterate over all entries in this ZIM, ordered by full URL. |
| Method | iter |
Iterate over all mimetypes in this archive. |
| Method | new |
Add a new cluster to this archive. |
| Method | remove |
Remove the cluster with the specified index. |
| Method | remove |
Remove the entry at the specified url. |
| Method | set |
Set the mainpage url. |
| Method | set |
Set metadata of the ZIM archive. |
| Method | update |
Calculate and write the checksum. |
| Method | write |
Update an existing cluster in this zim. |
| Method | write |
Write an entry to this archive. |
| Instance Variable | cluster |
internal cache for clusters, mapping the full location to each cluster |
| Instance Variable | compression |
compression strategy for assigning new items to clusters |
| Instance Variable | entry |
internal cache for entries, mapping the full location to each cluster |
| Instance Variable | filelock |
a lock to ensure file access works with multiple threads. Acquire if whenever any work is done on the file. |
| Instance Variable | header |
header of this ZIM file. |
| Instance Variable | mimetypelist |
the mimetype list |
| Instance Variable | mutable |
Undocumented |
| Instance Variable | policy |
policy to use |
| Instance Variable | spaceallocator |
an object responsible for managing storage space within the ZIM file, may be None if ZIM is read-only |
| Instance Variable | uncompressed |
compression strategy for assigning new items to clusters that are explicity uncompressed |
| Property | closed |
Return True if this archive has already been closed, False otherwise. |
| Property | counter |
Return the counter used for counting mimetype occurences. |
| Method | _check |
Check to ensure this ZIM file has not already been closed. |
| Method | _get |
Return the full URL for the entry with at the specified location. |
| Method | _get |
Return the namespace+title for the entry at the specified index in the URL pointer list. |
| Method | _get |
Return the title for the entry at the specified index in the URL pointer list. |
| Method | _init |
Initializes internal caches according to policy. |
| Method | _init |
Initiate as a new, empty archive. |
| Method | _load |
Read the header. |
| Method | _load |
Load the mimetypelist. |
| Method | _load |
Load the URL and title pointer lists. |
| Method | _new |
Return the number of the next new cluster. |
| Method | _on |
Called when a cluster leaves the cache. |
| Method | _on |
Called when an entry leaves the cache. |
| Method | _update |
Update references to URL pointers. |
| Instance Variable | _article |
a pointerlist to article entries ordered by title |
| Instance Variable | _base |
base offset of ZIM archive within the underlying file object |
| Instance Variable | _closed |
a flag indicating whether this archive has already been closed |
| Instance Variable | _cluster |
next cluster number to assign |
| Instance Variable | _cluster |
a pointer list to the individual clusters |
| Instance Variable | _counter |
the counter counting mimetype occurences |
| Instance Variable | _entry |
a pointerlist to entries ordered by title |
| Instance Variable | _f |
the underlying file object |
| Instance Variable | _mode |
the mode this archive has been opened in |
| Instance Variable | _operation |
Undocumented |
| Instance Variable | _operationbuffer |
buffer for not-yet-completable operations |
| Instance Variable | _processors |
list of processors to that have been installed on this zim |
| Instance Variable | _url |
a pointer list to entries ordered by URL |
| Instance Variable | _writable |
a flag indicating whether this archvie can be written to. |
Inherited from ModifiableMixIn:
| Method | add |
Add another modifiable object as a child of this one. |
| Method | after |
This method should be called after this object has been read and/or flushed to disk. In other words, it should be called at least once whenever this object matches the state of the object on the disk. |
| Method | dirty |
Setter for ModifiableMixIn.dirty |
| Method | ensure |
If this object is non-mutable, raise an Exception. |
| Method | get |
Return the size of this object on disk as it has been read. |
| Method | get |
Return the size of this object when written to a file before any modifications has been made since the last read/flush. |
| Method | mark |
Convenience function to mark this object as dirty. |
| Method | remove |
Remove a submodifiable from this object. |
| Instance Variable | dirty |
True if this object or a sub-modifiable has been modified. |
| Instance Variable | _dirty |
a boolean flag that's nonzero if this object has been modified |
| Instance Variable | _old |
the size of this object on disk before any modifications since the last flush/read |
| Instance Variable | _submodifiables |
a list of child objects, whose dirty state will affect this objects dirty state. |
Open the Zim archive at the specified path.
In addition to the modes listed in the documentation of pyzim.archive.Zim.__init__, the mode "x" is also supported. It behaves like mode "w", but raises an exception should the file already exists.
| Parameters | |
path:str | path to open |
mode:str | mode of the Zim archive (currently, only reading is supported) |
offset:int | offset of the ZIM archive within the file. |
policy:pyzim.policy.Policy | policy to use, default to pyzim.policy.DEFAULT_POLICY |
| Returns | |
pyzim.archive.Zim | the Zim archive opened from the file |
| Raises | |
FileExistsError | if mode == "x" and path already exists |
ValueError | on invalid mode |
The default constructor for opening a ZIM file.
Multiple modes are supported:
- "r": read-only
- "w": create a new file for writing, truncating the old file
- "u"/"a": modify the existing file
| Parameters | |
| f:file-like object | file-like object to read from (NOTE: must support reading) |
offset:int | offset of the ZIM archive within the file. |
mode:str | in which mode to open the ZIM file (e.g. read) |
policy:pyzim.policy.Policy | policy to use, default to pyzim.policy.DEFAULT_POLICY |
| Raises | |
ValueError | on invalid value for a parameter |
TypeError | on invalid type for value |
A context manager that locks the file access and provides the wrapped file object for the context.
| Raises | |
pyzim.exceptions.ZimFileClosed | when the ZIM file is already closed. |
Add a redirect from the source (full) url to the target (full) url.
This method uses full urls. You'll likely want to use pyzim.archive.Zim.add_redirect if you want to work with non-full urls in the "C" namespace.
Be warned that a redirect that can not be resolved will be buffered. This will not only result in an increased memory usage, but may also cause an exception to be raised later on if the url redirect can not be resolved during the next flush.
| Parameters | |
source:str | full url to redirect from |
target:str | full url to redirect to |
title:str or None | title for the redirect, defaulting to the target entry title |
| Raises | |
TypeError | on type error |
ValueError | on invalid value |
pyzim.exceptions.ZimFileClosed | if archive is already closed |
pyzim.exceptions.NonMutable | if this zim file is not mutable |
Add an item to this archive.
The write may not happen immediately.
| Parameters | |
item:pyzim.item.Item | item to write |
forcebool | if nonzero, add the item to the compression strategy for uncompressed content, regardless of other options |
| Raises | |
TypeError | on type error |
pyzim.exceptions.ZimFileClosed | if archive is already closed |
pyzim.exceptions.NonMutable | if this zim file is not mutable |
Add a redirect from the source (non-full) url to the target (non-full) url.
This method uses non-full urls and operates in the "C" namespace. Use pyzim.archive.Zim.add_full_url_redirect to work with full urls.
| Parameters | |
source:str | non-full url to redirect from |
target:str | non-full url to redirect to |
title:str or None | title for the redirect, defaulting to the target entry title |
| Raises | |
TypeError | on type error |
ValueError | on invalid value |
pyzim.exceptions.EntryNotFound | if target url does not yet exists |
pyzim.exceptions.ZimFileClosed | if archive is already closed |
pyzim.exceptions.NonMutable | if this zim file is not mutable |
Calculate the checksum of this ZIM file and return it.
NOTE: this reads the entire ZIM file and calculates the ZIM file. If you want to read the checksum listed in the ZIM file, use pyzim.archive.Zim.get_checksum instead.
| Returns | |
bytes | the calculated (md5) checksum of this ZIM |
Check if the entry at the specified full url is an article.
Articles are always in C namespace, thus the full url must start with a C.
This method returns False if the entry does not exists at all.
| Parameters | |
fullstr | full url of entry to check |
| Returns | |
bool | whether the entry is an article or not |
| Raises | |
TypeError | on type error |
ValueErorr | on value error. |
pyzim.exceptions.ZimFileClosed | if archive is already closed |
Write all changes to disk.
| Raises | |
pyzim.exceptions.ZimFileClosed | when the ZIM file is already closed. |
pyzim.exceptions.NonMutable | if this ZIM file is set to be non-mutable |
Read the checksum of this ZIM file and return it.
NOTE: this reads the checksum from the ZIM file, it does not calculate the actual checksum of the file. If you want to calculate the checksum of the ZIM, use pyzim.archive.Zim.calculate_checksum instead.
| Returns | |
bytes | the (md5) checksum of this ZIM |
Return the cluster at the specified location (offset) in the ZIM file.
If caching is configured, an instance of a previous cluster may be returned. This entry may already be modified and/or bound (even if bind=False).
| Parameters | |
location:int | location/offset of the cluster in the ZIM file |
| Returns | |
pyzim.cluster.Cluster | the entry at the specified location |
Return the cluster index for the cluster at the specified offset.
Note that the offset must match exactly the offset of the cluster. This is not the full offset (base offset must be substracted manually).
This method is mostly used as a helper by clusters to determine their own index.
| Returns | |
int | the index of the cluster at the offset in the cluser pointer list |
| Raises | |
KeyError | if the offset does not refer to a cluster. |
Return the entry at the specified (non-full) URL in the "C" namespace.
NOTE: "content" refers to an entry in the "C" namespace. This function may still return any type of pyzim.entry.BaseEntry and is NOT restricted to pyzim.entry.ContentEntry.
| Parameters | |
url:str | url of entry to get |
| Returns | |
pyzim.entry.BaseEntry | the entry at the specified url |
| Raises | |
pyzim.exceptions.EntryNotFound | when no entry matches the specified URL |
Calculate the size of this object when written to a file.
NOTE: in this context, size refers to the direct size of the object. If this object contains references to other objects, their sizes will not be included. For example, a pyzim.entry.ContentEntry also links to a blob, but this function will only return the size of the entry itself, excluding the referenced blob.
| Returns | |
int | the size, in bytes |
Return the entry at the specified location (offset) in the ZIM file.
If caching is configured, an instance of a previous entry may be returned. This entry may already be modified and/or bound (even if bind=False).
| Parameters | |
location:int | location/offset of the entry in the ZIM file |
bind:bool | if nonzero (default), bind this entry |
allowbool | if nonzero (default), allow cached entries to be replaced |
| Returns | |
pyzim.entry.BaseEntry | the entry at the specified location |
Return the entry at the specified full URL.
| Parameters | |
fullstr | full URL of entry to get |
| Returns | |
pyzim.entry.BaseEntry | the entry at the specified URL |
| Raises | |
pyzim.exceptions.EntryNotFound | when no entry matches the specified URL |
Return the entry at the specified (non-full) URL.
| Parameters | |
namespace:str of length 1 | namespace of entry to get |
url:str | url of entry to get |
| Returns | |
pyzim.entry.BaseEntry | the entry at the specified url |
| Raises | |
pyzim.exceptions.EntryNotFound | when no entry matches the specified URL |
Return the entry at the specified index in the URL pointer list.
| Parameters | |
i:int | index of entry in URL pointer list |
allowbool | if nonzero (default), allow cached entries to be replaced |
| Returns | |
pyzim.entry.BaseEntry | the entry at the specified location |
| Raises | |
pyzim.exceptions.EntryNotFound | when no entry matching the index was found |
Return the entry for the mainpage.
| Returns | |
pyzim.entry.BaseEntry | the entry for the mainpage |
| Raises | |
pyzim.exceptions.EntryNotFound | when no mainpage exists |
Read a metadata entry, returning its value.
See https://wiki.openzim.org/wiki/Metadata for metadata keys and values.
By default, this method returns unicode. You can set as_unicode=False to prevent this. If the key is not found, return None.
| Parameters | |
key:str | key/URL of metadata |
asbool | whether to decode value or not |
| Returns | |
str or bytes (or None if not found) | the metadata value |
| Raises | |
pyzim.exceptions.ZimFileClosed | if archive is already closed |
Return the mimetype with the specified index.
| Parameters | |
i:int | index of mimetype to get |
| Returns | |
str | the mimetype with the specified index |
| Raises | |
IndexError | when the index is invalid |
Return the mimetype of the specified entry.
If the entry is a redirect, this will be pyzim.constants.MIMETYPE_REDIRECT.
| Parameters | |
entry:pyzim.entry.BaseEntry | entry to get mimetype for |
| Returns | |
str | the mimetype of this entry |
Return an object that can be used to search this ZIM.
There are various ways to search a ZIM, for which pyzim tries to provide a unified interface. This method will return any available search. Said search may, however, be more limited than other search implementations. It is as such recommended not to use this method and instead manually instanciating one of the child classes of pyzim.search.BaseSearch. Use this method only if you don't care about what search you get.
Currently, this method will try to provide you with a xapian fulltext search, falling back to a xapian title search and finally to a simple titlestart based search.
| Returns | |
pyzim.search.BaseSearch | a search object that can be used to search this ZIM |
Install a processor on this archive.
See pyzim.processor for more details.
| Parameters | |
processor:bool | processor to install |
| Raises | |
TypeError | on type error |
Iterate over all article entries in this ZIM.
If start and end are specified, they reference the indexes of the first (inclusive) and last (exclusive) entry to return. In other words, this behavior matches the l[start:end] syntax.
This function does not guarantee any specific order of the entries yielded by this function, however it currently *should* be ordered by title.
| Parameters | |
start:int | index of first entry to return (inclusive) |
end:int | index of last entry to return (exclusive) |
| Yields | |
pyzim.entry.BaseEntry | the entries in the specified range |
Iterate over all clusters in this ZIM.
If start and end are specified, they reference the indexes of the first (inclusive) and last (exclusive) clusters to return. In other words, this behavior matches the l[start:end] syntax.
| Parameters | |
start:int | index of first cluster to return (inclusive) |
end:int | index of last cluster to return (exclusive) |
| Yields | |
pyzim.cluster.Cluster | the clusters in the specified range |
| Raises | |
IndexError | on invalid/out of bound indexes |
Iterate over all entries in this ZIM.
If start and end are specified, they reference the indexes of the first (inclusive) and last (exclusive) entry to return. In other words, this behavior matches the l[start:end] syntax.
This function does not guarantee any specific order of the entries yielded by this function, however it currently *should* be ordered by URL.
Before, this method iterated by title, but this has been changed following the removal of the v0 entry title index.
| Parameters | |
start:int | index of first entry to return (inclusive) |
end:int | index of last entry to return (exclusive) |
| Yields | |
pyzim.entry.BaseEntry | the entries in the specified range |
Iterate over all entries in this ZIM, ordered by full URL.
If start and end are specified, they reference the indexes of the first (inclusive) and last (exclusive) entry to return. In other words, this behavior matches the l[start:end] syntax.
| Parameters | |
start:int | index of first entry to return (inclusive) |
end:int | index of last entry to return (exclusive) |
| Yields | |
pyzim.entry.BaseEntry | the entries in the specified range |
Add a new cluster to this archive.
NOTE: the cluster will not be cached until it is written at least once. Consequently, the autoflush function will not work until you've written them at least once.
| Returns | |
pyzim.cluster.ModifiableClusterWrapper | a new cluster |
| Raises | |
pyzim.exceptions.ZimFileClosed | if archive is already closed |
pyzim.exceptions.NonMutable | if this zim file is not mutable |
Remove the entry at the specified url.
You can specify how the associated blob should be treated using the blob parameter:
- "keep": do nothing
- "empty": empty the associated blob (see
pyzim.cluster.ModifiableClusterWrapper.empty_blob) - "remove": delete the blob. Be warned that this will likely cause issues with other indexes.
If the entry has an associated blob, the cluster will be flushed.
Redirects pointing towards this url will also be removed. Buffered operations may interfere with this behavior, so be sure to flush() before.
| Parameters | |
fullstr | full url of entry to remove |
blob:str | how to treat the associated blob |
| Raises | |
TypeError | on type error |
ValueErorr | on value error. |
pyzim.exceptions.EntryNotFound | if the target entry does not exist |
pyzim.exceptions.ZimFileClosed | if archive is already closed |
pyzim.exceptions.NonMutable | if this zim file is not mutable |
Set the mainpage url.
An entry for the specified url must already exists.
| Parameters | |
url:str or None | non-full url of the mainpage (the mainpage is always in the "C" namespace). Set to None to disable. |
| Raises | |
TypeError | on type error |
pyzim.exceptions.ZimFileClosed | if archive is already closed |
pyzim.exceptions.NonMutable | if this zim file is not mutable |
Set metadata of the ZIM archive.
| Parameters | |
key:str | key of metadata to set |
value:str or bytes | value of metadata to set |
mimetype:str or bytes | mimetype of the associated blob |
| Raises | |
TypeError | on type error |
ValueError | on invalid value |
pyzim.exceptions.ZimFileClosed | if archive is already closed |
pyzim.exceptions.NonMutable | if this zim file is not mutable |
Calculate and write the checksum.
NOTE: this prior to this, pyzim.header.Header.checksum_position should already be set to the new position and the header flushed. This method does not take care of this.
Update an existing cluster in this zim.
The cluster must already be part of this archive. Use Zim.new_cluster for creating new clusters.
| Parameters | |
cluster:ModifiableClusterWrapper | cluster to write |
| cluster | the number/id of the cluster. Providing it speeds up the method. |
| Returns | |
int | the cluster number |
| Raises | |
TypeError | on type error |
ValueError | on invalid values (e.g. negative cluster numbers) |
pyzim.exceptions.ZimFileClosed | if archive is already closed |
pyzim.exceptions.NonMutable | if this zim file is not mutable |
pyzim.exceptions.BindingError | if cluster is not bound to self |
Write an entry to this archive.
| Parameters | |
entry:pyzim.entry.BaseEntry | entry to write |
updatebool | if nonzero, update redirects to this article if necessary |
addbool | if nonzero (default), add the entry to the title pointer lists |
| Raises | |
TypeError | on type error |
pyzim.exceptions.ZimFileClosed | if archive is already closed |
pyzim.exceptions.NonMutable | if this zim file is not mutable |
pyzim.exceptions.BindingError | if entry is not bound to self |
a lock to ensure file access works with multiple threads. Acquire if whenever any work is done on the file.
an object responsible for managing storage space within the ZIM file, may be None if ZIM is read-only
Return the counter used for counting mimetype occurences.
If not counter is available, return None instead.
Check to ensure this ZIM file has not already been closed.
| Raises | |
pyzim.exceptions.ZimFileClosed | when the ZIM file is already closed. |
Initiate as a new, empty archive.
This instantiated the header, pointerlists, ... .
TODO: find a better name for this method.
Return the number of the next new cluster.
This also increments the internal counter.
| Returns | |
int | the number of the next cluster |
Called when a cluster leaves the cache.
If the archive is writable and autoflush is enabled, write the cluster if it is dirty.
| Parameters | |
clusterint | total offset of cluster |
cluster:pyzim.cluster.Cluster | the cluster leaving the cache |
Called when an entry leaves the cache.
If the archive is writable and autoflush is enabled, write the entry if it is dirty.
| Parameters | |
fullint | the full offset of the entry |
entry:pyzim.entry.BaseEntry | the entry leaving the cache |
Update references to URL pointers.
As several pointers point to the position of an entry within the URL pointer list, but said list is sorted, modifying it will likely cause said pointers to point to the wrong entries. This method takes care of updating said references.
| Parameters | |
start:int | lowest URL pointer index that needs updating |
diff:int | integer to update said references by (e.g. 1) |
editbool | if nonzero (default), update the entry title pointer list |
editbool | if nonzero (default), update the article title pointer list |
updatebool | if nonzero (default), update redirects |
skip:list or tuple of str | list or tuple of full urls not to update recursively |