tpdf/tcparser.py

import xml.etree.ElementTree as ET
import json
import datetime
import sys
import re

import glob

# Parse Tellico's XML and get a library struct,
# a stripped version of our library in a Python-friendly format
def getLibrary(path, lastUpdate):
    namespace = {'tellico': 'http://periapsis.org/tellico/' }

    xml = ET.parse(path)
    root = xml.getroot()

    collection = root.find('tellico:collection', namespace)

    library = dict()
    library['lastupdate'] = lastUpdate
    library['books'] = list()

    for i in collection.findall('tellico:entry', namespace):
        newbook = dict()
        newbook['id'] = int(i.attrib['id'])
        # these `for` overwrite previous values,
        # but it is not an issue since actually these are one shot only elements
        # actually there is no 1to1 conversion between XML and JSON, so this operation
        # can not be made automagically without doing some preliminary assumptions
        # (Possibly useful module: https://github.com/martinblech/xmltodict)
        for j in i.iterfind('tellico:title', namespace):
            newbook['title'] = j.text
        for j in i.iterfind('tellico:publisher', namespace):
            newbook['publisher'] = j.text
        for j in i.iterfind('tellico:pub_year', namespace):
            match = re.search('[0-9]{4}', j.text)
            newbook['year'] = int(match.group())
        for j in i.iterfind('tellico:isbn', namespace):
            newbook['isbn'] = j.text.replace('-', '')
        for j in i.iterfind('tellico:pages', namespace):
            newbook['pages'] = int(j.text)
        for j in i.iterfind('tellico:cover', namespace):
            newbook['cover'] = j.text
        newbook['authors'] = list()
        for j in i.iterfind('tellico:authors', namespace):
            for k in j.findall('tellico:author', namespace):
                newbook['authors'].append(k.text)

        library['books'].append(newbook)

    return library

# Filter results using following filter functions and order by title
####################################################################
def filter(library, title='', author=''):
    resultBookList = list()

    for i in library['books']:
        if filterTitle(i, title) and filterAuthor(i, author):
            resultBookList.append(i)

    sortedResultBookList = sorted(resultBookList, key=lambda k: k.get('title', ''))

    result = dict()
    result['lastupdate'] = library['lastupdate']
    result['books'] = sortedResultBookList

    return result

# Filter by title
def filterTitle(book, filter):
    return filter.lower() in book['title'].lower()

# Filter by author
def filterAuthor(book, filter):
    for i in book['authors']:
        if filter.lower() in i.lower():
            return True
    return False
First commit 2018-10-20 16:46:05 +00:00			`import xml.etree.ElementTree as ET`
			`import json`
Database extraction only when needed Keeps a timestamp of last database update, and extracts it only if changed from last extraction (more efficient). Modified library struct, now contains not only a book list but also some informations about the library (specific last update timestamp). Last database update is shown in html and json output. Added .htaccess for Apache webserver to prevent tellico files leaks from output directory. 2018-10-28 10:03:26 +00:00			`import datetime`
Debian Apache adaptation - Proper UTF-8 handling - .htaccess CGI and environment variables 2018-11-03 11:52:36 +00:00			`import sys`
Fixed bug #1 Handling of unsafe publication year fields. If unsafe, then uses the first 4-digit number in the string as the year. 2018-12-23 11:00:14 +00:00			`import re`
First commit 2018-10-20 16:46:05 +00:00
Global customized output directory 2018-11-03 15:27:40 +00:00			`import glob`

First commit 2018-10-20 16:46:05 +00:00			`# Parse Tellico's XML and get a library struct,`
			`# a stripped version of our library in a Python-friendly format`
Database extraction only when needed Keeps a timestamp of last database update, and extracts it only if changed from last extraction (more efficient). Modified library struct, now contains not only a book list but also some informations about the library (specific last update timestamp). Last database update is shown in html and json output. Added .htaccess for Apache webserver to prevent tellico files leaks from output directory. 2018-10-28 10:03:26 +00:00			`def getLibrary(path, lastUpdate):`
Debian Apache adaptation - Proper UTF-8 handling - .htaccess CGI and environment variables 2018-11-03 11:52:36 +00:00			`namespace = {'tellico': 'http://periapsis.org/tellico/' }`
First commit 2018-10-20 16:46:05 +00:00
Debian Apache adaptation - Proper UTF-8 handling - .htaccess CGI and environment variables 2018-11-03 11:52:36 +00:00			`xml = ET.parse(path)`
			`root = xml.getroot()`
First commit 2018-10-20 16:46:05 +00:00
Debian Apache adaptation - Proper UTF-8 handling - .htaccess CGI and environment variables 2018-11-03 11:52:36 +00:00			`collection = root.find('tellico:collection', namespace)`
First commit 2018-10-20 16:46:05 +00:00
Database extraction only when needed Keeps a timestamp of last database update, and extracts it only if changed from last extraction (more efficient). Modified library struct, now contains not only a book list but also some informations about the library (specific last update timestamp). Last database update is shown in html and json output. Added .htaccess for Apache webserver to prevent tellico files leaks from output directory. 2018-10-28 10:03:26 +00:00			`library = dict()`
			`library['lastupdate'] = lastUpdate`
			`library['books'] = list()`
First commit 2018-10-20 16:46:05 +00:00
Debian Apache adaptation - Proper UTF-8 handling - .htaccess CGI and environment variables 2018-11-03 11:52:36 +00:00			`for i in collection.findall('tellico:entry', namespace):`
First commit 2018-10-20 16:46:05 +00:00			`newbook = dict()`
			`newbook['id'] = int(i.attrib['id'])`
			# these `for` overwrite previous values,
			`# but it is not an issue since actually these are one shot only elements`
Removed JSON<->XML issue Cannot be done automagically without some assumptions, so it is worth of doing the conversion by hand suited to our needs. 2018-10-21 09:30:59 +00:00			`# actually there is no 1to1 conversion between XML and JSON, so this operation`
			`# can not be made automagically without doing some preliminary assumptions`
			`# (Possibly useful module: https://github.com/martinblech/xmltodict)`
Debian Apache adaptation - Proper UTF-8 handling - .htaccess CGI and environment variables 2018-11-03 11:52:36 +00:00			`for j in i.iterfind('tellico:title', namespace):`
First commit 2018-10-20 16:46:05 +00:00			`newbook['title'] = j.text`
Debian Apache adaptation - Proper UTF-8 handling - .htaccess CGI and environment variables 2018-11-03 11:52:36 +00:00			`for j in i.iterfind('tellico:publisher', namespace):`
First commit 2018-10-20 16:46:05 +00:00			`newbook['publisher'] = j.text`
Debian Apache adaptation - Proper UTF-8 handling - .htaccess CGI and environment variables 2018-11-03 11:52:36 +00:00			`for j in i.iterfind('tellico:pub_year', namespace):`
Fixed bug #1 Handling of unsafe publication year fields. If unsafe, then uses the first 4-digit number in the string as the year. 2018-12-23 11:00:14 +00:00			`match = re.search('[0-9]{4}', j.text)`
			`newbook['year'] = int(match.group())`
Debian Apache adaptation - Proper UTF-8 handling - .htaccess CGI and environment variables 2018-11-03 11:52:36 +00:00			`for j in i.iterfind('tellico:isbn', namespace):`
First commit 2018-10-20 16:46:05 +00:00			`newbook['isbn'] = j.text.replace('-', '')`
Debian Apache adaptation - Proper UTF-8 handling - .htaccess CGI and environment variables 2018-11-03 11:52:36 +00:00			`for j in i.iterfind('tellico:pages', namespace):`
First commit 2018-10-20 16:46:05 +00:00			`newbook['pages'] = int(j.text)`
Debian Apache adaptation - Proper UTF-8 handling - .htaccess CGI and environment variables 2018-11-03 11:52:36 +00:00			`for j in i.iterfind('tellico:cover', namespace):`
Added images of covers 2018-10-21 13:00:19 +00:00			`newbook['cover'] = j.text`
First commit 2018-10-20 16:46:05 +00:00			`newbook['authors'] = list()`
Debian Apache adaptation - Proper UTF-8 handling - .htaccess CGI and environment variables 2018-11-03 11:52:36 +00:00			`for j in i.iterfind('tellico:authors', namespace):`
			`for k in j.findall('tellico:author', namespace):`
First commit 2018-10-20 16:46:05 +00:00			`newbook['authors'].append(k.text)`

Database extraction only when needed Keeps a timestamp of last database update, and extracts it only if changed from last extraction (more efficient). Modified library struct, now contains not only a book list but also some informations about the library (specific last update timestamp). Last database update is shown in html and json output. Added .htaccess for Apache webserver to prevent tellico files leaks from output directory. 2018-10-28 10:03:26 +00:00			`library['books'].append(newbook)`
First commit 2018-10-20 16:46:05 +00:00
			`return library`

Return booklist ordered by title 2018-10-21 10:26:28 +00:00			`# Filter results using following filter functions and order by title`
			`####################################################################`
First commit 2018-10-20 16:46:05 +00:00			`def filter(library, title='', author=''):`
Database extraction only when needed Keeps a timestamp of last database update, and extracts it only if changed from last extraction (more efficient). Modified library struct, now contains not only a book list but also some informations about the library (specific last update timestamp). Last database update is shown in html and json output. Added .htaccess for Apache webserver to prevent tellico files leaks from output directory. 2018-10-28 10:03:26 +00:00			`resultBookList = list()`
First commit 2018-10-20 16:46:05 +00:00
Database extraction only when needed Keeps a timestamp of last database update, and extracts it only if changed from last extraction (more efficient). Modified library struct, now contains not only a book list but also some informations about the library (specific last update timestamp). Last database update is shown in html and json output. Added .htaccess for Apache webserver to prevent tellico files leaks from output directory. 2018-10-28 10:03:26 +00:00			`for i in library['books']:`
First commit 2018-10-20 16:46:05 +00:00			`if filterTitle(i, title) and filterAuthor(i, author):`
Database extraction only when needed Keeps a timestamp of last database update, and extracts it only if changed from last extraction (more efficient). Modified library struct, now contains not only a book list but also some informations about the library (specific last update timestamp). Last database update is shown in html and json output. Added .htaccess for Apache webserver to prevent tellico files leaks from output directory. 2018-10-28 10:03:26 +00:00			`resultBookList.append(i)`

			`sortedResultBookList = sorted(resultBookList, key=lambda k: k.get('title', ''))`
First commit 2018-10-20 16:46:05 +00:00
Database extraction only when needed Keeps a timestamp of last database update, and extracts it only if changed from last extraction (more efficient). Modified library struct, now contains not only a book list but also some informations about the library (specific last update timestamp). Last database update is shown in html and json output. Added .htaccess for Apache webserver to prevent tellico files leaks from output directory. 2018-10-28 10:03:26 +00:00			`result = dict()`
			`result['lastupdate'] = library['lastupdate']`
			`result['books'] = sortedResultBookList`
Return booklist ordered by title 2018-10-21 10:26:28 +00:00
Database extraction only when needed Keeps a timestamp of last database update, and extracts it only if changed from last extraction (more efficient). Modified library struct, now contains not only a book list but also some informations about the library (specific last update timestamp). Last database update is shown in html and json output. Added .htaccess for Apache webserver to prevent tellico files leaks from output directory. 2018-10-28 10:03:26 +00:00			`return result`
First commit 2018-10-20 16:46:05 +00:00
			`# Filter by title`
			`def filterTitle(book, filter):`
			`return filter.lower() in book['title'].lower()`

			`# Filter by author`
			`def filterAuthor(book, filter):`
			`for i in book['authors']:`
			`if filter.lower() in i.lower():`
			`return True`
			`return False`