condor package

Submodules

condor.config module

condor.dbutil module

Utils to work with a mongo database, it contains a global connection to the database so that a new one is not created with every request which is a huge overhead. Furthermore it has a tool to use a pymongo collection as a context manager.

condor.dbutil.engine()[source]

Creates an engine to handle our database.

condor.dbutil.find_one(db, model, eid)[source]

Finds exactly one of the given models in the db by eid.

It’s useful when trying to delete or show an item, if we don’t know such item we might as well exit the program.

Parameters:
  • db – db connection instance
  • model – model to look for, should have an eid field
  • eid – part of the eid to look for
Returns:

the instance of the model found

condor.dbutil.one_or_latest(db, model, eid)[source]

Finds an instance of the given model if the eid is provided otherwise returns the latest created in the database. If there’s no instance in the database, returns None.

Parameters:
  • db – db connection instance
  • model – model to look for, should have an eid field
  • eid – part of the eid to look for
Returns:

the instance of the model found

condor.dbutil.requires_db(func)[source]

Injects a database session into a function as first argument.

Checks if the db is available and we can create a session otherwise errors out, if the database is available, it tries to run the underlying function and commit any changes to the database, if something fails, it rolls back any uncommitted changes.

Note

for best results use db.flush() instead of db.commit() in functions that require a database connection.

condor.dbutil.session()[source]

Creates a session maker instance attached to our default engine.

condor.normalize module

This module contains utilities to normalize words and simplify them, it should be able to remove punctuation, filter out stopwords and maybe transform some latex accents into unicode accent characters.

class condor.normalize.CompleteNormalizer(*args, **kwargs)[source]

Bases: condor.normalize.LatexAccentRemover, condor.normalize.PunctuationRemover, condor.normalize.Lowercaser, condor.normalize.StopwordRemover, condor.normalize.Stemmer

A Normalizer that aggregates all the effects described in this module

class condor.normalize.LatexAccentRemover(*args, **kwargs)[source]

Bases: condor.normalize.Normalizer

Removes latex accents like ‘{a} and makes them unicode chars á.

accents = {“’”: {‘a’: ‘á’, ‘e’: ‘é’, ‘i’: ‘í’, ‘o’: ‘ó’, ‘u’: ‘ú’, ‘c’: ‘ć’}, ‘`’: {‘a’: ‘à’, ‘e’: ‘è’, ‘i’: ‘ì’, ‘o’: ‘ò’, ‘u’: ‘ù’}, ‘~’: {‘n’: ‘ñ’, ‘o’: ‘õ’, ‘a’: ‘ã’}, ‘^’: {‘a’: ‘â’, ‘e’: ‘ê’, ‘i’: ‘î’, ‘o’: ‘ô’, ‘u’: ‘û’}, ‘”’: {‘a’: ‘ä’, ‘e’: ‘ë’, ‘i’: ‘ï’, ‘o’: ‘ö’, ‘u’: ‘ü’, ‘y’: ‘ÿ’}, ‘a’: {‘e’: ‘æ’}, ‘c’: {‘c’: ‘ç’}, ‘o’: {‘e’: ‘œ’}, ‘s’: {‘s’: ‘ß’}, ‘v’: {‘s’: ‘š’}}
apply_to(text)[source]
formats = [‘{{\{accent}{{{character}}}}}’, ‘{{\{accent}{character}}}’, ‘\{accent}{{{character}}}’, ‘\{accent}{character}’]
class condor.normalize.Lowercaser(language=None, tokenizer=None)[source]

Bases: condor.normalize.Normalizer

Changes case to lowercase through the normalizer API

apply_to(text)[source]
class condor.normalize.Normalizer(language=None, tokenizer=None)[source]

Bases: object

apply_to(text)[source]
default_language = ‘spanish’
default_tokenizer

alias of SpaceTokenizer

class condor.normalize.PunctuationRemover(characters=None, **kwargs)[source]

Bases: condor.normalize.Normalizer

Removes punctuation from a text

apply_to(text)[source]
characters = ‘!”#$%&'()*+,-./:;<=>?@[\]^_`{|}~¡¿“”‘’—'’
class condor.normalize.SpaceTokenizer[source]

Bases: object

Simple tokenization based on spaces

tokenize(text)[source]
class condor.normalize.Stemmer(**kwargs)[source]

Bases: condor.normalize.Normalizer

Changes words to their respective stemms

apply_to(text)[source]
class condor.normalize.StopwordRemover(**kwargs)[source]

Bases: condor.normalize.Normalizer

Removes stopwords from a text

apply_to(text)[source]

condor.schemas module

class condor.schemas.ModelSchema(extra=None, only=(), exclude=(), prefix=”, strict=None, many=False, context=None, load_only=(), dump_only=(), partial=False)[source]

Bases: marshmallow.schema.Schema

opts = <marshmallow.schema.SchemaOpts object>
class condor.schemas.RecordSchema(extra=None, only=(), exclude=(), prefix=”, strict=None, many=False, context=None, load_only=(), dump_only=(), partial=False)[source]

Bases: marshmallow.schema.Schema

opts = <marshmallow.schema.SchemaOpts object>

condor.util module

Contains utility functions to work with tokens and decorators to work with XML, lists and generators.

class condor.util.LanguageGuesser(languages=None)[source]

Bases: object

Guesses the language of a record if the record field is not defined

counts(sentence)[source]
default_lang = ‘english’
guess(sentence)[source]
languages = OrderedDict([(‘en_US’, ‘english’), (‘en_GB’, ‘english’), (‘es_ES’, ‘spanish’), (‘es_CO’, ‘spanish’), (‘es_MX’, ‘spanish’), (‘pt_BR’, ‘portuguese’), (‘pt_PT’, ‘portuguese’), (‘fr_FR’, ‘french’), (‘fr_BE’, ‘french’), (‘it_IT’, ‘italian’), (‘de_DE’, ‘german’), (‘de_CH’, ‘german’), (‘de_AT’, ‘german’)])
condor.util.frequency(words, tokens)[source]

Computes the frequency list of a list of tokens in a dense representation.

Parameters:
  • words (list) – list of the words to look for
  • tokens (list) – list of the tokens to count

Note

this function applies a complete normalizer to the given tokens and guesses the language.

condor.util.full_text_from_pdf(filename)[source]

Tries to extract text from pdfs.

condor.util.gen_to_list(func)[source]

Transforms a function that would return a generator into a function that returns a list of the generated values, ergo, do not use this decorator with infinite generators.

condor.util.isi_text_to_dic(text)[source]

This function takes in any ISI WOS plain text formatted string and turns it into a dictionary where the keys are the two letter leading keys and the values are a list of the strings under that key.

condor.util.to_list(obj)[source]

Transforms a non iterable object into a singleton list, or an iterable into a list.

condor.util.xml_to_text(func)[source]

Transforms a function that would return an XML element into a function that returns the text content of the XML element as a string.

Module contents