Skip to content

API - BackEnd.functions

BackEnd.functions.clustering

clustering

Functions

hover_with_keywords(research, list_id, embedding_2d, best_study_clusterer, tf_idf_sorted)

Create a cluster from research articles.

Create an object database model for every article according to list_id, extract the x,y coordinates from embedding_2d and make the clusters given by best_study_clusterer, every cluster has label extracted from tf_idf_sorted.

Parameters:

Name Type Description Default
research Research object Model database

The research object model where is used to store search query, status and related data.

required
list_id_final list of str

list of id_articles related with the research

required
embedding_2d A numpy array wich reduces dimensionality to visualize

better the clusters

required
best_study_clusterer optuna.study.Study

An object used to optimize parameters and create clusters with embedding_2d.

required
tf_idf_sorted pandas DataFrame

Storage the tf-idf index for each word in articles. A dataframe with keywords as row index in the columns the article related.

required

Returns:

Type Description
None
Notes

Create a cluster object database model for every article according to list_id, with research, article_id, pos_x, pos_y and labels

BackEnd.functions.journal_collectors.arxiv_collector

arxiv_collector

Classes

ArXivCollector

Bases: TheXivesCollector

Some methods for this class are defined at base.py:TheXivesCollector. Some of the methods for ArXivCollector overlap with the ones for medrXiv and biorXiv, so instead of defining them thrice we overwrite the methods specific for each one in their own files.

Functions
extract_author(entry)
Notes

Authors appear in the HTML document as:

Authors: Author 1 , Author 2 , Author 3

extract_date(entry)

As an example, date appears in the HTML document as:

Submitted 24 March, 2019; v1 submitted 17 August, 2018; originally announced August 2018.

generate_base_query(search_term, begin, end, page=None)

Generates the base query according to the search term, page number and articles by page.

get_dict_markup_lang(search_term, begin, end, entries=[])

Returns a list of article IDs.

get_max_count(soup)

Returns the number of articles resulting from a soupified HTML query.

Parameters:

Name Type Description Default
soup bs4.BeautifulSoup

The HTML page resulting from a query soupified using BeautifulSoup.

required

Returns:

Name Type Description
max_count int

The number of articles resulting from the query.

Notes

In the HTML document, the number of articles max_count appears in:

Showing `initial_from_page`–`final_from_page` of `max_count` results

get_url_response(query)
Notes

arXiv's advanced search returns an HTML document.

BackEnd.functions.journal_collectors.base

base

Classes

BaseCollector

Functions
extract_pdf_full_text(article)

Extracts the full text from the PDF file of an article.

generate_threads(search, research, begin, end, number_threads=1)

Retrieve articles using parallel threads based on the provided search parameters.

get_articles_parallel(research, dict_id_soup)

Fetch articles in parallel and store them in the database.

get_search_term(search)

Translates the search to a readable string search for the API

separate_ymd_from_date(date)

Separates year, month, and day as strings from the given date object.

PMAndPMCCollector

Bases: BaseCollector

This class is basically made of methods for PMCollector. Some of the functions overlap with the PMCCollector, so we define the methods for PMCollector here and overwrite the methods specific for PMCCollector on pmc_collector.py.

Functions
extract_abstract(entry)

Extracts the full abstract, not always available. Itertext takes all text between the abstract tags, so we remove all

characters and superfluous spaces.

extract_author(entry)

Extracts the author names from the XML document.

extract_date(entry)

Extracts the date from the XML document and returns a datetime.date object.

extract_doi(entry)

Extracts the DOI from the XML document.

extract_metadata_from_identifier(identifier)

Extract metadata from the identifier input and return a dictionary with these data.

extract_title(entry)

Extracts the title from the XML document.

extract_url(entry)

Extracts the URL based on the DOI.

generate_addr_fetch(identifier)

Generates the API fetch address for retrieving data based on the provided identifier.

generate_base_query(search_term, begin, end)

Generates the base query according to the API, database, search term, begin and end dates.

generate_id_list(entry)

Generates a list of IDs from the XML document.

get_list_id(search_term, begin, end)

Returns a list of article IDs.

get_max_articles(search, begin, end)

Returns the maximum number of articles for the given search query and date range

get_url_response(query)

Returns a XML doc.

prepare_articles_parallel(research, identifier, dict_id_soup={})

Return True if the outer loop should keep going or False if it should run continue statement.

TheXivesCollector

Bases: BaseCollector

Functions
generate_base_query(search_term, begin, end, page=0)

Generates the base query according to the API, database, search term, begin and end dates.

get_dict_markup_lang(search_term, begin, end, entries=[])

Returns a list of article IDs.

get_max_articles(search, begin, end)

Returns the maximum number of articles for the given search query and date range.

prepare_articles_parallel(research, identifier, dict_id_soup={})

Return True if the outer loop should keep going or False if it should run continue statement.

Functions

_get_articles_parallel_wrapper(args)

Execute get_articles_parallel from the given object.

This is a wrapper to BaseCollector.get_articles_parallel that needs to be called by multiprocessing.

BackEnd.functions.journal_collectors.biorxiv_collector

biorxiv_collector

Classes

BiorXivCollector

Bases: TheXivesCollector

The methods for this class are defined at base.py:TheXivesCollector. Some of the methods for BiorXivCollector overlap with the ones for medrXiv and arXiv, so instead of defining them thrice we overwrite the methods specific for each one in their own files.

BackEnd.functions.journal_collectors.medrxiv_collector

medrxiv_collector

Classes

MedrXivCollector

Bases: TheXivesCollector

The methods for this class are defined at base.py:TheXivesCollector. Some of the methods for MedrXivCollector overlap with the ones for biorXiv and arXiv, so instead of defining them thrice we overwrite the methods specific for each one in their own files.

BackEnd.functions.journal_collectors.pm_collector

pm_collector

Classes

PMCollector

Bases: PMAndPMCCollector

The methods for this class are defined at base.py:PMAndPMCCollector. Some of the methods for PMCollector overlap with the ones for PMCCollector, so instead of defining them twice we overwrite the methods specific for PMCCollector on pmc_collector.py.

BackEnd.functions.journal_collectors.pmc_collector

pmc_collector

Classes

PMCCollector

Bases: PMAndPMCCollector

Most of the methods for this class are defined at base.py: PMAndPMCCollector. Some of the methods for PMCCollector overlap with the ones for PMCCollector, so instead of defining them twice we overwrite the methods specific for PMCCollector here.

Functions
extract_abstract(entry)

Extracts the full abstract, not always available. Itertext takes all text between the abstract tags, so we remove all

characters and superfluous spaces.

extract_author(entry)

Extracts the author names from the XML document.

extract_date(entry)

Extracts the date from the XML document and returns a datetime.date object.

extract_doi(entry)

Extracts the DOI from the XML document.

BackEnd.functions.journal_collectors.types

types

BackEnd.functions.nlp

nlp

Functions

call_openai_chatgpt(question='', keywords='HIV, infant, pregnancy', api_key=settings.OPENAI_API_KEY)

Standard function to call ChatGPT through OpenAI's python package. More info: https://platform.openai.com/docs/guides/chat/introduction

list_as_string(input_list, glue=', ')

Auxiliary function. Reads a list and concatenates it as a string, using glue as the character joining each element of the list.

nlp_topic_description(input_list, api_key)

Reads keywords and returns a description according to ChatGPT.

BackEnd.functions.scatter_with_hover

scatter_with_hover

Functions

scatter_with_hover(research, path, fig=None, name=None, marker='circle', fig_width=1500, fig_height=900)

Creates a html Plot file an interactive scatter plot of x vs y using bokeh, with automatic tooltips showing columns from Cluters objects model database related with research object model database.

Parameters:

Name Type Description Default
research Research object Model database
   The research object model where is used to store
   search query, status and related data.
required
path (str, Path)

Full path to where the html plot file will be stored after being generated

required
fig bokeh.plotting.Figure

Figure on which to plot (if not given then a new figure will be created)

None
name str

Bokeh series name to give to the scattered data

None
marker str

Name of marker to use for scatter plot

circle
fig_width int
    with of the resulting plot.
1800
fig_height int
     height of the resulting plot.
900

Returns:

Type Description
None
Notes

Creates a html Plot file from Clusters objects and store in the path

Acknowledgment

Original code from Robin Wilson robin@rtwilson.com with thanks to Max Albert for original code example

BackEnd.functions.view_functions

view_functions

Classes

Functions

back_process(research)

Controls the whole process of researching from extracting articles, preprocess text, clustering, and plotting. This function allows to restart proccess that have been interrupted and continue to the point where it was stopped. When finalizes deletes all extra data that could be used in intemediate steps.

Parameters:

Name Type Description Default
research Research object Model database
   The research object model where is used to store
   search query, status and related data.
required

Returns:

Type Description
None
Notes

Modify the research object model database

check(research)

check if the thread of the research is alive

make_cluster(research, list_id, tf_idf, n_trials, n_threads, tf_idf_sorted)

Prepare from the entry args the data needed for create a cluster object database model use a pacmap object to optimize and reduce dimensionality.

Parameters:

Name Type Description Default
research Research object Model database
   The research object model where is used to store
   search query, status and related data.
required
list_id_final list of str
        list of id_articles related with the research
required
tf_idf scipy sparse matrix
 Matrix which has tf-idf for each relevant word
 in articles.
required
n_trial int
  Number of times that will search for the optimal
  cluster distribution.
required
n_threads int
    Number of parallel processing
required
tf_idf_sorted pandas DataFrame
        Storage the tf-idf index for each word in
        articles. A dataframe with keywords
        as row index in the columns the
        article related.
required

Returns:

Type Description
None
Notes

Create a clusters object database model and the clusters objects will be used to generate the clusters plot.

make_preprocessing(research, corpus='abstract', number_threads=1)

Preprocesses the research data and returns the tf-idf matrix, sorted tf-idf matrix, list of ids and list of preprocessed text.

Parameters:

Name Type Description Default
research Research object Model database
   The research object model where is used to store
   search query, status and related data.
required
corpus str
 The selected corpus is used to preprocess text
 it could be 'full_text' pdf or 'abstract or 'both'.
'abstract'
number_thread int
        Number of threads that will be processed in parallel.
1

Returns:

Name Type Description
tf_idf scipy sparse matrix

Matrix which has tf-idf for each relevant word in articles.

tf_idf_sorted pandas DataFrame

Storage the tf-idf index for each word in articles. A dataframe with keywords as row index in the columns the article related.

list_id_final list of str

list of id_articles related with the research

list_final list of str

Every element of the list correspond to the extracted text from the article

preprocessing_parallel(research, articles, corpus)

Preprocess the text from the articles related with the research and creates Preprocess_text object model database for every word preproccessed and a Number_preprocess model database as a flag to inicate that the research has processed text. This method will be parallelized

Parameters:

Name Type Description Default
research Research object Model database
   The research object model where is used to store
   search query, status and related data.
required
articles (List[Article], queryset)
   List of articles related with the research
required
corpus str
 The selected corpus is used to preprocess text
 it could be 'full_text' pdf or 'abstract or 'both'.
required

Returns:

Type Description
None
Notes

Create Preprocess_text objects model database for every word preproccessed and Number_preprocess objects model database as a flag to indicate that the research has processed text.

print_research(log_text, research_id)

Print a log message in the research logfile.

relaunch_if_fault_all()

Check if there is a research who is running but there is no more thread alive.

update_research()

This is a infinity loop who every 1 month, restart all research who is finished. But, between each restart, give some time so the host doesn't freeze

BackEnd.functions.filter_article

filter_article

Functions

parsing(list_term, dict_keyword)

for each keywords, it has a list of number. Output: a list of number who correspond of the parsing's result This method works in reccurance call. It returns the result of logical calcul between the first term and the rest.

split_search_term(search_term)

Take the search term and return a list of each element ( and, word, parenthesis, etc) in order and dictionary where the key is a keyword and the value is a list. It will fill with id of article

BackEnd.functions.PDF_download

PDF_download

Classes

PDFHandler

Functions
convert_pdf(path_file)

Converts PDF to text to do text processing.

Parameters:

Name Type Description Default
path_file str
    path to the pdf file.
required

Returns:

Name Type Description
text str

the text conversion of the pdf

download_from_url(article, filename)

Downloads a PDF from a direct URL.

Parameters:

Name Type Description Default
article Article object Model
An object wich has in its attribs
The URL to the pdf file.
required
filename str
The filename to use for the downloaded file.
it includes the full path and name.
required

Returns:

Name Type Description
status bool

Returns True if the PDF was saved successfully, False otherwise.

extract_full_text(article)

Downloads the corresponding PDF from the article's URL and extracts the full text. Returns the full text or an empty string if there was a problem. If DOWLOAD_PDF_ARTICLE flag is set to False, just return "".

Parameters:

Name Type Description Default
article Article object Model
the article object.
required

Returns:

Name Type Description
full_text str

the text conversion of the pdf

Notes

If DOWLOAD_PDF_ARTICLE flag is set to False, just return "".

get_url_from_doi(doi)
Description

This function gets a direct URL to a pdf based on DOI

Takes

DOI (string): URL to the PDF file

Returns:

Name Type Description
url_pdf str or None

URL to the PDF file.

remove_char(string)
Description

This function removes unwanted characters

Takes

string (string)

BackEnd.functions.Remove_references

Remove_references

BackEnd.functions.text_processing

text_processing

Functions

create_stopwords()

Create a new set of stopwords from a file.

Returns:

Name Type Description
stopwords Set[str]

A set of stopwords.

lemmatization(list_words, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])

Lemmatize a list of words.

Parameters:

Name Type Description Default
list_words List[str]
  A list of words.
required
allowed_postags List[str]
  A list of allowed POS tags.
['NOUN', 'ADJ', 'VERB', 'ADV']

Returns:

Name Type Description
list_lemmatized List[str]

A list of lemmatized words.

pre_processing(df_to_list)

Preprocess a list of strings by removing special characters and digits.

Parameters:

Name Type Description Default
df_to_list List[str]
  A list of strings.
required

Returns:

Name Type Description
list_preprocessing List[str]

A list of preprocessed strings.

remove_misspelled(list_one_two)

Receive a list of words and for each word Check if the word is in the dictionary_compact.json file, If the word is not in the dictionary, return the corrected word, Otherwise, return the original word

remove_words(list_lemmatized, list_stopwords)

Receive a list of words and a list of stopwords, Remove the stopwords from the list of words, Return a list with the remaining words

sent_to_words(list_languages)

Receive a list of sentences and for each sentence Convert into a list of lowercase tokens, ignoring tokens that are too short or too long (remove accents as well). For each sentence return a list with the processed words