API - BackEnd.functions

BackEnd.functions.clustering

`clustering`

Functions

`hover_with_keywords(research, list_id, embedding_2d, best_study_clusterer, tf_idf_sorted)`

Create a cluster from research articles.

Create an object database model for every article according to list_id, extract the x,y coordinates from embedding_2d and make the clusters given by best_study_clusterer, every cluster has label extracted from tf_idf_sorted.

Parameters:

Name	Type	Description	Default
`research`	`Research object Model database`	The research object model where is used to store search query, status and related data.	required
`list_id_final`	`list of str`	list of id_articles related with the research	required
`embedding_2d`	`A numpy array wich reduces dimensionality to visualize`	better the clusters	required
`best_study_clusterer`	`optuna.study.Study`	An object used to optimize parameters and create clusters with embedding_2d.	required
`tf_idf_sorted`	`pandas DataFrame`	Storage the tf-idf index for each word in articles. A dataframe with keywords as row index in the columns the article related.	required

Returns:

Type	Description
`None`

Notes

Create a cluster object database model for every article according to list_id, with research, article_id, pos_x, pos_y and labels

BackEnd.functions.journal_collectors.arxiv_collector

`arxiv_collector`

Classes

`ArXivCollector`

Bases: TheXivesCollector

Some methods for this class are defined at base.py:TheXivesCollector. Some of the methods for ArXivCollector overlap with the ones for medrXiv and biorXiv, so instead of defining them thrice we overwrite the methods specific for each one in their own files.

Functions

`extract_author(entry)`

Notes

Authors appear in the HTML document as:

Authors: Author 1 , Author 2 , Author 3

`extract_date(entry)`

As an example, date appears in the HTML document as:

Submitted 24 March, 2019; v1 submitted 17 August, 2018; originally announced August 2018.

`generate_base_query(search_term, begin, end, page=None)`

Generates the base query according to the search term, page number and articles by page.

`get_dict_markup_lang(search_term, begin, end, entries=[])`

Returns a list of article IDs.

`get_max_count(soup)`

Returns the number of articles resulting from a soupified HTML query.

Parameters:

Name	Type	Description	Default
`soup`	`bs4.BeautifulSoup`	The HTML page resulting from a query soupified using BeautifulSoup.	required

Returns:

Name	Type	Description
`max_count`	`int`	The number of articles resulting from the query.

Notes

In the HTML document, the number of articles max_count appears in:

Showing `initial_from_page`–`final_from_page` of `max_count` results

`get_url_response(query)`

Notes

arXiv's advanced search returns an HTML document.

BackEnd.functions.journal_collectors.base

`base`

Classes

`BaseCollector`

Functions

`extract_pdf_full_text(article)`

Extracts the full text from the PDF file of an article.

`generate_threads(search, research, begin, end, number_threads=1)`

Retrieve articles using parallel threads based on the provided search parameters.

`get_articles_parallel(research, dict_id_soup)`

Fetch articles in parallel and store them in the database.

`get_search_term(search)`

Translates the search to a readable string search for the API

`separate_ymd_from_date(date)`

Separates year, month, and day as strings from the given date object.

`PMAndPMCCollector`

Bases: BaseCollector

This class is basically made of methods for PMCollector. Some of the functions overlap with the PMCCollector, so we define the methods for PMCollector here and overwrite the methods specific for PMCCollector on pmc_collector.py.

Functions

`extract_abstract(entry)`

Extracts the full abstract, not always available. Itertext takes all text between the abstract tags, so we remove all

characters and superfluous spaces.

`extract_author(entry)`

Extracts the author names from the XML document.

`extract_date(entry)`

Extracts the date from the XML document and returns a datetime.date object.

`extract_doi(entry)`

Extracts the DOI from the XML document.

`extract_metadata_from_identifier(identifier)`

Extract metadata from the identifier input and return a dictionary with these data.

`extract_title(entry)`

Extracts the title from the XML document.

`extract_url(entry)`

Extracts the URL based on the DOI.

`generate_addr_fetch(identifier)`

Generates the API fetch address for retrieving data based on the provided identifier.

`generate_base_query(search_term, begin, end)`

Generates the base query according to the API, database, search term, begin and end dates.

`generate_id_list(entry)`

Generates a list of IDs from the XML document.

`get_list_id(search_term, begin, end)`

Returns a list of article IDs.

`get_max_articles(search, begin, end)`

Returns the maximum number of articles for the given search query and date range

`get_url_response(query)`

Returns a XML doc.

`prepare_articles_parallel(research, identifier, dict_id_soup={})`

Return True if the outer loop should keep going or False if it should run continue statement.

`TheXivesCollector`

Bases: BaseCollector

Functions

`generate_base_query(search_term, begin, end, page=0)`

Generates the base query according to the API, database, search term, begin and end dates.

`get_dict_markup_lang(search_term, begin, end, entries=[])`

Returns a list of article IDs.

`get_max_articles(search, begin, end)`

Returns the maximum number of articles for the given search query and date range.

`prepare_articles_parallel(research, identifier, dict_id_soup={})`

Return True if the outer loop should keep going or False if it should run continue statement.

Functions

`_get_articles_parallel_wrapper(args)`

Execute get_articles_parallel from the given object.

This is a wrapper to BaseCollector.get_articles_parallel that needs to be called by multiprocessing.

BackEnd.functions.journal_collectors.biorxiv_collector

`biorxiv_collector`

Classes

`BiorXivCollector`

Bases: TheXivesCollector

The methods for this class are defined at base.py:TheXivesCollector. Some of the methods for BiorXivCollector overlap with the ones for medrXiv and arXiv, so instead of defining them thrice we overwrite the methods specific for each one in their own files.

BackEnd.functions.journal_collectors.medrxiv_collector

`medrxiv_collector`

Classes

`MedrXivCollector`

Bases: TheXivesCollector

The methods for this class are defined at base.py:TheXivesCollector. Some of the methods for MedrXivCollector overlap with the ones for biorXiv and arXiv, so instead of defining them thrice we overwrite the methods specific for each one in their own files.

BackEnd.functions.journal_collectors.pm_collector

`pm_collector`

Classes

`PMCollector`

Bases: PMAndPMCCollector

The methods for this class are defined at base.py:PMAndPMCCollector. Some of the methods for PMCollector overlap with the ones for PMCCollector, so instead of defining them twice we overwrite the methods specific for PMCCollector on pmc_collector.py.

BackEnd.functions.journal_collectors.pmc_collector

`pmc_collector`

Classes

`PMCCollector`

Bases: PMAndPMCCollector

Most of the methods for this class are defined at base.py: PMAndPMCCollector. Some of the methods for PMCCollector overlap with the ones for PMCCollector, so instead of defining them twice we overwrite the methods specific for PMCCollector here.

Functions

`extract_abstract(entry)`

Extracts the full abstract, not always available. Itertext takes all text between the abstract tags, so we remove all

characters and superfluous spaces.

`extract_author(entry)`

Extracts the author names from the XML document.

`extract_date(entry)`

Extracts the date from the XML document and returns a datetime.date object.

`extract_doi(entry)`

Extracts the DOI from the XML document.

BackEnd.functions.journal_collectors.types

`types`

BackEnd.functions.nlp

`nlp`

Functions

`call_openai_chatgpt(question='', keywords='HIV, infant, pregnancy', api_key=settings.OPENAI_API_KEY)`

Standard function to call ChatGPT through OpenAI's python package. More info: https://platform.openai.com/docs/guides/chat/introduction

`list_as_string(input_list, glue=', ')`

Auxiliary function. Reads a list and concatenates it as a string, using glue as the character joining each element of the list.

`nlp_topic_description(input_list, api_key)`

Reads keywords and returns a description according to ChatGPT.

BackEnd.functions.scatter_with_hover

`scatter_with_hover`

Functions

`scatter_with_hover(research, path, fig=None, name=None, marker='circle', fig_width=1500, fig_height=900)`

Creates a html Plot file an interactive scatter plot of x vs y using bokeh, with automatic tooltips showing columns from Cluters objects model database related with research object model database.

Parameters:

Name	Type	Description	Default
`research`	`Research object Model database`	`The research object model where is used to store search query, status and related data.`	required
`path`	`(str, Path)`	Full path to where the html plot file will be stored after being generated	required
`fig`	`bokeh.plotting.Figure`	Figure on which to plot (if not given then a new figure will be created)	`None`
`name`	`str`	Bokeh series name to give to the scattered data	`None`
`marker`	`str`	Name of marker to use for scatter plot	`circle`
`fig_width`	`int`	`with of the resulting plot.`	`1800`
`fig_height`	`int`	`height of the resulting plot.`	`900`

Returns:

Type	Description
`None`

Notes

Creates a html Plot file from Clusters objects and store in the path

Acknowledgment

Original code from Robin Wilson robin@rtwilson.com with thanks to Max Albert for original code example

BackEnd.functions.view_functions

`view_functions`

Classes

Functions

`back_process(research)`

Controls the whole process of researching from extracting articles, preprocess text, clustering, and plotting. This function allows to restart proccess that have been interrupted and continue to the point where it was stopped. When finalizes deletes all extra data that could be used in intemediate steps.

Parameters:

Name	Type	Description	Default
`research`	`Research object Model database`	`The research object model where is used to store search query, status and related data.`	required

Returns:

Type	Description
`None`

Notes

Modify the research object model database

`check(research)`

check if the thread of the research is alive

`make_cluster(research, list_id, tf_idf, n_trials, n_threads, tf_idf_sorted)`

Prepare from the entry args the data needed for create a cluster object database model use a pacmap object to optimize and reduce dimensionality.

Parameters:

Name	Type	Description	Default
`research`	`Research object Model database`	`The research object model where is used to store search query, status and related data.`	required
`list_id_final`	`list of str`	`list of id_articles related with the research`	required
`tf_idf`	`scipy sparse matrix`	`Matrix which has tf-idf for each relevant word in articles.`	required
`n_trial`	`int`	`Number of times that will search for the optimal cluster distribution.`	required
`n_threads`	`int`	`Number of parallel processing`	required
`tf_idf_sorted`	`pandas DataFrame`	`Storage the tf-idf index for each word in articles. A dataframe with keywords as row index in the columns the article related.`	required

Returns:

Type	Description
`None`

Notes

Create a clusters object database model and the clusters objects will be used to generate the clusters plot.

`make_preprocessing(research, corpus='abstract', number_threads=1)`

Preprocesses the research data and returns the tf-idf matrix, sorted tf-idf matrix, list of ids and list of preprocessed text.

Parameters:

Name Type Description Default

research

Research object Model database

   The research object model where is used to store
   search query, status and related data.

required

corpus

str

 The selected corpus is used to preprocess text
 it could be 'full_text' pdf or 'abstract or 'both'.

'abstract'

number_thread

int

        Number of threads that will be processed in parallel.

1

Returns:

Name	Type	Description
`tf_idf`	`scipy sparse matrix`	Matrix which has tf-idf for each relevant word in articles.
`tf_idf_sorted`	`pandas DataFrame`	Storage the tf-idf index for each word in articles. A dataframe with keywords as row index in the columns the article related.
`list_id_final`	`list of str`	list of id_articles related with the research
`list_final`	`list of str`	Every element of the list correspond to the extracted text from the article

`preprocessing_parallel(research, articles, corpus)`

Preprocess the text from the articles related with the research and creates Preprocess_text object model database for every word preproccessed and a Number_preprocess model database as a flag to inicate that the research has processed text. This method will be parallelized

Parameters:

Name Type Description Default

research

Research object Model database

   The research object model where is used to store
   search query, status and related data.

required

articles

(List[Article], queryset)

   List of articles related with the research

required

corpus

str

 The selected corpus is used to preprocess text
 it could be 'full_text' pdf or 'abstract or 'both'.

required

Returns:

Type	Description
`None`

Notes

Create Preprocess_text objects model database for every word preproccessed and Number_preprocess objects model database as a flag to indicate that the research has processed text.

`print_research(log_text, research_id)`

Print a log message in the research logfile.

`relaunch_if_fault_all()`

Check if there is a research who is running but there is no more thread alive.

`update_research()`

This is a infinity loop who every 1 month, restart all research who is finished. But, between each restart, give some time so the host doesn't freeze

BackEnd.functions.filter_article

`filter_article`

Functions

`parsing(list_term, dict_keyword)`

for each keywords, it has a list of number. Output: a list of number who correspond of the parsing's result This method works in reccurance call. It returns the result of logical calcul between the first term and the rest.

`split_search_term(search_term)`

Take the search term and return a list of each element ( and, word, parenthesis, etc) in order and dictionary where the key is a keyword and the value is a list. It will fill with id of article

BackEnd.functions.PDF_download

`PDF_download`

Classes

`PDFHandler`

Functions

`convert_pdf(path_file)`

Converts PDF to text to do text processing.

Parameters:

Name	Type	Description	Default
`path_file`	`str`	`path to the pdf file.`	required

Returns:

Name	Type	Description
`text`	`str`	the text conversion of the pdf

`download_from_url(article, filename)`

Downloads a PDF from a direct URL.

Parameters:

Name	Type	Description	Default
`article`	`Article object Model`	`An object wich has in its attribs The URL to the pdf file.`	required
`filename`	`str`	`The filename to use for the downloaded file. it includes the full path and name.`	required

Returns:

Name	Type	Description
`status`	`bool`	Returns True if the PDF was saved successfully, False otherwise.

`extract_full_text(article)`

Downloads the corresponding PDF from the article's URL and extracts the full text. Returns the full text or an empty string if there was a problem. If DOWLOAD_PDF_ARTICLE flag is set to False, just return "".

Parameters:

Name	Type	Description	Default
`article`	`Article object Model`	`the article object.`	required

Returns:

Name	Type	Description
`full_text`	`str`	the text conversion of the pdf

Notes

If DOWLOAD_PDF_ARTICLE flag is set to False, just return "".

`get_url_from_doi(doi)`

Description

This function gets a direct URL to a pdf based on DOI

Takes

DOI (string): URL to the PDF file

Returns:

Name	Type	Description
`url_pdf`	`str or None`	URL to the PDF file.

`remove_char(string)`

Description

This function removes unwanted characters

Takes

string (string)

BackEnd.functions.Remove_references

`Remove_references`

BackEnd.functions.text_processing

`text_processing`

Functions

`create_stopwords()`

Create a new set of stopwords from a file.

Returns:

Name	Type	Description
`stopwords`	`Set[str]`	A set of stopwords.

`lemmatization(list_words, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])`

Lemmatize a list of words.

Parameters:

Name	Type	Description	Default
`list_words`	`List[str]`	`A list of words.`	required
`allowed_postags`	`List[str]`	`A list of allowed POS tags.`	`['NOUN', 'ADJ', 'VERB', 'ADV']`

Returns:

Name	Type	Description
`list_lemmatized`	`List[str]`	A list of lemmatized words.

`pre_processing(df_to_list)`

Preprocess a list of strings by removing special characters and digits.

Parameters:

Name	Type	Description	Default
`df_to_list`	`List[str]`	`A list of strings.`	required

Returns:

Name	Type	Description
`list_preprocessing`	`List[str]`	A list of preprocessed strings.

`remove_misspelled(list_one_two)`

Receive a list of words and for each word Check if the word is in the dictionary_compact.json file, If the word is not in the dictionary, return the corrected word, Otherwise, return the original word

`remove_words(list_lemmatized, list_stopwords)`

Receive a list of words and a list of stopwords, Remove the stopwords from the list of words, Return a list with the remaining words

`sent_to_words(list_languages)`

Receive a list of sentences and for each sentence Convert into a list of lowercase tokens, ignoring tokens that are too short or too long (remove accents as well). For each sentence return a list with the processed words