API - BackEnd.functions
BackEnd.functions.clustering
clustering
Functions
hover_with_keywords(research, list_id, embedding_2d, best_study_clusterer, tf_idf_sorted)
Create a cluster from research articles.
Create an object database model for every article according to list_id, extract the x,y coordinates from embedding_2d and make the clusters given by best_study_clusterer, every cluster has label extracted from tf_idf_sorted.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
research |
Research object Model database
|
The research object model where is used to store search query, status and related data. |
required |
list_id_final |
list of str
|
list of id_articles related with the research |
required |
embedding_2d |
A numpy array wich reduces dimensionality to visualize
|
better the clusters |
required |
best_study_clusterer |
optuna.study.Study
|
An object used to optimize parameters and create clusters with embedding_2d. |
required |
tf_idf_sorted |
pandas DataFrame
|
Storage the tf-idf index for each word in articles. A dataframe with keywords as row index in the columns the article related. |
required |
Returns:
| Type | Description |
|---|---|
None
|
|
Notes
Create a cluster object database model for every article according to list_id, with research, article_id, pos_x, pos_y and labels
BackEnd.functions.journal_collectors.arxiv_collector
arxiv_collector
Classes
ArXivCollector
Bases: TheXivesCollector
Some methods for this class are defined at base.py:TheXivesCollector.
Some of the methods for ArXivCollector overlap with the ones for medrXiv
and biorXiv, so instead of defining them thrice we overwrite the methods
specific for each one in their own files.
Functions
extract_author(entry)
Notes
Authors appear in the HTML document as:
extract_date(entry)
As an example, date appears in the HTML document as:
Submitted 24 March, 2019; v1 submitted 17 August, 2018; originally announced August 2018.
generate_base_query(search_term, begin, end, page=None)
Generates the base query according to the search term, page number and articles by page.
get_dict_markup_lang(search_term, begin, end, entries=[])
Returns a list of article IDs.
get_max_count(soup)
Returns the number of articles resulting from a soupified HTML query.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
soup |
bs4.BeautifulSoup
|
The HTML page resulting from a query soupified using BeautifulSoup. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
max_count |
int
|
The number of articles resulting from the query. |
Notes
In the HTML document, the number of articles max_count appears in:
Showing `initial_from_page`–`final_from_page` of `max_count` results
get_url_response(query)
Notes
arXiv's advanced search returns an HTML document.
BackEnd.functions.journal_collectors.base
base
Classes
BaseCollector
Functions
extract_pdf_full_text(article)
Extracts the full text from the PDF file of an article.
generate_threads(search, research, begin, end, number_threads=1)
Retrieve articles using parallel threads based on the provided search parameters.
get_articles_parallel(research, dict_id_soup)
Fetch articles in parallel and store them in the database.
get_search_term(search)
Translates the search to a readable string search for the API
separate_ymd_from_date(date)
Separates year, month, and day as strings from the given date object.
PMAndPMCCollector
Bases: BaseCollector
This class is basically made of methods for PMCollector.
Some of the functions overlap with the PMCCollector, so we define the
methods for PMCollector here and overwrite the methods specific for
PMCCollector on pmc_collector.py.
Functions
extract_abstract(entry)
Extracts the full abstract, not always available. Itertext takes all text between the abstract tags, so we remove all
characters and superfluous spaces.
extract_author(entry)
Extracts the author names from the XML document.
extract_date(entry)
Extracts the date from the XML document and returns a datetime.date object.
extract_doi(entry)
Extracts the DOI from the XML document.
extract_metadata_from_identifier(identifier)
Extract metadata from the identifier input and return a dictionary with these data.
extract_title(entry)
Extracts the title from the XML document.
extract_url(entry)
Extracts the URL based on the DOI.
generate_addr_fetch(identifier)
Generates the API fetch address for retrieving data based on the provided identifier.
generate_base_query(search_term, begin, end)
Generates the base query according to the API, database, search term, begin and end dates.
generate_id_list(entry)
Generates a list of IDs from the XML document.
get_list_id(search_term, begin, end)
Returns a list of article IDs.
get_max_articles(search, begin, end)
Returns the maximum number of articles for the given search query and date range
get_url_response(query)
Returns a XML doc.
prepare_articles_parallel(research, identifier, dict_id_soup={})
Return True if the outer loop should keep going or False if
it should run continue statement.
TheXivesCollector
Bases: BaseCollector
Functions
generate_base_query(search_term, begin, end, page=0)
Generates the base query according to the API, database, search term, begin and end dates.
get_dict_markup_lang(search_term, begin, end, entries=[])
Returns a list of article IDs.
get_max_articles(search, begin, end)
Returns the maximum number of articles for the given search query and date range.
prepare_articles_parallel(research, identifier, dict_id_soup={})
Return True if the outer loop should keep going or False if
it should run continue statement.
Functions
_get_articles_parallel_wrapper(args)
Execute get_articles_parallel from the given object.
This is a wrapper to BaseCollector.get_articles_parallel that needs to be called by multiprocessing.
BackEnd.functions.journal_collectors.biorxiv_collector
biorxiv_collector
Classes
BiorXivCollector
Bases: TheXivesCollector
The methods for this class are defined at base.py:TheXivesCollector.
Some of the methods for BiorXivCollector overlap with the ones for medrXiv
and arXiv, so instead of defining them thrice we overwrite the methods
specific for each one in their own files.
BackEnd.functions.journal_collectors.medrxiv_collector
medrxiv_collector
Classes
MedrXivCollector
Bases: TheXivesCollector
The methods for this class are defined at base.py:TheXivesCollector.
Some of the methods for MedrXivCollector overlap with the ones for biorXiv
and arXiv, so instead of defining them thrice we overwrite the methods
specific for each one in their own files.
BackEnd.functions.journal_collectors.pm_collector
pm_collector
Classes
PMCollector
Bases: PMAndPMCCollector
The methods for this class are defined at base.py:PMAndPMCCollector.
Some of the methods for PMCollector overlap with the ones for PMCCollector,
so instead of defining them twice we overwrite the methods specific for
PMCCollector on pmc_collector.py.
BackEnd.functions.journal_collectors.pmc_collector
pmc_collector
Classes
PMCCollector
Bases: PMAndPMCCollector
Most of the methods for this class are defined at base.py:
PMAndPMCCollector. Some of the methods for PMCCollector overlap with the
ones for PMCCollector, so instead of defining them twice we overwrite the
methods specific for PMCCollector here.
Functions
extract_abstract(entry)
Extracts the full abstract, not always available. Itertext takes all text between the abstract tags, so we remove all
characters and superfluous spaces.
extract_author(entry)
Extracts the author names from the XML document.
extract_date(entry)
Extracts the date from the XML document and returns a datetime.date object.
extract_doi(entry)
Extracts the DOI from the XML document.
BackEnd.functions.journal_collectors.types
types
BackEnd.functions.nlp
nlp
Functions
call_openai_chatgpt(question='', keywords='HIV, infant, pregnancy', api_key=settings.OPENAI_API_KEY)
Standard function to call ChatGPT through OpenAI's python package. More info: https://platform.openai.com/docs/guides/chat/introduction
list_as_string(input_list, glue=', ')
Auxiliary function. Reads a list and concatenates it as a string, using
glue as the character joining each element of the list.
nlp_topic_description(input_list, api_key)
Reads keywords and returns a description according to ChatGPT.
BackEnd.functions.scatter_with_hover
scatter_with_hover
Functions
scatter_with_hover(research, path, fig=None, name=None, marker='circle', fig_width=1500, fig_height=900)
Creates a html Plot file an interactive scatter plot of x vs y
using bokeh, with automatic tooltips showing columns from Cluters objects
model database related with research object model database.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
research |
Research object Model database
|
|
required |
path |
(str, Path)
|
Full path to where the html plot file will be stored after being generated |
required |
fig |
bokeh.plotting.Figure
|
Figure on which to plot (if not given then a new figure will be created) |
None
|
name |
str
|
Bokeh series name to give to the scattered data |
None
|
marker |
str
|
Name of marker to use for scatter plot |
circle
|
fig_width |
int
|
|
1800
|
fig_height |
int
|
|
900
|
Returns:
| Type | Description |
|---|---|
None
|
|
Notes
Creates a html Plot file from Clusters objects and store in the path
Acknowledgment
Original code from Robin Wilson robin@rtwilson.com with thanks to Max Albert for original code example
BackEnd.functions.view_functions
view_functions
Classes
Functions
back_process(research)
Controls the whole process of researching from extracting articles, preprocess text, clustering, and plotting. This function allows to restart proccess that have been interrupted and continue to the point where it was stopped. When finalizes deletes all extra data that could be used in intemediate steps.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
research |
Research object Model database
|
|
required |
Returns:
| Type | Description |
|---|---|
None
|
|
Notes
Modify the research object model database
check(research)
check if the thread of the research is alive
make_cluster(research, list_id, tf_idf, n_trials, n_threads, tf_idf_sorted)
Prepare from the entry args the data needed for create a cluster object database model use a pacmap object to optimize and reduce dimensionality.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
research |
Research object Model database
|
|
required |
list_id_final |
list of str
|
|
required |
tf_idf |
scipy sparse matrix
|
|
required |
n_trial |
int
|
|
required |
n_threads |
int
|
|
required |
tf_idf_sorted |
pandas DataFrame
|
|
required |
Returns:
| Type | Description |
|---|---|
None
|
|
Notes
Create a clusters object database model and the clusters objects will be used to generate the clusters plot.
make_preprocessing(research, corpus='abstract', number_threads=1)
Preprocesses the research data and returns the tf-idf matrix, sorted tf-idf matrix, list of ids and list of preprocessed text.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
research |
Research object Model database
|
|
required |
corpus |
str
|
|
'abstract'
|
number_thread |
int
|
|
1
|
Returns:
| Name | Type | Description |
|---|---|---|
tf_idf |
scipy sparse matrix
|
Matrix which has tf-idf for each relevant word in articles. |
tf_idf_sorted |
pandas DataFrame
|
Storage the tf-idf index for each word in articles. A dataframe with keywords as row index in the columns the article related. |
list_id_final |
list of str
|
list of id_articles related with the research |
list_final |
list of str
|
Every element of the list correspond to the extracted text from the article |
preprocessing_parallel(research, articles, corpus)
Preprocess the text from the articles related with the research and creates Preprocess_text object model database for every word preproccessed and a Number_preprocess model database as a flag to inicate that the research has processed text. This method will be parallelized
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
research |
Research object Model database
|
|
required |
articles |
(List[Article], queryset)
|
|
required |
corpus |
str
|
|
required |
Returns:
| Type | Description |
|---|---|
None
|
|
Notes
Create Preprocess_text objects model database for every word preproccessed and Number_preprocess objects model database as a flag to indicate that the research has processed text.
print_research(log_text, research_id)
Print a log message in the research logfile.
relaunch_if_fault_all()
Check if there is a research who is running but there is no more thread alive.
update_research()
This is a infinity loop who every 1 month, restart all research who is finished. But, between each restart, give some time so the host doesn't freeze
BackEnd.functions.filter_article
filter_article
Functions
parsing(list_term, dict_keyword)
for each keywords, it has a list of number. Output: a list of number who correspond of the parsing's result This method works in reccurance call. It returns the result of logical calcul between the first term and the rest.
split_search_term(search_term)
Take the search term and return a list of each element ( and, word, parenthesis, etc) in order and dictionary where the key is a keyword and the value is a list. It will fill with id of article
BackEnd.functions.PDF_download
PDF_download
Classes
PDFHandler
Functions
convert_pdf(path_file)
download_from_url(article, filename)
Downloads a PDF from a direct URL.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
article |
Article object Model
|
|
required |
filename |
str
|
|
required |
Returns:
| Name | Type | Description |
|---|---|---|
status |
bool
|
Returns True if the PDF was saved successfully, False otherwise. |
extract_full_text(article)
Downloads the corresponding PDF from the article's URL and extracts the full text. Returns the full text or an empty string if there was a problem. If DOWLOAD_PDF_ARTICLE flag is set to False, just return "".
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
article |
Article object Model
|
|
required |
Returns:
| Name | Type | Description |
|---|---|---|
full_text |
str
|
the text conversion of the pdf |
Notes
If DOWLOAD_PDF_ARTICLE flag is set to False, just return "".
get_url_from_doi(doi)
Description
This function gets a direct URL to a pdf based on DOI
Takes
DOI (string): URL to the PDF file
Returns:
| Name | Type | Description |
|---|---|---|
url_pdf |
str or None
|
URL to the PDF file. |
remove_char(string)
Description
This function removes unwanted characters
Takes
string (string)
BackEnd.functions.Remove_references
Remove_references
BackEnd.functions.text_processing
text_processing
Functions
create_stopwords()
lemmatization(list_words, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])
pre_processing(df_to_list)
remove_misspelled(list_one_two)
Receive a list of words and for each word Check if the word is in the dictionary_compact.json file, If the word is not in the dictionary, return the corrected word, Otherwise, return the original word
remove_words(list_lemmatized, list_stopwords)
Receive a list of words and a list of stopwords, Remove the stopwords from the list of words, Return a list with the remaining words
sent_to_words(list_languages)
Receive a list of sentences and for each sentence Convert into a list of lowercase tokens, ignoring tokens that are too short or too long (remove accents as well). For each sentence return a list with the processed words