BVSalud_Documents_save_into_mongo package¶

Crawl_Records module¶

Crawl_Records.count_records(url)¶

The method extract total number of records from the xml downloading with the url received.

Parameters:	url (string) – A url for downloading documents in a xml format file.
Returns:	Number of total records.
Return type:	Int

Note

It’s better to pass a url which conteins just 1 to 5 documents and it will count quickly. Otherwise it may take more time, because of the file’s size.

Example:

>>> count_records('http://pesquisa.bvsalud.org/portal/?output=xml&lang=pt&from=0&sort=&format=summary&count=5&fb=&page=1&filter[db][]=LILACS&filter[db][]=IBECS&q=&index=tw')
1032452

Crawl_Records.get_records(doc_type, mode=None)¶

The method works as main, because all other methods are called from here. It just receives two argument, the first (doc_type) is requeued and must be the type of document (journal). But the other one (mode) is for if you just want to count records, not to download and save.

Parameters:	doc_type (string) – The type of documents (Ex: ibecs, lilacs, none_indexed_ibecs, none_indexed_lilacs, all, all_none_indexed). mode (string (Ex: "count")) – If you just want to count records mode should be “count”.
Returns:	Number of records.
Return type:	Int

Crawl_Records.make_base_url(doc_type)¶

With the type of Pesquisa articles it returns a base url, folder to save, and data name, depending of type received by it.

Parameters:	doc_type (string) – Receives documents type (ibecs, lilacs, none_indexed_ibecs, none_indexed_lilacs, all, all_none_indexed).
Returns:	data_name, base_url, folder_to_save
Return type:	string, string, string

Example:

Crawl_Records.make_url(base_url, start_record, per_page, page)¶

Method to make a url, joining the base_url, start position of records, number of documents per page and page number. All parameters are required.

Parameters:	base_url (string) – A base url from where you want to download all contents. start_record (Int) – Start position for records .Records will be start by this number. per_page (Int) – Number of total records by a page. page (Int) – Number of the page.
Returns:	final_url
Return type:	string

Crawl_Records.save_all_xml(data_name, base_url, folder_to_save, total_records, per_page)¶

This method download all *XML files and save in a folder, the path of which is received by argument.*

Parameters:	data_name (string) – Journal’s name like (IBECS, LILACS, or IBECS_LILACS). base_url (string (Ex: ‘http://pesquisa.bvsalud.org/portal/?output=xml&lang=en’)) – A base url to make a new url with number of documents, start position and page number. folder_to_save (string ('./crawled/')) – Path of a folder, where the all documents will be stored. If it doesn’t exist it will be created. total_records (Int) – Number of all records. per_page (Int) – Number of records by a page.
Returns:	True. It returns always true.
Return type:	Boolean

Note

All records will be saved by the name created with data_name + date + file number + .xml./n (Ex: IBECS_LILACS_17072019_pg_1.xml)

parse_xml_new_and_update module¶

MongoDB:

Warning

MongoDb must be running. Otherwise it will give you an error.

Note

MongoDB is initialized just by calling the module parse_xml_new_and_update.

Data base	Collection
bvc	training_collection_All
	training_collection_None_Indexed_t1
	training_collection_None_Indexed_t2
	training_collection_Update_info
	errors_training

parse_xml_new_and_update.change_collections_name_mongo(old_name, new_name)¶

It changes the name of a collaction if the target name is exist than it will delete that collaction. (Ex: vs.training_collection_old -> vs.training_collection_new).

Parameters:	old_name – The collection’s name which will be changed by a new one. new_name – A new name for the collection.
Type:	strint
Returns:	Nothing to return

Warning

Please do not pass new_name same as old_name, those must be diffrent.

parse_xml_new_and_update.document_compare()¶

This method is just for compare all document none indexed, DATA BASE time1 by time2 and time2 by time1.: New will be inserted into the main DataBase and others will be updated by id, mh, sh, alternat_id unless in time2 documents have mh as None

Note

It receive nothing as parameter and nethier return. It just compare two collaction none indexed of time1 and time2.

parse_xml_new_and_update.download_document(id)¶

This method is for downloading a single article document in **xml* format*, by the id of article.

Parameters:	id (string) – Article’s alternate id. If it’s a normal id than it will return the same. (Ex: biblio-986217).
Returns:	url, xml (xml is a article document downloaded by id)
Return type:	string, xml

parse_xml_new_and_update.find_id_by_alternate_id(alternate_id)¶

Method for obtained article’s id by alternate id. It finds a document by document’s id or alternate_id.: The logic of this method is use for find a id by alternate id.

Parameters:	alternate_id (string) – Article’s alternate id. If it’s a normal id than it will return the same (Ex: biblio-986217).
Returns:	Article’s id.
Return type:	string (Ex: biblio-1001042)

parse_xml_new_and_update.find_new_documents()¶

parse_xml_new_and_update.main(arguments)¶

The method main is just for calling all other methods. It recives a argument, but not required. If it recives the argument “first_time” than it will download all documents and parse those to save in the MongoDB. Otherwise it will just download to be comared with others already existing.

Parameters:	argument – This a condition if the program is being excecuted for first time.
Type:	string
Returns:	Nothing to return

Note

If the program is being executed first time, you must pass a argument first_time. Otherwise it doesn’t need any. First time: python parse_xml_new_and_update.py first_time Otherwise: python parse_xml_new_and_update.py*

parse_xml_new_and_update.parse_file(path_to_file, mode=None)¶

The method parse a files and extract all documents one by one,: and after it converts each document by calling the function xml_to_dictionary. After all the documents one by one will be saved in the data base MongoDB as well all ERROR.

Parameters:	path_to_file (string (Ex: ./crawled/IBECS_LILACS_17072019_pg_1.xml)) – The root of file to be parsed. mode (string) – The mode is condition if it receives “compare” will saved into a collection time 2. Otherwise in the collection normal, maybe time 1. By default it’s None.
Returns:	Nothing to return.

parse_xml_new_and_update.process_dir_t1(path_to_dir)¶

Method to get all file from a folder. All files one by one will be passed to the method parse_file without any condition (None).

Parameters:	path_to_dir – The root of the directory where all files xml format are saved.
Type:	string (Ex: ./crawled/)
Returns:	Nothing to return.

Parameters:	document_xml (xml) – A single article document in the xml format.
Returns:	A single article document in the dictionary (json) format.
Return type:	dictionary/json