BVSalud_Documents_save_into_mongo package

Crawl_Records module

Crawl_Records.count_records(url)

The method extract total number of records from the xml downloading with the url received.

Parameters:url (string) – A url for downloading documents in a xml format file.
Returns:Number of total records.
Return type:Int

Note

It’s better to pass a url which conteins just 1 to 5 documents and it will count quickly. Otherwise it may take more time, because of the file’s size.

  • Example:
    >>> count_records('http://pesquisa.bvsalud.org/portal/?output=xml&lang=pt&from=0&sort=&format=summary&count=5&fb=&page=1&filter[db][]=LILACS&filter[db][]=IBECS&q=&index=tw')
    1032452
    
Crawl_Records.get_records(doc_type, mode=None)

The method works as main, because all other methods are called from here. It just receives two argument, the first (doc_type) is requeued and must be the type of document (journal). But the other one (mode) is for if you just want to count records, not to download and save.

Parameters:
  • doc_type (string) – The type of documents (Ex: ibecs, lilacs, none_indexed_ibecs, none_indexed_lilacs, all, all_none_indexed).
  • mode (string (Ex: "count")) – If you just want to count records mode should be “count”.
Returns:

Number of records.

Return type:

Int

Crawl_Records.make_base_url(doc_type)

With the type of Pesquisa articles it returns a base url, folder to save, and data name, depending of type received by it.

Parameters:doc_type (string) – Receives documents type (ibecs, lilacs, none_indexed_ibecs, none_indexed_lilacs, all, all_none_indexed).
Returns:data_name, base_url, folder_to_save
Return type:string, string, string
  • Example:
Crawl_Records.make_url(base_url, start_record, per_page, page)

Method to make a url, joining the base_url, start position of records, number of documents per page and page number. All parameters are required.

Parameters:
  • base_url (string) – A base url from where you want to download all contents.
  • start_record (Int) – Start position for records .Records will be start by this number.
  • per_page (Int) – Number of total records by a page.
  • page (Int) – Number of the page.
Returns:

final_url

Return type:

string

Crawl_Records.save_all_xml(data_name, base_url, folder_to_save, total_records, per_page)

This method download all *XML files and save in a folder, the path of which is received by argument.*

Parameters:
  • data_name (string) – Journal’s name like (IBECS, LILACS, or IBECS_LILACS).
  • base_url (string (Ex: ‘http://pesquisa.bvsalud.org/portal/?output=xml&lang=en’)) – A base url to make a new url with number of documents, start position and page number.
  • folder_to_save (string ('./crawled/')) – Path of a folder, where the all documents will be stored. If it doesn’t exist it will be created.
  • total_records (Int) – Number of all records.
  • per_page (Int) – Number of records by a page.
Returns:

True. It returns always true.

Return type:

Boolean

Note

All records will be saved by the name created with data_name + date + file number + .xml./n (Ex: IBECS_LILACS_17072019_pg_1.xml)

parse_xml_new_and_update module

MongoDB:

Warning

MongoDb must be running. Otherwise it will give you an error.

Note

MongoDB is initialized just by calling the module parse_xml_new_and_update.

Data base Collection
bvc training_collection_All
training_collection_None_Indexed_t1
training_collection_None_Indexed_t2
training_collection_Update_info
errors_training
parse_xml_new_and_update.change_collections_name_mongo(old_name, new_name)

It changes the name of a collaction if the target name is exist than it will delete that collaction. (Ex: vs.training_collection_old -> vs.training_collection_new).

Parameters:
  • old_name – The collection’s name which will be changed by a new one.
  • new_name – A new name for the collection.
Type:

strint

Returns:

Nothing to return

Warning

Please do not pass new_name same as old_name, those must be diffrent.

parse_xml_new_and_update.document_compare()
This method is just for compare all document none indexed, DATA BASE time1 by time2 and time2 by time1.
New will be inserted into the main DataBase and others will be updated by id, mh, sh, alternat_id unless in time2 documents have mh as None

Note

It receive nothing as parameter and nethier return. It just compare two collaction none indexed of time1 and time2.

parse_xml_new_and_update.download_document(id)

This method is for downloading a single article document in **xml* format*, by the id of article.

Parameters:id (string) – Article’s alternate id. If it’s a normal id than it will return the same. (Ex: biblio-986217).
Returns:url, xml (xml is a article document downloaded by id)
Return type:string, xml
parse_xml_new_and_update.find_id_by_alternate_id(alternate_id)
Method for obtained article’s id by alternate id. It finds a document by document’s id or alternate_id.
The logic of this method is use for find a id by alternate id.
Parameters:alternate_id (string) – Article’s alternate id. If it’s a normal id than it will return the same (Ex: biblio-986217).
Returns:Article’s id.
Return type:string (Ex: biblio-1001042)
parse_xml_new_and_update.find_new_documents()
parse_xml_new_and_update.main(arguments)

The method main is just for calling all other methods. It recives a argument, but not required. If it recives the argument “first_time” than it will download all documents and parse those to save in the MongoDB. Otherwise it will just download to be comared with others already existing.

Parameters:argument – This a condition if the program is being excecuted for first time.
Type:string
Returns:Nothing to return

Note

If the program is being executed first time, you must pass a argument first_time. Otherwise it doesn’t need any. First time: python parse_xml_new_and_update.py first_time Otherwise: python parse_xml_new_and_update.py*

parse_xml_new_and_update.parse_file(path_to_file, mode=None)
The method parse a files and extract all documents one by one,
and after it converts each document by calling the function xml_to_dictionary. After all the documents one by one will be saved in the data base MongoDB as well all ERROR.
Parameters:
  • path_to_file (string (Ex: ./crawled/IBECS_LILACS_17072019_pg_1.xml)) – The root of file to be parsed.
  • mode (string) – The mode is condition if it receives “compare” will saved into a collection time 2. Otherwise in the collection normal, maybe time 1. By default it’s None.
Returns:

Nothing to return.

parse_xml_new_and_update.process_dir_t1(path_to_dir)

Method to get all file from a folder. All files one by one will be passed to the method parse_file without any condition (None).

Parameters:path_to_dir – The root of the directory where all files xml format are saved.
Type:string (Ex: ./crawled/)
Returns:Nothing to return.

See also

You should take a look at the method parse file with mode “compare”. it would help you to handle better this method.

parse_xml_new_and_update.process_dir_t2(path_to_dir)

Method to get all file from a folder. All files one by one will be passed to the method parse_file with the condion “compare”

Parameters:path_to_dir – The root of the directory where all files in xml format are saved.
Type:string (Ex: ./crawled_no_indexed/)
Returns:Nothing to return.

See also

You should take a look at the method parse file with mode “compare”. it would help you to handle better this method.

parse_xml_new_and_update.save_to_mongo_updated_info(id, type, db)

This method is for saving the data like _id, type, db and date, in MongoDB data base: bvc and collection*.

Parameters:
  • id (string) – Article document’s id.
  • type (string (new or update)) – Type is new or update. It depends on article if it’s new or just being updated.
  • db (sting) – The name of article’s data base (LILACS or IBECS)
Returns:

Nothing to return

Note

The date will be saved automatically. It will be actual date obtained by datetime.utcnow().

parse_xml_new_and_update.xml_to_dictionary(document_xml)

The method converts a xml document to a dictionary (json) format. The method is just for article BVSalud LILACS or IBECS. difference_between_entry_update_date.

Parameters:document_xml (xml) – A single article document in the xml format.
Returns:A single article document in the dictionary (json) format.
Return type:dictionary/json