BVSalud_Documents_save_into_mongo package¶
Crawl_Records module¶
-
Crawl_Records.count_records(url)¶ The method extract total number of records from the xml downloading with the url received.
Parameters: url (string) – A url for downloading documents in a xml format file. Returns: Number of total records. Return type: Int Note
It’s better to pass a url which conteins just 1 to 5 documents and it will count quickly. Otherwise it may take more time, because of the file’s size.
- Example:
>>> count_records('http://pesquisa.bvsalud.org/portal/?output=xml&lang=pt&from=0&sort=&format=summary&count=5&fb=&page=1&filter[db][]=LILACS&filter[db][]=IBECS&q=&index=tw') 1032452
-
Crawl_Records.get_records(doc_type, mode=None)¶ The method works as main, because all other methods are called from here. It just receives two argument, the first (doc_type) is requeued and must be the type of document (journal). But the other one (mode) is for if you just want to count records, not to download and save.
Parameters: - doc_type (string) – The type of documents (Ex: ibecs, lilacs, none_indexed_ibecs, none_indexed_lilacs, all, all_none_indexed).
- mode (string (Ex: "count")) – If you just want to count records mode should be “count”.
Returns: Number of records.
Return type: Int
-
Crawl_Records.make_base_url(doc_type)¶ With the type of Pesquisa articles it returns a base url, folder to save, and data name, depending of type received by it.
Parameters: doc_type (string) – Receives documents type (ibecs, lilacs, none_indexed_ibecs, none_indexed_lilacs, all, all_none_indexed). Returns: data_name, base_url, folder_to_save Return type: string, string, string - Example:
-
Crawl_Records.make_url(base_url, start_record, per_page, page)¶ Method to make a url, joining the base_url, start position of records, number of documents per page and page number. All parameters are required.
Parameters: - base_url (string) – A base url from where you want to download all contents.
- start_record (Int) – Start position for records .Records will be start by this number.
- per_page (Int) – Number of total records by a page.
- page (Int) – Number of the page.
Returns: final_url
Return type: string
-
Crawl_Records.save_all_xml(data_name, base_url, folder_to_save, total_records, per_page)¶ This method download all *XML files and save in a folder, the path of which is received by argument.*
Parameters: - data_name (string) – Journal’s name like (IBECS, LILACS, or IBECS_LILACS).
- base_url (string (Ex: ‘http://pesquisa.bvsalud.org/portal/?output=xml&lang=en’)) – A base url to make a new url with number of documents, start position and page number.
- folder_to_save (string ('./crawled/')) – Path of a folder, where the all documents will be stored. If it doesn’t exist it will be created.
- total_records (Int) – Number of all records.
- per_page (Int) – Number of records by a page.
Returns: True. It returns always true.
Return type: Boolean
Note
All records will be saved by the name created with data_name + date + file number + .xml./n (Ex: IBECS_LILACS_17072019_pg_1.xml)
parse_xml_new_and_update module¶
MongoDB:
Warning
MongoDb must be running. Otherwise it will give you an error.
Note
MongoDB is initialized just by calling the module parse_xml_new_and_update.
| Data base | Collection |
|---|---|
| bvc | training_collection_All |
| training_collection_None_Indexed_t1 | |
| training_collection_None_Indexed_t2 | |
| training_collection_Update_info | |
| errors_training |
-
parse_xml_new_and_update.change_collections_name_mongo(old_name, new_name)¶ It changes the name of a collaction if the target name is exist than it will delete that collaction. (Ex: vs.training_collection_old -> vs.training_collection_new).
Parameters: - old_name – The collection’s name which will be changed by a new one.
- new_name – A new name for the collection.
Type: strint
Returns: Nothing to return
Warning
Please do not pass new_name same as old_name, those must be diffrent.
-
parse_xml_new_and_update.document_compare()¶ - This method is just for compare all document none indexed, DATA BASE time1 by time2 and time2 by time1.
- New will be inserted into the main DataBase and others will be updated by id, mh, sh, alternat_id unless in time2 documents have mh as None
Note
It receive nothing as parameter and nethier return. It just compare two collaction none indexed of time1 and time2.
-
parse_xml_new_and_update.download_document(id)¶ This method is for downloading a single article document in **xml* format*, by the id of article.
Parameters: id (string) – Article’s alternate id. If it’s a normal id than it will return the same. (Ex: biblio-986217). Returns: url, xml (xml is a article document downloaded by id) Return type: string, xml
-
parse_xml_new_and_update.find_id_by_alternate_id(alternate_id)¶ - Method for obtained article’s id by alternate id. It finds a document by document’s id or alternate_id.
- The logic of this method is use for find a id by alternate id.
Parameters: alternate_id (string) – Article’s alternate id. If it’s a normal id than it will return the same (Ex: biblio-986217). Returns: Article’s id. Return type: string (Ex: biblio-1001042)
-
parse_xml_new_and_update.find_new_documents()¶
-
parse_xml_new_and_update.main(arguments)¶ The method main is just for calling all other methods. It recives a argument, but not required. If it recives the argument “first_time” than it will download all documents and parse those to save in the MongoDB. Otherwise it will just download to be comared with others already existing.
Parameters: argument – This a condition if the program is being excecuted for first time. Type: string Returns: Nothing to return Note
If the program is being executed first time, you must pass a argument first_time. Otherwise it doesn’t need any. First time: python parse_xml_new_and_update.py first_time Otherwise: python parse_xml_new_and_update.py*
-
parse_xml_new_and_update.parse_file(path_to_file, mode=None)¶ - The method parse a files and extract all documents one by one,
- and after it converts each document by calling the function xml_to_dictionary. After all the documents one by one will be saved in the data base MongoDB as well all ERROR.
Parameters: - path_to_file (string (Ex: ./crawled/IBECS_LILACS_17072019_pg_1.xml)) – The root of file to be parsed.
- mode (string) – The mode is condition if it receives “compare” will saved into a collection time 2. Otherwise in the collection normal, maybe time 1. By default it’s None.
Returns: Nothing to return.
-
parse_xml_new_and_update.process_dir_t1(path_to_dir)¶ Method to get all file from a folder. All files one by one will be passed to the method parse_file without any condition (None).
Parameters: path_to_dir – The root of the directory where all files xml format are saved. Type: string (Ex: ./crawled/) Returns: Nothing to return. See also
You should take a look at the method parse file with mode “compare”. it would help you to handle better this method.
-
parse_xml_new_and_update.process_dir_t2(path_to_dir)¶ Method to get all file from a folder. All files one by one will be passed to the method parse_file with the condion “compare”
Parameters: path_to_dir – The root of the directory where all files in xml format are saved. Type: string (Ex: ./crawled_no_indexed/) Returns: Nothing to return. See also
You should take a look at the method parse file with mode “compare”. it would help you to handle better this method.
-
parse_xml_new_and_update.save_to_mongo_updated_info(id, type, db)¶ This method is for saving the data like _id, type, db and date, in MongoDB data base: bvc and collection*.
Parameters: - id (string) – Article document’s id.
- type (string (new or update)) – Type is new or update. It depends on article if it’s new or just being updated.
- db (sting) – The name of article’s data base (LILACS or IBECS)
Returns: Nothing to return
Note
The date will be saved automatically. It will be actual date obtained by
datetime.utcnow().
-
parse_xml_new_and_update.xml_to_dictionary(document_xml)¶ The method converts a xml document to a dictionary (json) format. The method is just for article BVSalud LILACS or IBECS. difference_between_entry_update_date.
Parameters: document_xml (xml) – A single article document in the xml format. Returns: A single article document in the dictionary (json) format. Return type: dictionary/json