-
Notifications
You must be signed in to change notification settings - Fork 17
Jupyter Notebooks
Ambreen H edited this page Sep 20, 2020
·
3 revisions
https://jupyter.readthedocs.io/en/latest/install.html#install-and-use
Has anyone managed to getpapers or ami running in a notebook?
contributor Ambreen H
Python was used to remove flag symbols from XML Dictionaries:
- The
SRARQLendpoint file was first converted into the standard format usingamidict( for reference see above) - The new XML file was imported into python and all characters within the grandchild elements (ie synonyms) were converted to
ASCII. This emptied the synonym elements
PYTHON CODE
import re
iname = "E:\\ami_try\\Dictionaries\\country_converted.xml"
oname = "E:\\ami_try\\Dictionaries\\country_converted2.xml"
pat = re.compile('(\s*<synonym>)(.*?)(</synonym>\s*)', re.U)
with open(iname, "rb") as fin:
with open(oname, "wb") as fout:
for line in fin:
#line = line.decode('utf-8')
line = line.decode('ascii', errors='ignore')
m = pat.search(line)
if m:
g = m.groups()
line = g[0].lower() + g[1].lower() + g[2].lower()
fout.write(line.encode('utf-8'))
- The empty elements were then deleted using python to create a new .xml file with all synonyms except the flags.
PYTHON CODE
from lxml import etree
def remove_empty_tag(tag, original_file, new_file):
root = etree.parse(original_file)
for element in root.xpath(f".//*[self::{tag} and not(node())]"):
element.getparent().remove(element)
# Serialize "root" and create a new tree using an XMLParser to clean up
# formatting caused by removing elements.
parser = etree.XMLParser(remove_blank_text=True)
tree = etree.fromstring(etree.tostring(root), parser=parser)
# Write to new file.
etree.ElementTree(tree).write(new_file, pretty_print=True, xml_declaration=True, encoding="utf-8")
remove_empty_tag("synonym", "E:\\ami_try\\Dictionaries\\country_converted2.xml", "E:\\ami_try\\Dictionaries\\country_converted3.xml")
All code is reusable with a little modification
Tester: Ambreen H
The code was written in Python to import the data from XML file and cleanse it to create a CSV document for binary classification of data.
In order to run the machine learning model, proper data preparation is necessary
- The following libraries were used:
xml.etree.ElementTree as ET, string, osandre - A function was written to locate XML files and extracting abstract from that
- This was done on a small number of papers (11 positives and 11 negatives)
- The abstract was cleaned by removing unnecessary characters, turning into lowercase and removing subheadings like 'abstract' etc
- Finally a single data file was created in CSV format having 3 columns, one for the name of the file, other for the entire cleaned text in the abstract, and whether the result is a false positive or true positive.
Jupyter was also used to run a smoke test for the Binary Classification using Machine Learning. More Information