Python extract italic content from html -
i trying extract 'italic' content pdf in python. have converted pdf html can use italic tag extract text. here how html looks like
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:71px; top:225px; width:422px; height:15px;"><span style="font-family: ttpgfa+symbol; font- size:12px">•</span><span style="font-family: yuwtqx+arialmt; font- size:14px"> kornai, janos. 1992. </span><span style="font-family: pucjzv+arial-italicmt; font-size:14px">the socialist system: political economy of communism</span><span style="font-family: yuwtqx+arialmt; font-size:14px">.
this how code looks:
from bs4 import beautifulsoup soup = beautifulsoup(open("/../..myfile.html")) btags = [] in soup.findall('span'): btags.append(i.text)
i not sure how can italic text.
try this:
from bs4 import beautifulsoup soup = beautifulsoup(html) btags = [] in soup.find_all('span', style=lambda x: x , 'italic' in x): btags.append(i.text) print btags
passing function style
argument filter results result of function, input value of style
attribute. check see if string italic
inside attribute, , if so, return true.
you may need more sophisticated algorithm depending on rest of html looks like.
Comments
Post a Comment