Python extract italic content from html -


i trying extract 'italic' content pdf in python. have converted pdf html can use italic tag extract text. here how html looks like

<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:71px; top:225px; width:422px; height:15px;"><span style="font-family: ttpgfa+symbol; font- size:12px">•</span><span style="font-family: yuwtqx+arialmt; font- size:14px">  kornai, janos. 1992. </span><span style="font-family: pucjzv+arial-italicmt; font-size:14px">the socialist system: political economy of communism</span><span style="font-family: yuwtqx+arialmt; font-size:14px">. 

this how code looks:

from bs4 import beautifulsoup soup = beautifulsoup(open("/../..myfile.html")) btags = [] in soup.findall('span'):     btags.append(i.text) 

i not sure how can italic text.

try this:

from bs4 import beautifulsoup  soup = beautifulsoup(html) btags = [] in soup.find_all('span', style=lambda x: x , 'italic' in x):     btags.append(i.text)  print btags 

passing function style argument filter results result of function, input value of style attribute. check see if string italic inside attribute, , if so, return true.

you may need more sophisticated algorithm depending on rest of html looks like.


Comments

Popular posts from this blog

javascript - Thinglink image not visible until browser resize -

firebird - Error "invalid transaction handle (expecting explicit transaction start)" executing script from Delphi -

mongodb - How to keep track of users making Stripe Payments -