pdf - Get font height/weight from TextRenderInfo how? -
when parse existing pdf using itext(sharp), create object implements irenderlistener pass pdfreadercontentparser.processcontent() , sure enough, object's rendertext() gets called repeatedly text in pdf.
the problem is, textrenderinfo tells me base font (in case, helvetica) can't tell height of font nor weight (regular vs. bold). known deficiency of itext(sharp) or missing something?
the textrenderinfo tells me base font (in case, helvetica) can't tell height of font nor weight (regular vs. bold)
height
unfortunately itextsharp not provide public font size method or member in textrenderinfo
. people worked around using distance between getascentline()
, getdescentline()
.
if ready use reflection
, though, can better exposing , using private textrenderinfo
member graphicsstate gs
, e.g. in render listener:
public class locationtextsizeextractionstrategy : locationtextextractionstrategy { //hold each coordinate public list<sizeandtextandfont> mychunks = new list<sizeandtextandfont>(); //automatically called each chunk of text in pdf public override void rendertext(textrenderinfo wholerenderinfo) { base.rendertext(wholerenderinfo); graphicsstate gs = (graphicsstate) gsfield.getvalue(wholerenderinfo); mychunks.add(new sizeandtextandfont(gs.fontsize, wholerenderinfo.gettext(), wholerenderinfo.getfont().postscriptfontname)); } fieldinfo gsfield = typeof(textrenderinfo).getfield("gs", system.reflection.bindingflags.nonpublic | system.reflection.bindingflags.instance); } //helper class stores our rectangle, text, , font public class sizeandtextandfont { public float size; public string text; public string font; public sizeandtextandfont(float size, string text, string font) { this.size = size; this.text = text; this.font = font; } }
you can extract information such render listener this:
using (var pdfreader = new pdfreader(testfile)) { // loop through each page of document (var page = startpage; page < endpage; page++) { console.writeline("\n page {0}", page); locationtextsizeextractionstrategy strategy = new locationtextsizeextractionstrategy(); pdftextextractor.gettextfrompage(pdfreader, page, strategy); foreach (sizeandtextandfont p in strategy.mychunks) { console.writeline(string.format("<{0}> in {2} @ {1}", p.text, p.size, p.font)); } } }
this produces output this:
page 1 < philippine stock exchange, inc> in helvetica-bold @ 8 < daily quotations report> in helvetica-bold @ 8 < march 23 , 2015> in helvetica-bold @ 8 <name> in helvetica @ 7 <symbol> in helvetica @ 7 <bid> in helvetica @ 7 [...]
considering transformations
the numbers see in output font sizes values of font size property in pdf graphics state @ time respective text drawn.
due flexibility of pdf may not font size see in output, though, custom transformation may stretch output considerably. pdf producers use font size of 1 , transformations stretch output accordingly.
to value font sizes in such documents, can improve locationtextsizeextractionstrategy
method rendertext
this:
public override void rendertext(textrenderinfo wholerenderinfo) { base.rendertext(wholerenderinfo); graphicsstate gs = (graphicsstate) gsfield.getvalue(wholerenderinfo); matrix texttouserspacetransformmatrix = (matrix) texttouserspacetransformmatrixfield.getvalue(wholerenderinfo); float transformedfontsize = new vector(0, gs.fontsize, 0).cross(texttouserspacetransformmatrix).length; mychunks.add(new sizeandtextandfont(transformedfontsize, wholerenderinfo.gettext(), wholerenderinfo.getfont().postscriptfontname)); }
with additional reflection fieldinfo
member.
fieldinfo texttouserspacetransformmatrixfield = typeof(textrenderinfo).getfield("texttouserspacetransformmatrix", system.reflection.bindingflags.nonpublic | system.reflection.bindingflags.instance);
weight
as can see in output above, name of font may contain more mere font family name weight indicator
< march 23 , 2015> in helvetica-bold @ 8
in example, therefore,
the textrenderinfo tells me base font (in case, helvetica)
the helvetica without decorations imply regular weight.
helvetica 1 of standard 14 fonts every pdf viewer must provide out-of-the-box: times-roman, helvetica, courier, symbol, times-bold, helvetica-bold, courier-bold, zapfdingbats, times-italic, helvetica-oblique, courier-oblique, times-bolditalic, helvetica-boldoblique, courier-boldoblique. thus, these names pretty dependable.
unfortunately font names in general may chosen arbitrarily; bold font may have "bold" or "black" or other indicators of boldness in name or none @ all.
one might try use font's fontdescriptor dictionary entry fontweight specified. unfortunately entry optional, cannot count on being there @ all.
furthermore, font in pdf can artificially bold'ed, cf. this answer:
all these numbers drawn using same font, merely adding rising outline line width.
thus, i'm afraid there no dependable way find exact font weight, merely number of heuristics may or may not return acceptable approximations.
Comments
Post a Comment