how to add unicode in truetype0font on pdfbox 2.0.0? -


i've been using pdfbox version 2.0.0 in java project convert pdfs text.

several of pdfs missing tounicode method, come out in gibberish while export them.

2016-09-14 10:44:55 warn org.apache.pdfbox.pdmodel.font.pdsimplefont(1):322 - no unicode mapping 694 (30) in font mpbaaa+f1

in warn above, instead of real character, gibberish unicode (30) presented.

i able overcome editing additional.txt file in pdfbox, since trial & error understood code of character (694 in case) represents hebrew letter (צ).

here's short example of i've edited inside file:

-694;05e6 #hexadecimal value letter צ -695;05e7 -696;05e8 

later i've encountered same warning on different pdf, instead of gibberish characters got no characters @ all. more detailed explination of issue can seen here - pdf reading via pdfbox in java

2016-09-14 11:07:10 warn org.apache.pdfbox.pdmodel.font.pdtype0font(1):431 - no unicode mapping cid+694 (694) in font abcdee+tahoma,bold

as can see, warning came different class (pdtype0font) rather first warning (pdsimplefont), code name (694) same in both of them , both talking same character.

does there's different file should edit other additional.txt point 694 code (the hebrew letter צ) it's correct unicode?

thanks main root

drill down on first type0 font

here's code add tounicode cmap stream in font. can't file, used 1 of test files, can found here. had work on each entry separately , didn't all. result enough extract first word in green print ("bedingungen").

the scenario tailored you:

  • identity-h entry
  • no tounicode entry
  • specific font name

    try (pddocument doc = pddocument.load(f)) {     (int p = 0; p < doc.getnumberofpages(); ++p)     {         pdpage page = doc.getpage(p);         pdresources res = page.getresources();         (cosname fontname : res.getfontnames())         {             pdfont font = res.getfont(fontname);             cosbase encoding = font.getcosobject().getdictionaryobject(cosname.encoding);             if (!cosname.identity_h.equals(encoding))             {                 continue;             }             // real name             string fname = font.getname();             int plus = fname.indexof('+');             if (plus != -1)             {                 fname = fname.substring(plus + 1);             }             if (font.getcosobject().containskey(cosname.to_unicode))             {                 continue;             }             system.out.println("file '" + f.getname() + "', page " + (p + 1) + ", " + fontname.getname() + ", " + font.getname());             if (!fname.startswith("calibri-bold"))             {                 continue;             }             cosstream tounicodestream = new cosstream();             try (printwriter pw = new printwriter(tounicodestream.createoutputstream(cosname.flate_decode)))             {                 // "9.10 extraction of text content" in pdf 32000 specification                 pw.println ("/cidinit /procset findresource begin\n" +                         "12 dict begin\n" +                         "begincmap\n" +                         "/cidsysteminfo\n" +                         "<< /registry (adobe)\n" +                         "/ordering (ucs) /supplement 0 >> def\n" +                         "/cmapname /adobe-identity-ucs def\n" +                         "/cmaptype 2 def\n" +                         "1 begincodespacerange\n" +                         "<0000> <ffff>\n" +                         "endcodespacerange\n" +                         "10 beginbfchar\n" + // number count of entries                         "<0001><0020>\n" + // space                         "<0002><0041>\n" + //                         "<0003><0042>\n" + // b                         "<0004><0044>\n" + // d                         "<0013><0065>\n" + // e                         "<0012><0064>\n" + // d                         "<0017><0069>\n" + //                         "<001b><006e>\n" + // n                         "<0015><0067>\n" + // g                         "<0020><0075>\n" + // u                         "endbfchar\n" +                         "endcmap cmapname currentdict /cmap defineresource pop end end");             }             font.getcosobject().setitem(cosname.to_unicode, tounicodestream);         }     }     doc.save("huhu.pdf"); } 

btw unreleased 2.1 version of pdfdebugger has improved features show fonts, can here:

you can use verify tounicode cmap makes sense. here's changes: enter image description here


Comments

Popular posts from this blog

php - isset function not working properly -

javascript - Thinglink image not visible until browser resize -

firebird - Error "invalid transaction handle (expecting explicit transaction start)" executing script from Delphi -