how to add unicode in truetype0font on pdfbox 2.0.0? -
i've been using pdfbox version 2.0.0 in java project convert pdfs text.
several of pdfs missing tounicode method, come out in gibberish while export them.
2016-09-14 10:44:55 warn org.apache.pdfbox.pdmodel.font.pdsimplefont(1):322 - no unicode mapping 694 (30) in font mpbaaa+f1
in warn above, instead of real character, gibberish unicode (30) presented.
i able overcome editing additional.txt file in pdfbox, since trial & error understood code of character (694 in case) represents hebrew letter (צ).
here's short example of i've edited inside file:
-694;05e6 #hexadecimal value letter צ -695;05e7 -696;05e8 later i've encountered same warning on different pdf, instead of gibberish characters got no characters @ all. more detailed explination of issue can seen here - pdf reading via pdfbox in java
2016-09-14 11:07:10 warn org.apache.pdfbox.pdmodel.font.pdtype0font(1):431 - no unicode mapping cid+694 (694) in font abcdee+tahoma,bold
as can see, warning came different class (pdtype0font) rather first warning (pdsimplefont), code name (694) same in both of them , both talking same character.
does there's different file should edit other additional.txt point 694 code (the hebrew letter צ) it's correct unicode?
here's code add tounicode cmap stream in font. can't file, used 1 of test files, can found here. had work on each entry separately , didn't all. result enough extract first word in green print ("bedingungen").
the scenario tailored you:
- identity-h entry
- no tounicode entry
specific font name
try (pddocument doc = pddocument.load(f)) { (int p = 0; p < doc.getnumberofpages(); ++p) { pdpage page = doc.getpage(p); pdresources res = page.getresources(); (cosname fontname : res.getfontnames()) { pdfont font = res.getfont(fontname); cosbase encoding = font.getcosobject().getdictionaryobject(cosname.encoding); if (!cosname.identity_h.equals(encoding)) { continue; } // real name string fname = font.getname(); int plus = fname.indexof('+'); if (plus != -1) { fname = fname.substring(plus + 1); } if (font.getcosobject().containskey(cosname.to_unicode)) { continue; } system.out.println("file '" + f.getname() + "', page " + (p + 1) + ", " + fontname.getname() + ", " + font.getname()); if (!fname.startswith("calibri-bold")) { continue; } cosstream tounicodestream = new cosstream(); try (printwriter pw = new printwriter(tounicodestream.createoutputstream(cosname.flate_decode))) { // "9.10 extraction of text content" in pdf 32000 specification pw.println ("/cidinit /procset findresource begin\n" + "12 dict begin\n" + "begincmap\n" + "/cidsysteminfo\n" + "<< /registry (adobe)\n" + "/ordering (ucs) /supplement 0 >> def\n" + "/cmapname /adobe-identity-ucs def\n" + "/cmaptype 2 def\n" + "1 begincodespacerange\n" + "<0000> <ffff>\n" + "endcodespacerange\n" + "10 beginbfchar\n" + // number count of entries "<0001><0020>\n" + // space "<0002><0041>\n" + // "<0003><0042>\n" + // b "<0004><0044>\n" + // d "<0013><0065>\n" + // e "<0012><0064>\n" + // d "<0017><0069>\n" + // "<001b><006e>\n" + // n "<0015><0067>\n" + // g "<0020><0075>\n" + // u "endbfchar\n" + "endcmap cmapname currentdict /cmap defineresource pop end end"); } font.getcosobject().setitem(cosname.to_unicode, tounicodestream); } } doc.save("huhu.pdf"); }
btw unreleased 2.1 version of pdfdebugger has improved features show fonts, can here:
you can use verify tounicode cmap makes sense. here's changes: 


Comments
Post a Comment