REGEX to Remove Embedded Header/Footer in the Text from a PDF?

Well, that title probably makes no sense. I've got the text from a PDF file that has the header and footer information embedded within it (so the header/footer looks just like actual text). Something like:

Quote:

blah blah blah

[page number] <== ex Footer

[title] <== ex Header

blah blah blah.

I'd like to convert that text to something passably readable as an EPUB via Calibre. But, before conversion, I need to get rid of those header/footer combinations. My REGEX knowledge is only microscopically above the zero point, and the best I could figure out as a way to find those headers/footers is:

Code:

\s+\d+\s+TITLE\s+

For my own future knowledge, I'll put what I think those codes mean in here:

\s means to match whitespace
+ means to match 1 or more
\d means to match a digit
TITLE is the title of the document that stuck in what used to be a header

So, it looks like that REGEX should grab from the start of the whitespace before the page number and run through the title to the end of the whitespace where the actual text picks up again. Probably not the best bit of REGEX, but it seems to work.

If the text before that header/footer combination is the end of a paragraph, that's fine. But, if the header/footer combination occurs right in the middle of a sentence, then removing it will result in the continuation "paragraph" being smashed right up against the paragraph that was before the header/footer.

For instance:

Quote:

Lit lognued in one of the gseut criahs in N’kcis ofcife, his lnog lges spilwarng
far asorcs the rgu. He was attauneted rehtar tahn bgi. Too mcuh of his chohdliod

11

TITLE

had been snept in fere flla. Now he cluod not fit itno a stadnard prussere siut
or sparcecaft cniba; and whvereer he sta, he lekood lkie he was tniyrg to tkae orev.

would be transmogrified to:

Quote:

Lit lognued in one of the gseut criahs in N’kcis ofcife, his lnog lges spilwarng
far asorcs the rgu. He was attauneted rehtar tahn bgi. Too mcuh of his chohdliodhad been snept in fere flla. Now he cluod not fit itno a stadnard prussere siut
or sparcecaft cniba; and whvereer he sta, he lekood lkie he was tniyrg to tkae orev.

Can anyone come up with a better way to strip out all those headers/footers?

EDIT: I guess if I replace the selection with a CR LF (/r/n), that would work reasonably. It doesn't look like it would be any worse than all the other lines ending with CR LF. I'll have to check and see if Calibre's conversion routine gets rid of those.

REGEX to Remove Embedded Header/Footer in the Text from a PDF?

Trending Articles

RAMAYAMPET Mandal Sarpanch | Upa-Sarpanch | Ward member Mobile Numbers Medak...

लड़कियां सेक्स के दौरान क्यों करती है उह! आह!लड़कियां सेक्स के दौरान क्यों करती...

Neem Baba Extra Questions Answer Class 6 English Poorvi

Throw Back: 4×4 — Sikilitele (Ft Castro) Prod by JQ

Rajasthan Board 10th Result 2016 Roll No wise & Name Wise

Lowe faces four theft charges

Practice Sheet of Right form of verbs for HSC Students

Mafia, Murder & Mayhem In The Motor City: Detroit Mob Hit Timeline (1937-2007)

The 10 Tennessee Cities With The Largest Black Population For 2021

Materials Around Us Class 6 Worksheet Science Chapter 6

デスクトップヒープの枯渇

Best Suvichar in Hindi |बेस्ट सुविचार |शुभ विचार हिंदी में

Kanulanu Thaake Lyrics and translation | Manam (2014)

Korean Sex Porn Videos: XXX Videos & Free Porn Movies

Teen Shot In Miami Drive-By Dies From Injuries

Download: IQ Muzatasha feat Shy D & Pmj – Ulesi NiFertilizer Yamavuto

Mahakal Attitude Status

Property developer set up cannabis factory to help pay off debts...

♡

KB: How to troubleshoot issues when adding a Hyper-V host in System Center...