I need to convert 24 PDF files in a folder into txt files so that I can perform semantic analysis on them. I took a look at this question, and proceeded from there. However, after getting the code to work the first time, I then changed some things around, and now I am getting the following error:
In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
EOF within quoted string
Because of this, what is saved in the bodies variable in the code below is just a list of 24 blanks, and I end up with 24 blank text files (in addition to the 24 text files that are created by converting the PDFs into txt). I'm not sure what I've done wrong - at one point, this code worked!
I've already looked through what I could find about this error, but those are associated with read.csv, and the fixes they suggested (setting white.space=TRUE and quote="") did not work.
Here's the code (the error is on line 20-23):
# folder with journal articles
PDFfolder_path <- "~/Dropbox/The Egoist PDFs/PDFs"
# vector of PDF file names
PDFfiles <- list.files(path=PDFfolder_path, pattern="*.pdf", full.names=TRUE)
# location of pdftotext.exe file
converter <- "~/Widgets/PDFConverter/bin64/pdftotext"
# folder with text files
textfolder_path <- "~/Dropbox/The Egoist PDFs/textfiles"
# convert PDFs in origin folder into txt files
lapply(PDFfiles, function(i) {
system(paste(converter, paste0('"', i, '"')), wait=FALSE)
})
# it takes DropBox a bit of time to catch all of the folders
# without this we only end up with 23 txt files for some reason
Sys.sleep(.5)
txtfiles_in_PDFfolder_path <- list.files(path=PDFfolder_path, pattern="*.txt", full.names=TRUE)
# extracting only the Bodies of the articles
bodies <- lapply(txtfiles_in_PDFfolder_path, function(i){
j <- paste0(scan(i, what = character()), collapse = " ")
regmatches(j, gregexpr("(?<=Published).*?(?=Prepaid Advertisements)", j, perl=TRUE))
})
# write article-bodies into txt files
lapply(1:length(bodies), function(i){
write.table(bodies[i], file=paste(txtfiles_in_PDFfolder_path[i], "body", "txt", sep="."), quote=FALSE, row.names=FALSE, col.names=FALSE, eol=" ")
})
EDIT: A bit more on the result of the bodies variable: the result is a list of 24, which takes the following form (on the R Studio console, I'm not sure the actual name of this):
bodies: list of 24
:List of 1
..$ : chr(0)
:List of 1
..$ : chr(0)
(repeating 24 times)
But I can't for the life of me figure out why it's chr(0) - I think it has something to do with the same kind of things that's going on here - I'm definitely not capturing all of the lines.
I've tried everything I can think of, even switching readLines() for scan(), and I've looked to see if that might help. I've even switched scan() for read.table(), but it turns out that read.table() itself relies on scan! So... I'm stuck, and am just working my way in circles.