Translate PDF to Table (将pdf文件转换为table)

library(pdftools)
rm(list=ls())
options(stringsAsFactors=F)
setwd("C:/Users/daizao/Desktop/pdf_table_read")
dz <- pdf_text("mmc3.pdf")
pdf2table <- function(x){
    process1 <- strsplit(x,"\\r\\n")
    process2 <- lapply(process1,function(x){gsub("\"\\s+(.*?)\\s+(.*?)\\s+(.*?)\\s+(.*?)\\s+(.*?)\"","\\1\t\\2\\3\\4\t\\5",x,perl=T)})
    test <- data.frame(matrix(unlist(process2), nrow=(length(unlist(process2))), byrow=T))
    data <- data.frame()
    for (i in 2:nrow(test)){
        a <- unlist(strsplit(test[i,]," "))
        b <- a[grepl("^[a-zA-Z]",unlist(strsplit(test[i,]," ")))]
        temp1 <- paste(b[2:(length(b)-1)],collapse=" ")
        temp2 <- cbind(b[1],temp1,b[(length(b))])
        data <- rbind(data,temp2)
    }
    return(data)
}
data <- pdf2table(dz)
colnames(data) <- data[1,]
data <- data[-1,]
write.table(data,file="gene.txt",sep="\t",row.names=F,quote=F)
此条目发表在R分类目录。将固定链接加入收藏夹。

发表评论

邮箱地址不会被公开。 必填项已用*标注

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    
    Markdown is turned off in code blocks:
     [This is not a link](http://example.com)

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see http://daringfireball.net/projects/markdown/syntax

Protected with IP Blacklist CloudIP Blacklist Cloud