search multiple pdf files for a string
search multiple pdf files for a string
Trying to assist a local museum in searching pdf files. Need a solution to search multiple pdf files for a specific string.
Users are not technical, so it would be helpful for the script to look in a specific location, i.e. /mnt/home/newsletter, and search for a string, i.e. 'Howard Huges' in a case-insensitive manner. User could select output to screen or a text file.
Thanks
Users are not technical, so it would be helpful for the script to look in a specific location, i.e. /mnt/home/newsletter, and search for a string, i.e. 'Howard Huges' in a case-insensitive manner. User could select output to screen or a text file.
Thanks
or maybe not: https://www.online-tech-tips.com/comput ... s-at-once/
Of course, Burn_It is correct if the goal is to write your own program, especially as this was posted to the Programming Section. But I'm still working on my first cup of coffee --still more right-brain gestalt than left-brain analytical-- so saw the highlight of problem being "Users are not technical". PDF's having been around for ages, so somebody must have thought about batch searching pdfs before. Googled expecting to find a technical discussion. The above was the first post found.
Why re-invent the wheel? I use FoxitReader all the time. Didn't know it had a batch search capability.
Of course, Burn_It is correct if the goal is to write your own program, especially as this was posted to the Programming Section. But I'm still working on my first cup of coffee --still more right-brain gestalt than left-brain analytical-- so saw the highlight of problem being "Users are not technical". PDF's having been around for ages, so somebody must have thought about batch searching pdfs before. Googled expecting to find a technical discussion. The above was the first post found.
Why re-invent the wheel? I use FoxitReader all the time. Didn't know it had a batch search capability.
- fabrice_035
- Posts: 765
- Joined: Mon 28 Apr 2014, 17:54
- Location: Bretagne / France
Hello,
You can try https://www.howtogeek.com/228531/how-to ... -in-linux/
(How to Convert a PDF File to Editable Text Using the Command Line in Linux)
https://poppler.freedesktop.org/
Or install with PPM
You can try https://www.howtogeek.com/228531/how-to ... -in-linux/
(How to Convert a PDF File to Editable Text Using the Command Line in Linux)
https://poppler.freedesktop.org/
Or install with PPM
Bionicpup64-8.0 _ Kernel 5.4.27-64oz _ Asus Rog GL752
- technosaurus
- Posts: 4853
- Joined: Mon 19 May 2008, 01:24
- Location: Blue Springs, MO
- Contact:
Mupdf has most of what you need in its tools along with a JavaScript interface. The source is pretty easy to follow if you have done any c programming, so you could implement your own tools for specific purposes using the existing ones as a template.
Check out my [url=https://github.com/technosaurus]github repositories[/url]. I may eventually get around to updating my [url=http://bashismal.blogspot.com]blogspot[/url].
- fabrice_035
- Posts: 765
- Joined: Mon 28 Apr 2014, 17:54
- Location: Bretagne / France
I complete my answer with this (simple) program
Code: Select all
#!/bin/bash
#
# -> PdfToText mandatory { https://poppler.freedesktop.org/ ]
# -> PPM / PuppyPacketManager { POPPLER-UTILS }
#
# search TeXt in pdf with pdftotext tool
trap ctrl_c INT
temp=/tmp/$(head /dev/urandom | tr -dc A-Za-z0-9 | head -c 13 ; echo '')
IFS=$'\n'
sortir() {
rm -f "$temp"
echo -e "\n Bye."
exit
}
export -f sortir
function ctrl_c() {
sortir
}
binary=$(whereis pdftotext | awk -F: '{print $2}' | tr -d " " )
echo "$binary"
if [ "$binary" != "/usr/bin/pdftotext" ] ; then
echo "/usr/bin/pdftotext not found. End."
sortir
fi
path="$2"
if [ "$path" = "" ] ; then
echo "1) Use default path $PWD, you can also specify folder."
path="$PWD"
else
echo "search path:$2"
fi
if [ -d "$path" ]; then
:
else
echo "Directory not found!"
sortir
fi
if [ "$1" = "" ] ; then
echo "2) This tool search text in PDF : enter an occurrence !"
sortir
else
echo "Search \"$1\" in all .pdf files "
fi
files=$(find $path -iname '*.pdf')
for file in $files
do
echo -e ">Look in:$file"
/usr/bin/pdftotext "$file" "$temp"
result=$(cat "$temp" | grep $1)
if [ "$result" != "" ] ; then
echo -e "- Found \"$1\" in $file \nPress [Enter] to continue [o]pen pdf or e[x]it" ; read x
if [ "$x" = "o" ] ; then
defaultpdfviewer $file &
fi
if [ "$x" = "x" ] ; then
sortir
fi
fi
done
sortir
Bionicpup64-8.0 _ Kernel 5.4.27-64oz _ Asus Rog GL752
fabrice_035, does that include PDF's where the text content is being stored in image format? (Suspect not as I don't see any OCR type links/code).
[size=75]( ͡° ͜ʖ ͡°) :wq[/size]
[url=http://murga-linux.com/puppy/viewtopic.php?p=1028256#1028256][size=75]Fatdog multi-session usb[/url][/size]
[size=75][url=https://hashbang.sh]echo url|sed -e 's/^/(c/' -e 's/$/ hashbang.sh)/'|sh[/url][/size]
[url=http://murga-linux.com/puppy/viewtopic.php?p=1028256#1028256][size=75]Fatdog multi-session usb[/url][/size]
[size=75][url=https://hashbang.sh]echo url|sed -e 's/^/(c/' -e 's/$/ hashbang.sh)/'|sh[/url][/size]
Hi
I find this quite useful myself: https://pdfgrep.org/
I find this quite useful myself: https://pdfgrep.org/
[img]http://www.smokey01.com/CatDude/.temp/sigs/acer-futile.gif[/img]
The plot thickens with the search in PDF files. Once the folks at the museum used the advanced search in Acrobat, they posed the question:
If the pdf files were online, would there be a way to search those files?
I can give them all the space they need on one of my servers, but I simply don't have the skills to write a script that would allow the online search. Anyone know of such a script?
thanks
my elemental internet search skills have not discovered one.
If the pdf files were online, would there be a way to search those files?
I can give them all the space they need on one of my servers, but I simply don't have the skills to write a script that would allow the online search. Anyone know of such a script?
thanks
my elemental internet search skills have not discovered one.