mirror of
https://github.com/novoid/guess-filename.py.git
synced 2026-02-16 13:24:15 +00:00
341 lines
14 KiB
Org Mode
341 lines
14 KiB
Org Mode
* guessfilename.py
|
|
|
|
This Python script tries to come up with a new file name for each
|
|
file from command line argument.
|
|
|
|
It does this with several methods: first, the current file name is
|
|
analyzed and any [[https://en.wikipedia.org/wiki/Iso_date][ISO date/timestamp]] and [[https://github.com/novoid/filetags/][filetags]] are re-used.
|
|
Secondly, if the parsing of the file name did not lead to any new file
|
|
name, the content of the file is analyzed. Following file types are
|
|
supported by now:
|
|
- PDF files
|
|
|
|
The script accepts an arbitrary number of files (see your shell for
|
|
possible length limitations).
|
|
|
|
- *Target group*: users who are able to use command line tools and who
|
|
are using tags in file names.
|
|
- Hosted on github: [[https://github.com/novoid/guess-filename.py]] and PyPi: [[https://pypi.org/project/guessfilename/]]
|
|
|
|
** Why
|
|
|
|
I do scan almost all paper mail. Many of those documents are sent to
|
|
me regularily. Such documents are bills or insurance informations, for
|
|
example.
|
|
|
|
Being too lazy to name those files manually with high chances of
|
|
getting many variants for the same document type, I came up with a
|
|
method to derive file names from either the old file name (cues I
|
|
enter without knowing the exact target file name) or the file content.
|
|
|
|
Analyzing the content enables this script to recognize bills via
|
|
customer numbers or phone numbers, amounts to pay, and so on.
|
|
|
|
** Examples
|
|
|
|
Here are some examples that demonstrate the purpose of this script.
|
|
The generated file names are following [[https://www.karl-voit.at/managing-digital-photographs/][my file name convention]].
|
|
|
|
For better user experience, I like to *define an abbreviation* in [[https://karl-voit.at/apps-I-am-using/][my
|
|
shell]] which also makes the examples easier to read:
|
|
|
|
: alias gf=guessfilename.py
|
|
|
|
A very simple example is a *simple bill*:
|
|
|
|
: gf "2016-03-05 phone 12,34 €.pdf"
|
|
: → "2016-03-05 COMPANY landline 12,34€ -- scan bill.pdf"
|
|
|
|
Some mobile apps generate weird formatted file names. Here is some *recording*:
|
|
|
|
: gf "rec_20171129-0902 A nice recording .wav"
|
|
: → "2017-11-29T09.02 A nice recording.wav"
|
|
|
|
*Android screenshot* files tend to look like that:
|
|
|
|
: gf "Screenshot_2017-11-29_10-32-12.png"
|
|
: → "2017-11-29T10.32.12 -- screenshots.png"
|
|
|
|
*Android photographs* are handled similarly:
|
|
|
|
: gf "IMG_20190118_133928.jpg"
|
|
: → "2019-01-18T13.39.28.jpg"
|
|
|
|
*Files saved from [[https://signal.org/][Signal]]* do have strange default names as well:
|
|
|
|
: gf "signal-2018-03-08-102332.jpg"
|
|
: → "2018-03-08T10.23.32.jpg"
|
|
|
|
Many companies like to generate *really silly file names*. This is from my bank:
|
|
|
|
: gf "C110014365208EUR20150930001.pdf"
|
|
: → "2015-09-30 Bank statement 2015-001 10014365208.pdf"
|
|
|
|
This script is able to *parse content of PDF* file in order to get
|
|
meta-data to generate the new file name. This can be applied to you
|
|
*salary*, for example:
|
|
|
|
: gf "2020-03-04 salary.pdf"
|
|
: → "2020-02-29 MYCOMPANY salary for February 1234,56€ -- finance.pdf"
|
|
|
|
As you can see, "guessfilename" makes your digital life easier when
|
|
you do have recurring file rename tasks.
|
|
|
|
** Usage
|
|
|
|
#+BEGIN_SRC sh :results output :wrap src
|
|
guessfilename --help
|
|
#+END_SRC
|
|
|
|
#+BEGIN_src
|
|
Usage:
|
|
guessfilename [<options>] <list of files>
|
|
|
|
This little Python script tries to rename files according to pre-defined rules.
|
|
|
|
It does this with several methods: first, the current file name is analyzed and
|
|
any ISO date/timestamp and filetags are re-used. Secondly, if the parsing of the
|
|
file name did not lead to any new file name, the content of the file is analyzed.
|
|
|
|
You have to adapt the rules in the Python script to meet your requirements.
|
|
The default rule-set follows the filename convention described on
|
|
http://karl-voit.at/managing-digital-photographs/
|
|
|
|
|
|
:copyright: (c) by Karl Voit
|
|
:license: GPL v3 or any later version
|
|
:URL: https://github.com/novoid/guess-filename.py
|
|
:bugreports: via github or <tools@Karl-Voit.at>
|
|
|
|
|
|
Options:
|
|
-h, --help show this help message and exit
|
|
-d, --dryrun enable dryrun mode: just simulate what would happen, do not
|
|
modify files
|
|
-v, --verbose enable verbose mode
|
|
-q, --quiet enable quiet mode
|
|
--version display version and exit
|
|
#+END_src
|
|
|
|
** Pixel Images and Videos
|
|
:PROPERTIES:
|
|
:CREATED: [2020-11-15 Sun 17:07]
|
|
:END:
|
|
|
|
I added handling for [[https://karl-voit.at/2020/11/15/pixel4a-migration/][my Pixel 4a]] camera results: JPEG images and MP4 videos.
|
|
|
|
Due to [[https://www.reddit.com/r/Pixel4a/comments/jubshe/fixing_the_messy_timestamps_of_pixel_4a_camera/][a somewhat messy meta data situation]] I had to use the
|
|
=File:FileModifyDate= [[https://en.wikipedia.org/wiki/Exif][Exif]] meta-data in order to get time-stamps from
|
|
the local time zone. If you happen to apply guessfilename after
|
|
modifying the file due to copying or editing, you will get wrong
|
|
time-stamps. Therefore, use [[https://syncthing.net/][Syncthing]] or similar synchronzation tools
|
|
that preserve file modification time to get the files from the mobile
|
|
to your computer. Apply guessfilename before modifying the files any
|
|
further.
|
|
|
|
Furthermore, you will need to install [[https://exiftool.org/][ExifTool]] as an external
|
|
dependency. I was not able to find a Python-only Exif library that
|
|
provided me read access to advanced Exif values the Pixel is using.
|
|
|
|
** MediathekView
|
|
:PROPERTIES:
|
|
:CREATED: [2018-05-10 Thu 17:03]
|
|
:END:
|
|
|
|
When downloading TV shows using [[https://github.com/mediathekview/MediathekView][MediathekView]], you should use the following download pattern:
|
|
|
|
- MediathekView v11:
|
|
: %DT%d %s - %t - %T -ORIGINAL- %N.mp4
|
|
|
|
- MediathekView v13:
|
|
- Einstellungen > Aufzeichnen und Abspielen > Set bearbeiten
|
|
- [Set-Name] > Hilfsprogramme:
|
|
- ffmpeg > Zieldateiname > =%DT%d %s - %t - %T -ORIGINALhd- %N.mp4=
|
|
- ffmpeg > Schalter > =-user_agent "Mozilla" -i %f -c copy -bsf:a aac_adtstoasc **=
|
|
|
|
When applying =guessfilename= on the resulting files, you will get something like this:
|
|
|
|
#+BEGIN_EXAMPLE
|
|
20180509T235000 ORF - ZIB 24 - Auswirkungen nach US-Aus für Atomdeal -ORIGINAL- 2018-05-09_2350_tl_01_ZIB-24_Auswirkungen-na__13976363__o__1735069995__s14297628_8__BCK1HD_23514710P_23540405P_Q4A.mp4 ...
|
|
→ 2018-05-09T23.51.47 ORF - ZIB 24 - Auswirkungen nach US-Aus für Atomdeal -- lowquality.mp4
|
|
|
|
20180509T235000 ORF - ZIB 24 - Hirntoter Bub plötzlich aufgewacht -ORIGINAL- 2018-05-09_2350_tl_01_ZIB-24_Hirntoter-Bub-p__13976363__o__5119815115__s14297631_1__BCK1HD_00045915P_00072303P_Q4A.mp4 ...
|
|
→ 2018-05-09T00.04.59 ORF - ZIB 24 - Hirntoter Bub plötzlich aufgewacht -- lowquality.mp4
|
|
|
|
20180509T235000 ORF - ZIB 24 - Meldungen -ORIGINAL- 2018-05-09_2350_tl_01_ZIB-24_Meldungen__13976363__o__1117657593__s14297632_2__BCK1HD_00072303P_00085816P_Q4A.mp4 ...
|
|
→ 2018-05-09T00.07.23 ORF - ZIB 24 - Meldungen -- lowquality.mp4
|
|
|
|
20180509T235000 ORF - ZIB 24 - Neuerung bei Filmfestspielen in Cannes -ORIGINAL- 2018-05-09_2350_tl_01_ZIB-24_Neuerung-bei-Fi__13976363__o__1941003027__s14297634_4__BCK1HD_00085816P_00111715P_Q4A.mp4 ...
|
|
→ 2018-05-09T00.08.58 ORF - ZIB 24 - Neuerung bei Filmfestspielen in Cannes -- lowquality.mp4
|
|
|
|
20180509T235000 ORF - ZIB 24 - Trumps CIA-Kandidatin umstritten -ORIGINAL- 2018-05-09_2350_tl_01_ZIB-24_Trumps-Kandidat__13976363__o__1488806017__s14297630_0__BCK1HD_00020922P_00045915P_Q4A.mp4 ...
|
|
→ 2018-05-09T00.02.09 ORF - ZIB 24 - Trumps CIA-Kandidatin umstritten -- lowquality.mp4
|
|
|
|
20180509T235000 ORF - ZIB 24 - Wetter -ORIGINAL- 2018-05-09_2350_tl_01_ZIB-24_Wetter__13976363__o__2966973785__s14297635_5__BCK1HD_00111715P_00120000P_Q4A.mp4 ...
|
|
→ 2018-05-09T00.11.17 ORF - ZIB 24 - Wetter -- lowquality.mp4
|
|
#+END_EXAMPLE
|
|
|
|
As you can see, the temporal order of the chunks is extracted so that
|
|
the files are in their correct order.
|
|
|
|
Please note that this does not work with a show whose chunks do cross
|
|
midnight since the date is always taken from the start of the show and
|
|
the time from the actual time being shown.
|
|
|
|
** .info.json Meta-Data Files
|
|
:PROPERTIES:
|
|
:CREATED: [2019-10-19 Sat 15:21]
|
|
:END:
|
|
|
|
If you do download a media file and its associated separate
|
|
=.info.json= file (both base-names without file extension need to
|
|
match), this tool is able to parse the meta-data to derive a new file
|
|
name.
|
|
|
|
Currently, there are two meta-data formats supported: ORG TVthek and
|
|
YouTube, both via http://rg3.github.io/youtube-dl/
|
|
|
|
: youtube-dl --write-info-json <URL>
|
|
|
|
This results, for example, with files like these:
|
|
|
|
: Durchbruch bei Brexit-Verhandlungen-14577219.info.json
|
|
: Durchbruch bei Brexit-Verhandlungen-14577219.mp4
|
|
: Isolierte Familie - 58-jähriger Österreicher in U-Haft-14577221.info.json
|
|
: Isolierte Familie - 58-jähriger Österreicher in U-Haft-14577221.mp4
|
|
: The Star7 PDA Prototype-Ahg8OBYixL0.info.json
|
|
: The Star7 PDA Prototype-Ahg8OBYixL0.mp4
|
|
|
|
Please notice the associated =mp4= files as well as the =info.json=
|
|
files.
|
|
|
|
Applying guess-filename on these files look like this:
|
|
|
|
#+BEGIN_EXAMPLE
|
|
vk@sherri ~tmp % guessfilename *mp4
|
|
|
|
Durchbruch bei Brexit-Verhandlungen-14577219.mp4 ...
|
|
→ 2019-10-17T16.59.07 ORF - ZIB 17 00 - Durchbruch bei Brexit-Verhandlungen -- highquality.mp4
|
|
|
|
Isolierte Familie - 58-jähriger Österreicher in U-Haft-14577221.mp4 ...
|
|
→ 2019-10-17T17.01.44 ORF - ZIB 17 00 - Isolierte Familie: 58-jähriger Österreicher in U-Haft -- highquality.mp4
|
|
|
|
The Star7 PDA Prototype-Ahg8OBYixL0.mp4 ...
|
|
→ 2007-09-13 youtube - The Star7 PDA Prototype - Ahg8OBYixL0.mp4
|
|
#+END_EXAMPLE
|
|
|
|
The =info.json= files are not removed or renamed.
|
|
|
|
** Extending with your own regular expressions
|
|
|
|
The structure of the script is like the following:
|
|
|
|
- general header, command-line argument parser, ...
|
|
- =handle_logging()=
|
|
- =error_exit()=
|
|
- =FileSizePlausibilityException()=
|
|
- =class GuessFilename()=
|
|
- *a long list of regular expression definitions*
|
|
- =derive_new_filename_from_old_filename()=
|
|
- here, you can *add code to interpret the regular expressions*
|
|
- =derive_new_filename_from_content()=
|
|
- if you want to parse PDF content, add your code here
|
|
- =derive_new_filename_from_json_metadata()=
|
|
- this handles the JSON meta-data files generated by [[https://ytdl-org.github.io/youtube-dl/index.html][youtube-dl]] (see above)
|
|
- =handle_file()=
|
|
- the function that loops over all files is probing for new file names until a function is returning with a new name:
|
|
1. =derive_new_filename_from_old_filename()=
|
|
2. =derive_new_filename_from_content()=
|
|
3. =derive_new_filename_from_json_metadata()=
|
|
4. if no name returned until here: prints out a warning that no new name could be derived
|
|
- The rest of the class consist of a bunch of tool functions, e.g., for parsing and querying:
|
|
- =adding_tags()=
|
|
- =split_filename_entities()=
|
|
- =contains_one_of()=
|
|
- =contains_all_of()=
|
|
- =fuzzy_contains_one_of()=
|
|
- =fuzzy_contains_all_of()=
|
|
- =has_euro_charge()=
|
|
- =get_euro_charge()=
|
|
- =get_euro_charge_from_context_or_basename()=
|
|
- =get_euro_charge_from_context()=
|
|
- =rename_file()=
|
|
- =get_datetime_string_from_named_groups()=
|
|
- =get_date_string_from_named_groups()=
|
|
- =get_datetime_description_extension_filename()=
|
|
- =get_date_description_extension_filename()=
|
|
- =NumToMonth()=
|
|
- =translate_ORF_quality_string_to_tag()=
|
|
- =get_file_size()=
|
|
- =warn_if_ORF_file_seems_to_small_according_to_duration_and_quality_indicator()=
|
|
- =move_to_success_dir()=
|
|
- =move_to_error_dir()=
|
|
- =main()=
|
|
|
|
For the most basic pattern matching, you just have to add regular
|
|
expressions to the =GuessFilename()= class and add the regex matching
|
|
code to =derive_new_filename_from_old_filename()=.
|
|
|
|
Do not forget to add simple tests to =guessfilename_test.py= as well!
|
|
|
|
* Related tools and workflows
|
|
|
|
This tool is part of a tool-set which I use to manage my digital files
|
|
such as photographs. My work-flows are described in [[http://karl-voit.at/managing-digital-photographs/][this blog posting]]
|
|
you might like to read.
|
|
|
|
In short:
|
|
|
|
For *tagging*, please refer to [[https://github.com/novoid/filetags][filetags]] and its documentation.
|
|
|
|
See [[https://github.com/novoid/date2name][date2name]] for easily adding ISO *time-stamps or date-stamps* to
|
|
files.
|
|
|
|
For *easily naming and tagging* files within file browsers that allow
|
|
integration of external tools, see [[https://github.com/novoid/appendfilename][appendfilename]] (once more) and
|
|
[[https://github.com/novoid/filetags][filetags]].
|
|
|
|
Moving to the archive folders is done using [[https://github.com/novoid/move2archive][move2archive]].
|
|
|
|
Having tagged photographs gives you many advantages. For example, I
|
|
automatically [[https://github.com/novoid/set_desktop_background_according_to_season][choose my *desktop background image* according to the
|
|
current season]].
|
|
|
|
Files containing an ISO time/date-stamp gets indexed by the
|
|
filename-module of [[https://github.com/novoid/Memacs][Memacs]].
|
|
|
|
-------------
|
|
|
|
[[http://www.jonasjberg.com/][Jonas Sjöberg]] took my idea and developed the much more advanced (and
|
|
thus a bit more complicated) [[https://github.com/jonasjberg/autonameow][autonameow]]. It uses rule-based renaming,
|
|
analyzes content of plain text, epub, pdf and rtf files, extracts
|
|
meta-data from many different file formats via [[https://www.sno.phy.queensu.ca/%257Ephil/exiftool/][exiftool]] and so forth.
|
|
|
|
-------------
|
|
|
|
[[https://www.reddit.com/r/datacurator/comments/f6ku5p/building_an_auto_file_sorter_need_requirements/][This reddit thread]] brought me to [[https://github.com/unreadablewxy/fs-curator][fs-curator]] whose [[https://github.com/unreadablewxy/fs-curator/wiki][documentation]] looks
|
|
promising. I did not test it and it's still in an early stage.
|
|
However, it could be a future user-friendly part of a workflow that
|
|
watches folders for file changes and applies processes like
|
|
guessfilename.
|
|
|
|
* Alternatives
|
|
|
|
I you don't need the full power of a programming language,
|
|
[[https://github.com/tfeldmann/organize][organize]] might do the trick for you.
|
|
Instead of coding Python, you define your rules within a text file.
|
|
|
|
* Contribute!
|
|
|
|
I am looking for your ideas!
|
|
|
|
If you want to contribute to this cool project, please fork and
|
|
contribute!
|
|
|
|
|
|
* Local Variables :noexport:
|
|
# Local Variables:
|
|
# mode: auto-fill
|
|
# mode: flyspell
|
|
# eval: (ispell-change-dictionary "en_US")
|
|
# End:
|