Публикации / Publications

36 thoughts on “Публикации / Publications

  1. Мы можем извлекать пользу и на статическом этапе и на
    динамическом (fastapi, pydantic, msgspec, beartype, swagger)

    — Зачастую есть производительные аналоги пакетов в языке (pandas -> polars,
    uvicorn -> granian, fastapi -> (расскажу в другом докладе), pydantic v2 -> msgspec)

  2. import pandas as pd
    import numpy as np

    # Read in the Excel file
    df = pd.read_excel(‘file.xlsx’)

    # Convert the ‘date’ column to datetime
    df[‘date’] = pd.to_datetime(df[‘date’], format=’%d.%m.%Y’)

    # Sort the dataframe by date
    df.sort_values(‘date’, inplace=True)

    # List of holidays (modify this according to your needs)
    holidays = [‘2019-12-25’, ‘2020-01-01’]

    # Calculate the difference in business days between each operation
    df[‘business_days_between_operations’] = df[‘date’].apply(
    lambda x: np.busday_count(x.strftime(‘%Y-%m-%d’),
    df[‘date’].shift(-1).strftime(‘%Y-%m-%d’),
    holidays=holidays)
    )

    # Resample the data to get the maximum number of business days between operations for each month
    max_business_days_between_operations = df.resample(‘M’, on=’date’)[‘business_days_between_operations’].max()

    # Convert the result to a dictionary
    result = max_business_days_between_operations.to_dict()

    # Convert dictionary to JSON
    import json
    json_result = json.dumps(result)

    print(json_result)

  3. To automatically select the date format from a list of candidates when reading a DataFrame from an Excel file, you can use the pandas library in Python. Here’s a code snippet that demonstrates this:

    import pandas as pd

    # List of date formats to check
    date_formats = [‘%m/%d/%Y’, ‘%d/%m/%Y’, ‘%Y-%m-%d’]

    # Read the Excel file into a DataFrame
    df = pd.read_excel(‘your_file.xlsx’)

    # Iterate over each column in the DataFrame
    for col in df.columns:
    # Iterate over each date format candidate
    for date_format in date_formats:
    try:
    # Attempt to convert the column to datetime using the current format
    df[col] = pd.to_datetime(df[col], format=date_format)
    # If successful, break the loop and move to the next column
    break
    except ValueError:
    pass

    # Print the updated DataFrame with converted dates
    print(df)

    In this code snippet, we first define a list of date formats (date_formats) that you want to check for each column. Then, we read the Excel file into a DataFrame using pd.read_excel(). Next, we iterate over each column in the DataFrame and try to convert it to datetime using each date format candidate. If the conversion is successful, we break the loop and move on to the next column. Finally, we print the updated DataFrame with converted dates.

    Note: Make sure to replace ‘your_file.xlsx’ with the actual path and filename of your Excel file.

  4. Sure! To handle a column with mixed date formats, you can modify the previous code snippet to check each value in the column individually and attempt to convert it using each date format candidate. Here’s an updated version of the code:

    import pandas as pd

    # List of date formats to check
    date_formats = [‘%m/%d/%Y’, ‘%d/%m/%Y’, ‘%Y-%m-%d’]

    # Read the Excel file into a DataFrame
    df = pd.read_excel(‘your_file.xlsx’)

    # Iterate over each column in the DataFrame
    for col in df.columns:
    # Iterate over each value in the column
    for i, value in enumerate(df[col]):
    # Iterate over each date format candidate
    for date_format in date_formats:
    try:
    # Attempt to convert the value to datetime using the current format
    df[col][i] = pd.to_datetime(value, format=date_format)
    # If successful, break the loop and move to the next value
    break
    except ValueError:
    pass

    # Print the updated DataFrame with converted dates
    print(df)

    In this updated code, we added an additional loop to iterate over each value in the column (for i, value in enumerate(df[col])). For each value, we try to convert it to datetime using each date format candidate. If the conversion is successful, we update the value in the DataFrame (df[col][i] = pd.to_datetime(value, format=date_format)) and break the loop to move on to the next value.

    Note: Make sure to replace ‘your_file.xlsx’ with the actual path and filename of your Excel file.

  5. As an experienced Python developer working on a machine learning classification service, optimizing data processing can provide significant speedups and potentially enhance robustness. Here are some upgrades you could consider:

    1. **Efficient Data Structures**: Use pandas DataFrames for structured data. Consider utilizing `categorical` data types for categorical features, as they can be more memory- and speed-efficient.

    2. **Vectorization**: Avoid loops where possible and leverage vectorized operations provided by NumPy, pandas, or similar libraries, which are designed to be fast and efficient.

    3. **Use Efficient Libraries**: For numerical computation, libraries like NumPy and SciPy are optimized for performance. For machine learning, libraries like Scikit-learn, TensorFlow, or PyTorch offer optimized routines for data processing and model training.

    4. **Concurrent Processing**: Utilize Python’s multiprocessing or concurrent.futures module to parallelize data processing tasks that are independent of each other.

    5. **Optimize I/O Operations**: Reading and writing data can be a bottleneck. Use efficient formats like HDF5, Parquet, or Feather for storing and accessing your data. These formats are optimized for fast read and write operations and allow for selective loading of data.

    6. **Memory Mapping**: For extremely large datasets that cannot fit into memory, use memory mapping (e.g., NumPy’s memmap function) to only load parts of the data into memory as needed.

    7. **Incremental Learning**: If your dataset is too large to process at once, consider incremental learning methods, where the model is updated in batches, reducing memory overhead.

    8. **Data Caching**: When working with data that require expensive preprocessing steps, consider caching the processed data on disk to avoid repeating these steps in subsequent runs.

    9. **Profiling and Optimization**: Profile your code to identify bottlenecks. Use profiling tools like cProfile, line_profiler, or memory_profiler to find inefficiencies.

    10. **Use Just-In-Time Compilation**: Try just-in-time (JIT) compilation provided by libraries such as Numba to compile performance-critical code sections into machine code.

    11. **Algorithmic Improvements**: Review the algorithms and data-processing steps you’re using. Sometimes, using more efficient algorithms or data structures can result in significant speed improvements.

    12. **Robust Error Handling**: When processing data, implement try-except blocks to handle potential data quality issues without crashing your processing pipeline.

    13. **Data Quality Checks**: Implement robust preprocessing to handle missing data, outliers, and anomalies, which can improve both the speed (by preventing reprocessing) and the robustness of your ML service.

    14. **Feature Engineering**: Intelligent feature selection can greatly reduce model complexity and data processing time, focusing only on the most informative features.

    15. **Hardware Acceleration**: If you’re already optimized on the software end, consider using hardware acceleration (e.g., GPUs or TPUs) for data processing and model training.

    16. **Distributed Processing**: For extremely large datasets or complex processing pipelines, consider a distributed processing framework like Apache Spark or Dask, which can handle data that exceeds a machine’s memory capacity and can distribute the workload across clusters.

    By strategically implementing these upgrades, you can significantly speed up your data processing and increase the robustness of your machine learning classification service. The appropriateness of each of these upgrades will depend on the specific bottlenecks and performance issues you are encountering with your current setup.

  6. If you’re dealing with deprecation warnings coming from within your virtual environment (for example, from packages installed in the `.venv` folder), you can still use the methods outlined previously, but you might want to be more specific in your filtering so that you don’t suppress warnings from your own code.

    Here’s how you can modify the Python `warnings` filter to ignore warnings specifically from modules in the `.venv` folder:

    1. **Using `warnings.filterwarnings` with a Specific Prefix:**
    You can suppress warnings by specifying a pattern that matches the file paths of your virtual environment. You’ll often find that the file paths of the installed packages within a virtual environment start with the path to the virtual environment itself.

    “`python
    import warnings
    import os
    import re

    # Get the path to the virtual environment
    venv_path = os.path.join(os.path.dirname(__file__), ‘.venv’)

    # Escape backslashes for Windows paths
    venv_path = re.escape(venv_path)

    # Suppress DeprecationWarnings coming from packages within the virtual environment
    warnings.filterwarnings(
    “ignore”,
    category=DeprecationWarning,
    module=r”{}.*”.format(venv_path)
    )
    “`

    Replace `.venv` with the name of your virtual environment folder if it’s different.

    2. **Using a Context Manager for a Specific Code Block:**

    If you want to suppress the warnings only for a specific block of code and you know that the code will execute functions from the virtual environment, you can use a context manager:

    “`python
    import warnings

    with warnings.catch_warnings():
    warnings.filterwarnings(
    “ignore”,
    category=DeprecationWarning,
    module=r”{}.*”.format(venv_path)
    )
    # Code that imports or uses modules from the virtual environment goes here
    “`

    Again, it’s important to note that you should use these techniques with care, as they might hide important warnings that could affect the future compatibility and functionality of your code. It’s always best to stay up-to-date with the libraries you’re using and refactor your code to avoid the use of deprecated features.

  7. -Xms6048m
    -Xmx8048m
    -Xss128m
    -XX:NewRatio=2
    -Dfile.encoding=UTF-8
    -XX:SoftRefLRUPolicyMSPerMB=250
    -XX:NewSize=1024m
    -XX:MaxNewSize=1024m
    -XX:PermSize=1024m
    -XX:MaxPermSize=2048m
    -XX:ParallelGCThreads=4
    -XX:MaxTenuringThreshold=1
    -XX:SurvivorRatio=8
    -XX:+UseCodeCacheFlushin

  8. > Evgenii Varseev:
    You are and experienced python developer. Your app recieves and well-formed xml file via message broker. But for debug you have only usual xml message. How would you convert conventional XML FILE to WELL-FORMED XML FILE. Please provide a snippet of code

    > Нейрокот:
    To convert a conventional XML file to a well-formed XML file, you can make use of the `xml.dom.minidom` module in Python. Here’s a snippet of code that demonstrates this conversion:

    “`python
    import xml.dom.minidom

    def convert_to_well_formed_xml(file_path):
    # Parse the conventional XML file
    xml_doc = xml.dom.minidom.parse(file_path)

    # Create an unformatted string representation of the XML document
    xml_str = xml_doc.toprettyxml(indent=””)

    # Remove leading/trailing whitespace and newlines
    well_formed_xml_str = xml_str.strip()

    return well_formed_xml_str

    # Example usage
    conventional_xml_file_path = “conventional.xml”

    # Convert the conventional XML file to a well-formed XML string
    well_formed_xml_str = convert_to_well_formed_xml(conventional_xml_file_path)

    # Print the well-formed XML string to console
    print(well_formed_xml_str)
    “`

    In this code snippet, we first parse the conventional XML file using `xml.dom.minidom.parse()`. Then, we use the `toprettyxml()` method to create a string representation of the XML document with indentation. Finally, we remove any leading/trailing whitespace and newlines to obtain the well-formed XML string.

    Note: This code assumes that you have the `xml.dom.minidom` module available, which is part of Python’s standard library. If not, you may need to install it using `pip install xml.dom.minidom` before executing the code.

    • As an experienced Python developer, it’s important to follow consistent naming conventions to ensure clarity and readability in your code. While it is generally recommended to use verbs or verb phrases for method names to indicate an action or behavior, there are exceptions, and using a name without a verb can be acceptable in certain cases.

      In the context of checking whether a file is oversized, both naming conventions can be suitable depending on your preference and the overall context of your codebase. Here are some considerations for each option:

      1. Naming with a Verb:
      – If you choose a name like `check_if_file_is_oversized`, the method name explicitly conveys the action being performed (checking) and the condition being evaluated (file is oversized).
      – This naming style provides a clear indication of the method’s purpose and behavior, making it easier for other developers (or your future self) to understand the function at a glance.

      2. Naming without a Verb:
      – If you opt for a name like `file_is_oversized`, the method name serves as a descriptive statement about the state or condition of the file (oversized).
      – While there is no explicit verb indicating an action, the name still conveys information about the outcome of the check, which can be sufficient in certain cases, especially if the context where this method is called provides enough context.

      Ultimately, the choice between including a verb in the method name or using a name without a verb is a matter of personal preference and style guidelines within your codebase or organization. Just ensure that the chosen name accurately reflects the purpose and behavior of the method to maintain code clarity and readability. Whichever naming convention you choose, consistency and clarity are key for effective code maintenance and collaboration.

  9. In Python using pandas, you can achieve this by using the `loc` function to subset the dataframe based on the specified conditions and then update the ‘operation code’ column with the value 52. Here is a code snippet to accomplish this:

    “`python
    import pandas as pd

    # Create a sample dataframe
    data = {‘operation_description’: [‘credit’, ‘purchase’, ‘withdrawn’, ‘balance’],
    ‘operation_code’: [10, 20, 30, 40]}
    df = pd.DataFrame(data)

    # Set operation code to 52 if ‘credit’, ‘debit’, or ‘withdrawn’ is present in operation_description
    df.loc[df[‘operation_description’].isin([‘credit’, ‘debit’, ‘withdrawn’]), ‘operation_code’] = 52

    # Display the updated dataframe
    print(df)
    “`

    This code snippet first creates a sample dataframe with ‘operation_description’ and ‘operation_code’ columns. It then uses the `loc` function to locate the rows where the ‘operation_description’ is ‘credit’, ‘debit’, or ‘withdrawn’, and sets the ‘operation_code’ to 52 for those rows. Finally, the updated dataframe is displayed.

  10. “`python
    import pandas as pd

    # Create a sample DataFrame
    data = {‘operation_description’: [‘credit card payment’, ‘withdrawn from atm’, ‘debit transaction’, ‘transfer to savings’]}
    df = pd.DataFrame(data)

    # Define the list of words to check for in operation_description
    words_to_check = [‘credit’, ‘debit’, ‘withdrawn’]

    # Set the operation code to 52 if any of the words_to_check is present in the operation_description column
    df[‘operation_code’] = df[‘operation_description’].apply(lambda x: 52 if any(word in x.lower() for word in words_to_check) else None)

    print(df)

    • Задание на Python ML developer.
      Направление анализа и распознавания аудио.

      Какой опыт коммерческой разработки на позиции ML?
      У вас есть опыт разработки своих моделей нейронных сетей или применения, существующих open source решений? Если да, опишите проекты или задачи, в которых вы применяли нейронные сети.
      С какими архитектурами нейронных сетей знакомы?
      Какой язык программирования вы используете в своей работе?
      Какие инструменты/фреймворки применяете и для каких задач?
      С какими библиотеками приходится работать?
      Работали ли ранее с SQL, (чистые запросы, ORM), какие ORM использовали? Какие СУБД использовали в работе (Sqlite, Postgresql, MySQL, MongoDb, ClickHouse, и др.)? Каков максимальный размер БД с которой был опыт работы?
      С какими сложностями приходилось сталкиваться при разработке и применении нейронных сетей?
      Опишите пайплайн обучения модели.
      Ссылка на дополнительное задание по направлению https://gitlab.zvonok.in/-/snippets/30.

Leave a Reply to Evgenii Cancel reply

Your email address will not be published. Required fields are marked *